In the field of empirical software engineering, the ability to analyze vast amounts of data from software repositories is crucial. With platforms like GitHub providing access to millions of repositories, researchers face the challenge of creating manageable and representative datasets.
Software repositories are one of the main sources of data for empirical software engineering, particularly in the Mining Software Repositories (MSR) field. However, the vast volume of data available makes mandatory the use of sampling techniques. Traditional methods often rely on random selection or variables like popularity. This can lead to biased samples that don’t truly represent the diversity of the repositories. Our research addresses this gap by proposing a structured methodology for creating representative samples.
In our paper “On the Creation of Representative Samples of Software Repositories”, to be presented at the International Symposium on Empirical Software Engineering and Measurement ’24 conference, we introduce a methodology for creating representative samples of software repositories. This approach ensures that the samples are aligned with both the characteristics of the repository population and the specific requirements of empirical studies.
What is Sampling?
In empirical software engineering, the concept of sampling is fundamental to the research process. Sampling involves systematically selecting a subset of elements from a larger population to make inferences about the entire population.
Stratified sampling is a technique where the population is partitioned into independent regions or strata, and a sample is then done within each stratum, ensuring that all relevant subgroups are adequately represented. It is broadly used to improve precision and promote representativeness in the inferences. We say a sample is representative when each sampled element represents the variables of a known number of elements in the population.
This method is particularly useful in software engineering studies, where different types of repositories or projects may exhibit unique characteristics.
Our Approach
We have developed a methodology employing stratified random sampling. Our methodology is divided into four phases:
- Variable Selection: This phase identifies the key variables that will drive the sample creation. This step is crucial as the representativeness of the sample depends on these variables. The selected variables can be either numerical, categorical, or a combination of both.
- Variable Analysis: We consider two types of variables: numerical and categorical. Numerical variables refer to data that is measured on a continuous scale (e.g., number of likes or downloads), while categorical variables refer to data that can be divided into groups (e.g., programming languages used in the code). This step is further divided into two steps: (1) Preprocessing, which focuses on studying the descriptive characteristics to handle non-sampling errors (e.g., missing data, non-numeric values in numerical variables, etc.); and (2) Stratification, which creates the strata to apply the stratified random sampling.
- Composition: This phase involves composing the strata of the selected variables to form a new strata distribution. This step involves generating all possible combinations and selecting the valid ones. Although several variables can be used for the composition, we recommend using no more than four to six variables, as the probability of some variables canceling the effects of others increases with the number of variables.
- Sampling: To create the samples, for numerical variables we apply probabilistic sampling methods to collect observations for each stratum, ensuring the quality of the sample through iterative processes. For categorical variables, we extract samples from each stratum, applying simple random sampling and maintaining their proportions relative to the population. When mixing both numerical and categorical variables, we apply the same process applied for categorical variables, but for each stratum of the resulting combination.
Case Studies
We illustrate our approach using Hugging Face Hub (HFH) repositories. We use HFCommunity, which collects data from HFH and Git repositories (to know more, you can read this post). The following figure shows a snippet of the conceptual schema of HFCommunity:
As a running example, we propose three use cases based on HFH repositories:
- Numerical Variable (Likes): We used the number of likes as the variable and applied the k-means clustering algorithm to generate strata. The sample size was determined using a margin of error and confidence interval.
- Categorical Variable (Type): We used the type of repository (dataset, model, space) as the variable. The sample size was calculated to ensure representativeness across all categories.
- Mixed Variables (Likes and Type): We combined the numerical and categorical variables to create a comprehensive sample. This involved generating all possible combinations of strata and applying proportional allocation.
Tool Support
To facilitate the application of our methodology, we developed a Python library that automates the process. We include a replicability package for the use cases presented, making it easier for researchers to apply our approach to their studies.
Conclusion
Our methodology provides a robust framework for creating representative samples of software repositories, addressing the limitations of current sampling methods. By ensuring that samples are aligned with the variables of interest, our approach enhances the quality and reliability of empirical studies in software engineering. As future work, we plan to extend our approach with support for sampling from multiple datasets or for evolving datasets (i.e., adding or removing observations or variables). Current implementation is focused on building samples given a margin error and a confidence level, which ensures high representativeness but may result in large samples. We plan to allow the sample size parameterization and report how representativeness may be affected.
We look forward to presenting our findings at ESEM’24 and exploring further applications of our methodology.
Research Engineer at IN3-SOM Research Team, in Barcelona, Spain. Currently studying a data science master. Interested in open source dynamics and collaboration.