The development of empirical studies in Open-Source Software (OSS) requires large amounts of data regarding software development events and developer actions, which are typically collected from code hosting platforms. Code hosting platforms are built on top of a version control system, such as Git, and provide collaboration tools such as issue trackers, discussions, and wikis; as well as social features such as the possibility to watch, follow and like other users and projects. Among them, GitHub has emerged as the largest code hosting site in the world, with more than 100 million users and 180 million (public) repositories.

The development of empirical studies in software engineering mainly relies on the data available on code hosting platforms, being GitHub the most representative. Nevertheless, in the last years, the emergence of Machine Learning (ML) has led to the development of platforms specifically designed for developing ML-based projects, being Hugging Face Hub (HFH) the most popular one. With over 600k repositories, and growing fast, HFH is becoming a promising ecosystem of ML artifacts and therefore a potential source of data for empirical studies. However, so far, there are no studies evaluating the potential of HFH for such empirical analysis.

In our paper “On the Suitability of Hugging Face Hub for Empirical Studies” (preprint available here) published in Empirical Software Engineering journal, we address the concern on whether HFH is suitable for performing empirical studies. For this, we have proposed a qualitative and quantitative analysis. In the following, we briefly explain the motivation, method and results of the paper.

So… What’s Hugging Face Hub?

Many of you may know this emergent platform, but for those who don’t, HFH is a Git-based online code hosting platform aimed at providing a hosting site for all kinds of ML artifacts, namely: (1) models, pretrained models that can be used with the Transformers library; (2) datasets, which can be used to train ML models; and (3) spaces, demo apps to showcase ML models.

As of November 2023, the platform hosts more than 625k repositories, and this number is growing fast. To illustrate the growing evolution of the platform, the following figures illustrate the natural and cumulative growth of new project registrations by month in HFH, respectively.

Number of project registrations per month in HFH

Cumulative number of projects in HFH

 

HFH has been evolving and incorporating features which are typically found in GitHub, such the ability to create discussions or submit pull requests enabling more complex interactions and development workflows. This evolution, its growing popularity and the ML-specific features make HFH a promising source of data for empirical studies. Although the usage of HFH in empirical studies is promising, the current status of the platform may involve relevant perils.

Our proposal

In ESEM ’23 we presented a registered report with the purpose of evaluating the potential of HFH for empirical studies (click here to read the full report). The conducted study has been accepted at the EMSE journal. The goal of our study is to assess the current state of HFH and analyze its adequacy to be used in empirical studies. For this, we proposed two research questions. The first one goes as:

  • What features does HFH provide as a code hosting platform to enable empirical studies?

This RQ is proposed to properly comprehend the key features that characterize HFH both for individual projects (i.e. features oriented towards end-users planning to use HFH for their software development projects) and at the platform level (i.e. to facilitate the retrieval and analysis of global HFH usage information). This analysis will allow characterizing the platform and identifying potential use cases for empirical studies. Hence, we subdivided this research question into two:

    1. What features HFH offers to facilitate the collaborative development of ML-oriented projects?
    2. What features HFH offers at the platform level to facilitate access to the hosted projects’ data?

The first one performs an exploratory study of the features offered by HFH to projects hosted in the platform. In this RQ, we focus on the features serving project development tasks, such as pull requests for managing code contributions or issue trackers for notifying bugs or requests. To this aim, we plan to study current code hosting platforms to define a feature framework to be used as a reference framework to analyze the platform. The second one’s purpose is to examine the features provided by HFH aimed at retrieving its internal data, derived from the activity of projects hosted in it.

The second research question is defined as:

  • How is HFH currently being exploited?

In this RQ we are interested in studying how HFH is so far being used at platform and project levels. In each level, we will analyze the data within two perspectives: volume and diversity. To measure the volume we define quantitative variables, such as the number of repositories and users at platform level; or the number of files, contributors and commits at project level. On the other hand, to measure diversity we define categorical variables, such as the  programming languages used in the repositories or the type of contributions (i.e., issues or discussions) in the projects. Following the level identification, we have subdivided this research question into two:

    1. What is the current state of the platform data in HFH?
    2. What is the current state of the project data in HFH?

The first one analyzes the platform as a whole while the second one explores the usage of HFH at project level being the goal to characterize the average (or averages if we detect different typologies) project on HFH via the analysis of their number of files and commits, number of users, its temporal evolution, etc.

Addressing the Research Questions

To address our research question, we conducted a qualitative and quantitative analysis of HFH. The former addresses RQ1, as it focuses on  identifying the features of HFH and the options available to retrieve HFH data. The latter addresses RQ2, and allows us to analyze the data available in HFH via the reported data retrieval solution. The results of the research questions are discussed in a set of semi-structured interviews. The results from the interviews prompt discussions, which we highlight a few

Qualitative Analysis

During the qualitative analysis we build a feature framework aimed at identifying which characteristics define a code hosting platform. The framework is  built by analyzing different code hosting platforms and identifying the features offered by each platform. The features extracted from this process rely on authors’ experience on the platform, the platform documentation and usage and literature. We revise a number of platforms, leveraging on existing literature. The result is a superset of features which are backed by the literature from relevant venues to underline the importance of some specific features in empirical studies. A version of our framework can be depicted as:

Feature framework

Besides the literature, we validate this framework by conducting a survey with relevant actors of each analyzed platform (i.e., developers relying on code-hosting platforms for their day-to-day development, and researchers of software empirical research and mining software repositories (MSR) communities).

The framework, composed by 34 features and six categories (full feature framework is available in the paper), characterizes HFH in terms of features, and allows us to analyze their importance in the context of empirical studies. We study the intersection of these features with the ones offered by other code hosting platforms. From the characterization of HFH we can highlight:

  1. HFH provides limited support to coding and project management features, as the platform, said by their internal team, is focused on provided further support to social interaction, rather than replicating the coding support other platforms provide.
  2. The platform mainly promotes social support, providing tools like discussions to enable interactions between users in the repository. Furthermore, they provide HF Posts a blogpost-like page so users can publish about latest trends on the ML community.
  3. Regarding the retrieval of the platform’s data, HFH provide an API and a Python library to interact with the repositories hosted in the platform. However, from the community it only exists one dataset to access the whole HFH data: HFCommunity.

Quantitative Analysis

In the quantitative analysis we examine the HFH data to provide an overview of the current usage of the platform, performed at platform and project level. The former shows the actual usage of the features identified in the previous research question and conclude on the level of exploitation of such features. The latter gives an insight of how the development process is currently carried out in HFH, thus favoring the comprehension  of why the users use this platform. To address this analysis we leverage on the data provided by HFCommunity.

We define quantitative variables to analyze the platform and the repositories. The Platform category is designed to characterize the HFH environment. The HFH platform is specifically designed for ML-based artifacts, thus the platform aspects may differ, such as having repositories designed for each type of ML artifact. Regarding the Project category, we will intend to give an insight on the status of the repositories. Empirical analysis usually rely on a reduced subset of the repositories in a code hosting platform. Therefore, we find appropriate to perform an analysis from a repository perspective and identify whether there is a way to select prolific repositories to perform empirical analysis as it is done on other code hosting platforms. Some example of variables of each level can be visualized as:

Quantitative variables

As with the qualitative analysis, we validate the identification of the metrics via a survey. We consider different sets of repositories: the top 100 liked, the top 100 downloaded, and all the public repositories uploaded to the HFH.  After applying the metrics we further discuss the results in the semi-structured interviews. The quantitative analysis provides a view on the use of HFH, which we summarize in the following:

  1. At platform level, HFH is under an exponential growth, hosting thousands of repositories. Despite its early stage, it provides valuable and diverse repositories.
  2. One key characteristic are the dependencies between repositories, potentially building an interconnected ecosystem, highlighting relationships caused by how ML artifacts are developed (e.g., models trained with datasets hosted in HFH and showcased by spaces).
  3. At project level, the development activity is mainly comprised by Git commits and discussions.
  4. Project’s activity usually does not last more than a month, which might indicate that the repository is uploaded to HFH with hosting rather than with development purposes.
  5. Projects are usually maintained by one or two contributors, while in the top-100 repositories there is more participation of the community.

Key Findings and Suggestions

We identified a set of strong and weak points of HFH, which help visualize which empirical studies on HFH are suitable or not. We summarize a few points in the following:

  1. We noticed the support for social interactions and a friendlier interface to all kinds of users. The spaces might be used as a landing ground for new users and potentially engaging them to contribute through discussions.
  2. The features provided, along with other integrations, such as Posts, facilitate users to have a centralized place for ML related topics. HFH aim is to foster ML communities usually found in other social sites (e.g., Twitter/X). Furthermore, the features provided to identify dependencies between repositories can help build and identify such communities.
  3. While HFH is following an exponential growth similar to the one GitHub had, it is still in an early stage of adoption.
  4. The limited support for coding force developers to use other platforms with further coding support in parallel to HFH.
  5. A few interviewees highlighted that the normal procedure of selecting models from HFH is based on the number of likes or downloads. Such approaches are far from providing an optimal selection, introducing a critical issue, the Matthew effect (i.e., commonly known as the rich get richer).
  6. There is a notable presence of empty and inactive repositories. In a previous work, we analyzed the survival rate of GitHub projects. In this work, we noticed most repositories turn inactive in a few months after their upload, which might also be happenning in HFH.

From the strong and weak points identified in the paper, we also discuss on which empirical studies are suitable on HFH or not. Some examples are:

  1. Given the focus on establishing HFH as a more social than development platform, empirical studies on the collaborative and networking aspects would be ideal.Furthermore, studies targeting social aspects could also be a good fit for the HFH.
  2. Thanks to the specific focus on hosting ML artifacts, HFH allows studies on specific ML concepts such as PTM reuse
  3. The features provided by HFH to identify dependencies between repositories allow the identification of community cluster, which enable studies in the identification of communities and graph analysis.
  4. As aforementioned, studies aiming at having a complete picture of the end-to-end development of an ML artefact may need to leverage HFH in combination with GitHub or others, rather than using HFH (or GitHub) as standalone data source.
  5. Furthermore, we believe that it would be interesting to replicate existing studies done on GitHub or other platforms.

Conclusion

In this article, we presented our proposal for a study of the suitability of HFH for empirical studies. The study comprises a feature-based framework comparison to characterize the HFH functionality together with an analysis of the mechanisms to retrieve information on how such features are used. This allows evaluating the suitability of HFH from a feature availability perspective. Besides this feature-level study we propose to conduct a second one, more quantitative one, based on the study of volume and diversity of the data stored on the HFH. We conduct this study both at the platform and project-levels, looking at the overall volume and richness of the data and on how the average project uses the platform.

Beyond a deeper understanding on how collaborative development of ML-related projects takes place on the HFH, the conclusion of this report is presented as a discussion on whether HFH can be a suitable data source to perform empirical studies. Also, given that empirical studies usually focus on a specific characteristic of code hosting platforms, beyond a boolean answer, the goal is to discuss what types of empirical studies could benefit from HFH data, either as a standalone data source or in combination with GitHub or other data sources.

 

Join our Club!

Follow the latest news on software development, especially for open source projects

You have Successfully Subscribed!

Share This