{"id":119004,"date":"2023-11-07T05:00:31","date_gmt":"2023-11-07T05:00:31","guid":{"rendered":"https:\/\/livablesoftware.com\/?p=119004"},"modified":"2023-11-07T05:05:37","modified_gmt":"2023-11-07T05:05:37","slug":"huggingface-hub-empirical-studies-ml","status":"publish","type":"post","link":"https:\/\/livablesoftware.com\/huggingface-hub-empirical-studies-ml\/","title":{"rendered":"Is Hugging Face Hub a good source of data for empirical studies on ML development?"},"content":{"rendered":"

The development of empirical studies in Open-Source Software (OSS) requires large amounts of data regarding software development events and developer actions, which are typically collected from code hosting platforms. Code hosting platforms are built on top of a version control system, such as Git<\/a>, and provide collaboration tools such as issue trackers, discussions, and wikis; as well as social features such as the possibility to watch, follow and like other users and projects. Among them, GitHub<\/a> has emerged as the largest code hosting site in the world, with more than 100 million users<\/a> and 180 million (public) repositories<\/a>.<\/p>\n

The development of empirical studies in software engineering mainly relies on the data available on code hosting platforms, being GitHub the most representative<\/strong>. Nevertheless, in the last years, the emergence of Machine Learning (ML) has led to the development of platforms specifically designed for developing ML-based projects, being Hugging Face Hub<\/a> (HFH) the most popular one. With over 600k<\/strong> repositories, and growing fast, HFH is becoming a promising ecosystem of ML artifacts and therefore a potential source of data for empirical studies. However, so far, there are no studies evaluating the potential of HFH for such empirical analysis<\/strong>.<\/p>\n

We need your help!<\/h1>\n

We are currently performing a study to evaluate HFH (more details in the following sections) and its potential as source of data for empirical studies.<\/p>\n

As part of this study, we want to better understand what are, in your opinion:<\/p>\n

    \n
  1. the essential characteristics of code hosting platforms,<\/li>\n
  2. and what stats of a project you’d like to see when deciding when choosing which projects or libraries to be part of your project when several offer a similar functionaity you need<\/li>\n<\/ol>\n

    For this, we created a survey to collect the opinions of developers and researchers.\u00a0If you are a developer using a code hosting platform frequently, or a researcher of the empirical studies (EMSE) or mining software repositories (MSR) areas, we would be very grateful if you could answer the survey (click here<\/a> to answer). <\/strong><\/p>\n

    And if, on top of answering the interview, you’d be open to have a short interview with us to elaborate on your opinions, we would be very grateful. In the last question, leave your email (or just contact us anytime) if you agree to have an interview.<\/p>\n

    So… What’s Hugging Face Hub?<\/h1>\n

    Many of you may know this emergent platform, but for those who don’t, HFH is a Git-based online code hosting platform<\/strong> aimed at providing a hosting site for all kinds of ML artifacts, namely: (1) models<\/a>, pretrained models that can be used with the Transformers<\/a> library; (2) datasets<\/a>, which can be used to train ML models; and (3) spaces<\/a>, demo apps to showcase ML models.<\/p>\n

    As of November 2023, the platform hosts more than 625k repositories<\/a>, and this number is growing fast<\/strong>. To illustrate the growing evolution of the platform, the following figures illustrate the natural and cumulative growth of new project registrations by month in HFH, respectively.<\/p>\n

    \"\"

    Number of project registrations per month in HFH<\/p><\/div>\n

    \"\"

    Cumulative number of projects in HFH<\/p><\/div>\n

     <\/p>\n

    HFH has been evolving and incorporating features which are typically found in GitHub, such the ability to create discussions or submit pull requests enabling more complex interactions and development workflows. This evolution, its growing popularity and the ML-specific features make HFH a promising source of data for empirical studies<\/strong>. Although the usage of HFH in empirical studies is promising, the current status of the platform may involve relevant perils.<\/p>\n

    Our proposal<\/h1>\n

    In ESEM ’23<\/a> we presented a registered report with the purpose of evaluating the potential of HFH for empirical studies (click here<\/a> to read the full report). The goal of our registered report is to assess the current state of HFH and analyze its adequacy to be used in empirical studies. For this, we proposed two research questions. The first one goes as:<\/p>\n