{"id":119004,"date":"2023-11-07T05:00:31","date_gmt":"2023-11-07T05:00:31","guid":{"rendered":"https:\/\/livablesoftware.com\/?p=119004"},"modified":"2023-11-07T05:05:37","modified_gmt":"2023-11-07T05:05:37","slug":"huggingface-hub-empirical-studies-ml","status":"publish","type":"post","link":"https:\/\/livablesoftware.com\/huggingface-hub-empirical-studies-ml\/","title":{"rendered":"Is Hugging Face Hub a good source of data for empirical studies on ML development?"},"content":{"rendered":"
The development of empirical studies in Open-Source Software (OSS) requires large amounts of data regarding software development events and developer actions, which are typically collected from code hosting platforms. Code hosting platforms are built on top of a version control system, such as Git<\/a>, and provide collaboration tools such as issue trackers, discussions, and wikis; as well as social features such as the possibility to watch, follow and like other users and projects. Among them, GitHub<\/a> has emerged as the largest code hosting site in the world, with more than 100 million users<\/a> and 180 million (public) repositories<\/a>.<\/p>\n The development of empirical studies in software engineering mainly relies on the data available on code hosting platforms, being GitHub the most representative<\/strong>. Nevertheless, in the last years, the emergence of Machine Learning (ML) has led to the development of platforms specifically designed for developing ML-based projects, being Hugging Face Hub<\/a> (HFH) the most popular one. With over 600k<\/strong> repositories, and growing fast, HFH is becoming a promising ecosystem of ML artifacts and therefore a potential source of data for empirical studies. However, so far, there are no studies evaluating the potential of HFH for such empirical analysis<\/strong>.<\/p>\n We are currently performing a study to evaluate HFH (more details in the following sections) and its potential as source of data for empirical studies.<\/p>\n As part of this study, we want to better understand what are, in your opinion:<\/p>\n For this, we created a survey to collect the opinions of developers and researchers.\u00a0If you are a developer using a code hosting platform frequently, or a researcher of the empirical studies (EMSE) or mining software repositories (MSR) areas, we would be very grateful if you could answer the survey (click here<\/a> to answer). <\/strong><\/p>\n And if, on top of answering the interview, you’d be open to have a short interview with us to elaborate on your opinions, we would be very grateful. In the last question, leave your email (or just contact us anytime) if you agree to have an interview.<\/p>\n Many of you may know this emergent platform, but for those who don’t, HFH is a Git-based online code hosting platform<\/strong> aimed at providing a hosting site for all kinds of ML artifacts, namely: (1) models<\/a>, pretrained models that can be used with the Transformers<\/a> library; (2) datasets<\/a>, which can be used to train ML models; and (3) spaces<\/a>, demo apps to showcase ML models.<\/p>\nWe need your help!<\/h1>\n
\n
So… What’s Hugging Face Hub?<\/h1>\n