Libraries.io monitors over 2 million open source libraries/packages from 36 package managers (npm, maven, Pypi, etc) and gathers relevant information about each of them, including their license, releases, contributors and dependencies among them. Libraries.io was started by Andrew Nesbitt and Benjamin Nickolls and it’s now part of Tidelift.
The key contribution of Libraries.io is that it goes beyond providing individual project information (something other tools provide) by offering key insights on the ecosystem surrounding that project. A software ecosystem is defined as a collection of software projects which are developed and co-evolve together (due to technical dependencies and shared developer communities) in the same environment.
As we said above, the dataset contains data on approximately 2.5 million unique software components, including 9 million tracked versions and 39 million tagged releases. It also contains data on 25 million repositories that utilize these projects in their own code, with 100 million declared dependencies upon projects. The complete dataset is approximately 5GB compressed, 25GB uncompressed. Beyond providing an open API, the full dataset has been made available as a comma-separated-value table hosted by Zenodo . The data is licensed for use under the terms of the Creative Commons Attribution, Sharealike 4.0 licence.
Libraries.io data schema
Data is organized in six packages. We summarize here the content of each package (see https://libraries.io/data for a full description, some data is only available for certain package managers since not all of them offer the same kind of data). The CSV export includes also aggregated summary data and a few additional fields to reduce the number of joins needed to analyze the data
A project corresponds to the definition of a package/library from one of the supported package managers. It includes data on:
- Package manager where the project is available
- Crated and updated timestamps
- Homepage and repository URLs
A version is an immutable published version of a project. Some packages instead of an explicit version concept directly rely on tags/branches in the repo. Version data includes:
- Number (specially useful for semantic versioning)
- Created, published and updated timestamps
- Host type
- Repository name and owner
- Tag git sha
- Published, created and updated timestamps
A repository in Libraries.io represents a publically accessible source code repository from either GitHub, GitLab or BitBucket.
- Host type
- Repository name and owner
- Created, updated and last pushed timestamps
Dependencies between projects. Dependencies belong to versions of a project since each version can show a different set of dependencies. Almost all dependencies are internal to the package manager. They are parsed from the project manifest file (a gemfile, package.json, or similar). Available data:
- Repository name and owner Manifest platform, filepath and kind SCM type
- Git Branch IsOptional?
- Dependency project name and requirements (e.g. versions it requires)
- Dependency kind (runtime, build, test, development,…)
I believe that general availability of ecosystem data, as the one we are proposing herein, could trigger much more research in this area. Right now collecting this kind of data requires plenty of manual intensive work to interact with individual package managers .For instance we could use the data to answer research questions on software ecosystems either looking at the projects’ relationships within a given package or performing cross-package comparisons. Some possible questions would be:
- Classification: Can we classify projects into types (according to different dimensions, e.g maturity, risk, quality,…) depending upon the relationships with other projects and repositories? Can we improve search and recommendation services based upon this knowledge?
- Comparison: Do we see similar behaviours between package managers, languages, or within smaller frameworks? Are there any common patterns that can be observed in “successful” projects (for whatever definition of success you choose to rely on)? Which projects are the most depended upon? Which projects represent the most value to the rest of the ecosystem? What are the critical projects on an ecosystem? Would the ecosystem be able to recover if a project were deleted?
- Temporal evolution: is software getting more and more complex? How does this compare across different ecosystems, languages, and project types (i.e. are some languages more prone to quickly increase in complexity? is there a tipping point from where projects start to quickly increase its complexity?).
- Licences: What are the most popular open source licenses in use today? Do they change over time? Do projects frequently ignore the licence requirements within their own dependencies?
Some examples of cool applications relying on this data are available in the Experiments section of their website. Or if you’re up for some scientific reading, check this paper: An Empirical Comparison of Dependency Network Evolution in Seven Software Packaging Ecosystems by Alexandre Decan, Tom Mens, Philippe Grosjean. (And if you’re also using libraries.io data in one of your projects and want to be listed here just me ping me back!).