The network of collaborations in an open source project can reveal relevant emergent properties that influence its prospects of success. In this work, we analyze open source projects to determine whether they exhibit a rich-club behavior, i.e., a phenomenon where contributors with a high number of collaborations (i.e., strongly connected within the collaboration network) are likely to cooperate with other well-connected individuals. The presence or absence of a rich-club has an impact on the sustainability and robustness of the project.
For this analysis, we build and study a dataset with the 100 most popular projects in GitHub, exploiting connectivity patterns in the graph structure of collaborations that arise from commits, issues and pull requests. Results show that rich-club behavior is present in all the projects, but only few of them have an evident club structure. We compute coefficients both for single source graphs and the overall interaction graph, showing that rich-club behavior varies across different layers of software development. We provide possible explanations of our results, as well as implications for further analysis.
This work has been accepted at OpenSym 2019. You can download the full paper here or keep reading to see the slides and an extended summary
Motivation
GitHub is the most popular service to develop and maintain open source projects: it allows to create public Git repositories, to modify the code through commits, to send pull requests updating other users’ repository or to notify them about issues in the code. Each user interacts with many other users in the project development process and these interactions define collaboration networks, that can be exploited to describe software projects. Studying collaboration networks helps in discovering properties that influence the success of a project, possibly giving interesting insights. In particular, we are interested at the mesoscopic level of analysis of networks. At this level, the main target is to identify relevant connectivity patterns, which may reveal emergent properties that stem from the way in which the nodes in the network interact.
It is in this context that the very well-known property rich-club behavior arises as an interesting structural signature to study. This behaviour reflects the tendency of well-connected nodes (i.e., hubs) to interact with other well-connected nodes.
The rich-club behavior has been studied in many diverse domains: scientific collaboration networks, migration flows, brain connectivity tissue, air transportation or the Internet topology. And we believe that a similar behaviour can occur in software development communities too. More precisely, if we consider a network representing collaborations between developers, a rich-club behavior may apply when developers collaborate mostly with the same fixed subset of other important colleagues, instead of spreading the cooperation to each component of the team. The availability of GitHub data through its public API and other services (GHArchive) introduces the possibility to study in detail how rich-club is mapped into open source software development.
Our goal is to analyze the interactions networks of the most popular software development projects available on GitHub, in order to verify the presence of the rich-club behaviour and its potential implications for open source projects.
We build a data collection process to integrate the commit data coming from Git and the activity data coming from GitHub into a graph structure suitable for our purpose, which considers users as nodes and collaborations on project’s activities as edges. We then computed the rich-club coefficient for three kind of networks: issues-based, pull-requests-based and commits-based. We verify how the rich-club coefficient changes when all the interactions are combined in a collaboration supergraph.
Our results reveal that the rich-club behavior is present in all the supergraphs, but only few of them have an evident club structure. We compute coefficients also for single source graph, showing that rich-club behavior varies across different layers of software development. Moreover, we manually compare two projects with very different rich-club behavior with respect to external data of the projects as a further step of validation.
Rich Club Coefficient
The rich-club coefficient was firstly introduced in [16] as a non-normalized metric dependent of a degree k:
where Nk is the number of nodes with degree greater or equal to k and Ek are the number of edges between these nodes. Intuitively, ϕ(k) measures how far the set of nodes with degree k is from being a complete subgraph, i.e., a clique. The value of ϕ(k) ranges from 0 (all nodes are disconnected) to 1 (a clique), with higher values showing a stronger rich-club behavior in the network.
However, this coefficient tends to increase as the network grows [4], so a null model is needed to normalize the previous formula. Normalization is given by:
where ϕ(k) is computed using Equation 1 and ϕr andom (k) is the coefficient computed for the same k for a random network with the same degree distribution as the original model. This coefficient ρ(k) provides a non-negative number where the value 1 is the baseline: if ρ(k) > 1, then the rich- club behavior of the network is above that of a random case.
Research Method
Data Collection
To collect a representative sample of open source projects, we built a dataset comprising the 100 most popular projects in GitHub2. In Github, users can star a project to show their interest and follow its progress, thus we choose to measure the popularity of a project in terms of its number of stars (i.e., the more stars the more popular the project is).
The construction of the dataset involved three phases: (1) cloning, (2) import and (3) enrichment. After these steps, we applied a graph generation process to define the main source to calculate the rich-club coefficient. This Figure illustrates the construction process. Next we describe each step.
Cloning. In the first phase, we obtained the list of the 100 most popular projects in GitHub (at the moment of collecting the data) via its API and clone them to collect the corresponding Git repositories.
Import. To facilitate the query and exploration of the projects we imported the Git repositories into a relational database using Gitana [6]. In the Gitana database, Git repositories are represented in terms of users (i.e., contributors with a name and an email), files, commits (i.e., changes performed in one or more files), references (i.e., branches and tags), and file modifications (i.e., they link commits with files). For two projects, the import process failed to complete due to missing data.
Enrichment Our study needs a clear identification of the author of each commit so that we can properly link contributors and the files they modified. Unfortunately, Git does not control contributors’ name and email when pushing commits, thus resulting in plenty of clashing and duplication problems in the data. Clashing appears when two or more different contributors have set the same value for their names (again, note that in Git the contributor name is manually configured), thus resulting in commits coming from different contributors appearing with the same commit name (e.g., this often happens when using common names such as “Mike” or “John Smith”). On the other hand, duplicity appears when a contributor has indicated several emails, thus there are commits linked to different emails suggesting different contributors while in fact they come from the same one.
After evaluating the impact of this issue, we assessed that on average around 60% of the commits in each project were modified by contributors that involved a clashing/duplicity problem (and affecting a similar ratio of files). Therefore, for our analysis to be meaningful we proceeded to apply corrective actions to uniquely identify contributors.
Graph Generation
We rely on the so-called collaboration supergraphs to calculate the rich-club coefficient in our study, which are obtained from two sets of graphs: (a) commit graphs using Git data and (b) activity graphs using GitHub data. They are both weighted, undirected graphs that include users based on different kind of interactions. We have defined three generation steps to create these graphs, as illustrated in Figure 1b. Next, we describe the graphs generated in each step.
Commit Graph Generation Commit graphs are composed of nodes, which represent contributors; and edges, which joins two authors who have edited the same file in the repository. Nodes are weighted according to the number of contributors, while edge weight represents the number of files edited by the two authors.
Activity Graph Generation Activity graphs are of two types: issues-based and pull-request-based. In the first group, edges connect contributors that interacted on the same issues, either both commenting on the issue or performing other actions such as opening, closing and assigning the issue. Similarly, pull-request graphs aggregate contributors that worked on the same pull requests, commenting or reviewing others’ work. Even activities graph are weighted based on the number of common issues or pull requests each pair of users has interacted with.
Collaboration Supergraph Generation The previous three graphs (i.e., commit, issue and pull request graphs) are merged together into a collaboration supergraph: each pair of users has an edge if either they have committed the same file, or they worked on the same issue or pull request. Matching of the nodes among the three source graphs is made via the GitHub username, which uniquely identifies the contributor. Note that this is possible thanks to having addressed the duplicity and clashing problem in Git, as described before.
Computing the Rich-Club Coefficient
The coefficient is computed using the implementation5 in the NetworkX [7] Python package for network analysis, that provides both the non-normalized and normalized version of the coefficient.
Note that the computation of the rich-club coefficient is run for each project graph, but results are not always available: if the graph is too simple, the randomization process used for normalization fails. We consider only the projects with a defined normalized value for the supergraph, indicated as ρG(k) , otherwise, results could not be validated. For this set of projects, rich-club coefficient is calculated as well for the issues, commits and pull-requests graphs and it is indicated as ρis (k), ρpr (k) and ρc (k), respectively.
Discussion
Presence of rich-club behavior on the overall contributions’ graph
A total of 60 projects have a defined ρG value. The distribution of the maximum coefficient for each project is shown in Figure 5.
Focusing on the supergraph performance (blue line), it is possible to notice that each project has a maximum coefficient slightly higher than 1: the rich-club behavior is present in all the inspected projects, but it has more rele- vance only for a few of them, i.e. those that are in the right tail of the distribution and so that are more distant from a random network. In Table 1, the top-10 projects with respect to ρG are listed, as well as the maximum rich-club coefficient for all the other source graphs. Intuitively, the higher the coefficient the more prominent is the effect of the club on the network, but quantitatively describing this effect is hard, because the distribution of coefficients is not Gaussian and confidence values cannot be directly applied.
We believe this diversity of results reflects the different maturity of the projects and the alternative ways open source projects can grow and evolve. For instance, many open- source projects start as a collaboration effort from a small team of developers. As the project grows and gains popularity, it attracts new contributors. At this stage, in some projects, the original team retains “ownership” of the code- base, with external developers submitting small contributions. Meanwhile, in other cases, the project matures and is able to attract developers that become core contributors, diluting the presence of the team of founders.
Other projects, typically high-profile ones, reach GitHub already in a mature state, after the initial development is performed privately, e.g. React, in a company or a closed community. In this scenario, once it becomes public, the project evolves rather than grows, and the role of a core team of developers has a lower impact.
Therefore, the project history must be taken into account when internally reacting to a rich-club measure.
Rich-club implications at the individual project level
We have assessed the presence of rich-clubs inside some projects among the most popular on GitHub. In the literature, the presence of rich-club behaviour has shown both positive and negative effects on the network. On the one hand, the rich club is thought to be critical for global communication given that these nodes have high betweenness centrality, in that, if the shortest paths between all pairs of nodes is found, many of these shortest paths involve rich club members. On the other hand, the presence of this behavior relates with the tendency of the most active community members to control the network.
Therefore, project owners should evaluate their rich-club coefficient, understand where its coming from (e.g. see some of the potential interpretations pointed above) and then react in consequence. If there is a low rich-club, they should make sure that the project information still flows across all nodes even if the hubs are not connected and therefore do not play this central role. If there is a high rich-club, governance policies should be put in place to guarantee that all important decisions require broad community participation and cannot be just dominated by the few hub nodes colluding together.
Future work
As further work, we plan to go deeper into this rich-club analysis by exploring as well weighted rich-club calculations and the rich-club effect at, both, the module and ecosystem level. The former would aim to detect “local” rich-club behaviours that may be hidden when looking the project from a global view but still be important for the success and evolution of specific project components. The latter looks for potential coordinated rich-clubs behaviours that span multiple projects in the same domain. Additionally, further sources of information (e.g., email exchanges in public project mailing lists) will be also considered to enrich our graphs to check whether rich-club structures do exist but manifest beyond the project data tracked by GitHub.