GitHub provides an unprecedented opportunity to study the collaboration patterns of software developers. Understanding collaboration patterns can support the design of better tools for collaborative coding and help measure developer’s productivity and impact.
A traditional tool for analyzing collaborative environments is co-authorship networks: a graph in which each node represents an author and each edge represents a collaboration between the two authors. This tool has been used extensively in studying scientific collaboration patterns and has been applied to many other collaborative environments. However, previous work on Wikipedia , an online environment that shares much of the above challenges with GitHub, found that the traditional co-authorship networks face challenges in scaling to the size and nature of Wikipedia. In Wikipedia, the fact that two authors contribute to the same page is not sufficient to establish a collaborative relationship. Long time gaps between the contributions, very small contributions, and contributions that are being cut-out shortly after should not be used to establish a collaboration between authors.
In our work, we performed a comprehensive study of the collaboration patterns in GitHub that attempts to address the described challenges. We carefully employ participation criteria for repositories (to eliminate personal and inactive repositories) and developers (to eliminate minor contributors), and use temporal co-authorship networks to account for temporal changes in patterns.
However, mining GitHub poses several challenges that can negatively affect the analysis of collaboration patterns  (see also this related post):
- A large number of inactive repositories
- A large number of personal projects
- Many small contributions that are often rolled-back shortly after
- GitHub is a fast-growing network and the collaboration patterns constantly change
These issues have also been taken into account when performing the analysis.
Data collection to detect collaboration patterns
We collected data for repositories in a set of 17 different programming languages that represent diverse characteristics (e.g., imperative/functional, systems/web/scientific, established/recent) and covers the top 10 languages based on IEEE spectrum ranking 2016.
GitHub API v3 was used to collect the data. We took advantage of the developer statistics API that provides a weekly summary of commits for each developer. As this API only provides information for the top 100 developers, we had to mine this information from the commit log of repositories that have more than 100 developers.
Reputation-based participation criteria for repositories
We focus on the collaboration patterns in the top repositories (based on star ranking). For each programming language, we collected data for the top 1000 repositories with the highest star ranking. Next, we filter out 16% of the repositories that we identify to be private or inactive. As previous work  found that 71.6% of the repositories on GitHub have only one developer and many more are inactive, we attribute the much lower percentage of filtered repositories in our dataset to the selection of top repositories.
Activity-based participation criteria for developers
A key part of our analysis is the selection of active developers. The purpose of the activity threshold is to get rid of the long tail phenomenon that is often associated with activities in large social networks. In GitHub, we find a long tail of developers that make very few and minor code commits. We, therefore, propose participation criteria (relative to the commit volume in each repository) that successfully filter out approximately 70% of the developers, while still retaining approximately 90% of the code commits, validating the existence of significant long tail behavior.
Temporal co-authorship networks
To account for temporal changes in collaboration patterns, and to analyze the evolution of patterns over time, we break the collected data into 10 nine-month periods with six-month overlap. We think nine-month is a reasonable period to observe collaboration patterns, and the overlapping time windows with a three-month shift to allow a smoother analysis of the changes observed after each quarter.
Analysis method to detect the patterns
We use the following metrics, often used in social network analysis, to analyze the constructed temporal networks:
- Connected components
- Degree distribution
- Network centralization
- Community structure
- Repository and language overlap
Our paper includes a detailed analysis of the constructed temporal networks. The main findings are as follows:
- We found that the active developers are less collaborative, more centralized, and have a stronger community structure compared to the population of all GitHub developers. They tend to have a smaller “giant component”, higher centralization, and smaller communities.
- Our analysis of the changes in collaboration patterns over time indicates that the collaboration patterns of the active developers evolve differently from the ones of the population of all GitHub developers.
- We compare the collaboration patterns between each programming language and find that different programming languages exhibit different collaboration patterns (which also evolve differently over time).
In addition to these findings, the methodology we develop can be used to better understand developer productivity and collaboration in collaborative coding environments.
For more information, including the detailed results, please refer to our published paper:
Cohen, Eldan, and Mariano P. Consens. “Large-Scale Analysis of the Co-Commit Patterns of the Active Developers in GitHub’s Top Repositories.” Proceedings of the 15th International Conference on Mining Software Repositories. ACM, 2018. [Pre-print is available here]
 Kalliamvakou, Eirini, et al. “An in-depth study of the promises and perils of mining GitHub.” Empirical Software Engineering 21.5 (2016): 2035-2071.
 Laniado, David, and Riccardo Tasso. “Co-authorship 2.0: Patterns of collaboration in Wikipedia.” Proceedings of the 22nd ACM conference on Hypertext and hypermedia. ACM, 2011.
I am a PhD student in the Toronto Intelligent Decision Engineering Laboratory at the University of Toronto. My research interests include machine and deep learning, heuristic search and optimization, and scalable data mining.