Mining Github for fun and profit

by Gousios, Georgios


The advent of distributed version control systems has led to the development of a new paradigm for distributed collaboration; instead of pushing changes to a central repository, developers pull them from other repositories and merge them locally. Various code hosting sites, notably Github, have tapped on the opportunity to facilitate pull based development by offering workflow support tools, such as code reviewing systems and integrated issue trackers. Interestingly, Github provides all its data through a comprehensive REST API. In our talk, we describe our work on the GHTorrent project, an effort to collect, process, mine and learn from the massive datasets that Github offers. We also present preliminary results of our in-depth analysis of pull requests.

