The GHTorrent dataset and tool suite

by Gousios, Georgios

You can get a pre-print version from here.
See the paper's associated code repository: gousiosg/github-mirror

This paper received the "MSR2013: Best data showcase paper" award

Abstract

A common requirement of many empirical software engineering studies is the acquisition and curation of data from software repositories. During the last few years, GitHub has emerged as a popular project hosting, mirroring and collaboration platform. GitHub provides an extensive REST API, which enables researchers to retrieve both the commits to the projects’ repositories and events generated through user actions on project resources. GHTorrent aims to create a scalable off line mirror of GitHub’s event streams and persistent data, and offer it to the research community as a service. In this paper, we present the project’s design and initial implementation and demonstrate how the provided datasets can be queried and processed.

Bibtex record

@inproceedings{G13,
  author = {Gousios, Georgios},
  title = {The {GHT}orrent dataset and tool suite},
  year = {2013},
  month = may,
  pages = {233--236},
  numpages = {4},
  booktitle = {Proceedings of the 10th Working Conference on Mining Software Repositories},
  series = {MSR '13},
  isbn = {978-1-4673-2936-1},
  location = {San Francisco, CA, USA},
  url = {/pub/ghtorrent-dataset-toolsuite.pdf},
  github = {gousiosg/github-mirror},
  speakerdeck = {75bea5909fbb0130f0eb364613f6f036},
  award = {MSR2013: Best data showcase paper},
  note = {Best data showcase paper award}
}

Presentation

The paper