08 Mar 2016
The Issue 32 incident – An update

Many of you are aware of the GHTorrent issue 32. To sum up the discussion in a couple of lines, various developers included in GHTorrent wanted their email removed from it (which I did) and then wanted all emails to be excluded from the dataset (which I refused to do). The reasons behind the requests where privacy and the right to do what ever one wants with their personal data (email in many jurisdictions is considered personal data). What caused the whole thread was that researchers used GHTorrent as a source of emails for research surveys which were sent to... Read more

26 Jun 2015
How do project contributors use pull requests on Github?

with Alberto Bacchelli Distributed software development projects employ collaboration models and patterns to streamline the process of integrating incoming contributions. Classic forms of code contributions to collaborative projects include change sets sent to development mailing lists or issue tracking systems and direct access to the version control system. More recently however, a big portion of open source development happens on GitHub. One of the main reasons for this is the fact that contributing to a GitHub project is a relatively pain-free experience. Or is it? In Apr 2014, we run a survey among contibutors (also: integrators) to Github projects trying... Read more

02 Apr 2015
How to run a large scale survey

If you know me well, this blog post might seem strange. I have always been a proponent of quantitative methods and big data. Despite this, in April 2014, I run a survey that got filled in by 1,500 people. One part of the survey analysis will be presented at ICSE 2015 this year, while we submitted the second part to FSE 2015 (still twiddling our thumbs about the results). In wake of the ICSE 2015 publication, many colleagues asked me how I managed to get so many responses. Here is how I did it. Target an audience: The broader the... Read more

03 Oct 2014
How do project owners use pull requests on Github?

Pull-based development as a distributed development model is a distinct way of collaborating in software development. In this model, the project’s main repository is not shared among potential contributors; instead, contributors fork (clone) the repository and make their changes independent of each other. In the pull-based model, the role of the integrator is crucial. The integrator must act as a guardian for the project’s quality while at the same time keeping several (often, more than ten) contributions "in-flight" by communicating modification requirements to the original contributors. Being a part of a development team, the integrator must facilitate consensus-reaching discussions and... Read more

07 Jul 2014
The computer scientist's guide to speech development

During the last 20 months, I 've been having fun with my daughter's (from now on: little λ) efforts to learn to speak. Up to now, the whole process can be split in 4 phases. The random noise phase This starts at around 4 months. The baby mumbles random noises initially (aaa, usually) and, as the brain develops, more focused 2 letter syllables (ma-ma, pa-pa etc). Nothing interesting here, apart from the fact the baby can combine various stimuli (noise, vision etc) with oral expressions (say ma-ma when she listens mummy whispering at night), which computers are not very capable... Read more

29 May 2014
What's new in GHTorrent land?

A lot of people (around 30 on last count) have been using GHTorrent lately as an easy to use source for accessing the wealth of data Github has. Portions of the dataset appear in the MSR14 and VISSOFT14 data challenges, while at least 15 papers at this year's MSR and ICSE conferences are based on it. In this blog post, I summarize the long list of changes that happened in the GHTorrent land since Sep 2013. Introducing Lean GHTorrent Obtaining and restoring the full GHTorrent dataset is serious business: one has to download and restore more than 3TB of MongoDB... Read more

27 Mar 2014
The triumph of online collaboration

For a research paper I am working on, we wanted to analyze the top 30 "most collaborative" projects on Github. Defining a quantitative metric of collaboration and sorting projects according to it is not an easy task, as collaboration is in many cases implicit and not recorded, while not all actions of collaboration are equal. As a proxy, we chose to measure the number of people that perform changes that mutate the state of a repository. On Github, we could identify the following: A: Create a commit to a repository B: Perform a code review on an individual commit C:... Read more

27 Jan 2014
How projects use pull requests on Github

Pull requests form a new method for collaboration on distributed software development. The novelty lays in the decoupling of the development effort from the decision to incorporate the results of the development in the code base. Several code hosting sites, including Github and BitBucket, tapped on the opportunity to make the pull-based development model more accessible to programmers. A unique characteristic of such sites is that they allow any user to fork any public repository. The clone creates a public project that belongs to the user that cloned it, so the user can modify the repository without being part of... Read more

31 Oct 2013
The SEFUNC project final report

by Georgios Gousios and Arie van Deursen This is the publishable version of the final report submitted as part of my Marie Curie IEF project. It summarizes what I did during the 16 months I was funded by it. The advent of distributed version control systems has led to the development of a new paradigm for distributed software development; instead of pushing changes to a central repository, developers pull them from other repositories and merge them locally. Various code hosting sites, notably Github, have tapped on the opportunity to facilitate pull-based development by offering workflow support tools, such as code... Read more

15 Oct 2013
Lazy hacker's service analytics

A week ago, I had trouble with the GHTorrent data retrieval process. Specifically, while scripts where performing as expected and the event processing error rate was within reasonable bounds, API requests took forever to complete, in many cases as much as 20 seconds. I know that Github's API is very snappy, and even though it the response times I get are slower than what Github reports, it is reasonably fast if we take into account the packet trip over (or under) the Atlantic (usually, around 500msec). My main hypothesis was that Github started employing some kind of tar pitting strategy... Read more

Older posts
Performance x 1
Java x 2
JVM x 1
C++ x 1
Research x 2
MSR x 5
Tools x 1
spam x 1
security x 1
greek x 2
bureaucracy x 1
passport x 1
MachineLearning x 1
R x 2
Graphs x 1
Github x 6
GHTorrent x 6
PullRequest x 1
Rx x 1
Hacking x 1
Scala x 1
politics x 1
crisis x 1
teapot x 1
php x 1
hhvm x 1
fp x 1
debug x 1
unix x 1
report x 1
pull-request x 1
collaboration x 2
speech x 1
pullrequest x 2
integrator x 2
survey x 1
qualitative x 1
ghtorrent x 1
legal x 1
openaccess x 1