General description

Software repositories archive valuable software engineering data, such as source code, execution traces, historical code changes, mailing lists, and bug reports. This data contains a wealth of information about a project’s status and history. Doing data science on software repositories, researchers can gain empirically based understanding of software development practices, and practitioners can better manage, maintain and evolve complex software projects.

IN4334 is a seminar course that aims to give students a deep and hands-on approach in software analytics.

Learning Objectives

This course enables the student to:

Course Organization

The course projects

During the course, the students will engage in 2 collaborative projects:

Survey of software analytics

Every year, tens of papers are published in the area of software analytics. This leads to a high noise to signal ratio: lots of papers containing marginal insights and, for outsiders, it is really difficult to obtain an overview of what software analytics have to offer to software projects.

To make things easier for newcomers, we will collaborative work on a high quality summary of the area, outlining the current state of the art and future challenges. To make this work, the course instructor will provide an outline of the area, pointers to important papers and a paper skeleton; you will have to summarize a sub-area of software analytics in a easily digestible format.

Task duration: 3 weeks

Extending the CodeFeedr platform

CodeFeedr is a state of the art software analytics platform developed by the Software Analytics Lab at TU Delft. What is really interesting about CodeFeedr is that it works in a real-time, stream processing fashion. Instead of analysing historical archives (the typical MSR task) it analyses events: those can originate in real time feeds (e.g. event stream on Github or the Android App store) or archived data ( e.g. all commits in a GitHub repo).

The basic architecture of CodeFeedr, along with some data plug-ins, is already there. To show case its full power, CodeFeedr needs i) more data plug-ins, and ii) a REPL loop to allow users to specify data processing operations using SQL.

The purpose of this task will be to extend CodeFeedr in any of the two directions proposed above.

Task duration: 4 weeks

Study material

The following papers / books / websites are a must read in the study of software analytics.


Week Lecture Topic Lecturer
1 1 Course introduction GG
1 2 Mining software repository data GG
2 1
2 2
3 1
3 2
4 1
4 2
5 1
5 2
6 1
6 2
7 1
7 2



The final course grade will be calculated as:

All deliverables will be peer-reviewed by 2 other teams. The peer-review grade is 50% of the final grade per grade item.


[1] C. Bird, T. Menzies, and T. Zimmermann, The art and science of analyzing software data, 1st ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2015.