General description

Software repositories archive valuable software engineering data, such as source code, execution traces, historical code changes, mailing lists, and bug reports. This data contains a wealth of information about a project’s status and history. Doing data science on software repositories, researchers can gain empirically based understanding of software development practices, and practitioners can better manage, maintain, and evolve complex software projects.

In the recent years, the advances in Machine Learning and AI technologies, as demonstrated by the successful application of Deep Neural Networks in various domains did not go unoticed in the field of Software Engineering. Researchers have applied DNNs to tackle issues such as automated program repair, code summarization, code structure representation, etc.

IN4334 is a seminar course that aims to give students a deep understanding of and hands-on approach on how deep neural networks and NLP techniques are used by today’s industry leaders to represent knoweledge and solve existing problems in novel ways.

Learning Objectives

This course will enable students to:

Course Organization

The course projects

During the course, you will need to replicate an existing paper.

Replication is a topic much touted but seldom practiced in the software engineering community. It is, however, a core aspect of science, especially empirical.

The purpose of this task is to attempt a replication of a recent paper, either by downloading readily available data sets published together with the paper, requesting the data from the original authors or by applying the same techniques on different data. We recommend you to select a paper from the list that you studied for your literature survey.

Required reading for week 1:


Date Week Lecture Topic Lecturer
2/9 1 1 Course Introduction, How to read a paper in a group MA
4/9 1 2 Enough data science to become dangerous GG
9/9 2 1 Enough neural networks to become dangerous MA
11/9 2 2 Enough NLP to become dangerous MA / GG
16/9 3 1 Representing code students
18/9 3 2 Code embeddings students
23/9 4 1 Source code analysis students
25/9 4 2 NLP-based program analysis students
30/9 5 1 Finding bugs guest lecture (MA)
2/10 5 2 Repairing bugs students
7/10 6 1 Program synthesis students
9/10 6 2 Code Completion students
14/10 7 1 Program translation students
16/10 7 2 Code summarization students


Guest lecturer

Suggested replication papers / New topics

Here is an indicative list of papers/topics for replication. We specifically encourage you to propose new ideas!


The final course grade will be calculated as:

All deliverables will be peer-reviewed by 2 other teams. The peer-review grade is 50% of the final grade per grade item.


[1] M. Pradel and K. Sen, “DeepBugs: A learning approach to name-based bug detection,” Proc. ACM Program. Lang., vol. 2, no. OOPSLA, pp. 147:1–147:25, Oct. 2018.