Here is a list of papers/problems to study.

Again, we specifically encourage you to propose new ideas! source{d}, who is using a lot of ML4SE, also keeps a repository with interesting papers on topic: Feel free to explore it!


  1. Code translation. Translate source code from one language to the other, e.g., from Java to C# or, maybe more interesting to industry, Cobol to Java. See Chen et al. [1] as reference. As a possible dataset to be explored, coding websites, such as Codeforces and Rosetta Code, contain the same problem implemented in multiple different languages.

  2. Code completion. IDEs have been suggesting code completion for years now. However, the use of DL brings us new possibilities: suggesting more contextual code completion. Researchers have been showing that this is indeed a tricky task [2]. This project is about replicating (or improving) upon this paper.

  3. Type inference. Inferring the type of a variable, especially in dynamically typed languages, can be a challenge. Hellendoorn [3] has shown us that DL techniques can indeed be very precise in this task. This project aims at replicating this paper.

  4. API usage. Developers often need help in learning how to use an API. Can we provide developers with API usage, given some natural text? Gu et al. [4] and Liu et al [5] showed that this is possible. Your project here is to replicate one of these papers.

  5. Mutation testing. In mutation testing, we mutate the original program and check if the existing test cases are able to find the error. Large companies, such as Google, have been adopting mutation testing, but not without its challenges [6]. In particular, given the size of programs, the number of possible mutants is enormous; thus, prioritizing which mutants to generate is currently an open problem. Tufano et al. [7] proposed the use of deep learning to learn which mutants are really relevant, based on bug fixes.

New ideas

  1. Logging strategies. Identifying where to log is a hard task in large systems; on one hand, you don’t want to log too much; on the other hand, if you don’t log an important part of the code, you might miss information to debug a crash. Researchers have been empirically studying how developers decide where to log [8], and have been proposing supervised ML techniques to suggest improvements in log lines [9] [10]. In this project, you will study whether NLP based approaches provide better results.

  2. Anomaly detection in logs. Under construction.

  3. Log reduction. Modern software systems generate lots of runtime information, that developers need to examine in order to identify causes of failures, when those happen. With this project, you will build a tool that given a log and a label (e.g. pass/fail) it will learn a model that automatically identifies the important lines in an input log.

  4. Code refactoring. Maintaining (bad) source code is not an easy task. And, although industry has widely adopted linters, they have a well-known problem: the number of false positives [11]. We conjecture that ML-based techniques will be able to provide more useful refactoring to developers. In this task, you will train ML models to recommend (or maybe even automatically applying) refactorings. See the RefactoringMiner tool, which might help you in collecting real-world refactorings.

  5. Flaky tests. Flaky tests are tests that present non-deterministic behavior (i.e., tests that sometimes pass, sometimes fail). Mark Harman, Facebook’s senior scientist, mentioned that flaky test is an important problem at Facebook. Google says that 1.5% of their 4.2M tests present flaky behavior at some point [12]. Researchers have been empirically investigating the problem. Luo et al. [13] noticed that async wait, concurrency, and test order dependency are the most common causes for flakiness. Lam et al. [14] recently developed iDFlakies, a tool that aims at identifying tests that are flaky due to order execution. We ask: can the use of ML help us in identifying flaky tests?

  6. Tagging Algorithm. Whenever solving coding challenges, e.g., the ones from CodeForces, you have to choose a strategy: will you apply dynamic programming? Will you apply brute force? Does it involve probabilities? Labeling a piece of code with such tags might be really useful to education. Or, somewhat related, given the textual description of the problem, can we suggest solution strategies? Current and closest (non ML) related work on this topic aims at inferring the algorithm complexity based on Java bytecode [15].

  7. Programming styles. A recent paper [16] in the PL field caused lots of stir by (very vocally) refuting the findings of a Comm. ACM highlight paper[17]. Both papers try to quantify the effects of programming language use on error-proneness of software. However, their approach is too coarse-grained, as we can write any style of code in any programming language. What is needed is a more fine-grained approach, that e.g. links code styles (e.g. functional, imperative or declarative) to bugs. With this project, you will build an automated program style detector by feeding an ML solution with code in functional and imperative styles and let it learn to differentiate between the two.



[1] X. Chen, C. Liu, and D. Song, “Tree-to-tree neural networks for program translation,” in Advances in neural information processing systems, 2018, pp. 2547–2557.

[2] V. J. Hellendoorn, S. Proksch, H. C. Gall, and A. Bacchelli, “When code completion fails: A case study on real-world completions,” in Proceedings of the 41st international conference on software engineering, 2019, pp. 960–970.

[3] V. J. Hellendoorn, C. Bird, E. T. Barr, and M. Allamanis, “Deep learning type inference,” in Proceedings of the 2018 26th acm joint meeting on european software engineering conference and symposium on the foundations of software engineering, 2018, pp. 152–162.

[4] X. Gu, H. Zhang, D. Zhang, and S. Kim, “Deep api learning,” in Proceedings of the 2016 24th acm sigsoft international symposium on foundations of software engineering, 2016, pp. 631–642.

[5] J. Liu, S. Kim, V. Murali, S. Chaudhuri, and S. Chandra, “Neural query expansion for code search,” in Proceedings of the 3rd acm sigplan international workshop on machine learning and programming languages, 2019, pp. 29–37.

[6] G. Petrović and M. Ivanković, “State of mutation testing at google,” in Proceedings of the 40th international conference on software engineering: Software engineering in practice, 2018, pp. 163–171.

[7] M. Tufano, C. Watson, G. Bavota, M. Di Penta, M. White, and D. Poshyvanyk, “Learning how to mutate source code from bug-fixes,” arXiv preprint arXiv:1812.10772, 2018.

[8] Q. Fu et al., “Where do developers log? An empirical study on logging practices in industry,” in Companion proceedings of the 36th international conference on software engineering, 2014, pp. 24–33.

[9] H. Li, W. Shang, Y. Zou, and A. E. Hassan, “Towards just-in-time suggestions for log changes,” Empirical Software Engineering, vol. 22, no. 4, pp. 1831–1865, 2017.

[10] H. Li, W. Shang, and A. E. Hassan, “Which log level should developers choose for a new logging statement?” Empirical Software Engineering, vol. 22, no. 4, pp. 1684–1716, 2017.

[11] B. Johnson, Y. Song, E. Murphy-Hill, and R. Bowdidge, “Why don’t software developers use static analysis tools to find bugs?” in Proceedings of the 2013 international conference on software engineering, 2013, pp. 672–681.

[12] J. Listfield, “Where do our flaky tests come from?”, 2017.

[13] Q. Luo, F. Hariri, L. Eloussi, and D. Marinov, “An empirical analysis of flaky tests,” in Proceedings of the 22nd acm sigsoft international symposium on foundations of software engineering, 2014, pp. 643–653.

[14] W. Lam, R. Oei, A. Shi, D. Marinov, and T. Xie, “IDFlakies: A framework for detecting and partially classifying flaky tests,” in 2019 12th ieee conference on software testing, validation and verification (icst), 2019, pp. 312–322.

[15] E. Albert, P. Arenas, S. Genaim, G. Puebla, and D. Zanardini, “Cost analysis of java bytecode,” in European symposium on programming, 2007, pp. 157–172.

[16] E. D. Berger, C. Hollenbeck, P. Maj, O. Vitek, and J. Vitek, “On the impact of programming languages on code quality,” arXiv preprint arXiv:1901.10220, 2019.

[17] B. Ray, D. Posnett, V. Filkov, and P. Devanbu, “A large scale study of programming languages and code quality in github,” in Proceedings of the 22Nd acm sigsoft international symposium on foundations of software engineering, 2014, pp. 155–165.