Charla: Neural Approaches in Software Engineering: Pretraining and Vocabulary

Romain Robbes
18 Marzo, 2019 - 16:00
Sala Grace Hopper N° 307, Edficio Poniente
Prof. Alexandre Bergel


In this talk, we discuss two issues that make Neural approaches more difficult to apply in software engineering, and propose solutions to them. The first issue is that many labelled software engineering datasets are small. The reason for this is that high-quality labels must often be produced manually, an intensive effort that can be done only by experts, and is thus costly. As a consequence, some software engineering datasets are limited to a few hundreds or thousands of datapoints. This limits the performance of neural approaches and makes them likely to overfit. We show that recent transfer learning results in Natural Language Processing apply in the Software Engineering domain. These approaches leverage a large unlabelled corpus of data in a pre-training phase in order to train a Neural Language Model, before converting and fine-tuning it on a small labelled dataset, allowing significant increases in classification performance.


The second issue is related to modelling source code. Many approaches use NLP techniques on source code, as there is a straightforward mapping from sequences of tokens in a programming language to sequences of tokens in natural language. These approaches are attractive since some results have shown programming languages to be more repetitive than natural ones. However, modelling source code in this way breaks down at scale, as the size of the vocabulary increases linearly with the size of the corpus. This prevents the usage of a pre-training approach such as the one proposed in the first step. To address this, we show that careful modelling choices in terms of vocabulary allow a drastic reduction in vocabulary size (by up to three orders of magnitude in our experiments), and allow us to train Neural Language Models on very large amounts of source code (more than 10,000 GitHub projects).