Open Vocabulary Learning on Source Code with a Graph-Structured Cache

10/18/2018
by   Milan Cvitkovic, et al.
2

Machine learning models that take computer program source code as input typically use Natural Language Processing (NLP) techniques. However, a major challenge is that code is written using an open, rapidly changing vocabulary due to, e.g., the coinage of new variable and method names. Reasoning over such a vocabulary is not something for which most NLP methods are designed. We introduce a Graph-Structured Cache to address this problem; this cache contains a node for each new word the model encounters with edges connecting each word to its occurrences in the code. We find that combining this graph-structured cache strategy with recent Graph-Neural-Network-based models for supervised learning on code improves the models' performance on a code completion task and a variable naming task --- with over 100 --- at the cost of a moderate increase in computation time.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/23/2020

Neural Code Completion with Anonymized Variable Names

Source code processing heavily relies on the methods widely used in natu...
research
10/23/2020

A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code

There is an emerging interest in the application of deep learning models...
research
03/17/2020

Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code

Statistical language modeling techniques have successfully been applied ...
research
01/29/2023

Composer's Assistant: Interactive Transformers for Multi-Track MIDI Infilling

We consider the task of multi-track MIDI infilling when arbitrary (track...
research
01/18/2019

Chinese Word Segmentation: Another Decade Review (2007-2017)

This paper reviews the development of Chinese word segmentation (CWS) in...
research
04/06/2023

GI Software with fewer Data Cache Misses

By their very name caches are often overlooked and yet play a vital role...
research
03/13/2019

Maybe Deep Neural Networks are the Best Choice for Modeling Source Code

Statistical language modeling techniques have successfully been applied ...

Please sign up or login with your details

Forgot password? Click here to reset