Pre-trained Contextual Embedding of Source Code

12/21/2019
by   Aditya Kanade, et al.
0

The source code of a program not only serves as a formal description of an executable task, but it also serves to communicate developer intent in a human-readable form. To facilitate this, developers use meaningful identifier names and natural-language documentation. This makes it possible to successfully apply sequence-modeling approaches, shown to be effective in natural-language processing, to source code. A major advancement in natural-language understanding has been the use of pre-trained token embeddings; BERT and other works have further shown that pre-trained contextual embeddings can be extremely powerful and can be fine-tuned effectively for a variety of downstream supervised tasks. Inspired by these developments, we present the first attempt to replicate this success on source code. We curate a massive corpus of Python programs from GitHub to pre-train a BERT model, which we call Code Understanding BERT (CuBERT). We also pre-train Word2Vec embeddings on the same dataset. We create a benchmark of five classification tasks and compare fine-tuned CuBERT against sequence models trained with and without the Word2Vec embeddings. Our results show that CuBERT outperforms the baseline methods by a margin of 2.9-22 with smaller datasets, and over fewer epochs. We further evaluate CuBERT's effectiveness on a joint classification, localization and repair task involving prediction of two pointers.

READ FULL TEXT
research
04/28/2020

SCELMo: Source Code Embeddings from Language Models

Continuous embeddings of tokens in computer programs have been used to s...
research
10/31/2021

Text Classification for Task-based Source Code Related Questions

There is a key demand to automatically generate code for small tasks for...
research
08/25/2021

What do pre-trained code models know about code?

Pre-trained models of code built on the transformer architecture have pe...
research
01/20/2023

Which Features are Learned by CodeBert: An Empirical Study of the BERT-based Source Code Representation Learning

The Bidirectional Encoder Representations from Transformers (BERT) were ...
research
05/28/2023

RefBERT: A Two-Stage Pre-trained Framework for Automatic Rename Refactoring

Refactoring is an indispensable practice of improving the quality and ma...
research
05/22/2020

Improving Segmentation for Technical Support Problems

Technical support problems are often long and complex. They typically co...
research
04/05/2019

A Literature Study of Embeddings on Source Code

Natural language processing has improved tremendously after the success ...

Please sign up or login with your details

Forgot password? Click here to reset