Contrastive Code Representation Learning

07/09/2020
by   Paras Jain, et al.
5

Machine-aided programming tools such as automated type predictors and autocomplete are increasingly learning-based. However, current approaches predominantly rely on supervised learning with task-specific datasets. We propose Contrastive Code Representation Learning (ContraCode), a self-supervised algorithm for learning task-agnostic semantic representations of programs via contrastive learning. Our approach uses no human-provided labels, only the raw text of programs. ContraCode optimizes for a representation that is invariant to semantic-preserving code transformations. We develop an automated source-to-source compiler that generates textually divergent variants of source programs. We then train a neural network to identify variants of anchor programs within a large batch of non-equivalent negatives. To solve this task, the network must extract features representing the functionality, not form, of the program. In experiments, we pre-train ContraCode with 1.8M unannotated JavaScript methods mined from GitHub, then transfer to downstream tasks by fine-tuning. Pre-training with ContraCode consistently improves the F1 score of code summarization baselines by up to 8 and top-1 accuracy of type inference baselines by up to 13 ContraCode achieves 9 current state-of-the-art static type analyzer for TypeScript.

READ FULL TEXT
research
09/06/2020

Self-Supervised Learning for Code Retrieval and Summarization through Semantic-Preserving Program Transformations

Code retrieval and summarization are useful tasks for developers, but it...
research
10/16/2022

Sentence Representation Learning with Generative Objective rather than Contrastive Objective

Though offering amazing contextualized token-level representations, curr...
research
05/27/2020

CLOCS: Contrastive Learning of Cardiac Signals

The healthcare industry generates troves of unlabelled physiological dat...
research
06/13/2022

MetaTPTrans: A Meta Learning Approach for Multilingual Code Representation Learning

Representation learning of source code is essential for applying machine...
research
06/05/2023

CONCORD: Clone-aware Contrastive Learning for Source Code

Deep Learning (DL) models to analyze source code have shown immense prom...
research
12/02/2021

GraphCode2Vec: Generic Code Embedding via Lexical and Program Dependence Analyses

Code embedding is a keystone in the application of machine learning on s...
research
06/07/2023

Large-Scale Cell Representation Learning via Divide-and-Conquer Contrastive Learning

Single-cell RNA sequencing (scRNA-seq) data is a potent tool for compreh...

Please sign up or login with your details

Forgot password? Click here to reset