XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence

06/16/2022
by   Ming Zhu, et al.
10

Recent advances in machine learning have significantly improved the understanding of source code data and achieved good performance on a number of downstream tasks. Open source repositories like GitHub enable this process with rich unlabeled code data. However, the lack of high quality labeled data has largely hindered the progress of several code related tasks, such as program translation, summarization, synthesis, and code search. This paper introduces XLCoST, Cross-Lingual Code SnippeT dataset, a new benchmark dataset for cross-lingual code intelligence. Our dataset contains fine-grained parallel data from 8 languages (7 commonly used programming languages and English), and supports 10 cross-lingual code tasks. To the best of our knowledge, it is the largest parallel dataset for source code both in terms of size and the number of languages. We also provide the performance of several state-of-the-art baseline models for each task. We believe this new dataset can be a valuable asset for the research community and facilitate the development and validation of new methods for cross-lingual code intelligence.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/16/2021

CrossSum: Beyond English-Centric Cross-Lingual Abstractive Text Summarization for 1500+ Language Pairs

We present CrossSum, a large-scale dataset comprising 1.65 million cross...
research
04/03/2020

XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation

In this paper, we introduce XGLUE, a new benchmark dataset to train larg...
research
03/07/2023

CroCoSum: A Benchmark Dataset for Cross-Lingual Code-Switched Summarization

Cross-lingual summarization (CLS) has attracted increasing interest in r...
research
05/15/2023

PMIndiaSum: Multilingual and Cross-lingual Headline Summarization for Languages in India

This paper introduces PMIndiaSum, a new multilingual and massively paral...
research
10/07/2020

WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive Summarization

We introduce WikiLingua, a large-scale, multilingual dataset for the eva...
research
10/19/2020

The RELX Dataset and Matching the Multilingual Blanks for Cross-Lingual Relation Classification

Relation classification is one of the key topics in information extracti...
research
11/03/2020

Towards Code-switched Classification Exploiting Constituent Language Resources

Code-switching is a commonly observed communicative phenomenon denoting ...

Please sign up or login with your details

Forgot password? Click here to reset