On the Evaluation of Neural Code Translation: Taxonomy and Benchmark

by   Mingsheng Jiao, et al.

In recent years, neural code translation has gained increasing attention. While most of the research focuses on improving model architectures and training processes, we notice that the evaluation process and benchmark for code translation models are severely limited: they primarily treat source code as natural languages and provide a holistic accuracy score while disregarding the full spectrum of model capabilities across different translation types and complexity. In this paper, we present a comprehensive investigation of four state-of-the-art models and analyze in-depth the advantages and limitations of three existing benchmarks. Based on the empirical results, we develop a taxonomy that categorizes code translation tasks into four primary types according to their complexity and knowledge dependence: token level (type 1), syntactic level (type 2), library level (type 3), and algorithm level (type 4). We then conduct a thorough analysis of how existing approaches perform across these four categories. Our findings indicate that while state-of-the-art code translation models excel in type-1 and type-2 translations, they struggle with knowledge-dependent ones such as type-3 and type-4. Existing benchmarks are biased towards trivial translations, such as keyword mapping. To overcome these limitations, we construct G-TransEval, a new benchmark by manually curating type-3 and type-4 translation pairs and unit test cases. Results on our new benchmark suggest that G-TransEval can exhibit more comprehensive and finer-grained capability of code translation models and thus provide a more rigorous evaluation. Our studies also provide more insightful findings and suggestions for future research, such as building type-3 and type-4 training data and ensembling multiple pretraining approaches.


Understanding the Effectiveness of Large Language Models in Code Translation

Code translation aims to convert source code from one programming langua...

Towards Fine-Grained Information: Identifying the Type and Location of Translation Errors

Fine-grained information on translation errors is helpful for the transl...

How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation

Generative Pre-trained Transformer (GPT) models have shown remarkable ca...

Generic Go to Go: Dictionary-Passing, Monomorphisation, and Hybrid

Go is a popular statically-typed industrial programming language. To aid...

Code Translation with Compiler Representations

In this paper, we leverage low-level compiler intermediate representatio...

Coalesced TLB to Exploit Diverse Contiguity of Memory Mapping

The miss rate of TLB is crucial to the performance of address translatio...

Empirical Translation Process Research: Past and Possible Future Perspectives

Over the past four decades, efforts have been made to develop and evalua...

Please sign up or login with your details

Forgot password? Click here to reset