CODEP: Grammatical Seq2Seq Model for General-Purpose Code Generation

by   Yihong Dong, et al.

General-purpose code generation (GPCG) aims to automatically convert the natural language description into source code in a general-purpose language (GPL) like Python. Intrinsically, code generation is a particular type of text generation that produces grammatically defined text, namely code. However, existing sequence-to-sequence (Seq2Seq) approaches neglect grammar rules when generating GPL code. In this paper, we make the first attempt to consider grammatical Seq2Seq (GSS) models for GPCG and propose CODEP, a GSS code generation framework equipped with a pushdown automaton (PDA) module. PDA module (PDAM) contains a PDA and an algorithm to help model generate the following prediction bounded in a valid set for each generation step, so that ensuring the grammatical correctness of generated codes. During training, CODEP additionally incorporates state representation and state prediction task, which leverages PDA states to assist CODEP in comprehending the parsing process of PDA. In inference, our method outputs codes satisfying grammatical constraints with PDAM and the joint prediction of PDA states. Furthermore, PDAM can be directly applied to Seq2Seq models, i.e., without any need for training. To evaluate the effectiveness of our proposed method, we construct the PDA for the most popular GPL Python and conduct extensive experiments on four benchmark datasets. Experimental results demonstrate the superiority of CODEP compared to the state-of-the-art approaches without pre-training, and PDAM also achieves significant improvements over the pre-trained models.


A Syntactic Neural Model for General-Purpose Code Generation

We consider the problem of parsing natural language descriptions into so...

Antecedent Predictions Are More Important Than You Think: An Effective Method for Tree-Based Code Generation

Code generation focuses on the automatic conversion of natural language ...

Improving AMR Parsing with Sequence-to-Sequence Pre-training

In the literature, the research on abstract meaning representation (AMR)...

Python computations of general Heun functions from their integral series representations

We present a numerical implementation in Python of the recently develope...

Generation-based Code Review Automation: How Far Are We?

Code review is an effective software quality assurance activity; however...

On the Reliability and Explainability of Automated Code Generation Approaches

Automatic code generation, the task of generating new code snippets from...

InfeRE: Step-by-Step Regex Generation via Chain of Inference

Automatically generating regular expressions (abbrev. regexes) from natu...

Please sign up or login with your details

Forgot password? Click here to reset