Private-Library-Oriented Code Generation with Large Language Models

by   Daoguang Zan, et al.

Large language models (LLMs), such as Codex and GPT-4, have recently showcased their remarkable code generation abilities, facilitating a significant boost in coding efficiency. This paper will delve into utilizing LLMs for code generation in private libraries, as they are widely employed in everyday programming. Despite their remarkable capabilities, generating such private APIs poses a formidable conundrum for LLMs, as they inherently lack exposure to these private libraries during pre-training. To address this challenge, we propose a novel framework that emulates the process of programmers writing private code. This framework comprises two modules: APIFinder first retrieves potentially useful APIs from API documentation; and APICoder then leverages these retrieved APIs to generate private code. Specifically, APIFinder employs vector retrieval techniques and allows user involvement in the retrieval process. For APICoder, it can directly utilize off-the-shelf code generation models. To further cultivate explicit proficiency in invoking APIs from prompts, we continuously pre-train a reinforced version of APICoder, named CodeGenAPI. Our goal is to train the above two modules on vast public libraries, enabling generalization to private ones. Meanwhile, we create four private library benchmarks, including TorchDataEval, TorchDataComplexEval, MonkeyEval, and BeatNumEval, and meticulously handcraft test cases for each benchmark to support comprehensive evaluations. Numerous experiments on the four benchmarks consistently affirm the effectiveness of our approach. Furthermore, deeper analysis is also conducted to glean additional insights.


page 16

page 19


When Language Model Meets Private Library

With the rapid development of pre-training techniques, a number of langu...

CERT: Continual Pre-Training on Sketches for Library-Oriented Code Generation

Code generation is a longstanding challenge, aiming to generate a code s...

ToolCoder: Teach Code Generation Models to use API search tools

Automatically generating source code from natural language descriptions ...

CodeT: Code Generation with Generated Tests

The task of generating code solutions for a given programming problem ca...

Towards Enhancing In-Context Learning for Code Generation

In-context learning (ICL) with pre-trained language models (PTLMs) has s...

RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation

The task of repository-level code completion is to continue writing the ...

Unmasking the giant: A comprehensive evaluation of ChatGPT's proficiency in coding algorithms and data structures

The transformative influence of Large Language Models (LLMs) is profound...

Please sign up or login with your details

Forgot password? Click here to reset