An Empirical Cybersecurity Evaluation of GitHub Copilot's Code Contributions

by   Hammond Pearce, et al.

There is burgeoning interest in designing AI-based systems to assist humans in designing computing systems, including tools that automatically generate computer code. The most notable of these comes in the form of the first self-described `AI pair programmer', GitHub Copilot, a language model trained over open-source GitHub code. However, code often contains bugs - and so, given the vast quantity of unvetted code that Copilot has processed, it is certain that the language model will have learned from exploitable, buggy code. This raises concerns on the security of Copilot's code contributions. In this work, we systematically investigate the prevalence and conditions that can cause GitHub Copilot to recommend insecure code. To perform this analysis we prompt Copilot to generate code in scenarios relevant to high-risk CWEs (e.g. those from MITRE's "Top 25" list). We explore Copilot's performance on three distinct code generation axes – examining how it performs given diversity of weaknesses, diversity of prompts, and diversity of domains. In total, we produce 89 different scenarios for Copilot to complete, producing 1,692 programs. Of these, we found approximately 40


page 1

page 4

page 6


Large Language Models and Simple, Stupid Bugs

With the advent of powerful neural language models, AI-based systems to ...

Systematically Finding Security Vulnerabilities in Black-Box Code Generation Models

Recently, large language models for code generation have achieved breakt...

Natural Language to Code Generation in Interactive Data Science Notebooks

Computational notebooks, such as Jupyter notebooks, are interactive comp...

Evaluating Large Language Models Trained on Code

We introduce Codex, a GPT language model fine-tuned on publicly availabl...

Can OpenAI Codex and Other Large Language Models Help Us Fix Security Bugs?

Human developers can produce code with cybersecurity weaknesses. Can eme...

Towards the Scalable Evaluation of Cooperativeness in Language Models

It is likely that AI systems driven by pre-trained language models (PLMs...

Please sign up or login with your details

Forgot password? Click here to reset