StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow

03/26/2018
by   Ziyu Yao, et al.
0

Stack Overflow (SO) has been a great source of natural language questions and their code solutions (i.e., question-code pairs), which are critical for many tasks including code retrieval and annotation. In most existing research, question-code pairs were collected heuristically and tend to have low quality. In this paper, we investigate a new problem of systematically mining question-code pairs from Stack Overflow (in contrast to heuristically collecting them). It is formulated as predicting whether or not a code snippet is a standalone solution to a question. We propose a novel Bi-View Hierarchical Neural Network which can capture both the programming content and the textual context of a code snippet (i.e., two views) to make a prediction. On two manually annotated datasets in Python and SQL domain, our framework substantially outperforms heuristic methods with at least 15 accuracy. Furthermore, we present StaQC (Stack Overflow Question-Code pairs), the largest dataset to date of 148K Python and 120K SQL question-code pairs, automatically mined from SO using our framework. Under various case studies, we demonstrate that StaQC can greatly help develop data-hungry models for associating natural language with programming language.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/21/2018

Predicting the Programming Language of Questions and Snippets of StackOverflow Using Natural Language Processing

Stack Overflow is the most popular Q&A website among software developers...
research
05/23/2018

Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow

For tasks like code synthesis from natural language, code retrieval, and...
research
04/12/2021

Generating Code with the Help of Retrieved Template Functions and Stack Overflow Answers

We approach the important challenge of code autocompletion as an open-do...
research
10/26/2022

CS1QA: A Dataset for Assisting Code-based Question Answering in an Introductory Programming Course

We introduce CS1QA, a dataset for code-based question answering in the p...
research
09/08/2020

Procedural Generation of STEM Quizzes

Electronic quizzes are used extensively for summative and formative asse...
research
10/31/2021

Text Classification for Task-based Source Code Related Questions

There is a key demand to automatically generate code for small tasks for...
research
10/19/2020

Adversarial Training for Code Retrieval with Question-Description Relevance Regularization

Code retrieval is a key task aiming to match natural and programming lan...

Please sign up or login with your details

Forgot password? Click here to reset