Evaluating and Explaining Large Language Models for Code Using Syntactic Structures

by   David N. Palacio, et al.

Large Language Models (LLMs) for code are a family of high-parameter, transformer-based neural networks pre-trained on massive datasets of both natural and programming languages. These models are rapidly being employed in commercial AI-based developer tools, such as GitHub CoPilot. However, measuring and explaining their effectiveness on programming tasks is a challenging proposition, given their size and complexity. The methods for evaluating and explaining LLMs for code are inextricably linked. That is, in order to explain a model's predictions, they must be reliably mapped to fine-grained, understandable concepts. Once this mapping is achieved, new methods for detailed model evaluations are possible. However, most current explainability techniques and evaluation benchmarks focus on model robustness or individual task performance, as opposed to interpreting model predictions. To this end, this paper introduces ASTxplainer, an explainability method specific to LLMs for code that enables both new methods for LLM evaluation and visualizations of LLM predictions that aid end-users in understanding model predictions. At its core, ASTxplainer provides an automated method for aligning token predictions with AST nodes, by extracting and aggregating normalized model logits within AST structures. To demonstrate the practical benefit of ASTxplainer, we illustrate the insights that our framework can provide by performing an empirical evaluation on 12 popular LLMs for code using a curated dataset of the most popular GitHub projects. Additionally, we perform a user study examining the usefulness of an ASTxplainer-derived visualization of model predictions aimed at enabling model users to explain predictions. The results of these studies illustrate the potential for ASTxplainer to provide insights into LLM effectiveness, and aid end-users in understanding predictions.


page 1

page 6

page 8


Explainability for Large Language Models: A Survey

Large language models (LLMs) have demonstrated impressive capabilities i...

An Exploratory Study on Code Attention in BERT

Many recent models in software engineering introduced deep neural models...

Evaluating ChatGPT's Information Extraction Capabilities: An Assessment of Performance, Explainability, Calibration, and Faithfulness

The capability of Large Language Models (LLMs) like ChatGPT to comprehen...

Large Language Models Are State-of-the-Art Evaluators of Code Generation

Recent advancements in the field of natural language generation have fac...

Exploring Software Naturalness through Neural Language Models

The Software Naturalness hypothesis argues that programming languages ca...

Deep Learning Generates Synthetic Cancer Histology for Explainability and Education

Artificial intelligence (AI) methods including deep neural networks can ...

GPT-Based Models Meet Simulation: How to Efficiently Use Large-Scale Pre-Trained Language Models Across Simulation Tasks

The disruptive technology provided by large-scale pre-trained language m...

Please sign up or login with your details

Forgot password? Click here to reset