A Function Interpretation Benchmark for Evaluating Interpretability Methods

09/07/2023
by   Sarah Schwettmann, et al.
0

Labeling neural network submodules with human-legible descriptions is useful for many downstream tasks: such descriptions can surface failures, guide interventions, and perhaps even explain important model behaviors. To date, most mechanistic descriptions of trained networks have involved small models, narrowly delimited phenomena, and large amounts of human labor. Labeling all human-interpretable sub-computations in models of increasing size and complexity will almost certainly require tools that can generate and validate descriptions automatically. Recently, techniques that use learned models in-the-loop for labeling have begun to gain traction, but methods for evaluating their efficacy are limited and ad-hoc. How should we validate and compare open-ended labeling tools? This paper introduces FIND (Function INterpretation and Description), a benchmark suite for evaluating the building blocks of automated interpretability methods. FIND contains functions that resemble components of trained neural networks, and accompanying descriptions of the kind we seek to generate. The functions are procedurally constructed across textual and numeric domains, and involve a range of real-world complexities, including noise, composition, approximation, and bias. We evaluate new and existing methods that use language models (LMs) to produce code-based and language descriptions of function behavior. We find that an off-the-shelf LM augmented with only black-box access to functions can sometimes infer their structure, acting as a scientist by forming hypotheses, proposing experiments, and updating descriptions in light of new data. However, LM-based descriptions tend to capture global function behavior and miss local corruptions. These results show that FIND will be useful for characterizing the performance of more sophisticated interpretability methods before they are applied to real-world models.

READ FULL TEXT

page 2

page 20

page 25

research
11/28/2018

Towards Task Understanding in Visual Settings

We consider the problem of understanding real world tasks depicted in vi...
research
10/27/2020

Quantifying Learnability and Describability of Visual Concepts Emerging in Representation Learning

The increasing impact of black box models, and particularly of unsupervi...
research
01/18/2021

Teach me how to Label: Labeling Functions from Natural Language with Text-to-text Transformers

Annotated data has become the most important bottleneck in training accu...
research
05/04/2023

Generating Virtual On-body Accelerometer Data from Virtual Textual Descriptions for Human Activity Recognition

The development of robust, generalized models in human activity recognit...
research
03/12/2017

Improving Interpretability of Deep Neural Networks with Semantic Information

Interpretability of deep neural networks (DNNs) is essential since it en...
research
03/07/2023

Describe me an Aucklet: Generating Grounded Perceptual Category Descriptions

Human language users can generate descriptions of perceptual concepts be...

Please sign up or login with your details

Forgot password? Click here to reset