Evaluating Neuron Interpretation Methods of NLP Models

by   Yimin Fan, et al.

Neuron Interpretation has gained traction in the field of interpretability, and have provided fine-grained insights into what a model learns and how language knowledge is distributed amongst its different components. However, the lack of evaluation benchmark and metrics have led to siloed progress within these various methods, with very little work comparing them and highlighting their strengths and weaknesses. The reason for this discrepancy is the difficulty of creating ground truth datasets, for example, many neurons within a given model may learn the same phenomena, and hence there may not be one correct answer. Moreover, a learned phenomenon may spread across several neurons that work together – surfacing these to create a gold standard challenging. In this work, we propose an evaluation framework that measures the compatibility of a neuron analysis method with other methods. We hypothesize that the more compatible a method is with the majority of the methods, the more confident one can be about its performance. We systematically evaluate our proposed framework and present a comparative analysis of a large set of neuron interpretation methods. We make the evaluation framework available to the community. It enables the evaluation of any new method using 20 concepts and across three pre-trained models.The code is released at https://github.com/fdalvi/neuron-comparative-analysis


page 7

page 8

page 15

page 16

page 17


NeuroX Library for Neuron Analysis of Deep NLP Models

Neuron analysis provides insights into how knowledge is structured in re...

Fine-grained Interpretation and Causation Analysis in Deep NLP Models

This paper is a write-up for the tutorial on "Fine-grained Interpretatio...

Emergent Modularity in Pre-trained Transformers

This work examines the presence of modularity in pre-trained Transformer...

Neuron to Graph: Interpreting Language Model Neurons at Scale

Advances in Large Language Models (LLMs) have led to remarkable capabili...

Are Interpretations Fairly Evaluated? A Definition Driven Pipeline for Post-Hoc Interpretability

Recent years have witnessed an increasing number of interpretation metho...

RethinkCWS: Is Chinese Word Segmentation a Solved Task?

The performance of the Chinese Word Segmentation (CWS) systems has gradu...

Edge Deep Learning Model Protection via Neuron Authorization

With the development of deep learning processors and accelerators, deep ...

Please sign up or login with your details

Forgot password? Click here to reset