CALM : A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias

by   Vipul Gupta, et al.

As language models (LMs) become increasingly powerful, it is important to quantify and compare them for sociodemographic bias with potential for harm. Prior bias measurement datasets are sensitive to perturbations in their manually designed templates, therefore unreliable. To achieve reliability, we introduce the Comprehensive Assessment of Language Model bias (CALM), a benchmark dataset to quantify bias in LMs across three tasks. We integrate 16 existing datasets across different domains, such as Wikipedia and news articles, to filter 224 templates from which we construct a dataset of 78,400 examples. We compare the diversity of CALM with prior datasets on metrics such as average semantic similarity, and variation in template length, and test the sensitivity to small perturbations. We show that our dataset is more diverse and reliable than previous datasets, thus better capture the breadth of linguistic variation required to reliably evaluate model bias. We evaluate 20 large language models including six prominent families of LMs such as Llama-2. In two LM series, OPT and Bloom, we found that larger parameter models are more biased than lower parameter models. We found the T0 series of models to be the least biased. Furthermore, we noticed a tradeoff between gender and racial bias with increasing model size in some model series. The code is available at


page 2

page 6


VisoGender: A dataset for benchmarking gender bias in image-text pronoun resolution

We introduce VisoGender, a novel dataset for benchmarking gender bias in...

CBBQ: A Chinese Bias Benchmark Dataset Curated with Human-AI Collaboration for Large Language Models

Holistically measuring societal biases of large language models is cruci...

Stereotype and Skew: Quantifying Gender Bias in Pre-trained and Fine-tuned Language Models

This paper proposes two intuitive metrics, skew and stereotype, that qua...

The Birth of Bias: A case study on the evolution of gender bias in an English language model

Detecting and mitigating harmful biases in modern language models are wi...

Detect, Quantify, and Incorporate Dataset Bias: A Neuroimaging Analysis on 12,207 Individuals

Neuroimaging datasets keep growing in size to address increasingly compl...

Casteist but Not Racist? Quantifying Disparities in Large Language Model Bias between India and the West

Large Language Models (LLMs), now used daily by millions of users, can e...

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

Generative Pre-trained Transformer (GPT) models have exhibited exciting ...

Please sign up or login with your details

Forgot password? Click here to reset