SLING: Sino Linguistic Evaluation of Large Language Models

10/21/2022
by   Yixiao Song, et al.
0

To understand what kinds of linguistic knowledge are encoded by pretrained Chinese language models (LMs), we introduce the benchmark of Sino LINGuistics (SLING), which consists of 38K minimal sentence pairs in Mandarin Chinese grouped into 9 high-level linguistic phenomena. Each pair demonstrates the acceptability contrast of a specific syntactic or semantic phenomenon (e.g., The keys are lost vs. The keys is lost), and an LM should assign lower perplexity to the acceptable sentence. In contrast to the CLiMP dataset (Xiang et al., 2021), which also contains Chinese minimal pairs and was created by translating the vocabulary of the English BLiMP dataset, the minimal pairs in SLING are derived primarily by applying syntactic and lexical transformations to naturally-occurring, linguist-annotated sentences from the Chinese Treebank 9.0, thus addressing severe issues in CLiMP's data generation process. We test 18 publicly available pretrained monolingual (e.g., BERT-base-zh, CPM) and multi-lingual (e.g., mT5, XLM) language models on SLING. Our experiments show that the average accuracy for LMs is far below human performance (69.7 97.1 LMs, even much larger ones. Additionally, we find that most LMs have a strong gender and number (singular/plural) bias, and they perform better on local phenomena than hierarchical ones.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/26/2021

CLiMP: A Benchmark for Chinese Language Model Evaluation

Linguistically informed analyses of language models (LMs) contribute to ...
research
10/16/2020

Linguistically-Informed Transformations (LIT): A Method forAutomatically Generating Contrast Sets

Although large-scale pretrained language models, such as BERT and RoBERT...
research
10/05/2020

Investigating representations of verb bias in neural language models

Languages typically provide more than one grammatical construction to ex...
research
12/02/2019

BLiMP: A Benchmark of Linguistic Minimal Pairs for English

We introduce The Benchmark of Linguistic Minimal Pairs (shortened to BLi...
research
01/16/2022

COLD: A Benchmark for Chinese Offensive Language Detection

Offensive language detection and prevention becomes increasing critical ...
research
12/02/2022

Event knowledge in large language models: the gap between the impossible and the unlikely

People constantly use language to learn about the world. Computational l...
research
11/18/2022

Context Variance Evaluation of Pretrained Language Models for Prompt-based Biomedical Knowledge Probing

Pretrained language models (PLMs) have motivated research on what kinds ...

Please sign up or login with your details

Forgot password? Click here to reset