RuCoLA: Russian Corpus of Linguistic Acceptability

10/23/2022
by   Vladislav Mikhailov, et al.
0

Linguistic acceptability (LA) attracts the attention of the research community due to its many uses, such as testing the grammatical knowledge of language models and filtering implausible texts with acceptability classifiers. However, the application scope of LA in languages other than English is limited due to the lack of high-quality resources. To this end, we introduce the Russian Corpus of Linguistic Acceptability (RuCoLA), built from the ground up under the well-established binary LA approach. RuCoLA consists of 9.8k in-domain sentences from linguistic publications and 3.6k out-of-domain sentences produced by generative models. The out-of-domain set is created to facilitate the practical use of acceptability for improving language generation. Our paper describes the data collection protocol and presents a fine-grained analysis of acceptability classification experiments with a range of baseline approaches. In particular, we demonstrate that the most widely used language models still fall behind humans by a large margin, especially when detecting morphological and semantic errors. We release RuCoLA, the code of experiments, and a public leaderboard (rucola-benchmark.com) to assess the linguistic competence of language models for Russian.

READ FULL TEXT
research
09/24/2021

Monolingual and Cross-Lingual Acceptability Judgments with the Italian CoLA corpus

The development of automated approaches to linguistic acceptability has ...
research
11/02/2020

A Closer Look at Linguistic Knowledge in Masked Language Models: The Case of Relative Clauses in American English

Transformer-based language models achieve high performance on various ta...
research
06/13/2023

NoCoLA: The Norwegian Corpus of Linguistic Acceptability

While there has been a surge of large language models for Norwegian in r...
research
07/27/2016

Synthetic Language Generation and Model Validation in BEAST2

Generating synthetic languages aids in the testing and validation of fut...
research
02/28/2021

RuSentEval: Linguistic Source, Encoder Force!

The success of pre-trained transformer language models has brought a gre...
research
07/25/2023

FacTool: Factuality Detection in Generative AI – A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios

The emergence of generative pre-trained models has facilitated the synth...
research
09/05/2020

Visually Analyzing Contextualized Embeddings

In this paper we introduce a method for visually analyzing contextualize...

Please sign up or login with your details

Forgot password? Click here to reset