Vyākarana: A Colorless Green Benchmark for Syntactic Evaluation in Indic Languages

03/01/2021
by   Rajaswa Patil, et al.
13

While there has been significant progress towards developing NLU datasets and benchmarks for Indic languages, syntactic evaluation has been relatively less explored. Unlike English, Indic languages have rich morphosyntax, grammatical genders, free linear word-order, and highly inflectional morphology. In this paper, we introduce Vyākarana: a benchmark of gender-balanced Colorless Green sentences in Indic languages for syntactic evaluation of multilingual language models. The benchmark comprises four syntax-related tasks: PoS Tagging, Syntax Tree-depth Prediction, Grammatical Case Marking, and Subject-Verb Agreement. We use the datasets from the evaluation tasks to probe five multilingual language models of varying architectures for syntax in Indic languages. Our results show that the token-level and sentence-level representations from the Indic language models (IndicBERT and MuRIL) do not capture the syntax in Indic languages as efficiently as the other highly multilingual language models. Further, our layer-wise probing experiments reveal that while mBERT, DistilmBERT, and XLM-R localize the syntax in middle layers, the Indic language models do not show such syntactic localization.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/01/2020

Cross-Linguistic Syntactic Evaluation of Word Prediction Models

A range of studies have concluded that neural word prediction models can...
research
10/02/2020

Syntax Representation in Word Embeddings and Neural Networks – A Survey

Neural networks trained on natural language processing tasks capture syn...
research
04/19/2022

Multilingual Syntax-aware Language Modeling through Dependency Tree Conversion

Incorporating stronger syntactic biases into neural language models (LMs...
research
08/17/2017

Towards Syntactic Iberian Polarity Classification

Lexicon-based methods using syntactic rules for polarity classification ...
research
11/23/2022

This is the way: designing and compiling LEPISZCZE, a comprehensive NLP benchmark for Polish

The availability of compute and data to train larger and larger language...
research
04/15/2021

Syntactic Perturbations Reveal Representational Correlates of Hierarchical Phrase Structure in Pretrained Language Models

While vector-based language representations from pretrained language mod...
research
09/01/2019

Syntax-aware Multilingual Semantic Role Labeling

Recently, semantic role labeling (SRL) has earned a series of success wi...

Please sign up or login with your details

Forgot password? Click here to reset