A New Dataset for Natural Language Inference from Code-mixed Conversations

04/10/2020
by   Simran Khanuja, et al.
0

Natural Language Inference (NLI) is the task of inferring the logical relationship, typically entailment or contradiction, between a premise and hypothesis. Code-mixing is the use of more than one language in the same conversation or utterance, and is prevalent in multilingual communities all over the world. In this paper, we present the first dataset for code-mixed NLI, in which both the premises and hypotheses are in code-mixed Hindi-English. We use data from Hindi movies (Bollywood) as premises, and crowd-source hypotheses from Hindi-English bilinguals. We conduct a pilot annotation study and describe the final annotation protocol based on observations from the pilot. Currently, the data collected consists of 400 premises in the form of code-mixed conversation snippets and 2240 code-mixed hypotheses. We conduct an extensive analysis to infer the linguistic phenomena commonly observed in the dataset obtained. We evaluate the dataset using a standard mBERT-based pipeline for NLI and report results.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/15/2018

A Dataset for Building Code-Mixed Goal Oriented Conversation Systems

There is an increasing demand for goal-oriented conversation systems whi...
research
02/23/2023

MUTANT: A Multi-sentential Code-mixed Hinglish Dataset

The multi-sentential long sequence textual data unfolds several interest...
research
07/24/2021

MIPE: A Metric Independent Pipeline for Effective Code-Mixed NLG Evaluation

Code-mixing is a phenomenon of mixing words and phrases from two or more...
research
03/06/2018

Annotation Artifacts in Natural Language Inference Data

Large-scale datasets for natural language inference are created by prese...
research
04/17/2021

GupShup: An Annotated Corpus for Abstractive Summarization of Open-Domain Code-Switched Conversations

Code-switching is the communication phenomenon where speakers switch bet...
research
05/19/2021

Detection of Emotions in Hindi-English Code Mixed Text Data

In recent times, we have seen an increased use of text chat for communic...
research
10/13/2020

Asking Crowdworkers to Write Entailment Examples: The Best of Bad Options

Large-scale natural language inference (NLI) datasets such as SNLI or MN...

Please sign up or login with your details

Forgot password? Click here to reset