Leveraging Language Identification to Enhance Code-Mixed Text Classification

06/08/2023
by   Gauri Takawane, et al.
0

The usage of more than one language in the same text is referred to as Code Mixed. It is evident that there is a growing degree of adaption of the use of code-mixed data, especially English with a regional language, on social media platforms. Existing deep-learning models do not take advantage of the implicit language information in the code-mixed text. Our study aims to improve BERT-based models performance on low-resource Code-Mixed Hindi-English Datasets by experimenting with language augmentation approaches. We propose a pipeline to improve code-mixed systems that comprise data preprocessing, word-level language identification, language augmentation, and model training on downstream tasks like sentiment analysis. For language augmentation in BERT models, we explore word-level interleaving and post-sentence placement of language information. We have examined the performance of vanilla BERT-based models and their code-mixed HingBERT counterparts on respective benchmark datasets, comparing their results with and without using word-level language information. The models were evaluated using metrics such as accuracy, precision, recall, and F1 score. Our findings show that the proposed language augmentation approaches work well across different BERT models. We demonstrate the importance of augmenting code-mixed text with language information on five different code-mixed Hindi-English downstream datasets based on sentiment analysis, hate speech detection, and emotion detection.

READ FULL TEXT

page 5

page 8

research
05/25/2023

Comparative Study of Pre-Trained BERT Models for Code-Mixed Hindi-English Data

The term "Code Mixed" refers to the use of more than one language in the...
research
06/24/2023

My Boli: Code-mixed Marathi-English Corpora, Pretrained Language Models and Evaluation Benchmarks

The research on code-mixed data is limited due to the unavailability of ...
research
04/18/2022

L3Cube-HingCorpus and HingBERT: A Code Mixed Hindi-English Dataset and BERT Language Models

Code-switching occurs when more than one language is mixed in a given se...
research
10/16/2018

Strategies for Language Identification in Code-Mixed Low Resource Languages

In the recent years, substantial work has been done on language tagging ...
research
10/18/2021

Ceasing hate withMoH: Hate Speech Detection in Hindi-English Code-Switched Language

Social media has become a bedrock for people to voice their opinions wor...
research
08/10/2022

An Empirical Exploration of Cross-domain Alignment between Language and Electroencephalogram

Electroencephalography (EEG) and language have been widely explored inde...
research
07/27/2020

ULD@NUIG at SemEval-2020 Task 9: Generative Morphemes with an Attention Model for Sentiment Analysis in Code-Mixed Text

Code mixing is a common phenomena in multilingual societies where people...

Please sign up or login with your details

Forgot password? Click here to reset