AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages

Africa is home to over 2000 languages from over six language families and has the highest linguistic diversity among all continents. This includes 75 languages with at least one million speakers each. Yet, there is little NLP research conducted on African languages. Crucial in enabling such research is the availability of high-quality annotated datasets. In this paper, we introduce AfriSenti, which consists of 14 sentiment datasets of 110,000+ tweets in 14 African languages (Amharic, Algerian Arabic, Hausa, Igbo, Kinyarwanda, Moroccan Arabic, Mozambican Portuguese, Nigerian Pidgin, Oromo, Swahili, Tigrinya, Twi, Xitsonga, and Yorùbá) from four language families annotated by native speakers. The data is used in SemEval 2023 Task 12, the first Afro-centric SemEval shared task. We describe the data collection methodology, annotation process, and related challenges when curating each of the datasets. We conduct experiments with different sentiment classification baselines and discuss their usefulness. We hope AfriSenti enables new work on under-represented languages. The dataset is available at https://github.com/afrisenti-semeval/afrisent-semeval-2023 and can also be loaded as a huggingface datasets (https://huggingface.co/datasets/shmuhammad/AfriSenti).

READ FULL TEXT

page 1

page 14

page 15

research
11/01/2020

ASAD: A Twitter-based Benchmark Arabic Sentiment Analysis Dataset

This paper provides a detailed description of a new Twitter-based benchm...
research
04/13/2023

SemEval-2023 Task 12: Sentiment Analysis for African Languages (AfriSenti-SemEval)

We present the first Africentric SemEval Shared task, Sentiment Analysis...
research
11/19/2022

ArtELingo: A Million Emotion Annotations of WikiArt with Emphasis on Diversity over Language and Culture

This paper introduces ArtELingo, a new benchmark and dataset, designed t...
research
04/26/2023

HausaNLP at SemEval-2023 Task 12: Leveraging African Low Resource TweetData for Sentiment Analysis

We present the findings of SemEval-2023 Task 12, a shared task on sentim...
research
10/13/2021

Masader: Metadata Sourcing for Arabic Text and Speech Data Resources

The NLP pipeline has evolved dramatically in the last few years. The fir...
research
05/17/2023

Smart Word Suggestions for Writing Assistance

Enhancing word usage is a desired feature for writing assistance. To fur...
research
09/26/2022

Digital Audio Forensics: Blind Human Voice Mimicry Detection

Audio is one of the most used way of human communication, but at the sam...

Please sign up or login with your details

Forgot password? Click here to reset