Annotating the Tweebank Corpus on Named Entity Recognition and Building NLP Models for Social Media Analysis

01/18/2022
by   Hang Jiang, et al.
13

Social media data such as Twitter messages ("tweets") pose a particular challenge to NLP systems because of their short, noisy, and colloquial nature. Tasks such as Named Entity Recognition (NER) and syntactic parsing require highly domain-matched training data for good performance. While there are some publicly available annotated datasets of tweets, they are all purpose-built for solving one task at a time. As yet there is no complete training corpus for both syntactic analysis (e.g., part of speech tagging, dependency parsing) and NER of tweets. In this study, we aim to create Tweebank-NER, an NER corpus based on Tweebank V2 (TB2), and we use these datasets to train state-of-the-art NLP models. We first annotate named entities in TB2 using Amazon Mechanical Turk and measure the quality of our annotations. We train a Stanza NER model on the new benchmark, achieving competitive performance against other non-transformer NER systems. Finally, we train other Twitter NLP models (a tokenizer, lemmatizer, part of speech tagger, and dependency parser) on TB2 based on Stanza, and achieve state-of-the-art or competitive performance on these tasks. We release the dataset and make the models available to use in an "off-the-shelf" manner for future Tweet NLP research. Our source code, data, and pre-trained models are available at: <https://github.com/social-machines/TweebankNLP>.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/08/2023

WikiGoldSK: Annotated Dataset, Baselines and Few-Shot Learning Experiments for Slovak Named Entity Recognition

Named Entity Recognition (NER) is a fundamental NLP tasks with a wide ra...
research
10/07/2022

Named Entity Recognition in Twitter: A Dataset and Analysis on Short-Term Temporal Shifts

Recent progress in language model pre-training has led to important impr...
research
06/26/2023

Enriching the NArabizi Treebank: A Multifaceted Approach to Supporting an Under-Resourced Language

In this paper we address the scarcity of annotated data for NArabizi, a ...
research
04/20/2021

Mitigating Temporal-Drift: A Simple Approach to Keep NER Models Crisp

Performance of neural models for named entity recognition degrades over ...
research
04/11/2021

NorDial: A Preliminary Corpus of Written Norwegian Dialect Use

Norway has a large amount of dialectal variation, as well as a general t...
research
03/27/2019

ner and pos when nothing is capitalized

For those languages which use it, capitalization is an important signal ...
research
04/04/2022

Product Market Demand Analysis Using NLP in Banglish Text with Sentiment Analysis and Named Entity Recognition

Product market demand analysis plays a significant role for originating ...

Please sign up or login with your details

Forgot password? Click here to reset