UniParma @ SemEval 2021 Task 5: Toxic Spans Detection Using CharacterBERT and Bag-of-Words Model

03/17/2021
by   Akbar Karimi, et al.
0

With the ever-increasing availability of digital information, toxic content is also on the rise. Therefore, the detection of this type of language is of paramount importance. We tackle this problem utilizing a combination of a state-of-the-art pre-trained language model (CharacterBERT) and a traditional bag-of-words technique. Since the content is full of toxic words that have not been written according to their dictionary spelling, attendance to individual characters is crucial. Therefore, we use CharacterBERT to extract features based on the word characters. It consists of a CharacterCNN module that learns character embeddings from the context. These are, then, fed into the well-known BERT architecture. The bag-of-words method, on the other hand, further improves upon that by making sure that some frequently used toxic words get labeled accordingly.

READ FULL TEXT
research
01/06/2020

Improving Entity Linking by Modeling Latent Entity Type Information

Existing state of the art neural entity linking models employ attention-...
research
07/20/2020

Morphological Skip-Gram: Using morphological knowledge to improve word representation

Natural language processing models have attracted much interest in the d...
research
05/19/2023

Persian Typographical Error Type Detection using Many-to-Many Deep Neural Networks on Algorithmically-Generated Misspellings

Digital technologies have led to an influx of text created daily in a va...
research
09/12/2018

Generalizing Word Embeddings using Bag of Subwords

We approach the problem of generalizing pre-trained word embeddings beyo...
research
02/23/2019

Fixed-Size Ordinally Forgetting Encoding Based Word Sense Disambiguation

In this paper, we present our method of using fixed-size ordinally forge...
research
04/05/2017

Bag-of-Words Method Applied to Accelerometer Measurements for the Purpose of Classification and Energy Estimation

Accelerometer measurements are the prime type of sensor information most...
research
04/28/2021

MelBERT: Metaphor Detection via Contextualized Late Interaction using Metaphorical Identification Theories

Automated metaphor detection is a challenging task to identify metaphori...

Please sign up or login with your details

Forgot password? Click here to reset