Automated Detection of Non-Relevant Posts on the Russian Imageboard "2ch": Importance of the Choice of Word Representations

07/16/2017
by   Amir Bakarov, et al.
0

This study considers the problem of automated detection of non-relevant posts on Web forums and discusses the approach of resolving this problem by approximation it with the task of detection of semantic relatedness between the given post and the opening post of the forum discussion thread. The approximated task could be resolved through learning the supervised classifier with a composed word embeddings of two posts. Considering that the success in this task could be quite sensitive to the choice of word representations, we propose a comparison of the performance of different word embedding models. We train 7 models (Word2Vec, Glove, Word2Vec-f, Wang2Vec, AdaGram, FastText, Swivel), evaluate embeddings produced by them on dataset of human judgements and compare their performance on the task of non-relevant posts detection. To make the comparison, we propose a dataset of semantic relatedness with posts from one of the most popular Russian Web forums, imageboard "2ch", which has challenging lexical and grammatical features.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/02/2021

DUKweb: Diachronic word representations from the UK Web Archive corpus

Lexical semantic change (detecting shifts in the meaning and usage of wo...
research
07/02/2020

Lightme: Analysing Language in Internet Support Groups for Mental Health

Background: Assisting moderators to triage harmful posts in Internet Sup...
research
11/19/2021

Toxicity Detection can be Sensitive to the Conversational Context

User posts whose perceived toxicity depends on the conversational contex...
research
01/10/2012

Sentence based semantic similarity measure for blog-posts

Blogs-Online digital diary like application on web 2.0 has opened new an...
research
07/07/2021

POSLAN: Disentangling Chat with Positional and Language encoded Post Embeddings

Most online message threads inherently will be cluttered and any new use...
research
12/09/2021

Combining Textual Features for the Detection of Hateful and Offensive Language

The detection of offensive, hateful and profane language has become a cr...
research
10/13/2022

Early Discovery of Disappearing Entities in Microblogs

We make decisions by reacting to changes in the real world, in particula...

Please sign up or login with your details

Forgot password? Click here to reset