Persian Typographical Error Type Detection using Many-to-Many Deep Neural Networks on Algorithmically-Generated Misspellings

05/19/2023
by   Mohammad Dehghani, et al.
0

Digital technologies have led to an influx of text created daily in a variety of languages, styles, and formats. A great deal of the popularity of spell-checking systems can be attributed to this phenomenon since they are crucial to polishing the digitally conceived text. In this study, we tackle Typographical Error Type Detection in Persian, which has been relatively understudied. In this paper, we present a public dataset named FarsTypo, containing 3.4 million chronologically ordered and part-of-speech tagged words of diverse topics and linguistic styles. An algorithm for applying Persian-specific errors is developed and applied to a scalable size of these words, forming a parallel dataset of correct and incorrect words. Using FarsTypo, we establish a firm baseline and compare different methodologies using various architectures. In addition, we present a novel Many-to-Many Deep Sequential Neural Network to perform token classification using both word and character embeddings in combination with bidirectional LSTM layers to detect typographical errors across 51 classes. We compare our approach with highly-advanced industrial systems that, unlike this study, have been developed utilizing a variety of resources. The results of our final method were competitive in that we achieved an accuracy of 97.62 recall of 98.61

READ FULL TEXT
research
10/17/2017

EffectiveSan: Type and Memory Error Detection using Dynamically Typed C/C++

Low-level programming languages such as C and C++ are vulnerable to erro...
research
03/17/2021

UniParma @ SemEval 2021 Task 5: Toxic Spans Detection Using CharacterBERT and Bag-of-Words Model

With the ever-increasing availability of digital information, toxic cont...
research
09/09/2020

Unconstrained Text Detection in Manga: a New Dataset and Baseline

The detection and recognition of unconstrained text is an open problem i...
research
11/22/2019

Multilingual Culture-Independent Word Analogy Datasets

In text processing, deep neural networks mostly use word embeddings as a...
research
03/29/2022

Automatic Detection of Speech Sound Disorder in Child Speech Using Posterior-based Speaker Representations

This paper presents a macroscopic approach to automatic detection of spe...
research
10/25/2018

The Logoscope: a Semi-Automatic Tool for Detecting and Documenting French New Words

In this article we present the design and implementation of the Logoscop...
research
04/01/2021

HLE-UPC at SemEval-2021 Task 5: Multi-Depth DistilBERT for Toxic Spans Detection

This paper presents our submission to SemEval-2021 Task 5: Toxic Spans D...

Please sign up or login with your details

Forgot password? Click here to reset