Kvistur 2.0: a BiLSTM Compound Splitter for Icelandic

04/16/2020
by   Jón Friðrik Daðason, et al.
0

In this paper, we present a character-based BiLSTM model for splitting Icelandic compound words, and show how varying amounts of training data affects the performance of the model. Compounding is highly productive in Icelandic, and new compounds are constantly being created. This results in a large number of out-of-vocabulary (OOV) words, negatively impacting the performance of many NLP tools. Our model is trained on a dataset of 2.9 million unique word forms and their constituent structures from the Database of Icelandic Morphology. The model learns how to split compound words into two parts and can be used to derive the constituent structure of any word form. Knowing the constituent structure of a word form makes it possible to generate the optimal split for a given task, e.g., a full split for subword tokenization, or, in the case of part-of-speech tagging, splitting an OOV word until the largest known morphological head is found. The model outperforms other previously published methods when evaluated on a corpus of manually split word forms. This method has been integrated into Kvistur, an Icelandic compound word analyzer.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/01/2018

Sanskrit Sandhi Splitting using seq2(seq)^2

In Sanskrit, small words (morphemes) are combined through a morphophonol...
research
03/13/2021

Approximating How Single Head Attention Learns

Why do models often attend to salient words, and how does this evolve th...
research
12/14/2019

Attending Form and Context to Generate Specialized Out-of-VocabularyWords Representations

We propose a new contextual-compositional neural network layer that hand...
research
12/17/2014

Computational Model to Generate Case-Inflected Forms of Masculine Nouns for Word Search in Sanskrit E-Text

The problem of word search in Sanskrit is inseparable from complexities ...
research
09/04/2020

Linguistically inspired morphological inflection with a sequence to sequence model

Inflection is an essential part of every human language's morphology, ye...
research
08/20/2017

LSTM Network for Inflected Abbreviation Expansion

In this paper, the problem of recovery of morphological information lost...
research
04/03/2022

A Part-of-Speech Tagger for Yiddish: First Steps in Tagging the Yiddish Book Center Corpus

We describe the construction and evaluation of a part-of-speech tagger f...

Please sign up or login with your details

Forgot password? Click here to reset