Comparison of Turkish Word Representations Trained on Different Morphological Forms

02/13/2020
by   Gökhan Güler, et al.
0

Increased popularity of different text representations has also brought many improvements in Natural Language Processing (NLP) tasks. Without need of supervised data, embeddings trained on large corpora provide us meaningful relations to be used on different NLP tasks. Even though training these vectors is relatively easy with recent methods, information gained from the data heavily depends on the structure of the corpus language. Since the popularly researched languages have a similar morphological structure, problems occurring for morphologically rich languages are mainly disregarded in studies. For morphologically rich languages, context-free word vectors ignore morphological structure of languages. In this study, we prepared texts in morphologically different forms in a morphologically rich language, Turkish, and compared the results on different intrinsic and extrinsic tasks. To see the effect of morphological structure, we trained word2vec model on texts which lemma and suffixes are treated differently. We also trained subword model fastText and compared the embeddings on word analogy, text classification, sentimental analysis, and language model tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/05/2021

Evaluation Of Word Embeddings From Large-Scale French Web Content

Distributed word representations are popularly used in many tasks in nat...
research
07/27/2019

Nefnir: A high accuracy lemmatizer for Icelandic

Lemmatization, finding the basic morphological form of a word in a corpu...
research
08/13/2018

Comparing morphological complexity of Spanish, Otomi and Nahuatl

We use two small parallel corpora for comparing the morphological comple...
research
07/14/2023

MorphPiece : Moving away from Statistical Language Representation

Tokenization is a critical part of modern NLP pipelines. However, contem...
research
04/29/2020

Evaluating the Role of Language Typology in Transformer-Based Multilingual Text Classification

As NLP tools become ubiquitous in today's technological landscape, they ...
research
04/29/2020

Evaluating Transformer-Based Multilingual Text Classification

As NLP tools become ubiquitous in today's technological landscape, they ...
research
08/03/2023

Lexicon and Rule-based Word Lemmatization Approach for the Somali Language

Lemmatization is a Natural Language Processing (NLP) technique used to n...

Please sign up or login with your details

Forgot password? Click here to reset