More Romanian word embeddings from the RETEROM project

by   Vasile Pais, et al.

Automatically learned vector representations of words, also known as "word embeddings", are becoming a basic building block for more and more natural language processing algorithms. There are different ways and tools for constructing word embeddings. Most of the approaches rely on raw texts, the construction items being the word occurrences and/or letter n-grams. More elaborated research is using additional linguistic features extracted after text preprocessing. Morphology is clearly served by vector representations constructed from raw texts and letter n-grams. Syntax and semantics studies may profit more from the vector representations constructed with additional features such as lemma, part-of-speech, syntactic or semantic dependants associated with each word. One of the key objectives of the ReTeRom project is the development of advanced technologies for Romanian natural language processing, including morphological, syntactic and semantic analysis of text. As such, we plan to develop an open-access large library of ready-to-use word embeddings sets, each set being characterized by different parameters: used features (wordforms, letter n-grams, lemmas, POSes etc.), vector lengths, window/context size and frequency thresholds. To this end, the previously created sets of word embeddings (based on word occurrences) on the CoRoLa corpus (Păiş and Tufiş, 2018) are and will be further augmented with new representations learned from the same corpus by using specific features such as lemmas and parts of speech. Furthermore, in order to better understand and explore the vectors, graphical representations will be available by customized interfaces.


A New Corpus for Low-Resourced Sindhi Language with Word Embeddings

Representing words and phrases into dense vectors of real numbers which ...

Evaluation of Greek Word Embeddings

Since word embeddings have been the most popular input for many NLP task...

Quantifying the Dissimilarity of Texts

Quantifying the dissimilarity of two texts is an important aspect of a n...

A Part-of-Speech Tagger for Yiddish: First Steps in Tagging the Yiddish Book Center Corpus

We describe the construction and evaluation of a part-of-speech tagger f...

Proactive Security: Embedded AI Solution for Violent and Abusive Speech Recognition

Violence is an epidemic in Brazil and a problem on the rise world-wide. ...

Text Segmentation based on Semantic Word Embeddings

We explore the use of semantic word embeddings in text segmentation algo...

A standardized Project Gutenberg corpus for statistical analysis of natural language and quantitative linguistics

The use of Project Gutenberg (PG) as a text corpus has been extremely po...

Please sign up or login with your details

Forgot password? Click here to reset