Co-occurrences using Fasttext embeddings for word similarity tasks in Urdu

02/22/2021
by   Usama Khalid, et al.
0

Urdu is a widely spoken language in South Asia. Though immoderate literature exists for the Urdu language still the data isn't enough to naturally process the language by NLP techniques. Very efficient language models exist for the English language, a high resource language, but Urdu and other under-resourced languages have been neglected for a long time. To create efficient language models for these languages we must have good word embedding models. For Urdu, we can only find word embeddings trained and developed using the skip-gram model. In this paper, we have built a corpus for Urdu by scraping and integrating data from various sources and compiled a vocabulary for the Urdu language. We also modify fasttext embeddings and N-Grams models to enable training them on our built corpus. We have used these trained embeddings for a word similarity task and compared the results with existing techniques.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/02/2016

New word analogy corpus for exploring embeddings of Czech words

The word embedding methods have been proven to be very useful in many ta...
research
03/31/2019

SART - Similarity, Analogies, and Relatedness for Tatar Language: New Benchmark Datasets for Word Embeddings Evaluation

There is a huge imbalance between languages currently spoken and corresp...
research
01/13/2022

Compressing Word Embeddings Using Syllables

This work examines the possibility of using syllable embeddings, instead...
research
07/18/2016

Language classification from bilingual word embedding graphs

We study the role of the second language in bilingual word embeddings in...
research
11/12/2021

PESTO: Switching Point based Dynamic and Relative Positional Encoding for Code-Mixed Languages

NLP applications for code-mixed (CM) or mix-lingual text have gained a s...
research
02/28/2016

Gibberish Semantics: How Good is Russian Twitter in Word Semantic Similarity Task?

The most studied and most successful language models were developed and ...
research
09/09/2021

All Bark and No Bite: Rogue Dimensions in Transformer Language Models Obscure Representational Quality

Similarity measures are a vital tool for understanding how language mode...

Please sign up or login with your details

Forgot password? Click here to reset