Learning Word Embeddings from the Portuguese Twitter Stream: A Study of some Practical Aspects

09/04/2017
by   Pedro Saleiro, et al.
0

This paper describes a preliminary study for producing and distributing a large-scale database of embeddings from the Portuguese Twitter stream. We start by experimenting with a relatively small sample and focusing on three challenges: volume of training data, vocabulary size and intrinsic evaluation metrics. Using a single GPU, we were able to scale up vocabulary size from 2048 words embedded and 500K training examples to 32768 words over 10M training examples while keeping a stable validation loss and approximately linear trend on training time per epoch. We also observed that using less than 50% of the available training examples for each vocabulary size might result in overfitting. Results on intrinsic evaluation show promising performance for a vocabulary size of 32768 words. Nevertheless, intrinsic evaluation metrics suffer from over-sensitivity to their corresponding cosine similarity thresholds, indicating that a wider range of metrics need to be developed to track progress.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/28/2019

A New Corpus for Low-Resourced Sindhi Language with Word Embeddings

Representing words and phrases into dense vectors of real numbers which ...
research
05/23/2019

Misspelling Oblivious Word Embeddings

In this paper we present a method to learn word embeddings that are resi...
research
06/24/2022

Using BERT Embeddings to Model Word Importance in Conversational Transcripts for Deaf and Hard of Hearing Users

Deaf and hard of hearing individuals regularly rely on captioning while ...
research
08/10/2022

How Effective is Byte Pair Encoding for Out-Of-Vocabulary Words in Neural Machine Translation?

Neural Machine Translation (NMT) is an open vocabulary problem. As a res...
research
03/31/2021

Few-shot learning through contextual data augmentation

Machine translation (MT) models used in industries with constantly chang...
research
02/25/2020

Language-Independent Tokenisation Rivals Language-Specific Tokenisation for Word Similarity Prediction

Language-independent tokenisation (LIT) methods that do not require labe...
research
10/02/2021

A Case Study to Reveal if an Area of Interest has a Trend in Ongoing Tweets Using Word and Sentence Embeddings

In the field of Natural Language Processing, information extraction from...

Please sign up or login with your details

Forgot password? Click here to reset