Detecting Unassimilated Borrowings in Spanish: An Annotated Corpus and Approaches to Modeling

03/30/2022
by   Elena Alvarez-Mellado, et al.
0

This work presents a new resource for borrowing identification and analyzes the performance and errors of several models on this task. We introduce a new annotated corpus of Spanish newswire rich in unassimilated lexical borrowings – words from one language that are introduced into another without orthographic adaptation – and use it to evaluate how several sequence labeling models (CRF, BiLSTM-CRF, and Transformer-based models) perform. The corpus contains 370,000 tokens and is larger, more borrowing-dense, OOV-rich, and topic-varied than previous corpora available for this task. Our results show that a BiLSTM-CRF model fed with subword embeddings along with either Transformer-based embeddings pretrained on codeswitched data or a combination of contextualized word embeddings outperforms results obtained by a multilingual BERT-based model.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/03/2019

Evaluating KGR10 Polish word embeddings in the recognition of temporal expressions using BiLSTM-CRF

The article introduces a new set of Polish word embeddings, built using ...
research
02/17/2021

Metrical Tagging in the Wild: Building and Annotating Poetry Corpora with Rhythmic Features

A prerequisite for the computational study of literature is the availabi...
research
10/19/2019

Keyphrase Extraction from Scholarly Articles as Sequence Labeling using Contextualized Embeddings

In this paper, we formulate keyphrase extraction from scholarly articles...
research
06/30/2022

Domain Adaptive Pretraining for Multilingual Acronym Extraction

This paper presents our findings from participating in the multilingual ...
research
04/06/2020

An Annotated Corpus of Emerging Anglicisms in Spanish Newspaper Headlines

The extraction of anglicisms (lexical borrowings from English) is releva...
research
06/05/2019

Topic Sensitive Attention on Generic Corpora Corrects Sense Bias in Pretrained Embeddings

Given a small corpus D_T pertaining to a limited set of focused topics,...
research
10/19/2018

Weak Semi-Markov CRFs for NP Chunking in Informal Text

This paper introduces a new annotated corpus based on an existing inform...

Please sign up or login with your details

Forgot password? Click here to reset