Massively Multilingual Document Alignment with Cross-lingual Sentence-Mover's Distance

01/31/2020
by   Ahmed El-Kishky, et al.
0

Cross-lingual document alignment aims to identify pairs of documents in two distinct languages that are of comparable content or translations of each other. Such aligned data can be used for a variety of NLP tasks from training cross-lingual representations to mining parallel bitexts for machine translation training. In this paper we develop an unsupervised scoring function that leverages cross-lingual sentence embeddings to compute the semantic distance between documents in different languages. These semantic distances are then used to guide a document alignment algorithm to properly pair cross-lingual web documents across a variety of low, mid, and high-resource language pairs. Recognizing that our proposed scoring function and other state of the art methods are computationally intractable for long web documents, we utilize a more tractable greedy algorithm that performs comparably. We experimentally demonstrate that our distance metric performs better alignment than current baselines outperforming them by 7 pairs, 15 pairs

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/10/2019

A Massive Collection of Cross-Lingual Web-Document Pairs

Cross-lingual document alignment aims to identify pairs of documents in ...
research
05/22/2023

Towards Unsupervised Recognition of Semantic Differences in Related Documents

Automatically highlighting words that cause semantic differences between...
research
07/29/2017

Bilingual Document Alignment with Latent Semantic Indexing

We apply cross-lingual Latent Semantic Indexing to the Bilingual Documen...
research
11/28/2019

Legal document retrieval across languages: topic hierarchies based on synsets

Cross-lingual annotations of legislative texts enable us to explore majo...
research
04/04/2021

Revisiting Indirect Ontology Alignment : New Challenging Issues in Cross-Lingual Context

Ontology alignment process is overwhelmingly cited in Knowledge Engineer...
research
08/21/2021

Metric Learning in Multilingual Sentence Similarity Measurement for Document Alignment

Document alignment techniques based on multilingual sentence representat...
research
12/15/2020

Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies

With the ongoing growth in number of digital articles in a wider set of ...

Please sign up or login with your details

Forgot password? Click here to reset