What do Language Representations Really Represent?

01/09/2019
by   Johannes Bjerva, et al.
0

A neural language model trained on a text corpus can be used to induce distributed representations of words, such that similar words end up with similar representations. If the corpus is multilingual, the same model can be used to learn distributed representations of languages, such that similar languages end up with similar representations. We show that this holds even when the multilingual corpus has been translated into English, by picking up the faint signal left by the source languages. However, just like it is a thorny problem to separate semantic from syntactic similarity in word representations, it is not obvious what type of similarity is captured by language representations. We investigate correlations and causal relationships between language representations learned from translations on one hand, and genetic, geographical, and several levels of structural similarity between languages on the other. Of these, structural similarity is found to correlate most strongly with language representation similarity, while genetic relationships---a convenient benchmark used for evaluation in previous work---appears to be a confounding factor. Apart from implications about translation effects, we see this more generally as a case where NLP and linguistic typology can interact and benefit one another.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/22/2016

Continuous multilinguality with language vectors

Most existing models for multilingual natural language processing (NLP) ...
research
02/01/2018

Emerging Language Spaces Learned From Massively Multilingual Corpora

Translations capture important information about languages that can be u...
research
03/17/2017

Construction of a Japanese Word Similarity Dataset

An evaluation of distributed word representation is generally conducted ...
research
11/15/2017

Tracking Typological Traits of Uralic Languages in Distributed Language Representations

Although linguistic typology has a long history, computational approache...
research
05/15/2023

A Crosslingual Investigation of Conceptualization in 1335 Languages

Languages differ in how they divide up the world into concepts and words...
research
08/09/2018

Efficient human-like semantic representations via the Information Bottleneck principle

Maintaining efficient semantic representations of the environment is a m...
research
03/13/2017

A Visual Representation of Wittgenstein's Tractatus Logico-Philosophicus

In this paper we present a data visualization method together with its p...

Please sign up or login with your details

Forgot password? Click here to reset