A Crosslingual Investigation of Conceptualization in 1335 Languages

05/15/2023
by   Yihong Liu, et al.
0

Languages differ in how they divide up the world into concepts and words; e.g., in contrast to English, Swahili has a single concept for `belly' and `womb'. We investigate these differences in conceptualization across 1,335 languages by aligning concepts in a parallel corpus. To this end, we propose Conceptualizer, a method that creates a bipartite directed alignment graph between source language concepts and sets of target language strings. In a detailed linguistic analysis across all languages for one concept (`bird') and an evaluation on gold standard data for 32 Swadesh concepts, we show that Conceptualizer has good alignment accuracy. We demonstrate the potential of research on conceptualization in NLP with two experiments. (1) We define crosslingual stability of a concept as the degree to which it has 1-1 correspondences across languages, and show that concreteness predicts stability. (2) We represent each language by its conceptualization pattern for 83 concepts, and define a similarity measure on these representations. The resulting measure for the conceptual similarity of two languages is complementary to standard genealogical, typological, and surface similarity measures. For four out of six language families, we can assign languages to their correct family based on conceptual similarity with accuracy between 54% and 87%.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/22/2023

A study of conceptual language similarity: comparison and evaluation

An interesting line of research in natural language processing (NLP) aim...
research
03/20/2015

On measuring linguistic intelligence

This work addresses the problem of measuring how many languages a person...
research
04/29/2015

On the universal structure of human lexical semantics

How universal is human conceptual structure? The way concepts are organi...
research
01/09/2019

What do Language Representations Really Represent?

A neural language model trained on a text corpus can be used to induce d...
research
02/01/2020

Concept Embedding for Information Retrieval

Concepts are used to solve the term-mismatch problem. However, we need a...
research
12/02/2020

Linguistic Classification using Instance-Based Learning

Traditionally linguists have organized languages of the world as languag...
research
01/31/2020

An efficient automated data analytics approach to large scale computational comparative linguistics

This research project aimed to overcome the challenge of analysing human...

Please sign up or login with your details

Forgot password? Click here to reset