Richer Countries and Richer Representations

05/10/2022
by   Kaitlyn Zhou, et al.
0

We examine whether some countries are more richly represented in embedding space than others. We find that countries whose names occur with low frequency in training corpora are more likely to be tokenized into subwords, are less semantically distinct in embedding space, and are less likely to be correctly predicted: e.g., Ghana (the correct answer and in-vocabulary) is not predicted for, "The country producing the most cocoa is [MASK].". Although these performance discrepancies and representational harms are due to frequency, we find that frequency is highly correlated with a country's GDP; thus perpetuating historic power and wealth inequalities. We analyze the effectiveness of mitigation strategies; recommend that researchers report training word frequencies; and recommend future work for the community to define and design representational guarantees.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/23/2023

Is a Prestigious Job the same as a Prestigious Country? A Case Study on Multilingual Sentence Embeddings and European Countries

We study how multilingual sentence representations capture European coun...
research
11/15/2022

Relationship of the language distance to English ability of a country

Language difference is one of the factors that hinder the acquisition of...
research
07/03/2018

The USA is an indisputable world leader in medical and biotechnological research

A country's research success can be assessed from the power law function...
research
03/20/2021

Influence of journals indexed from a country on its research output: An empirical investigation

Scientific journals are currently the primary medium used by researchers...
research
02/04/2021

Building Representative Corpora from Illiterate Communities: A Review of Challenges and Mitigation Strategies for Developing Countries

Most well-established data collection methods currently adopted in NLP d...
research
05/18/2023

Inspecting the Geographical Representativeness of Images from Text-to-Image Models

Recent progress in generative models has resulted in models that produce...

Please sign up or login with your details

Forgot password? Click here to reset