Tailoring and Evaluating the Wikipedia for in-Domain Comparable Corpora Extraction

05/03/2020
by   Cristina España-Bonet, et al.
0

We propose an automatic language-independent graph-based method to build à-la-carte article collections on user-defined domains from the Wikipedia. The core model is based on the exploration of the encyclopaedia's category graph and can produce both monolingual and multilingual comparable collections. We run thorough experiments to assess the quality of the obtained corpora in 10 languages and 743 domains. According to an extensive manual evaluation, our graph-based model outperforms a retrieval-based approach and reaches an average precision of 84 introduce the concept of "domainness" and design several automatic metrics to account for the quality of the collections. Our best metric for domainness shows a strong correlation with the human-judged precision, representing a reasonable automatic alternative to assess the quality of domain-specific corpora. We release the WikiTailor toolkit with the implementation of the extraction methods, the evaluation measures and several utilities. WikiTailor makes obtaining multilingual in-domain data from the Wikipedia easy.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/10/2019

GeBioToolkit: Automatic Extraction of Gender-Balanced Multilingual Corpus of Wikipedia Biographies

We introduce GeBioToolkit, a tool for extracting multilingual parallel c...
research
03/22/2016

Multi-domain machine translation enhancements by parallel data extraction from comparable corpora

Parallel texts are a relatively rare language resource, however, they co...
research
10/07/2021

GeSERA: General-domain Summary Evaluation by Relevance Analysis

We present GeSERA, an open-source improved version of SERA for evaluatin...
research
12/05/2015

Unsupervised comparable corpora preparation and exploration for bi-lingual translation equivalents

The multilingual nature of the world makes translation a crucial require...
research
04/25/2017

280 Birds with One Stone: Inducing Multilingual Taxonomies from Wikipedia using Character-level Classification

We propose a simple, yet effective, approach towards inducing multilingu...
research
06/02/2023

Fair multilingual vandalism detection system for Wikipedia

This paper presents a novel design of the system aimed at supporting the...
research
09/17/2020

What if we had no Wikipedia? Domain-independent Term Extraction from a Large News Corpus

One of the most impressive human endeavors of the past two decades is th...

Please sign up or login with your details

Forgot password? Click here to reset