DBpedia NIF: Open, Large-Scale and Multilingual Knowledge Extraction Corpus

12/26/2018
by   Milan Dojchinovski, et al.
0

In the past decade, the DBpedia community has put significant amount of effort on developing technical infrastructure and methods for efficient extraction of structured information from Wikipedia. These efforts have been primarily focused on harvesting, refinement and publishing semi-structured information found in Wikipedia articles, such as information from infoboxes, categorization information, images, wikilinks and citations. Nevertheless, still vast amount of valuable information is contained in the unstructured Wikipedia article texts. In this paper, we present DBpedia NIF - a large-scale and multilingual knowledge extraction corpus. The aim of the dataset is two-fold: to dramatically broaden and deepen the amount of structured information in DBpedia, and to provide large-scale and multilingual language resource for development of various NLP and IR task. The dataset provides the content of all articles for 128 Wikipedia languages. We describe the dataset creation process and the NLP Interchange Format (NIF) used to model the content, links and the structure the information of the Wikipedia articles. The dataset has been further enriched with about 25 partitions published as Linked Data. Finally, we describe the maintenance and sustainability plans, and selected use cases of the dataset from the TextExt knowledge extraction challenge.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/06/2023

Orphan Articles: The Dark Matter of Wikipedia

With 60M articles in more than 300 language versions, Wikipedia is the l...
research
05/15/2018

Harvesting Paragraph-Level Question-Answer Pairs from Wikipedia

We study the task of generating from Wikipedia articles question-answer ...
research
03/30/2021

Tracking Knowledge Propagation Across Wikipedia Languages

In this paper, we present a dataset of inter-language knowledge propagat...
research
09/17/2020

What if we had no Wikipedia? Domain-independent Term Extraction from a Large News Corpus

One of the most impressive human endeavors of the past two decades is th...
research
05/10/2021

Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia

Wikipedia is the largest online encyclopedia, used by algorithms and web...
research
12/18/2021

The Web Is Your Oyster – Knowledge-Intensive NLP against a Very Large Web Corpus

In order to address the increasing demands of real-world applications, t...
research
07/31/2018

Neural Article Pair Modeling for Wikipedia Sub-article Matching

Nowadays, editors tend to separate different subtopics of a long Wiki-pe...

Please sign up or login with your details

Forgot password? Click here to reset