Building a Web-Scale Dependency-Parsed Corpus from CommonCrawl

10/04/2017
by   Alexander Panchenko, et al.
0

We present DepCC, the largest to date linguistically analyzed corpus in English including 365 million documents, composed of 252 billion tokens and 7.5 billion of named entity occurrences in 14.3 billion sentences from a web-scale crawl of the CommonCrawl project. The sentences are processed with a dependency parser and with a named entity tagger and contain provenance information, enabling various applications ranging from training syntax-based word embeddings based on to open information extraction and question answering. We demonstrate the utility of this corpus on the verb similarity task by showing that a distributional model trained on our corpus yields better results than models trained on smaller corpora, like Wikipedia. This distributional model outperforms the state of art models of verb similarity trained on smaller corpora on the SimVerb3500 dataset.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/19/2018

Size vs. Structure in Training Corpora for Word Embedding Models: Araneum Russicum Maximum and Russian National Corpus

In this paper, we present a distributional word embedding model trained ...
research
09/03/2019

Introducing RONEC -- the Romanian Named Entity Corpus

We present RONEC - the Named Entity Corpus for the Romanian language. Th...
research
05/23/2023

WebIE: Faithful and Robust Information Extraction on the Web

Extracting structured and grounded fact triples from raw text is a funda...
research
07/14/2020

Using Holographically Compressed Embeddings in Question Answering

Word vector representations are central to deep learning natural languag...
research
12/14/2022

Building and Evaluating Universal Named-Entity Recognition English corpus

This article presents the application of the Universal Named Entity fram...
research
06/18/2020

AMALGUM – A Free, Balanced, Multilayer English Web Corpus

We present a freely available, genre-balanced English web corpus totalin...
research
04/23/2018

Can Eye Movement Data Be Used As Ground Truth For Word Embeddings Evaluation?

In recent years a certain success in the task of modeling lexical semant...

Please sign up or login with your details

Forgot password? Click here to reset