Extracting Parallel Paragraphs from Common Crawl

04/27/2018
by   Jakub Kúdela, et al.
0

Most of the current methods for mining parallel texts from the web assume that web pages of web sites share same structure across languages. We believe that there still exists a non-negligible amount of parallel data spread across sources not satisfying this assumption. We propose an approach based on a combination of bivec (a bilingual extension of word2vec) and locality-sensitive hashing which allows us to efficiently identify pairs of parallel segments located anywhere on pages of a given web domain, regardless their structure. We validate our method on realigning segments from a large parallel corpus. Another experiment with real-world data provided by Common Crawl Foundation confirms that our solution scales to hundreds of terabytes large set of web-crawled data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/25/2011

User Modeling Combining Access Logs, Page Content and Semantics

The paper proposes an approach to modeling users of large Web sites base...
research
01/04/2020

Locality-Sensitive Hashing for Efficient Web Application Security Testing

Web application security has become a major concern in recent years, as ...
research
10/22/2020

Transform Data Complexity into Profitability through Data Mining Services

Data Mining experts are able to efficiently search and extract data from...
research
05/10/2011

The Hidden Web, XML and Semantic Web: A Scientific Data Management Perspective

The World Wide Web no longer consists just of HTML pages. Our work sheds...
research
06/12/2011

Evolutionary Biclustering of Clickstream Data

Biclustering is a two way clustering approach involving simultaneous clu...
research
01/04/2013

Similarity Assessment through blocking and affordance assignment in Textual CBR

It has been conceived that children learn new objects through their affo...
research
01/26/2018

Can Common Crawl reliably track persistent identifier (PID) use over time?

We report here on the results of two studies using two and four monthly ...

Please sign up or login with your details

Forgot password? Click here to reset