Announcing CzEng 2.0 Parallel Corpus with over 2 Gigawords

07/06/2020
by   Tom Kocmi, et al.
0

We present a new release of the Czech-English parallel corpus CzEng 2.0 consisting of over 2 billion words (2 "gigawords") in each language. The corpus contains document-level information and is filtered with several techniques to lower the amount of noise. In addition to the data in the previous version of CzEng, it contains new authentic and also high-quality synthetic parallel data. CzEng is freely available for research and educational purposes.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/08/2017

The IIT Bombay English-Hindi Parallel Corpus

We present the IIT Bombay English-Hindi Parallel Corpus. The corpus is a...
research
08/11/2021

Icelandic Parallel Abstracts Corpus

We present a new Icelandic-English parallel corpus, the Icelandic Parall...
research
07/23/2023

Milimili. Collecting Parallel Data via Crowdsourcing

We present a methodology for gathering a parallel corpus through crowdso...
research
08/29/2022

naab: A ready-to-use plug-and-play corpus for Farsi

Huge corpora of textual data are always known to be a crucial need for t...
research
09/23/2022

Cem Mil Podcasts: A Spoken Portuguese Document Corpus

This document describes the Portuguese language podcast dataset released...
research
08/01/2017

A Continuously Growing Dataset of Sentential Paraphrases

A major challenge in paraphrase research is the lack of parallel corpora...
research
08/28/2022

CJaFr-v3 : A Freely Available Filtered Japanese-French Aligned Corpus

We present a free Japanese-French parallel corpus. It includes 15M align...

Please sign up or login with your details

Forgot password? Click here to reset