The Danish Gigaword Project

Danish is a North Germanic/Scandinavian language spoken primarily in Denmark, a country with a tradition of technological and scientific innovation. However, from a technological perspective, the Danish language has received relatively little attention and, as a result, Danish language technology is hard to develop, in part due to a lack of large or broad-coverage Danish corpora. This paper describes the Danish Gigaword project, which aims to construct a freely-available one billion word corpus of Danish text that represents the breadth of the written language.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/18/2017

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

This paper introduces the Multi-Genre Natural Language Inference (MultiN...
research
09/12/2017

Language Models of Spoken Dutch

In Flanders, all TV shows are subtitled. However, the process of subtitl...
research
06/08/2021

Defining definition: a Text mining Approach to Define Innovative Technological Fields

One of the first task of an innovative project is delineating the scope ...
research
01/26/2017

emLam -- a Hungarian Language Modeling baseline

This paper aims to make up for the lack of documented baselines for Hung...
research
06/09/2018

Word Familiarity and Frequency

Word frequency is assumed to correlate with word familiarity, but the st...
research
11/30/2021

Challenges in Developing LRs for Non-Scheduled Languages: A Case of Magahi

Magahi is an Indo-Aryan Language, spoken mainly in the Eastern parts of ...
research
03/27/2020

Text-based Technological Signatures and Similarities: How to create them and what to do with them

This paper describes a new approach to measure technological similarity ...

Please sign up or login with your details

Forgot password? Click here to reset