A Chinese Dataset with Negative Full Forms for General Abbreviation Prediction

12/18/2017
by   Yi Zhang, et al.
0

Abbreviation is a common phenomenon across languages, especially in Chinese. In most cases, if an expression can be abbreviated, its abbreviation is used more often than its fully expanded forms, since people tend to convey information in a most concise way. For various language processing tasks, abbreviation is an obstacle to improving the performance, as the textual form of an abbreviation does not express useful information, unless it's expanded to the full form. Abbreviation prediction means associating the fully expanded forms with their abbreviations. However, due to the deficiency in the abbreviation corpora, such a task is limited in current studies, especially considering general abbreviation prediction should also include those full form expressions that do not have valid abbreviations, namely the negative full forms (NFFs). Corpora incorporating negative full forms for general abbreviation prediction are few in number. In order to promote the research in this area, we build a dataset for general Chinese abbreviation prediction, which needs a few preprocessing steps, and evaluate several different models on the built dataset. The dataset is available at https://github.com/lancopku/Chinese-abbreviation-dataset

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/03/2021

CCPM: A Chinese Classical Poetry Matching Dataset

Poetry is one of the most important art forms of human languages. Recent...
research
06/09/2017

Overview of the NLPCC 2017 Shared Task: Chinese News Headline Categorization

In this paper, we give an overview for the shared task at the CCF Confer...
research
06/05/2023

MCTS: A Multi-Reference Chinese Text Simplification Dataset

Text simplification aims to make the text easier to understand by applyi...
research
09/14/2023

C-Pack: Packaged Resources To Advance General Chinese Embedding

We introduce C-Pack, a package of resources that significantly advance t...
research
05/19/2023

InstructIE: A Chinese Instruction-based Information Extraction Dataset

We introduce a new Information Extraction (IE) task dubbed Instruction-b...
research
01/20/2018

Building an Ellipsis-aware Chinese Dependency Treebank for Web Text

Web 2.0 has brought with it numerous user-produced data revealing one's ...
research
02/06/2023

Evolution of grammatical forms: some quantitative approaches

Grammatical forms are said to evolve via two main mechanisms. These are,...

Please sign up or login with your details

Forgot password? Click here to reset