Reusable Templates and Guides For Documenting Datasets and Models for Natural Language Processing and Generation: A Case Study of the HuggingFace and GEM Data and Model Cards

08/16/2021
by   Angelina McMillan-Major, et al.
0

Developing documentation guidelines and easy-to-use templates for datasets and models is a challenging task, especially given the variety of backgrounds, skills, and incentives of the people involved in the building of natural language processing (NLP) tools. Nevertheless, the adoption of standard documentation practices across the field of NLP promotes more accessible and detailed descriptions of NLP datasets and models, while supporting researchers and developers in reflecting on their work. To help with the standardization of documentation, we present two case studies of efforts that aim to develop reusable documentation templates – the HuggingFace data card, a general purpose card for datasets in NLP, and the GEM benchmark data and model cards with a focus on natural language generation. We describe our process for developing these templates, including the identification of relevant stakeholder groups, the definition of a set of guiding principles, the use of existing templates as our foundation, and iterative revisions based on feedback.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/28/2021

Natural Language Processing 4 All (NLP4All): A New Online Platform for Teaching and Learning NLP Concepts

Natural Language Processing offers new insights into language data acros...
research
05/07/2023

LatinCy: Synthetic Trained Pipelines for Latin NLP

This paper introduces LatinCy, a set of trained general purpose Latin-la...
research
11/16/2021

STAMP 4 NLP – An Agile Framework for Rapid Quality-Driven NLP Applications Development

The progress in natural language processing (NLP) research over the last...
research
09/15/2017

Harvesting Creative Templates for Generating Stylistically Varied Restaurant Reviews

Many of the creative and figurative elements that make language exciting...
research
08/03/2020

Towards a Semantic Model of the GDPR Register of Processing Activities

A core requirement for GDPR compliance is the maintenance of a register ...
research
03/26/2021

An Automated Multiple-Choice Question Generation Using Natural Language Processing Techniques

Automatic multiple-choice question generation (MCQG) is a useful yet cha...
research
03/10/2019

Contextualised concept embedding for efficiently adapting natural language processing models for phenotype identification

Many efforts have been put to use automated approaches, such as natural ...

Please sign up or login with your details

Forgot password? Click here to reset