: Structured Dataset Preprocessing Annotations for Frictionless Extreme Multi-Task Learning and Evaluation

by   Damien Sileo, et al.

The HuggingFace Datasets Hub hosts thousands of datasets. This provides exciting opportunities for language model training and evaluation. However, the datasets for a given type of task are stored with different schemas, and harmonization is harder than it seems (https://xkcd.com/927/). Multi-task training or evaluation requires manual work to fit data into task templates. Various initiatives independently address this problem by releasing the harmonized datasets or harmonization codes to preprocess datasets to the same format. We identify patterns across previous preprocessings, e.g. mapping of column names, and extraction of a specific sub-field from structured data in a column, and propose a structured annotation framework that makes our annotations fully exposed and not buried in unstructured code. We release a dataset annotation framework and dataset annotations for more than 400 English tasks (https://github.com/sileod/tasksource). These annotations provide metadata, like the name of the columns that should be used as input or labels for all datasets, and can save time for future dataset preprocessings, even if they do not use our framework. We fine-tune a multi-task text encoder on all tasksource tasks, outperforming every publicly available text encoder of comparable size on an external evaluation https://hf.co/sileod/deberta-v3-base-tasksource-nli.


page 1

page 2

page 3

page 4


FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks

In the fashion domain, there exists a variety of vision-and-language (V+...

CaSiNo: A Corpus of Campsite Negotiation Dialogues for Automatic Negotiation Systems

Automated systems that negotiate with humans have broad applications in ...

Open Images V5 Text Annotation and Yet Another Mask Text Spotter

A large scale human-labeled dataset plays an important role in creating ...

Multi-Task Learning with Shared Encoder for Non-Autoregressive Machine Translation

Non-Autoregressive machine Translation (NAT) models have demonstrated si...

Rad-ReStruct: A Novel VQA Benchmark and Method for Structured Radiology Reporting

Radiology reporting is a crucial part of the communication between radio...

Bollyrics: Automatic Lyrics Generator for Romanised Hindi

Song lyrics convey a meaningful story in a creative manner with complex ...

The Devil is in the Details: On the Pitfalls of Event Extraction Evaluation

Event extraction (EE) is a crucial task aiming at extracting events from...

Please sign up or login with your details

Forgot password? Click here to reset