UniTabE: Pretraining a Unified Tabular Encoder for Heterogeneous Tabular Data

07/18/2023
by   Yazheng Yang, et al.
0

Recent advancements in Natural Language Processing (NLP) have witnessed the groundbreaking impact of pretrained models, yielding impressive outcomes across various tasks. This study seeks to extend the power of pretraining methodologies to tabular data, a domain traditionally overlooked, yet inherently challenging due to the plethora of table schemas intrinsic to different tasks. The primary research questions underpinning this work revolve around the adaptation to heterogeneous table structures, the establishment of a universal pretraining protocol for tabular data, the generalizability and transferability of learned knowledge across tasks, the adaptation to diverse downstream applications, and the incorporation of incremental columns over time. In response to these challenges, we introduce UniTabE, a pioneering method designed to process tables in a uniform manner, devoid of constraints imposed by specific table structures. UniTabE's core concept relies on representing each basic table element with a module, termed TabUnit. This is subsequently followed by a Transformer encoder to refine the representation. Moreover, our model is designed to facilitate pretraining and finetuning through the utilization of free-form prompts. In order to implement the pretraining phase, we curated an expansive tabular dataset comprising approximately 13 billion samples, meticulously gathered from the Kaggle platform. Rigorous experimental testing and analyses were performed under a myriad of scenarios to validate the effectiveness of our methodology. The experimental results demonstrate UniTabE's superior performance against several baseline models across a multitude of benchmark datasets. This, therefore, underscores UniTabE's potential to significantly enhance the semantic representation of tabular data, thereby marking a significant stride in the field of tabular data analysis.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/19/2022

TransTab: Learning Transferable Tabular Transformers Across Tables

Tabular data (or tables) are the most widely used data format in machine...
research
05/10/2023

XTab: Cross-table Pretraining for Tabular Transformers

The success of self-supervised learning in computer vision and natural l...
research
05/06/2021

TABBIE: Pretrained Representations of Tabular Data

Existing work on tabular representation learning jointly models tables a...
research
05/28/2021

Domain-Adaptive Pretraining Methods for Dialogue Understanding

Language models like BERT and SpanBERT pretrained on open-domain data ha...
research
03/30/2023

Whether and When does Endoscopy Domain Pretraining Make Sense?

Automated endoscopy video analysis is a challenging task in medical comp...
research
07/14/2023

HYTREL: Hypergraph-enhanced Tabular Data Representation Learning

Language models pretrained on large collections of tabular data have dem...
research
07/03/2023

Improving Language Plasticity via Pretraining with Active Forgetting

Pretrained language models (PLMs) are today the primary model for natura...

Please sign up or login with your details

Forgot password? Click here to reset