An Efficient Active Learning Pipeline for Legal Text Classification

11/15/2022
by   Sepideh Mamooler, et al.
0

Active Learning (AL) is a powerful tool for learning with less labeled data, in particular, for specialized domains, like legal documents, where unlabeled data is abundant, but the annotation requires domain expertise and is thus expensive. Recent works have shown the effectiveness of AL strategies for pre-trained language models. However, most AL strategies require a set of labeled samples to start with, which is expensive to acquire. In addition, pre-trained language models have been shown unstable during fine-tuning with small datasets, and their embeddings are not semantically meaningful. In this work, we propose a pipeline for effectively using active learning with pre-trained language models in the legal domain. To this end, we leverage the available unlabeled data in three phases. First, we continue pre-training the model to adapt it to the downstream task. Second, we use knowledge distillation to guide the model's embeddings to a semantically meaningful space. Finally, we propose a simple, yet effective, strategy to find the initial set of labeled samples with fewer actions compared to existing methods. Our experiments on Contract-NLI, adapted to the classification task, and LEDGAR benchmarks show that our approach outperforms standard AL strategies, and is more efficient. Furthermore, our pipeline reaches comparable results to the fully-supervised approach with a small performance gap, and dramatically reduced annotation cost. Code and the adapted data will be made available.

READ FULL TEXT

page 13

page 14

research
04/16/2021

Bayesian Active Learning with Pretrained Language Models

Active Learning (AL) is a method to iteratively select data for annotati...
research
10/19/2020

Cold-start Active Learning through Self-supervised Language Modeling

Active learning strives to reduce annotation costs by choosing the most ...
research
12/16/2021

ATM: An Uncertainty-aware Active Self-training Framework for Label-efficient Text Classification

Despite the great success of pre-trained language models (LMs) in many n...
research
12/20/2022

Smooth Sailing: Improving Active Learning for Pre-trained Language Models with Representation Smoothness Analysis

Developed as a solution to a practical need, active learning (AL) method...
research
07/20/2023

Embroid: Unsupervised Prediction Smoothing Can Improve Few-Shot Classification

Recent work has shown that language models' (LMs) prompt-based learning ...
research
07/31/2022

Deep Active Learning with Budget Annotation

Digital data collected over the decades and data currently being produce...
research
09/03/2021

ALLWAS: Active Learning on Language models in WASserstein space

Active learning has emerged as a standard paradigm in areas with scarcit...

Please sign up or login with your details

Forgot password? Click here to reset