LaFiCMIL: Rethinking Large File Classification from the Perspective of Correlated Multiple Instance Learning

by   Tiezhu Sun, et al.

Transformer-based models have revolutionized the performance of a wide range of language tasks. Intuitively, one might expect text classification, which does not necessitate as many high-level representations as generative tasks, to be comprehensively addressed with the powerful representation capabilities of Transformers. However, in reality, there remains significant potential for enhancement, particularly in the areas of multi-class and multi-label classification of lengthy textual documents and other large files. The performance of Transformer-based models is mainly hindered by a major limitation: a restricted input length, e.g., 512 tokens for BERT. While an increase in GPU memory can marginally extend this limit, practical real-world applications often operate under constrained GPU resources. In this work, we tackle the input limit problem from the perspective of correlated multiple instance learning. The proposed approach, LaFiCMIL, serves as a versatile framework applicable to various large file classification tasks covering binary, multi-class, and multi-label classification tasks, spanning various domains including Natural Language Processing, Programming Language Processing, and Android Analysis. To evaluate its effectiveness, we employ eight benchmark datasets pertaining to Long Document Classification, Code Defect Detection, and Android Malware Detection. Leveraging BERT-family models as feature extractors, our experimental results demonstrate that LaFiCMIL achieves new state-of-the-art performance across all benchmark datasets. This is largely attributable to its capability of scaling BERT up to nearly 20K tokens, running on a single Tesla V-100 GPU with 32G of memory.


page 1

page 2

page 3

page 4


Scaling Transformer to 1M tokens and beyond with RMT

This technical report presents the application of a recurrent memory to ...

Explicit Interaction Model towards Text Classification

Text classification is one of the fundamental tasks in natural language ...

Revisiting Transformer-based Models for Long Document Classification

The recent literature in text classification is biased towards short tex...

Efficient Classification of Long Documents Using Transformers

Several methods have been proposed for classifying long textual document...

Evaluation of ChatGPT Model for Vulnerability Detection

In this technical report, we evaluated the performance of the ChatGPT an...

On Sensitivity of Deep Learning Based Text Classification Algorithms to Practical Input Perturbations

Text classification is a fundamental Natural Language Processing task th...

Emoji Prediction: Extensions and Benchmarking

Emojis are a succinct form of language which can express concrete meanin...

Please sign up or login with your details

Forgot password? Click here to reset