Text Data Augmentation: Towards better detection of spear-phishing emails

07/04/2020
by   Mehdi Regina, et al.
0

Text data augmentation, i.e. the creation of synthetic textual data from an original text, is challenging as augmentation transformations should take into account language complexity while being relevant to the target Natural Language Processing (NLP) task (e.g. Machine Translation, Question Answering, Text Classification, etc.). Motivated by a business application of Business Email Compromise (BEC) detection, we propose a corpus and task agnostic text augmentation framework combining different methods, utilizing BERT language model, multi-step back-translation and heuristics. We show that our augmentation framework improves performances on several text classification tasks using publicly available models and corpora (SST2 and TREC) as well as on a BEC detection task. We also provide a comprehensive argumentation about the limitations of our augmentation framework.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset