Language Model for Text Analytic in Cybersecurity

04/06/2022
by   Ehsan Aghaei, et al.
0

NLP is a form of artificial intelligence and machine learning concerned with a computer or machine's ability to understand and interpret human language. Language models are crucial in text analytics and NLP since they allow computers to interpret qualitative input and convert it to quantitative data that they can use in other tasks. In essence, in the context of transfer learning, language models are typically trained on a large generic corpus, referred to as the pre-training stage, and then fine-tuned to a specific underlying task. As a result, pre-trained language models are mostly used as a baseline model that incorporates a broad grasp of the context and may be further customized to be used in a new NLP task. The majority of pre-trained models are trained on corpora from general domains, such as Twitter, newswire, Wikipedia, and Web. Such off-the-shelf NLP models trained on general text may be inefficient and inaccurate in specialized fields. In this paper, we propose a cybersecurity language model called SecureBERT, which is able to capture the text connotations in the cybersecurity domain, and therefore could further be used in automation for many important cybersecurity tasks that would otherwise rely on human expertise and tedious manual efforts. SecureBERT is trained on a large corpus of cybersecurity text collected and preprocessed by us from a variety of sources in cybersecurity and the general computing domain. Using our proposed methods for tokenization and model weights adjustment, SecureBERT is not only able to preserve the understanding of general English as most pre-trained language models can do, but also effective when applied to text that has cybersecurity implications.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/16/2023

Pre-Training to Learn in Context

In-context learning, where pre-trained language models learn to perform ...
research
02/22/2023

Learning from Multiple Sources for Data-to-Text and Text-to-Data

Data-to-text (D2T) and text-to-data (T2D) are dual tasks that convert st...
research
02/10/2021

Customizing Contextualized Language Models forLegal Document Reviews

Inspired by the inductive transfer learning on computer vision, many eff...
research
04/03/2023

DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains

In recent years, pre-trained language models (PLMs) achieve the best per...
research
05/13/2022

PathologyBERT – Pre-trained Vs. A New Transformer Language Model for Pathology Domain

Pathology text mining is a challenging task given the reporting variabil...
research
07/15/2021

Spanish Language Models

This paper presents the Spanish RoBERTa-base and RoBERTa-large models, a...
research
04/18/2021

Documenting the English Colossal Clean Crawled Corpus

As language models are trained on ever more text, researchers are turnin...

Please sign up or login with your details

Forgot password? Click here to reset