Learning Domain-Specific Word Embeddings from Sparse Cybersecurity Texts

by   Arpita Roy, et al.

Word embedding is a Natural Language Processing (NLP) technique that automatically maps words from a vocabulary to vectors of real numbers in an embedding space. It has been widely used in recent years to boost the performance of a vari-ety of NLP tasks such as Named Entity Recognition, Syntac-tic Parsing and Sentiment Analysis. Classic word embedding methods such as Word2Vec and GloVe work well when they are given a large text corpus. When the input texts are sparse as in many specialized domains (e.g., cybersecurity), these methods often fail to produce high-quality vectors. In this pa-per, we describe a novel method to train domain-specificword embeddings from sparse texts. In addition to domain texts, our method also leverages diverse types of domain knowledge such as domain vocabulary and semantic relations. Specifi-cally, we first propose a general framework to encode diverse types of domain knowledge as text annotations. Then we de-velop a novel Word Annotation Embedding (WAE) algorithm to incorporate diverse types of text annotations in word em-bedding. We have evaluated our method on two cybersecurity text corpora: a malware description corpus and a Common Vulnerability and Exposure (CVE) corpus. Our evaluation re-sults have demonstrated the effectiveness of our method in learning domain-specific word embeddings.


page 1

page 2

page 3

page 4


UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus

Contextual word embedding models, such as BioBERT and Bio_ClinicalBERT, ...

Learning Word Embeddings with Domain Awareness

Word embeddings are traditionally trained on a large corpus in an unsupe...

Anchor Transform: Learning Sparse Representations of Discrete Objects

Learning continuous representations of discrete objects such as text, us...

Word Embeddings for Sentiment Analysis: A Comprehensive Empirical Survey

This work investigates the role of factors like training method, trainin...

Creation and Evaluation of Datasets for Distributional Semantics Tasks in the Digital Humanities Domain

Word embeddings are already well studied in the general domain, usually ...

Knowledge-Base Enriched Word Embeddings for Biomedical Domain

Word embeddings have been shown adept at capturing the semantic and synt...

Not just about size - A Study on the Role of Distributed Word Representations in the Analysis of Scientific Publications

The emergence of knowledge graphs in the scholarly communication domain ...

Please sign up or login with your details

Forgot password? Click here to reset