kTrans: Knowledge-Aware Transformer for Binary Code Embedding

by   Wenyu Zhu, et al.

Binary Code Embedding (BCE) has important applications in various reverse engineering tasks such as binary code similarity detection, type recovery, control-flow recovery and data-flow analysis. Recent studies have shown that the Transformer model can comprehend the semantics of binary code to support downstream tasks. However, existing models overlooked the prior knowledge of assembly language. In this paper, we propose a novel Transformer-based approach, namely kTrans, to generate knowledge-aware binary code embedding. By feeding explicit knowledge as additional inputs to the Transformer, and fusing implicit knowledge with a novel pre-training task, kTrans provides a new perspective to incorporating domain knowledge into a Transformer framework. We inspect the generated embeddings with outlier detection and visualization, and also apply kTrans to 3 downstream tasks: Binary Code Similarity Detection (BCSD), Function Type Recovery (FTR) and Indirect Call Recognition (ICR). Evaluation results show that kTrans can generate high-quality binary code embeddings, and outperforms state-of-the-art (SOTA) approaches on downstream tasks by 5.2 https://github.com/Learner0x5a/kTrans-release


jTrans: Jump-Aware Transformer for Binary Code Similarity

Binary code similarity detection (BCSD) has important applications in va...

BinBert: Binary Code Understanding with a Fine-tunable and Execution-aware Transformer

A recent trend in binary code analysis promotes the use of neural soluti...

Semantic-aware Binary Code Representation with BERT

A wide range of binary analysis applications, such as bug discovery, mal...

PalmTree: Learning an Assembly Language Model for Instruction Embedding

Deep learning has demonstrated its strengths in numerous binary analysis...

Learning to Reverse DNNs from AI Programs Automatically

With the privatization deployment of DNNs on edge devices, the security ...

Asteria-Pro: Enhancing Deep-Learning Based Binary Code Similarity Detection by Incorporating Domain Knowledge

The widespread code reuse allows vulnerabilities to proliferate among a ...

LibAM: An Area Matching Framework for Detecting Third-party Libraries in Binaries

Third-party libraries (TPLs) are extensively utilized by developers to e...

Please sign up or login with your details

Forgot password? Click here to reset