Semantic-aware Binary Code Representation with BERT

by   Hyungjoon Koo, et al.

A wide range of binary analysis applications, such as bug discovery, malware analysis and code clone detection, require recovery of contextual meanings on a binary code. Recently, binary analysis techniques based on machine learning have been proposed to automatically reconstruct the code representation of a binary instead of manually crafting specifics of the analysis algorithm. However, the existing approaches utilizing machine learning are still specialized to solve one domain of problems, rendering recreation of models for different types of binary analysis. In this paper, we propose DeepSemantic utilizing BERT in producing the semantic-aware code representation of a binary code. To this end, we introduce well-balanced instruction normalization that holds rich information for each of instructions yet minimizing an out-of-vocabulary (OOV) problem. DeepSemantic has been carefully designed based on our study with large swaths of binaries. Besides, DeepSemantic leverages the essence of the BERT architecture into re-purposing a pre-trained generic model that is readily available as a one-time processing, followed by quickly applying specific downstream tasks with a fine-tuning process. We demonstrate DeepSemantic with two downstream tasks, namely, binary similarity comparison and compiler provenance (i.e., compiler and optimization level) prediction. Our experimental results show that the binary similarity model outperforms two state-of-the-art binary similarity tools, DeepBinDiff and SAFE, 49.84 respectively.


page 3

page 4

page 14


kTrans: Knowledge-Aware Transformer for Binary Code Embedding

Binary Code Embedding (BCE) has important applications in various revers...

Towards Learning Representations of Binary Executable Files for Security Tasks

Tackling binary analysis problems has traditionally implied manually def...

UniASM: Binary Code Similarity Detection without Fine-tuning

Binary code similarity detection (BCSD) is widely used in various binary...

NeuDep: Neural Binary Memory Dependence Analysis

Determining whether multiple instructions can access the same memory loc...

GraphMoco:a Graph Momentum Contrast Model that Using Multimodel Structure Information for Large-scale Binary Function Representation Learning

The ability to compute similarity scores of binary code at the function ...

A Survey of Binary Code Similarity

Binary code similarity approaches compare two or more pieces of binary c...

Understand Code Style: Efficient CNN-based Compiler Optimization Recognition System

Compiler optimization level recognition can be applied to vulnerability ...

Please sign up or login with your details

Forgot password? Click here to reset