Improving type information inferred by decompilers with supervised machine learning

01/19/2021
by   Javier Escalada, et al.
8

In software reverse engineering, decompilation is the process of recovering source code from binary files. Decompilers are used when it is necessary to understand or analyze software for which the source code is not available. Although existing decompilers commonly obtain source code with the same behavior as the binaries, that source code is usually hard to interpret and certainly differs from the original code written by the programmer. Massive codebases could be used to build supervised machine learning models aimed at improving existing decompilers. In this article, we build different classification models capable of inferring the high-level type returned by functions, with significantly higher accuracy than existing decompilers. We automatically instrument C source code to allow the association of binary patterns with their corresponding high-level constructs. A dataset is created with a collection of real open-source applications plus a huge number of synthetic programs. Our system is able to predict function return types with a 79.1 Moreover, we document the binary patterns used by our classifier to allow their addition in the implementation of existing decompilers.

READ FULL TEXT

page 8

page 20

page 27

research
04/10/2023

GraphBinMatch: Graph-based Similarity Learning for Cross-Language Binary and Source Code Matching

Matching binary to source code and vice versa has various applications i...
research
06/22/2022

Exploring the Impact of Code Style in Identifying Good Programmers

Code style reflects the choice of textual representation of source code....
research
03/31/2020

Archiving and referencing source code with Software Heritage

Software, and software source code in particular, is widely used in mode...
research
03/09/2021

Finding Inlined Functions in Optimized Binaries

Much software, whether beneficent or malevolent, is distributed only as ...
research
05/17/2023

OpenLB User Guide: Associated with Release 1.6 of the Code

OpenLB is an object-oriented implementation of LBM. It is the first impl...
research
01/04/2023

Extending Source Code Pre-Trained Language Models to Summarise Decompiled Binaries

Reverse engineering binaries is required to understand and analyse progr...
research
04/07/2023

Revisiting Deep Learning for Variable Type Recovery

Compiled binary executables are often the only available artifact in rev...

Please sign up or login with your details

Forgot password? Click here to reset