A Pre-Trained BERT Model for Android Applications

by   Tiezhu Sun, et al.

The automation of an increasingly large number of software engineering tasks is becoming possible thanks to Machine Learning (ML). One foundational building block in the application of ML to software artifacts is the representation of these artifacts (e.g., source code or executable code) into a form that is suitable for learning. Many studies have leveraged representation learning, delegating to ML itself the job of automatically devising suitable representations. Yet, in the context of Android problems, existing models are either limited to coarse-grained whole-app level (e.g., apk2vec) or conducted for one specific downstream task (e.g., smali2vec). Our work is part of a new line of research that investigates effective, task-agnostic, and fine-grained universal representations of bytecode to mitigate both of these two limitations. Such representations aim to capture information relevant to various low-level downstream tasks (e.g., at the class-level). We are inspired by the field of Natural Language Processing, where the problem of universal representation was addressed by building Universal Language Models, such as BERT, whose goal is to capture abstract semantic information about sentences, in a way that is reusable for a variety of tasks. We propose DexBERT, a BERT-like Language Model dedicated to representing chunks of DEX bytecode, the main binary format used in Android applications. We empirically assess whether DexBERT is able to model the DEX language and evaluate the suitability of our model in two distinct class-level software engineering tasks: Malicious Code Localization and Defect Prediction. We also experiment with strategies to deal with the problem of catering to apps having vastly different sizes, and we demonstrate one example of using our technique to investigate what information is relevant to a given task.


page 1

page 2

page 3

page 4


What do pre-trained code models know about code?

Pre-trained models of code built on the transformer architecture have pe...

An Exploratory Literature Study on Sharing and Energy Use of Language Models for Source Code

Large language models trained on source code can support a variety of so...

MetaTPTrans: A Meta Learning Approach for Multilingual Code Representation Learning

Representation learning of source code is essential for applying machine...

On The Cross-Modal Transfer from Natural Language to Code through Adapter Modules

Pre-trained neural Language Models (PTLM), such as CodeBERT, are recentl...

CLOWER: A Pre-trained Language Model with Contrastive Learning over Word and Character Representations

Pre-trained Language Models (PLMs) have achieved remarkable performance ...

BURT: BERT-inspired Universal Representation from Learning Meaningful Segment

Although pre-trained contextualized language models such as BERT achieve...

Text2App: A Framework for Creating Android Apps from Text Descriptions

We present Text2App – a framework that allows users to create functional...

Please sign up or login with your details

Forgot password? Click here to reset