Decoding the Secrets of Machine Learning in Malware Classification: A Deep Dive into Datasets, Feature Extraction, and Model Performance

by   Savino Dambra, et al.

Many studies have proposed machine-learning (ML) models for malware detection and classification, reporting an almost-perfect performance. However, they assemble ground-truth in different ways, use diverse static- and dynamic-analysis techniques for feature extraction, and even differ on what they consider a malware family. As a consequence, our community still lacks an understanding of malware classification results: whether they are tied to the nature and distribution of the collected dataset, to what extent the number of families and samples in the training dataset influence performance, and how well static and dynamic features complement each other. This work sheds light on those open questions. by investigating the key factors influencing ML-based malware detection and classification. For this, we collect the largest balanced malware dataset so far with 67K samples from 670 families (100 samples each), and train state-of-the-art models for malware detection and family classification using our dataset. Our results reveal that static features perform better than dynamic features, and that combining both only provides marginal improvement over static features. We discover no correlation between packing and classification accuracy, and that missing behaviors in dynamically-extracted features highly penalize their performance. We also demonstrate how a larger number of families to classify make the classification harder, while a higher number of samples per family increases accuracy. Finally, we find that models trained on a uniform distribution of samples per family better generalize on unseen data.


page 11

page 12


Integration of Static and Dynamic Analysis for Malware Family Classification with Composite Neural Network

Deep learning has been used in the research of malware analysis. Most cl...

MOTIF: A Large Malware Reference Dataset with Ground Truth Family Labels

Malware family classification is a significant issue with public safety ...

Local Model Feature Transformations

Local learning methods are a popular class of machine learning algorithm...

Mind the Gap: On Bridging the Semantic Gap between Machine Learning and Information Security

Despite the potential of Machine learning (ML) to learn the behavior of ...

Fast and Efficient Malware Detection with Joint Static and Dynamic Features Through Transfer Learning

In malware detection, dynamic analysis extracts the runtime behavior of ...

"Influence Sketching": Finding Influential Samples In Large-Scale Regressions

There is an especially strong need in modern large-scale data analysis t...

Malware Detection Using Frequency Domain-Based Image Visualization and Deep Learning

We propose a novel method to detect and visualize malware through image ...

Please sign up or login with your details

Forgot password? Click here to reset