Machine Learning is a subfield of computer science that aims to give computers the ability to learn from data instead of being explicitly programmed, thus leveraging the petabytes of data that exists on the internet nowadays to make decisions, and do tasks that are somewhere impossible or just complicated and time consuming for us humans.
Malware is one the imminent threats that companies and users face every day. Whether it is a phishing email or an exploit delivered throughout the browser, coupled with multiple evasion methods and other security vulnerabilities, it is a proven fact that nowadays defense systems cannot compete. The availability of frameworks such as Veil, Shelter, and others are known to be used by professionals when conducting pentesting work and are known to be quite effective.
Today I am going to show you that indeed Machine Learning can be used to detect Malware without having to use neither a signature detection nor a behavioral analysis.
P.S: Many products nowadays like CylanceProtect, SentinelOne, Carbon Black are known to leverage these capabilities the framework we are going to develop trough out this session is not at any level capable of doing what these products do, and I will explain shortly why.
Machine Learning a brief Introduction
Machine Learning is a subfield that mixes many domains of mathematics mainly Statistics and Probabilities and Linear Algebra and Computation (Algorithms, Data Processing, Numerical Calculations). To gain insight from data it is used to detect fraud, spam and recommending movies and meals and products to buy, Amazon, Facebook, Google to name a few of the hundreds of companies that use Machine learning to improve their products.
Machine Learning can be split into two major methods supervised learning and unsupervised learning the first means that the data we are going to work with is labeled the second means it is unlabeled, detecting malware can be attacked using both methods, but we will focus on the first one since our goal is to classify files.
Classification is a sub domain of supervised learning it can be either binary (malware-not malware) or multi-class (cat-dog-pig-lama…) thus malware detection falls under binary classification.
Explaining Machine Learning is beyond this article, and nowadays you can find a large amount of resources to know more about it, and you can check the Appendix for more of these resources.
The Problem Set
Machine Learning works by defining a problem, collecting the data, processing the data to make it usable and then feeding it to the algorithms. This makes it quite hard to implement in everything for the extensive amount of resources you may need to do this; this is called the machine learning workflow it is the minimal steps you need to start doing Machine Learning.
In our case let’s define our workflow:
First, we need to collect malware samples and clean samples we cannot work with less than 10k samples of both, and it is advisable to use even more of these
We need to extract meaningful features from our samples these features will be the basis of our study; features are what describe something, for example, the features of a house are:
number of rooms
SQ foot of the house
After extracting these features, we need to process all our samples to build a dataset it can be a database file or a CSV file this way it will be easier to turn it into vectors since the algorithms work by performing computation on vectors
Lastly, we need metrics in this binary classification there are a multitude of metrics to benchmark the performance of an algorithm (ROC/AUC, Confusion Matrix…) we will use a confusion matrix since it represents the rates of True Positives and True Negatives as well as False Positives and False Negatives.
Collecting Samples and Feature Extraction
I assume the reader knows about the PE File Format if you do not you can read about it here, collecting samples is quite easy you can either use a paid service like (VirusTotal) or one of the links here
Okay, let’s start on by discussing our model.
For our algorithm to learn from the data you feed it we need to make that data understandable and clear, in our case, we will use 12 features to teach our algorithm these features will be extracted from each binary and organized into a CSV file once.
To extract features, we will be using pefile. First Step is to download pefile I assume you know some Python and how to use pip.
From your terminal run:
pip install pefile
Now that you have the necessary tools let’s write some code, but first let’s discuss what kind of information we want to extract. We are interested in extracting the following fields of a PE File:
Major Image Version: Used to indicate the major version number of the application; in Microsoft Excel version 4.0, it would be 4.
Virtual Adress and Size of the IMAGE_DATA_DIRECTORY
Import Adress Table Adress
Number Of Sections
Size of Stack Reserve
Export Table Size and Adresshttps://github.com/erocarrera/pefile/
Read more : https://resources.infosecinstitute.com/machine-learning-malware-detection/