Autore Topic: Machine Learning for Malware Detection  (Letto 392 volte)

0 Utenti e 1 Visitatore stanno visualizzando questo topic.

Offline Flavio58

Machine Learning for Malware Detection
« Risposta #1 il: Maggio 21, 2018, 08:56:02 pm »
Advertisement

Machine Learning is a subfield of computer science that aims to give computers the ability to learn from data instead of being explicitly programmed, thus leveraging the petabytes of data that exists on the internet nowadays to make decisions, and do tasks that are somewhere impossible or just complicated and time consuming for us humans.

Malware is one the imminent threats that companies and users face every day. Whether it is a phishing email or an exploit delivered throughout the browser, coupled with multiple evasion methods and other security vulnerabilities, it is a proven fact that nowadays defense systems cannot compete. The availability of frameworks such as Veil, Shelter, and others are known to be used by professionals when conducting pentesting work and are known to be quite effective.

Today I am going to show you that indeed Machine Learning can be used to detect Malware without having to use neither a signature detection nor a behavioral analysis.

P.S: Many products nowadays like CylanceProtect, SentinelOne, Carbon Black are known to leverage these capabilities the framework we are going to develop trough out this session is not at any level capable of doing what these products do, and I will explain shortly why.

Machine Learning a brief Introduction
Machine Learning is a subfield that mixes many domains of mathematics mainly Statistics and Probabilities and Linear Algebra and Computation (Algorithms, Data Processing, Numerical Calculations). To gain insight from data it is used to detect fraud, spam and recommending movies and meals and products to buy, Amazon, Facebook, Google to name a few of the hundreds of companies that use Machine learning to improve their products.

Machine Learning can be split into two major methods supervised learning and unsupervised learning the first means that the data we are going to work with is labeled the second means it is unlabeled, detecting malware can be attacked using both methods, but we will focus on the first one since our goal is to classify files.

Classification is a sub domain of supervised learning it can be either binary (malware-not malware) or multi-class (cat-dog-pig-lama…) thus malware detection falls under binary classification.

Explaining Machine Learning is beyond this article, and nowadays you can find a large amount of resources to know more about it, and you can check the Appendix for more of these resources.

The Problem Set
Machine Learning works by defining a problem, collecting the data, processing the data to make it usable and then feeding it to the algorithms. This makes it quite hard to implement in everything for the extensive amount of resources you may need to do this; this is called the machine learning workflow it is the minimal steps you need to start doing Machine Learning.

In our case let’s define our workflow:

First, we need to collect malware samples and clean samples we cannot work with less than 10k samples of both, and it is advisable to use even more of these
We need to extract meaningful features from our samples these features will be the basis of our study; features are what describe something, for example, the features of a house are:
number of rooms
SQ foot of the house
price
After extracting these features, we need to process all our samples to build a dataset it can be a database file or a CSV file this way it will be easier to turn it into vectors since the algorithms work by performing computation on vectors
Lastly, we need metrics in this binary classification there are a multitude of metrics to benchmark the performance of an algorithm (ROC/AUC, Confusion Matrix…) we will use a confusion matrix since it represents the rates of True Positives and True Negatives as well as False Positives and False Negatives.
Collecting Samples and Feature Extraction
I assume the reader knows about the PE File Format if you do not you can read about it here, collecting samples is quite easy you can either use a paid service like (VirusTotal) or one of the links here

Okay, let’s start on by discussing our model.

For our algorithm to learn from the data you feed it we need to make that data understandable and clear, in our case, we will use 12 features to teach our algorithm these features will be extracted from each binary and organized into a CSV file once.

Feature Extraction
To extract features, we will be using pefile. First Step is to download pefile I assume you know some Python and how to use pip.

From your terminal run:

pip install pefile

Now that you have the necessary tools let’s write some code, but first let’s discuss what kind of information we want to extract. We are interested in extracting the following fields of a PE File:

Major Image Version: Used to indicate the major version number of the application; in Microsoft Excel version 4.0, it would be 4.
Virtual Adress and Size of the IMAGE_DATA_DIRECTORY
OS Version
Import Adress Table Adress
Ressources Size
Number Of Sections
Linker Version
Size of Stack Reserve
DLL Characteristics
Export Table Size and Adress

https://github.com/erocarrera/pefile/


Read more : https://resources.infosecinstitute.com/machine-learning-malware-detection/


Consulente in Informatica dal 1984

Software automazione, progettazione elettronica, computer vision, intelligenza artificiale, IoT, sicurezza informatica, tecnologie di sicurezza militare, SIGINT. 

Facebook:https://www.facebook.com/flaviobernardotti58
Twitter : https://www.twitter.com/Flavio58

Cell:  +39 366 3416556

f.bernardotti@deeplearningitalia.eu

#deeplearning #computervision #embeddedboard #iot #ai

 

Related Topics

  Oggetto / Aperto da Risposte Ultimo post
0 Risposte
248 Visite
Ultimo post Aprile 11, 2018, 12:30:56 am
da Flavio58
0 Risposte
104 Visite
Ultimo post Aprile 30, 2018, 08:55:43 pm
da Flavio58
0 Risposte
200 Visite
Ultimo post Maggio 12, 2018, 12:40:11 am
da Flavio58
0 Risposte
119 Visite
Ultimo post Maggio 16, 2018, 01:04:59 pm
da Flavio58
0 Risposte
109 Visite
Ultimo post Maggio 17, 2018, 07:01:41 pm
da Flavio58

Sitemap 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326