My Cybersecurity Machine Learning Research / Part1

11 min readNov 8, 2019

http://thenewsdoctors.com/wp-content/uploads/2014/10/mad-scientist.jpeg

Author: Christian Camilo Urcuqui López

Date: 04 November 2019

In this post I would like to share some of my research results, I’m going to explain to the interested of this cybersecurity area how I used data science in order to train machine learning mechanisms to detect threats, especially, for this part, I abstracted some of the ideas from my book [1] and others articles, which I will use during the explanation about this topic.

This research is for me as my philanthropy work, so please the idea of this sharing is to get feedback, networking and cites about these results (hacker culture ☠)

Android

Android is one of the most used mobile operating systems worldwide. Due to its technological impact, its open-source code and the possibility of installing applications from third parties without any central control, Android has recently become a malware target. Even if it includes security mechanisms, the last news about malicious activities and Android´s vulnerabilities point to the importance of continuing the development of methods and frameworks to improve its security.

To prevent malware attacks, researches and developers have proposed different security solutions, applying static analysis, dynamic analysis, and artificial intelligence. Indeed, data science has become a promising area in cybersecurity, since analytical models based on data allow for the discovery of insights that can help to predict malicious activities.

We can analyze cyber threats using two techniques, static analysis, and dynamic analysis, the most important thing is that these are the approaches to get the features that we are going to use in data science.

Static analysis: it includes the methods that allow us to get information about the software that we want to analyze without executing it, one example of them is the study of the code, their callings, resources, etc.
Dynamic analysis: it is another approach where the idea is to analyze the cyber threat during its execution, in other words, get information about its behavior, some of their features are the netflows.

State of the Art

In 2016 we published an article [2] about the state of the art of frameworks and results about Android malware detection. This work reflects different static analysis tools (TaintDroid, Stowaway, Crowdroid y Airmid), dynamic analysis systems (Paranoid and DroidMOSS), frameworks (MobSafe, SAAF, and ASEF) and some research results about using machine learning. From this article we concluded that the idea is using both static and dynamic analysis in order to get spectra of features, moreover, some works have been working to use virtual devices in the cloud.

Datasets

In 2016 we explored [3] Android Genome Project (MalGenome), it is a dataset which was active from 2012 until the end of the year 2015, this set of malware has a size of 1260 applications, grouped into a total of 49 families. Today, we can find other jobs such as: Drebin, a research project offering a total of 5560 applications consisting of 179 malware families; AndrooZoo, which includes a collection of 5669661 applications Android from different sources (including Google Play); VirusShare, another repository that provides samples of malware for cybersecurity researchers; and DroidCollector, this is another set which provides around 8000 benign applications and 5560 malware samples, moreover, it facilitates us samples of network traffic as pcap files.

Static Analysis

In this first step, I’m going to analyze some features in order to answer the next hypothesis, exist a differential of the permissions used between a set of malware and benign samples, in other words…

For this approach, I developed a code that consisted to extract and make a CSV file which has information about permissions of applications, through this script you can map each APK (Android Application Package) against a list of permissions. You can find more information about the proposed framework at [3]

https://github.com/urcuqui/WhiteHat

Exploratory

For the next analysis, I’m going to explore the Malgenome dataset, as I said nowadays we can find other sources with a lot of examples and malware families which would be important for future analysis, the idea of the next experiment and results is to show our first approached. I’m going to use the dataset that I uploaded years ago in Kaggle.

Android Malware Analysis

Download Open Datasets on 1000s of Projects + Share Projects on One Platform. Explore Popular Topics Like Government…

www.kaggle.com

# I'm going to download the datasetdf = pd.read_csv("../input/datasetandroidpermissions/train.csv", sep=";")
df = df.astype("int64")
df.type.value_counts()

Out[4]:

1    199
0    199
Name: type, dtype: int64

Type is the label that represents if an application is a malware or not, as we can see this dataset is balanced.

Let's get the top 10 of permissions that are used for our malware samples

Malicious

In [6]:

pd.Series.sort_values(df[df.type==1].sum(axis=0), ascending=False)[1:11]

Out[6]:

android.permission.INTERNET                  195
android.permission.READ_PHONE_STATE          190
android.permission.ACCESS_NETWORK_STATE      167
android.permission.WRITE_EXTERNAL_STORAGE    136
android.permission.ACCESS_WIFI_STATE         135
android.permission.READ_SMS                  124
android.permission.WRITE_SMS                 104
android.permission.RECEIVE_BOOT_COMPLETED    102
android.permission.ACCESS_COARSE_LOCATION     80
android.permission.CHANGE_WIFI_STATE          75
dtype: int64

Benign

In [7]:

pd.Series.sort_values(df[df.type==0].sum(axis=0), ascending=False)[:10]

Out[7]:

android.permission.INTERNET                  104
android.permission.WRITE_EXTERNAL_STORAGE     76
android.permission.ACCESS_NETWORK_STATE       62
android.permission.WAKE_LOCK                  36
android.permission.RECEIVE_BOOT_COMPLETED     30
android.permission.ACCESS_WIFI_STATE          29
android.permission.READ_PHONE_STATE           24
android.permission.VIBRATE                    21
android.permission.ACCESS_FINE_LOCATION       18
android.permission.READ_EXTERNAL_STORAGE      15
dtype: int64

In [8]:

import matplotlib.pyplot as plt
fig, axs =  plt.subplots(nrows=2, sharex=True)

pd.Series.sort_values(df[df.type==0].sum(axis=0), ascending=False)[:10].plot.bar(ax=axs[0])
pd.Series.sort_values(df[df.type==1].sum(axis=0), ascending=False)[1:11].plot.bar(ax=axs[1], color="red")

Out[8]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f3dea5e7390>

The last outputs allow us to get insights about a difference between the permissions used by the malware and the benign applications.

Modeling

In [9]:

X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, 1:330], df['type'], test_size=0.20, random_state=42)

Naive Bayes algorithm

In [10]:

# Naive Bayes algorithm
gnb = GaussianNB()
gnb.fit(X_train, y_train)# pred
pred = gnb.predict(X_test)# accuracy
accuracy = accuracy_score(pred, y_test)
print("naive_bayes")
print(accuracy)
print(classification_report(pred, y_test, labels=None))

Out [10]:

naive_bayes
0.8375
              precision    recall  f1-score   support           0       0.91      0.76      0.83        41
           1       0.78      0.92      0.85        39    accuracy                           0.84        80
   macro avg       0.85      0.84      0.84        80
weighted avg       0.85      0.84      0.84        80

K-neighbors algorithm

In [11]:

# kneighbors algorithmfor i in range(3,15,3):
    
    neigh = KNeighborsClassifier(n_neighbors=i)
    neigh.fit(X_train, y_train)
    pred = neigh.predict(X_test)
    # accuracy
    accuracy = accuracy_score(pred, y_test)
    print("kneighbors {}".format(i))
    print(accuracy)
    print(classification_report(pred, y_test, labels=None))
    print("")

Out [11]:

kneighbors 3
0.8875
              precision    recall  f1-score   support           0       0.94      0.82      0.88        39
           1       0.85      0.95      0.90        41    accuracy                           0.89        80
   macro avg       0.89      0.89      0.89        80
weighted avg       0.89      0.89      0.89        80kneighbors 6
0.85
              precision    recall  f1-score   support           0       0.94      0.76      0.84        42
           1       0.78      0.95      0.86        38    accuracy                           0.85        80
   macro avg       0.86      0.85      0.85        80
weighted avg       0.87      0.85      0.85        80kneighbors 9
0.8375
              precision    recall  f1-score   support           0       0.94      0.74      0.83        43
           1       0.76      0.95      0.84        37    accuracy                           0.84        80
   macro avg       0.85      0.85      0.84        80
weighted avg       0.86      0.84      0.84        80kneighbors 12
0.825
              precision    recall  f1-score   support           0       0.91      0.74      0.82        42
           1       0.76      0.92      0.83        38    accuracy                           0.82        80
   macro avg       0.84      0.83      0.82        80
weighted avg       0.84      0.82      0.82        80

Decision Tree

In [12]:

clf = tree.DecisionTreeClassifier()
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
print(classification_report(pred, y_test, labels=None))

Out [12]:


              precision    recall  f1-score   support           0       0.94      0.89      0.91        36
           1       0.91      0.95      0.93        44    accuracy                           0.93        80
   macro avg       0.93      0.92      0.92        80
weighted avg       0.93      0.93      0.92        80

Through the last results, we can see how we trained different classifiers to detect malware using its permissions, but as I said this is only a first approximation, I didn’t analyze the hyperparameters and others things to improve the results.

Dynamic Analysis

For this approach, we used a set of pcap files from the DroidCollector project integrated by 4705 benign and 7846 malicious applications. All of the files were processed by our feature extractor script (a result from [4]), the idea of this analysis is to answer the next question, according to the static analysis previously seen a lot of applications use a network connection, in other words, they are trying to communicate or transmit information, so.. is it possible to distinguish between malware and benign application using network traffic?

In [13]:

import pandas as pd
data = pd.read_csv("../input/network-traffic-android-malware/android_traffic.csv", sep=";")
data.columns

Out [13]:

Index(['name', 'tcp_packets', 'dist_port_tcp', 'external_ips', 'vulume_bytes','udp_packets', 'tcp_urg_packet', source_app_packets',       'remote_app_packets', 'source_app_bytes', 'remote_app_bytes',       'duracion', 'avg_local_pkt_rate', 'avg_remote_pkt_rate',       'source_app_packets.1', 'dns_query_times', 'type'],      dtype='object')data.shape

Out[15]:

(7845, 17)

In [16]:

data.type.value_counts()

Out[16]:

benign       4704
malicious    3141
Name: type, dtype: int64

In this case, we have an unbalanced dataset, so another model evaluation will be used.

Data Cleaning and Processing

In [17]:

data.isna().sum()

Out[17]:

name                       0
tcp_packets                0
dist_port_tcp              0
external_ips               0
vulume_bytes               0
udp_packets                0
tcp_urg_packet             0
source_app_packets         0
remote_app_packets         0
source_app_bytes           0
remote_app_bytes           0
duracion                7845
avg_local_pkt_rate      7845
avg_remote_pkt_rate     7845
source_app_packets.1       0
dns_query_times            0
type                       0
dtype: int64

When we processed each pcap we had some problems getting three features (duration, avg remote package rate, avg local package rate) this why got during the feature processing script, we don't have this issue nowadays.

In [18]:

data = data.drop(['duracion','avg_local_pkt_rate','avg_remote_pkt_rate'], axis=1).copy()

Now, the idea is to see the outliers in the data

In [20]:

sns.boxplot(data.tcp_urg_packet)

Out[20]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f3dea43f208>

In [21]:

data.loc[data.tcp_urg_packet > 0].shape[0]

Out[21]:

That column will be no used for the analysis, only two rows are different to zero, maybe they are interesting for future analysis.

In [22]:

data = data.drop(columns=["tcp_urg_packet"], axis=1).copy()
data.shape

Out[22]:

(7845, 13)

In [23]:

sns.pairplot(data)

Out[23]:

<seaborn.axisgrid.PairGrid at 0x7f3dea4aae80>

We have many outliers in some features, I will omit the depth analysis and only get the set of the data without the noise.

data=data[data.tcp_packets<20000].copy()
data=data[data.dist_port_tcp<1400].copy()
data=data[data.external_ips<35].copy()
data=data[data.vulume_bytes<2000000].copy()
data=data[data.udp_packets<40].copy()
data=data[data.remote_app_packets<15000].copy()

In [25]:

data[data.duplicated()].sum()

Out[25]:

name                    AntiVirusAntiVirusAntiVirusAntiVirusAntiVirusA...
tcp_packets                                                         15038
dist_port_tcp                                                        3514
external_ips                                                         1434
vulume_bytes                                                      2061210
udp_packets                                                            38
source_app_packets                                                  21720
remote_app_packets                                                  18841
source_app_bytes                                                  8615120
remote_app_bytes                                                  2456160
source_app_packets.1                                                21720
dns_query_times                                                      5095
type                    benignbenignbenignbenignbenignbenignbenignbeni...
dtype: object

In [26]:

data=data.drop('source_app_packets.1',axis=1).copy()

In [27]:

scaler = preprocessing.RobustScaler()
scaledData = scaler.fit_transform(data.iloc[:,1:11])
scaledData = pd.DataFrame(scaledData, columns=['tcp_packets','dist_port_tcp','external_ips','vulume_bytes','udp_packets','source_app_packets','remote_app_packets',' source_app_bytes','remote_app_bytes','dns_query_times'])

From [6] we concluded that the best network features are:

(R1): TCP packets, it has the number of packets TCP sent and got during communication.
(R2): Different TCP packets, it is the total number of packets different from TCP.
(R3): External IP, represents the number the external addresses (IPs) where the application tried to communicated
(R4): Volume of bytes, it is the number of bytes that was sent from the application to the external sites
(R5) UDP packets, the total number of packets UDP transmitted in a communication.
(R6) Packets of the source application, it is the number of packets that were sent from the application to a remote server.
(R7) Remote application packages, number of packages received from external sources.
(R8) Bytes of the application source, this is the volume (in Bytes) of the communication between the application and server.
(R9) Bytes of the application remote, this is the volume (in Bytes) of the data from the server to the emulator.
(R10) DNS queries, number of DNS queries.

Modeling

X_train, X_test, y_train, y_test = train_test_split(scaledData.iloc[:,0:10], data.type.astype("str"), test_size=0.25, random_state=45)

In [29]:

gnb = GaussianNB()
gnb.fit(X_train, y_train)
pred = gnb.predict(X_test)
## accuracy
accuracy = accuracy_score(y_test,pred)
print("naive_bayes")
print(accuracy)
print(classification_report(y_test,pred, labels=None))
print("cohen kappa score")
print(cohen_kappa_score(y_test, pred))

Out [29]:

naive_bayes
0.44688457609805926
              precision    recall  f1-score   support

      benign       0.81      0.12      0.20      1190
   malicious       0.41      0.96      0.58       768

    accuracy                           0.45      1958
   macro avg       0.61      0.54      0.39      1958
weighted avg       0.66      0.45      0.35      1958

cohen kappa score
0.06082933470572538

In [30]:

# kneighbors algorithm

for i in range(3,15,3):
    
    neigh = KNeighborsClassifier(n_neighbors=i)
    neigh.fit(X_train, y_train)
    pred = neigh.predict(X_test)
    # accuracy
    accuracy = accuracy_score(pred, y_test)
    print("kneighbors {}".format(i))
    print(accuracy)
    print(classification_report(pred, y_test, labels=None))
    print("cohen kappa score")
    print(cohen_kappa_score(y_test, pred))
    print("")

Out [30]:

kneighbors 3
0.8861082737487231
              precision    recall  f1-score   support

      benign       0.90      0.91      0.91      1173
   malicious       0.87      0.85      0.86       785

    accuracy                           0.89      1958
   macro avg       0.88      0.88      0.88      1958
weighted avg       0.89      0.89      0.89      1958

cohen kappa score
0.7620541314671169

kneighbors 6
0.8784473953013279
              precision    recall  f1-score   support

      benign       0.92      0.88      0.90      1240
   malicious       0.81      0.87      0.84       718

    accuracy                           0.88      1958
   macro avg       0.87      0.88      0.87      1958
weighted avg       0.88      0.88      0.88      1958

cohen kappa score
0.7420746759356631

kneighbors 9
0.8707865168539326
              precision    recall  f1-score   support

      benign       0.89      0.90      0.89      1175
   malicious       0.85      0.83      0.84       783

    accuracy                           0.87      1958
   macro avg       0.87      0.86      0.86      1958
weighted avg       0.87      0.87      0.87      1958

cohen kappa score
0.729919255030886

kneighbors 12
0.8615934627170582
              precision    recall  f1-score   support

      benign       0.88      0.89      0.89      1185
   malicious       0.83      0.82      0.82       773

    accuracy                           0.86      1958
   macro avg       0.86      0.85      0.86      1958
weighted avg       0.86      0.86      0.86      1958

cohen kappa score
0.7100368862537227

In [31]:

rdF=RandomForestClassifier(n_estimators=250, max_depth=50,random_state=45)
rdF.fit(X_train,y_train)
pred=rdF.predict(X_test)
cm=confusion_matrix(y_test, pred)

accuracy = accuracy_score(y_test,pred)
print(classification_report(y_test,pred, labels=None))
print("cohen kappa score")
print(cohen_kappa_score(y_test, pred))
print(cm)

Out [31]:


              precision    recall  f1-score   support

      benign       0.93      0.94      0.93      1190
   malicious       0.90      0.88      0.89       768

    accuracy                           0.92      1958
   macro avg       0.91      0.91      0.91      1958
weighted avg       0.92      0.92      0.92      1958

cohen kappa score
0.8258206083396299
[[1117   73]
 [  89  679]]

Conclusions

Great 💪🤖!!, now we have seen the two approaches to analyze a cyber threat. Of course we can use a lot of variables that in this case, we didn't use them, for example, netflows, methods callings, graph analysis, and many others, but the idea behind this work is to understand that we need to pay attention of all of the environments because when we are working in cybersecurity we face with a complex problem.

I'm going to continue with my cybersecurity posts, in spite of this is the first one, this will be updated once I find an improvement and new results.

Some of the next posts that I will publish are:

Malicious Websites Detection
Secure Learning
Cyber Web Attacks Detection
Deep Fakes Detection

I would like to mention my ex-bachelor students who helped me to continue in my Android research

Melisa Garcia
Jose Luis Osorio
Andres Felipe Perez
Jhoan Steven Delgado

Moreover, I would like to cite the co-authors and my English teacher:

Andres Navarro
Sebastian Londoño
Samir Riascos

References

[1] López, U., Camilo, C., García Peña, M., Osorio Quintero, J. L., & Navarro Cadavid, A. (2018). Ciberseguridad: un enfoque desde la ciencia de datos-Primera edición.
[2] Navarro Cadavid, A., Londoño, S., Urcuqui López, C. C., & Gomez, J. (2014, June). Análisis y caracterización de frameworks para detección de aplicaciones maliciosas en Android. In Conference: XIV Jornada Internacional de Seguridad Informática ACIS-2014 (Vol. 14). ResearchGate.
[3] Urcuqui-López, C., & Cadavid, A. N. (2016). Framework for malware analysis in Android.
[4] Urcuqui, C., Navarro, A., Osorio, J., & Garcıa, M. (2017). Machine Learning Classifiers to Detect Malicious Websites. CEUR Workshop Proceedings. Vol 1950, 14-17.
[5] López, C. C. U., Villarreal, J. S. D., Belalcazar, A. F. P., Cadavid, A. N., & Cely, J. G. D. (2018, May). Features to Detect Android Malware. In 2018 IEEE Colombian Conference on Communications and Computing (COLCOM) (pp. 1-6). IEEE.

___   ___    __   ____  ___    ___    __    __    
|   | /  |   |  |  |  |  |  |  /  |   |  |  |  |       
|   |/  /    |  |  |  |  |  | /  /    |  |  |  |      
|      /     |  |  |  |  |      |      \  \_/  /  
|  |\  \     |  '--'  |  |  |\  \        \   /        
| _| `.__\   |________|  | _| `.__\      |___|[0][1][0]
[0][0][1]
[1][1][1]
Happy Hacking

About me

I am a security data scientist, hacker, researcher passionate about ICTs, moreover, I’m an assistant professor at Universidad Icesi. I’m magister in informatics and telecommunications and I have an undergraduate in software engineering. My research topics are cybersecurity and data science, I’m really excited about malware (anomaly) detection, adversarial techniques, secure learning, and ethical hacking.

Welcome to my world 💻~(-_-~)

https://www.linkedin.com/in/christianurcuqui/

My Cybersecurity Machine Learning Research / Part1

Android

State of the Art

Datasets

Static Analysis

Exploratory

Android Malware Analysis

Download Open Datasets on 1000s of Projects + Share Projects on One Platform. Explore Popular Topics Like Government…

Modeling

Dynamic Analysis

Data Cleaning and Processing

Modeling

Conclusions

References

Written by Christian Urcuqui