My Cybersecurity Machine Learning Research / Part1

Christian Urcuqui
11 min readNov 8, 2019
http://thenewsdoctors.com/wp-content/uploads/2014/10/mad-scientist.jpeg

Author: Christian Camilo Urcuqui López

Date: 04 November 2019

GitHub: https://github.com/urcuqui/

In this post I would like to share some of my research results, I’m going to explain to the interested of this cybersecurity area how I used data science in order to train machine learning mechanisms to detect threats, especially, for this part, I abstracted some of the ideas from my book [1] and others articles, which I will use during the explanation about this topic.

This research is for me as my philanthropy work, so please the idea of this sharing is to get feedback, networking and cites about these results (hacker culture ☠)

My first book 💪🤖

Android

Android is one of the most used mobile operating systems worldwide. Due to its technological impact, its open-source code and the possibility of installing applications from third parties without any central control, Android has recently become a malware target. Even if it includes security mechanisms, the last news about malicious activities and Android´s vulnerabilities point to the importance of continuing the development of methods and frameworks to improve its security.

To prevent malware attacks, researches and developers have proposed different security solutions, applying static analysis, dynamic analysis, and artificial intelligence. Indeed, data science has become a promising area in cybersecurity, since analytical models based on data allow for the discovery of insights that can help to predict malicious activities.

We can analyze cyber threats using two techniques, static analysis, and dynamic analysis, the most important thing is that these are the approaches to get the features that we are going to use in data science.

  • Static analysis: it includes the methods that allow us to get information about the software that we want to analyze without executing it, one example of them is the study of the code, their callings, resources, etc.
  • Dynamic analysis: it is another approach where the idea is to analyze the cyber threat during its execution, in other words, get information about its behavior, some of their features are the netflows.

State of the Art

In 2016 we published an article [2] about the state of the art of frameworks and results about Android malware detection. This work reflects different static analysis tools (TaintDroid, Stowaway, Crowdroid y Airmid), dynamic analysis systems (Paranoid and DroidMOSS), frameworks (MobSafe, SAAF, and ASEF) and some research results about using machine learning. From this article we concluded that the idea is using both static and dynamic analysis in order to get spectra of features, moreover, some works have been working to use virtual devices in the cloud.

Datasets

In 2016 we explored [3] Android Genome Project (MalGenome), it is a dataset which was active from 2012 until the end of the year 2015, this set of malware has a size of 1260 applications, grouped into a total of 49 families. Today, we can find other jobs such as: Drebin, a research project offering a total of 5560 applications consisting of 179 malware families; AndrooZoo, which includes a collection of 5669661 applications Android from different sources (including Google Play); VirusShare, another repository that provides samples of malware for cybersecurity researchers; and DroidCollector, this is another set which provides around 8000 benign applications and 5560 malware samples, moreover, it facilitates us samples of network traffic as pcap files.

Static Analysis

In this first step, I’m going to analyze some features in order to answer the next hypothesis, exist a differential of the permissions used between a set of malware and benign samples, in other words…

For this approach, I developed a code that consisted to extract and make a CSV file which has information about permissions of applications, through this script you can map each APK (Android Application Package) against a list of permissions. You can find more information about the proposed framework at [3]

https://github.com/urcuqui/WhiteHat

Exploratory

For the next analysis, I’m going to explore the Malgenome dataset, as I said nowadays we can find other sources with a lot of examples and malware families which would be important for future analysis, the idea of the next experiment and results is to show our first approached. I’m going to use the dataset that I uploaded years ago in Kaggle.

# I'm going to download the datasetdf = pd.read_csv("../input/datasetandroidpermissions/train.csv", sep=";")
df = df.astype("int64")
df.type.value_counts()

Out[4]:

1    199
0 199
Name: type, dtype: int64

Type is the label that represents if an application is a malware or not, as we can see this dataset is balanced.

Let's get the top 10 of permissions that are used for our malware samples

Malicious

In [6]:

pd.Series.sort_values(df[df.type==1].sum(axis=0), ascending=False)[1:11]

Out[6]:

android.permission.INTERNET                  195
android.permission.READ_PHONE_STATE 190
android.permission.ACCESS_NETWORK_STATE 167
android.permission.WRITE_EXTERNAL_STORAGE 136
android.permission.ACCESS_WIFI_STATE 135
android.permission.READ_SMS 124
android.permission.WRITE_SMS 104
android.permission.RECEIVE_BOOT_COMPLETED 102
android.permission.ACCESS_COARSE_LOCATION 80
android.permission.CHANGE_WIFI_STATE 75
dtype: int64

Benign

In [7]:

pd.Series.sort_values(df[df.type==0].sum(axis=0), ascending=False)[:10]

Out[7]:

android.permission.INTERNET                  104
android.permission.WRITE_EXTERNAL_STORAGE 76
android.permission.ACCESS_NETWORK_STATE 62
android.permission.WAKE_LOCK 36
android.permission.RECEIVE_BOOT_COMPLETED 30
android.permission.ACCESS_WIFI_STATE 29
android.permission.READ_PHONE_STATE 24
android.permission.VIBRATE 21
android.permission.ACCESS_FINE_LOCATION 18
android.permission.READ_EXTERNAL_STORAGE 15
dtype: int64

In [8]:

import matplotlib.pyplot as plt
fig, axs = plt.subplots(nrows=2, sharex=True)

pd.Series.sort_values(df[df.type==0].sum(axis=0), ascending=False)[:10].plot.bar(ax=axs[0])
pd.Series.sort_values(df[df.type==1].sum(axis=0), ascending=False)[1:11].plot.bar(ax=axs[1], color="red")

Out[8]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f3dea5e7390>

The last outputs allow us to get insights about a difference between the permissions used by the malware and the benign applications.

Modeling

In [9]:

X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, 1:330], df['type'], test_size=0.20, random_state=42)

Naive Bayes algorithm

In [10]:

# Naive Bayes algorithm
gnb = GaussianNB()
gnb.fit(X_train, y_train)
# pred
pred = gnb.predict(X_test)
# accuracy
accuracy = accuracy_score(pred, y_test)
print("naive_bayes")
print(accuracy)
print(classification_report(pred, y_test, labels=None))

Out [10]:

naive_bayes
0.8375
precision recall f1-score support
0 0.91 0.76 0.83 41
1 0.78 0.92 0.85 39
accuracy 0.84 80
macro avg 0.85 0.84 0.84 80
weighted avg 0.85 0.84 0.84 80

K-neighbors algorithm

In [11]:

# kneighbors algorithmfor i in range(3,15,3):

neigh = KNeighborsClassifier(n_neighbors=i)
neigh.fit(X_train, y_train)
pred = neigh.predict(X_test)
# accuracy
accuracy = accuracy_score(pred, y_test)
print("kneighbors {}".format(i))
print(accuracy)
print(classification_report(pred, y_test, labels=None))
print("")

Out [11]:

kneighbors 3
0.8875
precision recall f1-score support
0 0.94 0.82 0.88 39
1 0.85 0.95 0.90 41
accuracy 0.89 80
macro avg 0.89 0.89 0.89 80
weighted avg 0.89 0.89 0.89 80
kneighbors 6
0.85
precision recall f1-score support
0 0.94 0.76 0.84 42
1 0.78 0.95 0.86 38
accuracy 0.85 80
macro avg 0.86 0.85 0.85 80
weighted avg 0.87 0.85 0.85 80
kneighbors 9
0.8375
precision recall f1-score support
0 0.94 0.74 0.83 43
1 0.76 0.95 0.84 37
accuracy 0.84 80
macro avg 0.85 0.85 0.84 80
weighted avg 0.86 0.84 0.84 80
kneighbors 12
0.825
precision recall f1-score support
0 0.91 0.74 0.82 42
1 0.76 0.92 0.83 38
accuracy 0.82 80
macro avg 0.84 0.83 0.82 80
weighted avg 0.84 0.82 0.82 80

Decision Tree

In [12]:

clf = tree.DecisionTreeClassifier()
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
print(classification_report(pred, y_test, labels=None))

Out [12]:


precision recall f1-score support
0 0.94 0.89 0.91 36
1 0.91 0.95 0.93 44
accuracy 0.93 80
macro avg 0.93 0.92 0.92 80
weighted avg 0.93 0.93 0.92 80

Through the last results, we can see how we trained different classifiers to detect malware using its permissions, but as I said this is only a first approximation, I didn’t analyze the hyperparameters and others things to improve the results.

Dynamic Analysis

For this approach, we used a set of pcap files from the DroidCollector project integrated by 4705 benign and 7846 malicious applications. All of the files were processed by our feature extractor script (a result from [4]), the idea of this analysis is to answer the next question, according to the static analysis previously seen a lot of applications use a network connection, in other words, they are trying to communicate or transmit information, so.. is it possible to distinguish between malware and benign application using network traffic?

In [13]:

import pandas as pd
data = pd.read_csv("../input/network-traffic-android-malware/android_traffic.csv", sep=";")
data.columns

Out [13]:

Index(['name', 'tcp_packets', 'dist_port_tcp', 'external_ips', 'vulume_bytes','udp_packets', 'tcp_urg_packet', source_app_packets',       'remote_app_packets', 'source_app_bytes', 'remote_app_bytes',       'duracion', 'avg_local_pkt_rate', 'avg_remote_pkt_rate',       'source_app_packets.1', 'dns_query_times', 'type'],      dtype='object')data.shape

Out[15]:

(7845, 17)

In [16]:

data.type.value_counts()

Out[16]:

benign       4704
malicious 3141
Name: type, dtype: int64

In this case, we have an unbalanced dataset, so another model evaluation will be used.

Data Cleaning and Processing

In [17]:

data.isna().sum()

Out[17]:

name                       0
tcp_packets 0
dist_port_tcp 0
external_ips 0
vulume_bytes 0
udp_packets 0
tcp_urg_packet 0
source_app_packets 0
remote_app_packets 0
source_app_bytes 0
remote_app_bytes 0
duracion 7845
avg_local_pkt_rate 7845
avg_remote_pkt_rate 7845
source_app_packets.1 0
dns_query_times 0
type 0
dtype: int64

When we processed each pcap we had some problems getting three features (duration, avg remote package rate, avg local package rate) this why got during the feature processing script, we don't have this issue nowadays.

In [18]:

data = data.drop(['duracion','avg_local_pkt_rate','avg_remote_pkt_rate'], axis=1).copy()

Now, the idea is to see the outliers in the data

In [20]:

sns.boxplot(data.tcp_urg_packet)

Out[20]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f3dea43f208>

In [21]:

data.loc[data.tcp_urg_packet > 0].shape[0]

Out[21]:

2

That column will be no used for the analysis, only two rows are different to zero, maybe they are interesting for future analysis.

In [22]:

data = data.drop(columns=["tcp_urg_packet"], axis=1).copy()
data.shape

Out[22]:

(7845, 13)

In [23]:

sns.pairplot(data)

Out[23]:

<seaborn.axisgrid.PairGrid at 0x7f3dea4aae80>

We have many outliers in some features, I will omit the depth analysis and only get the set of the data without the noise.

data=data[data.tcp_packets<20000].copy()
data=data[data.dist_port_tcp<1400].copy()
data=data[data.external_ips<35].copy()
data=data[data.vulume_bytes<2000000].copy()
data=data[data.udp_packets<40].copy()
data=data[data.remote_app_packets<15000].copy()

In [25]:

data[data.duplicated()].sum()

Out[25]:

name                    AntiVirusAntiVirusAntiVirusAntiVirusAntiVirusA...
tcp_packets 15038
dist_port_tcp 3514
external_ips 1434
vulume_bytes 2061210
udp_packets 38
source_app_packets 21720
remote_app_packets 18841
source_app_bytes 8615120
remote_app_bytes 2456160
source_app_packets.1 21720
dns_query_times 5095
type benignbenignbenignbenignbenignbenignbenignbeni...
dtype: object

In [26]:

data=data.drop('source_app_packets.1',axis=1).copy()

In [27]:

scaler = preprocessing.RobustScaler()
scaledData = scaler.fit_transform(data.iloc[:,1:11])
scaledData = pd.DataFrame(scaledData, columns=['tcp_packets','dist_port_tcp','external_ips','vulume_bytes','udp_packets','source_app_packets','remote_app_packets',' source_app_bytes','remote_app_bytes','dns_query_times'])

From [6] we concluded that the best network features are:

  • (R1): TCP packets, it has the number of packets TCP sent and got during communication.
  • (R2): Different TCP packets, it is the total number of packets different from TCP.
  • (R3): External IP, represents the number the external addresses (IPs) where the application tried to communicated
  • (R4): Volume of bytes, it is the number of bytes that was sent from the application to the external sites
  • (R5) UDP packets, the total number of packets UDP transmitted in a communication.
  • (R6) Packets of the source application, it is the number of packets that were sent from the application to a remote server.
  • (R7) Remote application packages, number of packages received from external sources.
  • (R8) Bytes of the application source, this is the volume (in Bytes) of the communication between the application and server.
  • (R9) Bytes of the application remote, this is the volume (in Bytes) of the data from the server to the emulator.
  • (R10) DNS queries, number of DNS queries.

Modeling

X_train, X_test, y_train, y_test = train_test_split(scaledData.iloc[:,0:10], data.type.astype("str"), test_size=0.25, random_state=45)

In [29]:

gnb = GaussianNB()
gnb.fit(X_train, y_train)
pred = gnb.predict(X_test)
## accuracy
accuracy = accuracy_score(y_test,pred)
print("naive_bayes")
print(accuracy)
print(classification_report(y_test,pred, labels=None))
print("cohen kappa score")
print(cohen_kappa_score(y_test, pred))

Out [29]:

naive_bayes
0.44688457609805926
precision recall f1-score support

benign 0.81 0.12 0.20 1190
malicious 0.41 0.96 0.58 768

accuracy 0.45 1958
macro avg 0.61 0.54 0.39 1958
weighted avg 0.66 0.45 0.35 1958

cohen kappa score
0.06082933470572538

In [30]:

# kneighbors algorithm

for i in range(3,15,3):

neigh = KNeighborsClassifier(n_neighbors=i)
neigh.fit(X_train, y_train)
pred = neigh.predict(X_test)
# accuracy
accuracy = accuracy_score(pred, y_test)
print("kneighbors {}".format(i))
print(accuracy)
print(classification_report(pred, y_test, labels=None))
print("cohen kappa score")
print(cohen_kappa_score(y_test, pred))
print("")

Out [30]:

kneighbors 3
0.8861082737487231
precision recall f1-score support

benign 0.90 0.91 0.91 1173
malicious 0.87 0.85 0.86 785

accuracy 0.89 1958
macro avg 0.88 0.88 0.88 1958
weighted avg 0.89 0.89 0.89 1958

cohen kappa score
0.7620541314671169

kneighbors 6
0.8784473953013279
precision recall f1-score support

benign 0.92 0.88 0.90 1240
malicious 0.81 0.87 0.84 718

accuracy 0.88 1958
macro avg 0.87 0.88 0.87 1958
weighted avg 0.88 0.88 0.88 1958

cohen kappa score
0.7420746759356631

kneighbors 9
0.8707865168539326
precision recall f1-score support

benign 0.89 0.90 0.89 1175
malicious 0.85 0.83 0.84 783

accuracy 0.87 1958
macro avg 0.87 0.86 0.86 1958
weighted avg 0.87 0.87 0.87 1958

cohen kappa score
0.729919255030886

kneighbors 12
0.8615934627170582
precision recall f1-score support

benign 0.88 0.89 0.89 1185
malicious 0.83 0.82 0.82 773

accuracy 0.86 1958
macro avg 0.86 0.85 0.86 1958
weighted avg 0.86 0.86 0.86 1958

cohen kappa score
0.7100368862537227

In [31]:

rdF=RandomForestClassifier(n_estimators=250, max_depth=50,random_state=45)
rdF.fit(X_train,y_train)
pred=rdF.predict(X_test)
cm=confusion_matrix(y_test, pred)

accuracy = accuracy_score(y_test,pred)
print(classification_report(y_test,pred, labels=None))
print("cohen kappa score")
print(cohen_kappa_score(y_test, pred))
print(cm)

Out [31]:


precision recall f1-score support

benign 0.93 0.94 0.93 1190
malicious 0.90 0.88 0.89 768

accuracy 0.92 1958
macro avg 0.91 0.91 0.91 1958
weighted avg 0.92 0.92 0.92 1958

cohen kappa score
0.8258206083396299
[[1117 73]
[ 89 679]]

Conclusions

Great 💪🤖!!, now we have seen the two approaches to analyze a cyber threat. Of course we can use a lot of variables that in this case, we didn't use them, for example, netflows, methods callings, graph analysis, and many others, but the idea behind this work is to understand that we need to pay attention of all of the environments because when we are working in cybersecurity we face with a complex problem.

I'm going to continue with my cybersecurity posts, in spite of this is the first one, this will be updated once I find an improvement and new results.

Some of the next posts that I will publish are:

  • Malicious Websites Detection
  • Secure Learning
  • Cyber Web Attacks Detection
  • Deep Fakes Detection

I would like to mention my ex-bachelor students who helped me to continue in my Android research

  • Melisa Garcia
  • Jose Luis Osorio
  • Andres Felipe Perez
  • Jhoan Steven Delgado

Moreover, I would like to cite the co-authors and my English teacher:

  • Andres Navarro
  • Sebastian Londoño
  • Samir Riascos

References

  • [1] López, U., Camilo, C., García Peña, M., Osorio Quintero, J. L., & Navarro Cadavid, A. (2018). Ciberseguridad: un enfoque desde la ciencia de datos-Primera edición.
  • [2] Navarro Cadavid, A., Londoño, S., Urcuqui López, C. C., & Gomez, J. (2014, June). Análisis y caracterización de frameworks para detección de aplicaciones maliciosas en Android. In Conference: XIV Jornada Internacional de Seguridad Informática ACIS-2014 (Vol. 14). ResearchGate.
  • [3] Urcuqui-López, C., & Cadavid, A. N. (2016). Framework for malware analysis in Android.
  • [4] Urcuqui, C., Navarro, A., Osorio, J., & Garcıa, M. (2017). Machine Learning Classifiers to Detect Malicious Websites. CEUR Workshop Proceedings. Vol 1950, 14-17.
  • [5] López, C. C. U., Villarreal, J. S. D., Belalcazar, A. F. P., Cadavid, A. N., & Cely, J. G. D. (2018, May). Features to Detect Android Malware. In 2018 IEEE Colombian Conference on Communications and Computing (COLCOM) (pp. 1-6). IEEE.
___   ___    __   ____  ___    ___    __    __    
| | / | | | | | | | / | | | | |
| |/ / | | | | | | / / | | | |
| / | | | | | | \ \_/ /
| |\ \ | '--' | | |\ \ \ /
| _| `.__\ |________| | _| `.__\ |___|
[0][1][0]
[0][0][1]
[1][1][1]
Happy Hacking

About me

I am a security data scientist, hacker, researcher passionate about ICTs, moreover, I’m an assistant professor at Universidad Icesi. I’m magister in informatics and telecommunications and I have an undergraduate in software engineering. My research topics are cybersecurity and data science, I’m really excited about malware (anomaly) detection, adversarial techniques, secure learning, and ethical hacking.

Welcome to my world 💻~(-_-~)

https://www.linkedin.com/in/christianurcuqui/

--

--

Christian Urcuqui

Cybersecurity data scientist with more than 9 years in the software industry. My passion is to create solutions to make cyberspace safer using data science.