Data Holds the Answer — A Journey Through Cybersecurity and Machine Learning

Christian Urcuqui
11 min readSep 29, 2021
The hacker cat. Source: own picture

It is time to implement proactive scenarios; we must be two steps ahead of the digital issues. Urcuqui, C.

Preface

Among several articles on research findings on cybersecurity and data science, this is my second publication on this site which started two years ago [1]. The projects were carried out by systems engineering students at ICESI university who were working on their thesis in 2019–2021 under my supervision as their teacher, and as it is indicated in each specific project. The purpose of this article is to introduce the projects, highlight each student’s result, and bring some examples.

The general idea from this article started when I was reading the book “The art of the war” by Sun Tzu, a well-known publication about the Chinese military techniques and tactics. This text includes the phrase “if you know the enemy and know yourself, you need not fear the result of a hundred battles. If you know yourself but not the enemy, for every victory gained you will also suffer a defeat. If you know neither the enemy nor yourself, you will succumb in every battle”. This thought allowed me to see a new approximation of security, in other words, a proactive approach instead of a reactive one.

Christian Urcuqui

Introduction

This document is divided into two main parts, the first explores the application of data science on cybersecurity, specifically proposing new defensive models to detect cyber threats, such as cryptojacking and malicious websites. The second shows the perspective of security on AI, which includes a project of adversarial machine learning and an introduction of unfairness.

Part One

Malicious Websites Detection

Authors: Melisa Garcia Peña and Jose Quintero.

People interact with websites every day and might be exposed to visit malicious ones, a problem that can be present in diverse ways, such as web browsing, opening emails, or ads, among others. This is where digital controls prevent and alert about the malicious URLs (a blacklist of sites previously analyzed by security experts). Nevertheless, cybercriminals can hide their malicious activities in diverse ways (through the injection of code into files and links). Therefore, it can be difficult for security research teams to identify each one.

To face this identification problem, it was divided into several parts to cover each variant where a web attack can occur. The idea was to develop a mechanism that learns malicious patterns and reduces the time of analysis of each site.

The project was published in two articles, the first one [2] presents a state-of-the-art analysis about anti-defacement techniques, giving an approximation to use network data and propose machine learning models. During the process, honeypots were analyzed as mechanisms to simulate the execution of a malicious set of URLs.

The second article presents the results of the proposed methodology which was published in a congress in Chile [3] and whose dataset is open to access and can be downloaded [4]. Results, which include the accuracy of four classification models trained using data integrated by features from the network and application layers, are shown in the following table.

Table I. Machine learning classifiers to identify malicious websites
Table I. Machine learning classifiers to identify malicious websites

Web Application Attack Detection

Authors: Brayan Henao and Juan Prada

Network information was an interesting resource because of its properties, principally its volume, velocity, and variety. Only one datum can have crucial features that can be used to study during a cyber-threat analysis (Threat intelligence, vulnerability research, and other approaches); more information (including metadata) can give an explanation of the behavior and the history of an incident.

Researchers explored several honeypots analyzing their properties to identify which one was better for simulating web attack scenarios [14, 15], in the same way, they analyzed some vulnerable web applications [16].

The project was oriented to detect web application attacks, principally Cross-site scripting, and SQL Injection. Researchers developed a framework that included a vulnerable web application, a set of web attacks, and python scripts.

The architecture of the system

It simulated each attack and recollected the network traffic from it (pcap files); the data were explored, and a set of machine learning classifiers were proposed [17], the next table presents some of the results of each classification model evaluated.

Machine learning classifiers to detect web attacks (SQLi and XSS)

Netflows, Logs and Analysis

Author: Christian Urcuqui

Other projects explore the use of netflows, proxy records, and DNS queries. One example is Apache Spot [5] that proposes the use of network resources to find an anomaly in communication using machine learning models. Since pcap files have more information, it involves greater resource consumption.

The approach was applied in a project that used a set of log files from different proxies that resulted in a framework. The idea included a model based on Latent Dirichlet Allocation, a dictionary of words, and a function for detecting anomalies during proxy communications. Results were shared in a conference in 2019 [6].

According to [5] and the logs processed (that were in bluecoat format), the words gotten from an IP were:

  • DNS (1 for Alexa, 2 for user domains, 0 others)
  • Day hour
  • Request Method
  • The entropy of an URI
  • Type of content
  • User-agent frequency
  • Code response

A security data science consultancy project was done for a company that had well-defined processes, frameworks, controls, and teams for facing digital issues. Some organizations use defensive controls managed by external institutions, which is not bad but shows some limitations such as the administration of the data life cycle. The information was obtained from different endpoints and filtered previously by the security controls. The project faced two challenges because the information was limited for a period and was not raw data. These are some conclusions from the projects mentioned:

  1. The data science mechanisms proposed need to support other controls, and not to replace them (especially when there is not a security research team).
  2. There are important ways to explore such as social engineering using AI and remote work vulnerabilities.
  3. Incorporate security cognitive solutions during software development processes can be interesting, such as DevSecOps (DevSecAIOps, DevSecDataOps …?); the idea is to find vulnerabilities or digital issues using data, AI, and an adversarial approach.
  4. Evaluate different scenarios (integrated by controls and processes) using a model and data to quantify real-time risk can bring ideas about the status.
  5. The data science approach proposed included:
  • Unsupervised learning models and a framework that included the business to reduce the false-positive ratio.
  • A set of security databases and controls to increase the true positive ratio (cyber threats detection) and reduce the percentage of false negatives.
  • The idea was to refine the model using the benign data from the organization and MLOps.

Botnets

Authors: Julio Gaviria and Anderson Ramírez

Network of bots (botnet) has been used for malicious purposes. A set of devices that were infected and can be controlled by someone (owner) using command and control (C&C) software.

During the last years, botnets have been used as weapons in cybercrime. These mechanisms allow cybercriminals to use sensitive data and digital assets, without authorization, to perform computer crimes such as DDoS, Spam, and cryptojacking.

Netflows use was explored again on a project to detect malicious botnets based on a PhD thesis [7]. The authors created a framework with python scripts to (1) transform network traffic into netflow, (2) simulate benign information and (3) train machine learning algorithms. Moreover, they developed a web application with interfaces for end-users interested in any anomaly in a pcap file. This project was presented at a recognized security conference (DragonJar) in Latam, 2020 [8].

Cryptojacking

Authors: Steven Bernal

Cryptojacking is a cyber-threat that using Internet degrades a computer processor. Because its purpose is to resolve mathematical puzzles validating cryptocurrency transactions, it implies an increase in energy costs since it is using the system resources without authorization. There a lot of news about the use cryptocurrency, its implications and cryptojacking, some of them are [18, 19, 20, 21, 22].

News about cryptocurrency and cryptojacking

The author explored the netflows use and different sets of variants of cryptojacking, and its integration with hardware features to train machine learning models. These results were presented at UNAD [9], at the third edition of the Sombreros Blancos Conference (2021), and Unitalks Ekoparty Colombia (2021).

Part Two

Deepfakes

Authors: Bayron Campaz, Juan Diaz, and Santiago Gutierrez

Artificial intelligence has had significant advances, especially in its sub-area machine learning. Because of it, nowadays the computational infrastructure is affordable thanks to cloud computing, and, even more to the amount of data, and Internet. On the other hand, now there are social websites and a lot of information generated from personal devices; data can have real and non-real sources; in other words, we are in a digital world where fake news passes over our eyes on Internet each second.

Non-real information has evolved thanks to the same AI improvements; math models have capacities to make fake resources for different uses, such as manipulation of images, videos, and sounds. These modification techniques might be used to perform social engineering attacks, for example, spoofing, intimidation, blackmail, among others.

Use of a GAN model to generate videos

Currently, detection available methods work well for first-generation datasets which have poor-quality deepfakes, although they do not have satisfactory results against next-generation datasets. Researchers proposed a deepfake-detection method based on a deep learning model that had an Xception architecture and was trained using a transfer learning approach. The model was developed using a dataset integrated by first and second-generation deepfake sets, and its results showed an accuracy of 92,12% and AUC of 92,15%.

Adversarial Machine Learning

Authors: Jhoan Steven

Based on the idea of gaming, seeing the security as scenarios integrated by security agents versus adversaries (such as cybercriminals). A project explored the application of reinforcement learning to make proactive scenarios that can bring information for improving security defensive mechanisms.

AI versus AI (fun representation)

First, a set of machine learning models were evaluated successfully reaching detection rates of 99% and 81%. The results proposed showed superior performance for a set of malware, they could be also vulnerable to adversarial attacks. This project developed a Reinforcement Learning strategy to exploit a machine learning defensive model for detecting Android malware. The project presented a robust model through adversarial training, in other words, applying a secure learning approximation. The results were presented in different recognized security conferences [10, 11]

Fairness

Author: Christian Urcuqui

During a security project, the idea was to make a model predict crimes in the capital city of Colombia. Nowadays different projects about the analysis of crime have been proposed, but in the same way, there are some discussions about the use of surveillance systems, AI, and math; There are amazing literature about the topic, some of them are:

Books of reference about AI, math, data, privacy, and surveillance systems

From the work, two articles were published [12, 13], one analyzed the incorporation of a penalization mechanism during the training phase of a model to improve fairness, and the last one studied a way to interpret the model results.

Conclusions and Future Works

We saw different projects of cybersecurity and data science in both an offensive and a defensive approximation; according to the art of the war we must see our environment, and one way to do this is getting information inside and outside from an organization. If we have the data, measure our risks in real-time, making proactive scenarios thinking like an adversary then we cannot fear the result of a hundred battles.

All the previous projects are in a book that is going to be published at the end of this year (a tentative date). The book also has a set of content-oriented to introduce any person interested in data science and cybersecurity, especially in research from academia. Right now, the new edition will have the name “data are the answer”.

About the Author

As a cybersecurity data scientist, hacker, researcher, and assistant professor, Christian Camilo Urcuqui López has been working on cybersecurity, data science, and e-health. He has worked for the software industry and his experiences include different projects from public and private institutions participating as a software engineer, researcher, director of IT, and nowadays as a data scientist. He is interested in malware (anomaly) detection, adversarial techniques, secure learning, privacy, threat hunting, responsible AI, and ethical hacking. He is the author of the book “Ciberseguridad: un enfoque desde la ciencia de datos”, published by Editorial ICESI in 2019. Nowadays Urcuqui is a Data Scientist at Globant (an IT and Software Development company).

Email: ulcamilo@gmail.com

GitHub: https://github.com/urcuqui/

LinkedIn: https://www.linkedin.com/in/christianurcuqui/

Acknowledgments

I would like to finish this article by thanking everyone that helped me by giving their time and ideas on these projects, each current and graduate student, the professors, and professionals; in addition to the places where we had been invited and accepted as speakers. Special thanks to my family and my girlfriend, who have been supporting me during these years, and to my hacker cat, with whom there was a connection with this field. Finally, I appreciate the help of my English teacher Nelly Manosalva for her time reviewing this post.

References

[1] Urcuqui, C. My Cybersecurity Machine Learning Research / Part1. Via https://medium.com/@ulcamilo/my-android-malware-research-d20f16af4d1f

[1] López, C. C. U., Peña, M. G., Quintero, J. L. O., & Cadavid, A. N. (2016). Antidefacement-State of art. Sistemas & Telemática, 14(39), 9–27.

[2] Urcuqui, C., Navarro, A., Osorio, J., & García, M. (2017). Machine Learning Classifiers to Detect Malicious Websites. In SSN (pp. 14–17).

[4] Urcuqui, C., Navarro, A., Osorio, J., & García, M. (2017). Malicious and Benign Websites. IEEE Dataport.

[5] Apache Spot. Vía https://spot.incubator.apache.org/

[6] Illuminating Threats in the Network Using Data Analytics. Vía https://www.researchgate.net/publication/341231964_ILLUMINATING_THREATS_IN_THE_NETWORK_USING_data_analytics_PRESENTACION

[7] Garcia, S. (enero de 2014). Researchgate. Obtenido de https://researchgate.net/publication/271204142_Identifying_Modeling_and_Detecting_Botnet_Behaviors_in_the_Network/download

[8] Gaviria, J., Mauricio, A., Urcuqui, C., (2020). Analysis of Time Windows to Detect Botnet’s Behaviour. DragonJar Conference .Vía https://youtu.be/vJ9AzfudQK8

[9] Bernal, S., Urcuqui, C. (2021). Detección de Cryptojacking Aplicando Técnicas de Ciencia de Datos. Universidad Nacional Abierta y a Distancia UNAD. Vía https://youtu.be/2VTNDG-ClXM

[10] Hacking Machine Learning Models and How It Leads Us to Make Them Less Compromised. (2020). HoneyCON, and CoronaCon 2020.

[11] Hacking Machine Learning Models and How It Leads Us to Make Them Less Compromised. (2020). 8.8 Computer Security Conference — Andina Edition.

[12] Urcuqui, C. & Moreno, J. & Montenegro, C. & Riascos, A.& Dulce, M. (2020). Accuracy and Fairness in a Conditional Generative Adversarial Model of Crime Prediction. In: The 7th International Conference on Behavioral and Social Computing (BESC).

[13] Dulce, M. & Gomez, O. & Moreno, J. & Urcuqui, C. & Riascos, A. (2021). Interpreting a Conditional Generative Adversarial Network Model for Crime Prediction. 25th Iberoamerican Congress on Pattern Recognition.

[14] Henao, S., & Prada, S. (2018). Tabla comparativa Web Honeypots https://github.com/i2tResearch/Ciberseguridad_web/blob/master/Sistema%20para%20el%20estudio%20de%20ciberataques%20web%20(PDG)/Documentos/Tabla%20comparativa%20Honeypots.docx

[15] Henao, S., & Prada, S. (2018). Tabla comparativa Web Honeypots https://github.com/i2tResearch/Ciberseguridad_web/blob/master/Sistema%20para%20el%20estudio%20de%20ciberataques%20web%20(PDG)/Documentos/Tabla%20comparativa%20Web%20Honeypots.docx

[16] Henao, S., & Prada, S. (2018). Comparación de aplicaciones. https://github.com/i2tResearch/Ciberseguridad_web/blob/master/Sistema%20para%20el%20estudio%20de%20ciberataques%20web%20(PDG)/Documentos/Comparacion%20aplicaciones%20vulnerables.docx

[17] Henao, S., & Prada, S. (2018). Sistema para el estudio de ciberataques web. https://github.com/i2tResearch/Ciberseguridad_web/blob/master/Sistema%20para%20el%20estudio%20de%20ciberataques%20web%20(PDG)/Documento%20final%20proyecto%20de%20grado.pdf

[18] bbc. Via https://www.bbc.com/mundo/noticias-56732881

[19] larepublica. Via https://www.larepublica.co/globoeconomia/las-multinacionales-pizza-hut-starbucks-y-mcdonalds-reciben-bitcoin-en-el-salvador-3229560

[20]marca. Via https://www.marca.com/en/football/international-football/2020/ 12/22/5fe211e922601d0e638b45d3.html

[21] cointelegraph. https://es.cointelegraph.com/news/browser-based-cryptojacking-is-back-as-attacks-spike-163

[22] bbc. Via https://www.bbc.com/mundo/noticias-52861246

--

--

Christian Urcuqui

Cybersecurity data scientist with more than 9 years in the software industry. My passion is to create solutions to make cyberspace safer using data science.