Sadržaj

Applications of machine learning in cybersecurity

Applications of machine learning in cybersecurity

Abstract

In recent years, machine learning has experienced tremendous progress due to an increase in raw computing power, progress in algorithms, and most importantly, increase in the volume of data for training. Day by day the applications of machine learning continue to expand. One of those applications is cybersecurity. Machine learning seems like a good tool for tackling cybersecurity issues due to its ability to adapt to new and unknown circumstances. This paper covers such applications. Specifically, topics covered in this paper are authentication with keystroke dynamics, Android applications malware detection, phishing detection, breaking Human Interaction Proofs and security of machine learning.

Keywords: machine learning; cybersecurity; internet; cyber threats

Introduction

Problems in machine learning are usually categorized into three main categories based on data used for training:

Supervised learning
Unsupervised learning
Reinforcement learning

Problems in machine learning can also be categorized by their output:

Classification
Clustering
Regression

Machine learning categorisation by training data

Supervised learning

Supervised learning problems consist of presenting an algorithm with input data, as well as the desired output data, and allowing it to produce a model that maps each input to an output. An example of supervised learning in cybersecurity is a classification for malicious email detection. The training set for this problem would consist of emails labeled with labels such as 'safe' or 'malicious'. The algorithm would produce a model that would be able to classify given emails into categories 'safe' and 'malicious'.

Chapters “Authentication with Keystroke Dynamics”, “Android applications malware detection”, “Phishing Detection” and “Breaking Human Interaction Proofs (HIPs)” cover topics in which supervised learning was used.

Unsupervised learning

Unsupervised learning deals with unlabelled data. The most common application of unsupervised learning is finding groups within a given dataset. An example of unsupervised learning in cybersecurity is the BotMiner system [8] which detects real-world botnets by clustering network flows along with behaviors and actions of machines on the internet.

Chapter “Android applications malware detection” covers the usage of unsupervised learning in Android malware detection.

Reinforcement learning

Reinforcement learning is a reward-based learning system. The goal of this system is to create an intelligent agent that acts in an environment of states with a set of available actions. The agent chooses the action based on previous experience and whether that action led to reward or punishment. An example of reinforcement learning application in cybersecurity can be found in [4].

Machine learning categorisation by their output

Classification

In a classification problem, the inputs are mapped to user-specified outputs. An example of classification, already mention in the Supervised learning chapter, is a classification of emails into 'safe' or 'malicious'.

Clustering

In a clustering problem, the inputs are grouped into clusters that are not previously defined. Most of the clustering algorithms require a number of clusters prior to training.

Chapter “Android applications malware detection” covers the usage of clustering in Android malware detection.

Regression

In a regression problem, the inputs are used to predict the output from a continuous set, rather than a discrete set, which is a case in classification and clustering. There will be no further discussion about regression in this paper.

Authentication with Keystroke Dynamics

Researchers [7] proposed applying a Probabilistic Neural Network (PNN) for identifying imposter using keystroke dynamics. Keystroke dynamics consist of multiple behavioral biometrics that captures the typing style of a user.

Eight different behavioral biometrics were monitored during authentication attempts (username and password). These biometrics were: digraphs (two-letter combinations), trigraphs (three-letter combinations), total username time, total password time, total entry time, scan code, speed, and edit distance. The system was evaluated on a dataset containing authentication attempt keystrokes of 50 people.

Researchers then asked 30 of them to attempt authentication as a different user. The data was fed into a trained PNN and tested. The accuracy of classification of legitimate/imposter equaled 90%.

Android applications malware detection

Researchers [6] presented two machine learning aided approaches for static analysis of the mobile applications: one based on permissions, while the other based on source code analysis that utilizes a bag of words representation model. Researchers used classification and clustering in their work. They used classification to classify software as safe or malicious using existing labeled examples. They used clustering to group similar examples of unlabeled data and that way they acquired more data for the classification model.

Permission analysis

All Android applications have an AndroidManifest.xml file that includes all permissions the application needs to access certain features on the phone. During installation user is notified about those permissions and has to allow or deny them. Malicious applications usually require Internet access, contacts information, and other sensitive data access. Researches extracted permissions as a list of permissions names and built classification and clustering models using permissions names as features. Algorithms used for classification were: support vector machine (SVM), Naive Bayes, Decision trees, JRIP, random forest and logistic regression. Results, measured with 3 different measures (recall, precision and F1), are shown in table 1. Researchers also applied 3 different clustering algorithms: Farthest First, Simple K-means and Expectation maximization (EM). Their success rates, measured by correctly and incorrectly clustered instances, are shown in table 2.

Table 1 [6]:

Algorithm	Precision	Recall	F1
Decision trees	0.827	0.827	0.827
Random forest	0.871	0.866	0.865
Naive Bayes	0.747	0.747	0.747
SVM	0.879	0.879	0.879
JRIP	0.821	0.819	0.819
Logistic regression	0.823	0.822	0.821

Table 2 [6]:

Algorithm	Correctly clustered instances	Incorrectly clustered instances
Simple K-means	229 (59.17%)	158.0 (40.83%)
Farthest First	199 (51.42%)	188.0 (48.58%)
EM	250 (64.6%)	137.0 (35.4%)

Source code analysis

In this approach, Application package file (APK), containing Android app, is transformed to Dalvik Executable file (dex) by unzipping APK. Then, dex file is transformed to Java archive (jar) using dex2jar tool. After this step, .class files from jar are decompiled into .java files using Procyon decompiler. This workflow is shown in Image 1.

Image 1 [6]:

Researchers' idea was to extracts services, methods and API calls and discover their potentially malicious usage patterns. They processed the whole code using a technique called bag of words. In this technique, code is represented as a set of words, disregarding grammar or word order. The whole source code was tokenized into unigrams that are used as bag of words which are then fed into classification or clustering algorithms. Classification algorithms used were: decision trees, Naive Bayes, support vector machines (SVM) with sequential minimal optimization (SMO), random forests, JRIP and logistic regression. Their results, measured with precision, recall and F1, are shown in Table 3. Researchers also applied 3 different clustering algorithms: Farthest First, Simple K-means and Expectation maximization (EM). Their success rates, measured by correctly and incorrectly clustered instances, are shown in Table 4.

Table 3 [6]:

Algorithm	Precision	Recall	F1
Decision trees	0.886	0.886	0.886
Random forest	0.937	0.935	0.935
Naive Bayes	0.825	0.821	0.820
SVM with SMO	0.952	0.951	0.951
JRIP	0.916	0.916	0.916
Logistic regression	0.935	0.935	0.935

Table 4 [6]:

Algorithm	Correctly clustered instances	Incorrectly clustered instances
Simple K-means	303 (82.3%)	65 (17.66%)
Farthest First	296 (80.44%)	72 (19.56%)
EM	300 (81.53%)	68 (18.47%)

Phishing Detection

Phishing is the term used for a fraudulent attempt at obtaining sensitive data, such as passwords and credit card details. The attacker, posing as a trustworthy entity, contacts the target by email, telephone or text message and lures them into providing sensitive data.

Researchers [9] compared 6 machine learning classifiers used for classifying phishing emails: Logistic Regression (LR), Classification and Regression Trees (CART), Bayesian Additive Regression Trees (BART), Support Vector Machines (SVM), Random Forests (RF), and Neural Networks (NNets).

The data used for training consisted of phishing emails and legitimate emails. Emails were parsed using text indexing techniques. All attachments were removed, bodies and specific elements were extracted and a stemming algorithm was applied and all the irrelevant words were removed. Finally, all items were sorted according to their frequency in emails.

The classifier comparison based on their measurements, including precision, recall and F1 measure, is shown in Table 5.

Table 5 [9]:

Classifier	Precision	Recall	F1
LR	95.11 %	82.96 %	88.59 %
CART	92.32 %	87.07 %	89.59 %
SVM	92.08 %	82.74 %	87.07 %
NNet	94.15 %	78.28 %	85.45 %
BART	94.18 %	81.08 %	87.09 %
RF	91.71 %	88.88 %	90.24%

The researchers concluded that Logistic Regression is most preferable option due to its low false positive rate. It would be bad experience for users to have their email misclassified as junk or spam.

Breaking Human Interaction Proofs (HIPs)

Researchers [5] proposed a machine learning approach for breaking Completely Automated Public Turing Tests to Tell Computers and Humans Apart (CAPTCHAs) and Human Interaction Proofs (HIPs). The proposed approach is aimed at locating the characters (segmentation step) and employing a neural network for character recognition (recognition step).

So, each experiment was split into two parts:

segmentation
recognition

The segmentation part was relatively difficult for the following reasons:

it is computationally expensive
complex segmentation function
difficulty in identification of valid characters

Their method for breaking HIPs is to write a custom algorithm to locate the characters, and then use machine learning for recognition. Surprisingly, segmentation was simple for many HIPs which made the process of breaking the HIP particularly easy. Once the segmentation problem is solved, solving the HIP becomes a pure recognition problem, and it can trivially be solved using machine learning. Their recognition engine is based on a neural network.

On the segmentation stage, different computer vision techniques like converting to grayscale, thresholding to black and white, dilating and eroding, and selecting large connected components (CCs) with sizes close to HIP char sizes were applied. An example of the segmentation process is shown in Image 2. The first image shows the original HIP, the second image shows the processed HIP, and the third image shows HIP with segmented characters.

Image 2 [5]:

The first 3 segmented images from the previous example, which are fed to the neural network, are shown in Image 3.

Image 3 [5]:

Six experiments were conducted with EZ-Gimpy/Yahoo, Yahoo v2, mailblocks, register, ticketmaster, and Google HIPs. The segmentation success rates and recognition success rates of each experiment are shown in Table 6.

Table 6 [5]:

HIP	Segmentation success rate	Recognition success rate (after segmentation)	Total
Mailblocks	88.8 %	95.9 %	66.2 %
Register	95.4 %	87.1 %	47.8 %
Yahoo/EZ- Gimpy	56.2 %	90.3 %	34.4 %
Ticketmaster	16.6 %	82.3 %	4.9 %
Yahoo ver. 2	58.4 %	95.2 %	45.7 %
Google/Gmail	10.2 %	89.3 %	4.89 %

Researchers concluded that CAPTCHAs and HIPs that emphasize the segmentation problem are much stronger than the HIPs examined in their paper, which rely on recognition being difficult. A simple change of fonts, distortions, or arc types would require extensive work for the attacker to adjust to.

Security of Machine Learning

There are multiple types of attacks aimed at exploiting machine learning systems:

Causative attacks - altering the training process through influence over the training data
Attacks on integrity - result in intrusion points being classified as normal (false negatives)
Attacks on availability - cause so many classification errors that the system becomes effectively unusable
Exploratory attacks - exploiting the existing vulnerabilities
Targeted attacks - directed to a certain input
Indiscriminate attacks - causes all inputs to fail

The researchers [10] proposed a defense against exploratory and causative attacks. For defending against exploratory attacks, in which an attacker can create an evaluation distribution that the learner predicts poorly, the defender can limit the access to the training procedure and data, making it harder for an attacker to apply reverse engineering. For defending against the causative attacks, in which an attacker can manipulate both training and evaluation distributions, the defender can employ Reject On Negative Impact (RONI) defense. RONI defense ignores all the training data points that have a substantial negative impact on classification accuracy. RONI defense consists of two classifiers. One classifier is trained using the base training set and the other is trained with the base set and potentially malicious data. If the errors of those two classifiers differ significantly from each other the data is labeled as malicious.

Conclusion

Machine learning is a powerful and adaptive tool that enabled tackling problems that so far required humans. It also enabled the automation of threat recognition tasks. In this paper, multiple applications of machine learning in cybersecurity were shown. Most of the problems were solved using supervised learning and classification since their required classifying input into safe or malicious categories. For classification tasks, researchers tested multiple classifiers, each with its own pros and cons, and they chose what they considered the best ones for the task at hand.