In recent years, machine learning has experienced tremendous progress due to an increase in raw computing power, progress in algorithms, and most importantly, increase in the volume of data for training. Day by day the applications of machine learning continue to expand. One of those applications is cybersecurity. Machine learning seems like a good tool for tackling cybersecurity issues due to its ability to adapt to new and unknown circumstances. This paper covers such applications. Specifically, topics covered in this paper are authentication with keystroke dynamics, Android applications malware detection, phishing detection, breaking Human Interaction Proofs and security of machine learning.
Keywords: machine learning; cybersecurity; internet; cyber threats
Problems in machine learning are usually categorized into three main categories based on data used for training:
Problems in machine learning can also be categorized by their output:
Supervised learning problems consist of presenting an algorithm with input data, as well as the desired output data, and allowing it to produce a model that maps each input to an output. An example of supervised learning in cybersecurity is a classification for malicious email detection. The training set for this problem would consist of emails labeled with labels such as 'safe' or 'malicious'. The algorithm would produce a model that would be able to classify given emails into categories 'safe' and 'malicious'.
Chapters “Authentication with Keystroke Dynamics”, “Android applications malware detection”, “Phishing Detection” and “Breaking Human Interaction Proofs (HIPs)” cover topics in which supervised learning was used.
Unsupervised learning deals with unlabelled data. The most common application of unsupervised learning is finding groups within a given dataset. An example of unsupervised learning in cybersecurity is the BotMiner system [8] which detects real-world botnets by clustering network flows along with behaviors and actions of machines on the internet.
Chapter “Android applications malware detection” covers the usage of unsupervised learning in Android malware detection.
Reinforcement learning is a reward-based learning system. The goal of this system is to create an intelligent agent that acts in an environment of states with a set of available actions. The agent chooses the action based on previous experience and whether that action led to reward or punishment. An example of reinforcement learning application in cybersecurity can be found in [4].
In a classification problem, the inputs are mapped to user-specified outputs. An example of classification, already mention in the Supervised learning chapter, is a classification of emails into 'safe' or 'malicious'.
Chapters “Authentication with Keystroke Dynamics”, “Android applications malware detection”, “Phishing Detection” and “Breaking Human Interaction Proofs (HIPs)” cover topics in which classificattion was used.
In a clustering problem, the inputs are grouped into clusters that are not previously defined. Most of the clustering algorithms require a number of clusters prior to training.
Chapter “Android applications malware detection” covers the usage of clustering in Android malware detection.
In a regression problem, the inputs are used to predict the output from a continuous set, rather than a discrete set, which is a case in classification and clustering. There will be no further discussion about regression in this paper.
Researchers [7] proposed applying a Probabilistic Neural Network (PNN) for identifying imposter using keystroke dynamics. Keystroke dynamics consist of multiple behavioral biometrics that captures the typing style of a user.
Eight different behavioral biometrics were monitored during authentication attempts (username and password). These biometrics were: digraphs (two-letter combinations), trigraphs (three-letter combinations), total username time, total password time, total entry time, scan code, speed, and edit distance. The system was evaluated on a dataset containing authentication attempt keystrokes of 50 people.
Researchers then asked 30 of them to attempt authentication as a different user. The data was fed into a trained PNN and tested. The accuracy of classification of legitimate/imposter equaled 90%.
Researchers [6] presented two machine learning aided approaches for static analysis of the mobile applications: one based on permissions, while the other based on source code analysis that utilizes a bag of words representation model. Researchers used classification and clustering in their work. They used classification to classify software as safe or malicious using existing labeled examples. They used clustering to group similar examples of unlabeled data and that way they acquired more data for the classification model.
All Android applications have an AndroidManifest.xml file that includes all permissions the application needs to access certain features on the phone. During installation user is notified about those permissions and has to allow or deny them. Malicious applications usually require Internet access, contacts information, and other sensitive data access. Researches extracted permissions as a list of permissions names and built classification and clustering models using permissions names as features. Algorithms used for classification were: support vector machine (SVM), Naive Bayes, Decision trees, JRIP, random forest and logistic regression. Results, measured with 3 different measures (recall, precision and F1), are shown in table 1. Researchers also applied 3 different clustering algorithms: Farthest First, Simple K-means and Expectation maximization (EM). Their success rates, measured by correctly and incorrectly clustered instances, are shown in table 2.
Table 1 [6]:
Algorithm | Precision | Recall | F1 |
Decision trees | 0.827 | 0.827 | 0.827 |
Random forest | 0.871 | 0.866 | 0.865 |
Naive Bayes | 0.747 | 0.747 | 0.747 |
SVM | 0.879 | 0.879 | 0.879 |
JRIP | 0.821 | 0.819 | 0.819 |
Logistic regression | 0.823 | 0.822 | 0.821 |
Table 2 [6]:
Algorithm | Correctly clustered instances | Incorrectly clustered instances |
Simple K-means | 229 (59.17%) | 158.0 (40.83%) |
Farthest First | 199 (51.42%) | 188.0 (48.58%) |
EM | 250 (64.6%) | 137.0 (35.4%) |
In this approach, Application package file (APK), containing Android app, is transformed to Dalvik Executable file (dex) by unzipping APK. Then, dex file is transformed to Java archive (jar) using dex2jar tool. After this step, .class files from jar are decompiled into .java files using Procyon decompiler. This workflow is shown in Image 1.
Image 1 [6]:
Researchers' idea was to extracts services, methods and API calls and discover their potentially malicious usage patterns. They processed the whole code using a technique called bag of words. In this technique, code is represented as a set of words, disregarding grammar or word order. The whole source code was tokenized into unigrams that are used as bag of words which are then fed into classification or clustering algorithms. Classification algorithms used were: decision trees, Naive Bayes, support vector machines (SVM) with sequential minimal optimization (SMO), random forests, JRIP and logistic regression. Their results, measured with precision, recall and F1, are shown in Table 3. Researchers also applied 3 different clustering algorithms: Farthest First, Simple K-means and Expectation maximization (EM). Their success rates, measured by correctly and incorrectly clustered instances, are shown in Table 4.
Table 3 [6]:
Algorithm | Precision | Recall | F1 |
Decision trees | 0.886 | 0.886 | 0.886 |
Random forest | 0.937 | 0.935 | 0.935 |
Naive Bayes | 0.825 | 0.821 | 0.820 |
SVM with SMO | 0.952 | 0.951 | 0.951 |
JRIP | 0.916 | 0.916 | 0.916 |
Logistic regression | 0.935 | 0.935 | 0.935 |
Table 4 [6]:
Algorithm | Correctly clustered instances | Incorrectly clustered instances |
Simple K-means | 303 (82.3%) | 65 (17.66%) |
Farthest First | 296 (80.44%) | 72 (19.56%) |
EM | 300 (81.53%) | 68 (18.47%) |
Phishing is the term used for a fraudulent attempt at obtaining sensitive data, such as passwords and credit card details. The attacker, posing as a trustworthy entity, contacts the target by email, telephone or text message and lures them into providing sensitive data.
Researchers [9] compared 6 machine learning classifiers used for classifying phishing emails: Logistic Regression (LR), Classification and Regression Trees (CART), Bayesian Additive Regression Trees (BART), Support Vector Machines (SVM), Random Forests (RF), and Neural Networks (NNets).
The data used for training consisted of phishing emails and legitimate emails. Emails were parsed using text indexing techniques. All attachments were removed, bodies and specific elements were extracted and a stemming algorithm was applied and all the irrelevant words were removed. Finally, all items were sorted according to their frequency in emails.
The classifier comparison based on their measurements, including precision, recall and F1 measure, is shown in Table 5.
Table 5 [9]:
Classifier | Precision | Recall | F1 |
LR | 95.11 % | 82.96 % | 88.59 % |
CART | 92.32 % | 87.07 % | 89.59 % |
SVM | 92.08 % | 82.74 % | 87.07 % |
NNet | 94.15 % | 78.28 % | 85.45 % |
BART | 94.18 % | 81.08 % | 87.09 % |
RF | 91.71 % | 88.88 % | 90.24% |
The researchers concluded that Logistic Regression is most preferable option due to its low false positive rate. It would be bad experience for users to have their email misclassified as junk or spam.
Researchers [5] proposed a machine learning approach for breaking Completely Automated Public Turing Tests to Tell Computers and Humans Apart (CAPTCHAs) and Human Interaction Proofs (HIPs). The proposed approach is aimed at locating the characters (segmentation step) and employing a neural network for character recognition (recognition step).
So, each experiment was split into two parts:
The segmentation part was relatively difficult for the following reasons:
Their method for breaking HIPs is to write a custom algorithm to locate the characters, and then use machine learning for recognition. Surprisingly, segmentation was simple for many HIPs which made the process of breaking the HIP particularly easy. Once the segmentation problem is solved, solving the HIP becomes a pure recognition problem, and it can trivially be solved using machine learning. Their recognition engine is based on a neural network.
On the segmentation stage, different computer vision techniques like converting to grayscale, thresholding to black and white, dilating and eroding, and selecting large connected components (CCs) with sizes close to HIP char sizes were applied. An example of the segmentation process is shown in Image 2. The first image shows the original HIP, the second image shows the processed HIP, and the third image shows HIP with segmented characters.
Image 2 [5]:
The first 3 segmented images from the previous example, which are fed to the neural network, are shown in Image 3.
Image 3 [5]:
Six experiments were conducted with EZ-Gimpy/Yahoo, Yahoo v2, mailblocks, register, ticketmaster, and Google HIPs. The segmentation success rates and recognition success rates of each experiment are shown in Table 6.
Table 6 [5]:
HIP | Segmentation success rate | Recognition success rate (after segmentation) | Total |
Mailblocks | 88.8 % | 95.9 % | 66.2 % |
Register | 95.4 % | 87.1 % | 47.8 % |
Yahoo/EZ- Gimpy | 56.2 % | 90.3 % | 34.4 % |
Ticketmaster | 16.6 % | 82.3 % | 4.9 % |
Yahoo ver. 2 | 58.4 % | 95.2 % | 45.7 % |
Google/Gmail | 10.2 % | 89.3 % | 4.89 % |
Researchers concluded that CAPTCHAs and HIPs that emphasize the segmentation problem are much stronger than the HIPs examined in their paper, which rely on recognition being difficult. A simple change of fonts, distortions, or arc types would require extensive work for the attacker to adjust to.
There are multiple types of attacks aimed at exploiting machine learning systems:
The researchers [10] proposed a defense against exploratory and causative attacks. For defending against exploratory attacks, in which an attacker can create an evaluation distribution that the learner predicts poorly, the defender can limit the access to the training procedure and data, making it harder for an attacker to apply reverse engineering. For defending against the causative attacks, in which an attacker can manipulate both training and evaluation distributions, the defender can employ Reject On Negative Impact (RONI) defense. RONI defense ignores all the training data points that have a substantial negative impact on classification accuracy. RONI defense consists of two classifiers. One classifier is trained using the base training set and the other is trained with the base set and potentially malicious data. If the errors of those two classifiers differ significantly from each other the data is labeled as malicious.
Machine learning is a powerful and adaptive tool that enabled tackling problems that so far required humans. It also enabled the automation of threat recognition tasks. In this paper, multiple applications of machine learning in cybersecurity were shown. Most of the problems were solved using supervised learning and classification since their required classifying input into safe or malicious categories. For classification tasks, researchers tested multiple classifiers, each with its own pros and cons, and they chose what they considered the best ones for the task at hand.