Malicious PDF Detection using Machine Learning

Abstract

The complexity and structure of modern documents make it possible to hide malicious code or confuse it with data. For that reason, the so-called trojan documents are often used as a vehicle for the distribution of malicious code, often appearing as legitimate and useful. The goal is to exploit vulnerabilities in the client application to perform arbitrary code execution. The PDF file format, one of the most widely spread file formats, has become popular due to its ease of use and broad set of functionalities. This seminar will cover a method for static analysis of PDF documents that employs machine learning algorithms to discriminate between benign and malicious PDF documents, described in [1]. Besides benign/malicious classification, the same method will be used to discriminate between malicious documents designed for large-scale phishing attacks and the ones designed for targeted attacks.

Introduction

Trojan documents exploit vulnerabilities of the rising number of document viewer applications, often combined with social engineering to deceive victims of the documents' legitimacy. For example, packaging malware in fake bank statements, company reports, etc. There are generally two types of attacks. The first is large-scale phishing attacks whose goal is to espionage and collect data on a large number of random victims. The second type is targeted (or victim specific) attacks which use knowledge about a specific person or entity.

PDF documents are one of the most popular file formats for performing these types of attacks. Many different methods and strategies are used by attackers to conduct these attacks using PDF file format. To list a few:

using document purely to exploit vulnerability in client reader application
using document to transfer complete malware code on victims machine
document contains code for downloading the remaining components of malware

To detect malicious PDF documents, there are generally two main approaches. First is static analysis of document which employs signature analysis or pattern matching. Second is dynamic analysis which observes the behavior of decoded PDF document.

The method in this seminar takes a static analysis approach. Using regex matching and set of documents, features are extracted from document metadata and structural elements, without ever decoding the PDF document. These features are then used by the machine learning algorithm called Random forests to distinguish between benign and malicious documents.
The fundamental assumption is that any two benign documents will have similar features. The same goes for two malicious documents. But, no two documents will ever have similar features if one of them is benign and the other one is malicious.
The main benefit of the machine learning approach is the ability to generalize to new types of malware. The method is vulnerability and exploit agnostic, requiring no prior knowledge of any malware families.

Feature extraction and feature selection

Two different datasets are used, one for the training phase and one for the testing phase. The distribution is shown in the table below. Detailed descriptions of two datasets can be found in the original paper [1].

For each document simple string matching was utilized to extract features from metadata and structural elements, with 202 features in total per document. A few of those features are:

Count of font objects (“/Font” markers)
Average length of stream objects (difference between “/Stream” and “/Endstream” markers)
Dimensions of JavaScript objects (“/JavaScript” markers)
Dimensions of JS objects (“/JS” markers)
Dimensions of box and image objects
Number of lower case letters in the title
Sum of pixels in all images
…

This kind of feature extraction also works well on encrypted documents because structure and metadata are not encrypted. Most of the features are numeric, those which are not are transformed to make them numeric.

Features are designed to eliminate reliance on specific strings or byte sequences. A few examples of those features would be the name of the author, the number of characters in the author field, etc. Also, any features connected to specific vulnerabilities of malware families.

Classification using Random forests algorithm

Random forests algorithm is an ensemble classification algorithm. The result is based on the output of many decision trees trained using random subsets of feature set. The classification result is determined by voting. More about Random forests can be found in [2].

This method includes training two different Random forests classifiers. The first classifier determines whether PDF document is benign of malicious. The second classifier is used only if document is malicious and it determines wheres PDF document for designed for large-scale attack or targeted attack. Image below illustrates this setup. Benign document is denoted with “ben”, malicious document with “mal”, large-scale attack with “opp”, targeted with “tar”.

To get the best results using the Random forests algorithm, authors perform space-search to find the set of hyperparameters that maximize the accuracy on the test set.

Results

Each classifier is evaluated on the test set. The results are shown in the following tables using standard classification metrics [5].

The table below shows the performance of benign/malicious document classifier. Recall of malicious documents is very high, with a price of very small amount of false positives (0.24%).

The table below shows the performance of large-scale/targeted attack classifier. Recall of documents designed for a targeted attack is also high, with false positives rate of 1%, which are very good results.

After training, the Random forests model can be exported and packaged in the form of desktop software or an online service with associated API. Executing takes 1s when predicting class for the new document.

It is important for any detection mechanism to be able to detect malicious documents even when the attacker tries to perform some kind of detection evasion. This method shows strong resilience to adversarial attacks. To make it even stronger, authors of the paper perturbate the training data. Essentially, they make the data more noisy which results in a higher variance of features and makes it harder for Random forests algorithm to prefer some features over others by a large amount.

Conclusion

This seminar explores an approach for malicious PDF detection using machine learning algorithms. Features are extracted using static analysis (string matching) from metadata and structural elements. The result of this approach are two classifiers. The first classifier is used for the initial discovery of malicious PDF documents. Once a malicious PDF document was detected, the second classifier predicts which type of attack the document performs, a large-scale or targeted attack. This method achieves high accuracy and shows robustness to detection evasion.

Sources

[1] Smutz, Charles, and Angelos Stavrou. "Malicious PDF detection using metadata and structural features." Proceedings of the 28th annual computer security applications conference. ACM, 2012.

[2] Wikipedia contributors. "Random forest." Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia, 8 Nov. 2019. Web. 3 Jan. 2020.

[3] Wikipedia contributors. "PDF." Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia, 1 Jan. 2020. Web. 3 Jan. 2020.

[4] Wikipedia contributors. "Hyperparameter optimization." Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia, 7 Nov. 2019. Web. 3 Jan. 2020.

[5] Wikipedia contributors. "Precision and recall." Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia, 3 Dec. 2019. Web. 3 Jan. 2020.

racfor_wiki/malware/detekcija_malicioznih_pdf_datoteka_metodama_strojnog_ucenja.1578092018.txt.gz · Zadnja izmjena: 2024/12/05 12:23 (vanjsko uređivanje)