Malicious Software Prediction

Traditional antivirus systems rely on detecting the md5 hashes of malicious binaries which have already been identified and manually added to the AV’s registry. With the ever-evolving threat landscape, including polymorphic samples which can evade hash-based detection entirely, this approach leaves systems more and more vunerable. This raises the need for dynamic analysis of suspicious code, and heuristics for identifying malicious behavior at runtime.

We use a dataset of malicious samples collected from VxHeaven and VirusTotal, the former of which is a popular malware repository for cybersecurity hobbyiests, and the latter an anti-virus search engine. The dataset is available via the UCI ML Repository. Our feature space includes static analysis in the form of features extracted from a hex dump, as well as dynamic features analyzed in a Cuckoo sandbox runtime.

Using a densely connected neural network, we achieve 95% accuracy in classifying binary samples as malicious or benign. A random forest model across the same dataset yielded 97.4% accuracy. However, our neural network had lower precision than our random forest model (neural network had 91 false positives v.s. the 8 of the random forest.). Overall, our random forest model outperformed deep learning techniques for this problem space.

You can see the jupyter notebook for this problem, including a more extensive writeup here.

This project was completed as a part of CS-434 Applied Machine Learning @ OSU-Cascades.