Computer Science Research and Development

Support Vector Machine Classification of Data Dependency Graphs

Dr. John Musgrave — Wed, 09 Jul 2025 01:10:00 GMT

I will be presenting this paper at the IEEE NAECON 2025 conference.

Abstract: Support Vector Machine approaches provide efficient classification in data sets with high dimensionality. In prior studies we have presented classification results in a high dimensional a metric space constructed from data dependency graphs. These features were demonstrated to be tied to ground truth class labels, and therefore correlated to operational semantics. The dimensionality of the metric space is derived from the quantity of isomorphically unique data dependency graphs within the data set. While successive refinement can reduce the dimensionality and search space, dimensionality at the most coarse-grained level remains very high. We present results from Support Vector Machine classifiers and show high accuracy with low false positive rates using features correlated to operational semantics. We train classifiers on the Kaggle 2015 Microsoft Malware data set using a linear SVM classifier (One-vs-Rest and One-vs-One), SVM classifier with RBF kernel, SVM classifier with polynomial kernel, and an SVM classifier with custom kernel based on computing the pairwise Hamming distance. This study obtains a total accuracy of over 93% for a multi-class classification problem using a Linear SVM (One-vs-Rest) classifier. The Linear SVM classifier has the lowest false positive rate and high precision, with 6 of 9 classes having precision above 90%, high F1 scores, and a ROC AUC of 0.98. Non-linear SVM kernels show a decrease in total accuracy which indicates linearity in the associated feature space. The classifier was trained on features correlated with binary operational semantics which are demonstrated to be tied to ground truth class labels.

kNN Classification of Malware Data Dependency Graph Features

Dr. John Musgrave — Tue, 04 Jun 2024 19:36:00 GMT

I will be presenting this paper at the IEEE NAECON 2024 conference.

https://ieeexplore.ieee.org/document/10670673

Abstract: “Explainability of classification results is dependent upon the features used in classification. Data dependency graph features representing data movement are directly correlated with operational semantics, and subject to fine grained analysis. This study obtains accurate classification from the use of features tied to structure and semantics. By training an accurate model using labeled data, this feature representation of semantics is shown to be correlated with ground truth labels. This was performed using non-parametric learning with a novel feature representation on a large scale dataset, the Kaggle 2015 Malware dataset. The features used enable fine grained analysis, increase in resolution, and explainable inferences. This allows for the body of the term frequency distribution to be further analyzed and to provide an increase in feature resolution over term frequency features. This method obtains high accuracy from analysis of a single instruction, a method that can be repeated for additional instructions to obtain further increases in accuracy. This study evaluates the hypothesis that the semantic representation and analysis of structure are able to make accurate predictions that are also correlated to ground truth labels. Additionally, similarity in the metric space can be calculated directly without prior training. Our results provide evidence that data dependency graphs accurately capture both semantic and structural information for increased explainability in classification results.”

@INPROCEEDINGS{10670673,
  author={Musgrave, John and Ralescu, Anca},
  booktitle={NAECON 2024 - IEEE National Aerospace and Electronics Conference}, 
  title={kNN Classification of Malware Data Dependency Graph Features}, 
  year={2024},
  volume={},
  number={},
  pages={206-213},
  keywords={Training;Measurement;Accuracy;Semantics;Aerospace electronics;Feature extraction;Malware;machine learning;feature extraction;malware analysis},
  doi={10.1109/NAECON61878.2024.10670673}
}

Some recent papers

Dr. John Musgrave — Wed, 20 Mar 2024 18:00:00 GMT

Search and Retrieval in Semantic-Structural Representations of Novel Malware

Abstract: “In this study, we present a novel representation for binary programs which captures semantic similarity and structural properties. This representation enables the search and retrieval of binary executable programs based on their similarity of behavioral properties. The proposed representation is composed in a bottom-up approach: we begin by extracting data dependency graphs (DDG), which are representative of both program structure and operational semantics. We then encode each program as a set of graph hashes representing isomorphic uniqueness, a method we have labeled DDG Fingerprinting. We present experimental results of search using k-Nearest Neighbors in a metric space constructed from a set of binary executables. Searches in the dataset are based on the operational semantics of specific malware examples.”

http://dx.doi.org/10.54364/aaiml.2024.41117

@article{musgrave2024search, title={Search and Retrieval in Semantic-Structural Representations of Novel Malware}, author={Musgrave, John and Campan, Alina and Messay-Kebede, Temesguen and Kapp, David and Wang, Boyang}, journal={Advances in Artificial Intelligence and Machine Learning}, volume={4}, number={1}, pages={117}, year={2024} }

Empirical Network Structure of Malicious Programs

Abstract: “A modern binary executable is a composition of various types of networks. Control flow graphs are a commonly used representation of an executable program used for classification tasks. Control flow and term frequency representations are widely adopted, but provide only a partial view of program semantics and present challenges to increases in resolution. By performing a quantitative analysis of program networks, we enable the identification of patterns within these features that are correlated to structure. This allows for increases in feature resolution and pattern recognition in classification tasks. These are necessary steps in order to obtain greater explainability in classification results. We demonstrate the presence of Scale-Free properties of network structure for program data dependency and control flow graphs, and show that data dependency graphs also have Small-World structural properties. We show that program data dependency graphs have a degree correlation that is structurally disassortative, and that control flow graphs have a neutral degree assortativity, indicating the use of random graphs to model the structural properties of program control flow graphs would show increased accuracy. An increase in feature resolution allows for the structural properties of program classes to be analyzed for patterns as well as their component parts. By providing an increase in feature resolution within labeled datasets of executable programs we provide a quantitative basis to interpret the results of classifiers trained on CFG graph features. By capturing a complete picture of program networks we can enable future work in mapping a program’s operational semantics to its structure.“

http://dx.doi.org/10.54364/aaiml.2024.41112

@article{musgrave2024empirical, title={Empirical Network Structure of Malicious Programs}, author={Musgrave, John and Campan, Alina and Messay-Kebede, Temesguen and Kapp, David and Wang, Boyang}, journal={Advances in Artificial Intelligence and Machine Learning}, volume={4}, number={1}, pages={112}, year={2024} }