Support Vector Machine Classification of Data Dependency Graphs
I will be presenting this paper at the IEEE NAECON 2025 conference.
Abstract: Support Vector Machine approaches provide efficient classification in data sets with high dimensionality. In prior studies we have presented classification results in a high dimensional a metric space constructed from data dependency graphs. These features were demonstrated to be tied to ground truth class labels, and therefore correlated to operational semantics. The dimensionality of the metric space is derived from the quantity of isomorphically unique data dependency graphs within the data set. While successive refinement can reduce the dimensionality and search space, dimensionality at the most coarse-grained level remains very high. We present results from Support Vector Machine classifiers and show high accuracy with low false positive rates using features correlated to operational semantics. We train classifiers on the Kaggle 2015 Microsoft Malware data set using a linear SVM classifier (One-vs-Rest and One-vs-One), SVM classifier with RBF kernel, SVM classifier with polynomial kernel, and an SVM classifier with custom kernel based on computing the pairwise Hamming distance. This study obtains a total accuracy of over 93% for a multi-class classification problem using a Linear SVM (One-vs-Rest) classifier. The Linear SVM classifier has the lowest false positive rate and high precision, with 6 of 9 classes having precision above 90%, high F1 scores, and a ROC AUC of 0.98. Non-linear SVM kernels show a decrease in total accuracy which indicates linearity in the associated feature space. The classifier was trained on features correlated with binary operational semantics which are demonstrated to be tied to ground truth class labels.