SFCWGAN-BiTCN with Sequential Features for Malware Detection
Abstract
:1. Introduction
- SFCWGAN-BiTCN, a malware detection approach based on SFCWGAN and BiTCN, is proposed to mitigate the interference of malware family sample imbalance on detection accuracy.
- A new word-embedding method is designed in conjunction with the Word2Vec algorithm to obtain API and Opcode call sequences in the order of their virtual addresses and map them into a vector space, exploiting the sequential semantics of the API and Opcode call contexts.
- Feature selection and merging using whale optimization algorithm extreme gradient boosting (WOA-XGBoost) and Spearman correlation coefficients reduces redundant features, simplifies the Word2Vec feature, and improves detection accuracy and efficiency.
- The use of CWGAN to generate imbalanced malware family samples to supplement the dataset enhances model training, reduces the effect of malware family sample imbalance on detection, and improves detection accuracy.
- The BiTCN extracts time-varying features in malware sequences for deep feature mining of time series by improving the TCN model, in order to fully exploit temporal features.
- The malware detection structure is shown in Figure 1.
2. Related Work
2.1. Unbalanced Datasets
2.2. Generative Adversarial Networks (GANs)
3. SFCWGAN and BiTCN
3.1. CWGAN
3.2. BiTCN
3.3. Feature Pre-Processing
3.4. Feature Pre-Processing
3.4.1. WOA-XGBoost
3.4.2. Spearman’s Correlation Coefficient
3.5. Sample Generation
Algorithm 1: Minority class sample generation based on CWGANs |
, where is noise data and is class label |
1. the generator and discriminator generator parameters , discriminator parameters , the gradient penalty , Adam learning rate |
2. While does not approach 0.05/*CWGAN training */ |
3. for do/*optimize discriminator */ |
4. Sampling form |
5. Sampling form |
6. /*calc gradient */ |
7. /*updata hyperparameters */ |
8. end |
9. form sample /*optimize discriminator */ |
10 /*calc gradient */ |
11 /*updata hyperparameters */ |
end |
return/*generate samples */ |
3.6. Feature Extraction and Training
4. Experiments and Analysis of Results
4.1. Datasets
4.1.1. Kaggle
4.1.2. DataCon
4.2. Experimental Assessment Criteria
4.3. Experimental Results and Analysis
- Accuracy analysis experiments
- Noise analysis experiments
- Comparison of difference sample balance algorithms
- Ablation experiments
- Comparison of difference classification algorithms
- Comparison of existing methods
4.3.1. Accuracy Analysis Experiments
4.3.2. Noise Analysis Experiments
4.3.3. Comparison of Difference Sample Balance Algorithms
4.3.4. Ablation Experiments
4.3.5. Comparison of Difference Classification Algorithms
4.3.6. Comparison of Existing Methods
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Kim, S.; Hong, S.; Oh, J. Obfuscated VBA macro detection using machine learning. In Proceedings of the 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Luxembourg, 25–28 June 2018; pp. 490–501. [Google Scholar]
- Wang, S.; Chen, Z.; Yan, Q. Deep and broad URL feature mining for android malware detection. Inf. Sci. 2020, 513, 600–613. [Google Scholar] [CrossRef]
- Demetrio, L.; Coull, S.E.; Biggio, B. Adversarial exemples: A survey and experimental evaluation of practical attacks on machine learning for windows malware detection. ACM Trans. Priv. Secur. 2021, 24, 1–31. [Google Scholar] [CrossRef]
- Li, D.; Li, Q.; Ye, Y. Arms race in adversarial malware detection: A survey. ACM Comput. Surv. 2021, 55, 1–35. [Google Scholar] [CrossRef]
- Mimura, M.; Ohminami, T. Using LSI to detect unknown malicious VBA macros. J. Inf. Process. 2020, 28, 493–501. [Google Scholar] [CrossRef]
- Mimura, M. Using fake text vectors to improve the sensitivity of minority class for macro malware detection. J. Inf. Secur. Appl. 2020, 54, 102600. [Google Scholar] [CrossRef]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Bunkhumpornpa, C.; Sinapiromsaran, K.; Lursinsap, C. Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, Bangkok, Thailand, 27–30 April 2009; pp. 475–482. [Google Scholar]
- Graa, O.; Rekik, I. Multi-view learning-based data proliferator for boosting classification using highly imbalanced classes. J. Neurosci. Methods 2019, 327, 108344. [Google Scholar] [CrossRef]
- Fu, G.H.; Wu, Y.J.; Zong, M.J. Hellinger distance-based stable sparse feature selection for high-dimensional class-imbalanced data. BMC Bioinform. 2020, 21, 1–14. [Google Scholar] [CrossRef]
- Cui, Z.; Xue, F.; Cai, X. Detection of malicious code variants based on deep learning. IEEE Trans. Ind. Inform. 2018, 14, 3187–3196. [Google Scholar] [CrossRef]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
- Kim, J.Y.; Bu, S.J.; Cho, S.B. Malware detection using deep transferred generative adversarial networks. In Proceedings of the 2017 International Conference on Neural Information Processing, Long Beach, CA, USA, 4–9 December 2017; pp. 556–564. [Google Scholar]
- Kim, J.Y.; Bu, S.J.; Cho, S.B. Zero-day malware detection using transferred generative adversarial networks based on deep autoencoders. Inf. Sci. 2018, 460, 83–102. [Google Scholar] [CrossRef]
- Liu, Y.; Li, J.; Liu, B. Malware detection method based on image analysis and generative adversarial networks. Concurr. Comput.: Pract. Exp. 2022, 34, e7170. [Google Scholar] [CrossRef]
- Suciu, O.; Coull, S.E.; Johns, J. Exploring adversarial examples in malware detection. In Proceedings of the 2019 IEEE Security and Privacy Workshops (SPW), San Francisco, CA, USA, 19–23 May 2019; pp. 8–14. [Google Scholar]
- Hu, W.; Tan, Y. Generating adversarial malware examples for black-box attacks based on GAN. Comput. Sci. 2017, 99, 8–14. [Google Scholar]
- Tang, C.; Zhang, Y.; Yang, Y.X. DroidGAN: Android adver sarial sample generation framework based on DCGAN. J. Commun. 2018, 39, 64–69. (In Chinese) [Google Scholar]
- Rosenberg, I.; Shabtai, A.; Rokach, L. Generic black-box end-to-end attack against state of the art API call based malware classifiers. In Proceedings of the 2018 International Symposium on Research in Attacks, Intrusions, and Defenses, Crete, Greece, 10–12 September 2018; pp. 490–510. [Google Scholar]
- Jha, S.; Prashar, D.; Long, H.V. Recurrent neural network for detecting malware. Comput. Secur. 2020, 99, 102037. [Google Scholar] [CrossRef]
- Gibert, D.; Mateu, C.; Planes, J. HYDRA: A multimodal deep learning framework for malware classification. Comput. Secur. 2020, 95, 101873. [Google Scholar] [CrossRef]
- Yu, L.; Zhang, W.; Wang, J. Seqgan: Sequence generative adversarial nets with policy gradient. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 23–30. [Google Scholar]
- Liao, D.; Huang, S.; Tan, Y. Network intrusion detection method based on gan model. In Proceedings of the 2020 International Conference on Computer Communication and Network Security (CCNS), Xi’an, China, 21–23 August 2020; pp. 153–156. [Google Scholar]
- Huang, S.; Lei, K. IGAN-IDS: An imbalanced generative adversarial network towards intrusion detection system in ad-hoc networks. Ad Hoc Netw. 2020, 105, 102177. [Google Scholar] [CrossRef]
- Solis, D.; Vicens, R. Convolutional neural networks for classification of malware assembly code. In Proceedings of the 20th International Conference of the Catalan Association for Artificial Intelligence, Terres de L’Ebre, Spain, 25–27 October 2017. [Google Scholar]
- McLaughlin, N.; Martinez del Rincon, J.; Kang, B.J. Deep android malware detection. In Proceedings of the Seventh ACM on Conference on Data and Application Security and Privacy, Scottsdale, AZ, USA, 22–24 March 2017; pp. 301–308. [Google Scholar]
- Bhati, B.S.; Chugh, G.; Al-Turjman, F. An improved ensemble based intrusion detection technique using XGBoost. Trans. Emerg. Telecommun. Technol. 2021, 32, e4076. [Google Scholar] [CrossRef]
- Ikram, S.T.; Cherukuri, A.K.; Poorva, B. Anomaly detection using XGBoost ensemble of deep neural network models. Cybern. Inf. Technol. 2021, 21, 175–188. [Google Scholar] [CrossRef]
- Mirjalili, S.; Lewis, A. The whale optimization algorithm. Adv. Eng. Softw. 2016, 95, 51–67. [Google Scholar] [CrossRef]
- Qiu, Y.; Zhou, J.; Khandelwal, M. Performance evaluation of hybrid WOA-XGBoost, GWO-XGBoost and BO-XGBoost models to predict blast-induced ground vibration. Eng. Comput. 2021, 1–18. [Google Scholar] [CrossRef]
- Mirjalili, S.; Mirjalili, S.M.; Hatamlou, A. Multi-verse optimizer: A nature-inspired algorithm for global optimization. Neural Comput. Appl. 2016, 27, 495–513. [Google Scholar] [CrossRef]
- Dubey, G.P.; Bhujade, R.K. Optimal feature selection for machine learning based intrusion detection system by exploiting attribute dependence. Mater. Today Proc. 2021, 47, 6325–6331. [Google Scholar] [CrossRef]
- Ronen, R.; Radu, M.; Feuerstein, C. Microsoft Malware Classification Challenge 2018. Comput. Secur. 2020, 95, 101873. Available online: https://www.kaggle.com/c/malware-classification/data (accessed on 28 December 2022).
- Qi An Xin Technology Research Institute. DataCon: Multidomain Large-Scale Competition Open Data for Security Research. Available online: https://datacon.qianxin.com/opendata (accessed on 11 November 2021). (In Chinese).
- Han, H.; Wang, W.Y.; Mao, B.H. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In Proceedings of the International Conference on Intelligent Computing, Hefei, China, 23–26 August 2005; pp. 878–887. [Google Scholar]
- Mease, D.; Wyner, A.J.; Buja, A. Boosted classification trees and class probability/quantile estimation. J. Mach. Learn. Res. 2007, 8. [Google Scholar]
- He, H.; Bai, Y.; Garcia, E.A. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–6 June 2008; pp. 1322–1328. [Google Scholar]
- Yu, Y.; Tang, B.; Lin, R. CWGAN: Conditional wasserstein generative adversarial nets for fault data generation. In Proceedings of the 2019 IEEE International Conference on Robotics and Biomimetics (ROBIO), Dali, China, 6–8 December 2019; pp. 2713–2718. [Google Scholar]
- Lu, W.; Li, J.; Wang, J. A CNN-BiLSTM-AM method for stock price prediction. Neural Comput. Appl. 2021, 33, 4741–4753. [Google Scholar] [CrossRef]
- She, D.; Jia, M. A BiGRU method for remaining useful life prediction of machinery. Measurement 2021, 167, 108277. [Google Scholar] [CrossRef]
- Gibert, D.; Mateu, C.; Planes, J. Orthrus: A Bimodal Learning Architecture for Malware Classification. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–8. [Google Scholar]
- Yan, J.; Qi, Y.; Rao, Q. Detecting malware with an ensemble method based on deep neural network. Secur. Commun. Netw. 2018, 2018, 7247095. [Google Scholar] [CrossRef]
- Marastoni, N.; Giacobazzi, R.; Dalla, P.M. Data augmentation and transfer learning to classify malware images in a deep learning context. J. Comput. Virol. Hacking Tech. 2021, 17, 279–297. [Google Scholar] [CrossRef]
- Darem, A.; Abawajy, J.; Makkar, A. Visualization and deep-learning-based malware variant detection using OpCode-level features. Future Gener. Comput. Syst. 2021, 125, 314–323. [Google Scholar] [CrossRef]
- Lin, W.C.; Yeh, Y.R. Efficient Malware Classification by Binary Sequences with One-Dimensional Convolutional Neural Networks. Mathematics 2022, 10, 608. [Google Scholar] [CrossRef]
- Chen, X.; Hao, Z.; Li, L. CruParamer: Learning on Parameter-Augmented API Sequences for Malware Detection. IEEE Trans. Inf. Forensics Secur. 2022, 17, 788–803. [Google Scholar] [CrossRef]
Family Name | Samples | Type |
---|---|---|
Ramnit | 1541 | Worm |
Lollipop | 2478 | Adware |
Kelihos_ver3 | 2942 | Backdoor |
Vundo | 475 | Trojan |
Simda | 42 | Backdoor |
Tracur | 751 | TrojanDownloader |
Kelihos_ver1 | 398 | Backdoor |
Obfuscator.ACY | 1228 | Any kind of obfuscated malware |
Gatak | 1013 | Backdoor |
Family Name | Samples | Type |
---|---|---|
White | 15,759 | No_Miner |
Black | 7896 | Miner |
Family Name | Accuracy/% | Precision/% | Recall/% | F1-Score/% |
---|---|---|---|---|
BO-XGBoost | 97.16 | 97.16 | 97.16 | 97.16 |
GWO-XGBoostand | 97.62 | 97.62 | 97.62 | 97.62 |
WOA-XGBoost | 98.34 | 98.34 | 98.34 | 98.34 |
Dataset | 0 | 0.01 | 0.03 | 0.05 |
---|---|---|---|---|
Kaggle | 99.16 ± 0.18 | 98.87 ± 0.21 | 98.46 ± 0.26 | 98.01 ± 0.31 |
DataCon | 96.53 ± 0.21 | 96.63 ± 0.25 | 95.81 ± 0.28 | 95.20 ± 0.32 |
Algorithm | Method | Accuracy/% | Precision/% | Recall/% | F1-Score/% |
---|---|---|---|---|---|
BOS-BiTCN | Borderline random synthesis and balance the samples | 99.13 | 99.13 | 99.13 | 99.13 |
ROS-BiTCN | Resample the sample | 99.03 | 99.03 | 99.03 | 99.03 |
ADASYN-BiTCN | KNN for create samples based on density | 99.22 | 99.22 | 99.22 | 99.22 |
SMOTE-BiTCN | KNN for random synthesis and balance the samples | 99.35 | 99.35 | 99.35 | 99.35 |
CWGAN-BiTCN | CWGAN samples | 99.46 | 99.46 | 99.46 | 99.46 |
Model in this paper | Based on feature selection and solves the vanishing gradient | 99.55 | 99.57 | 99.54 | 99.53 |
Algorithm | Method | Accuracy/% | Precision/% | Recall/% | F1-Score/% |
---|---|---|---|---|---|
BOS-BiTCN | Borderline random synthesis and balance the samples | 96.63 | 96.63 | 96.63 | 96.63 |
ROS-BiTCN | Resample the sample | 96.53 | 96.50 | 96.51 | 96.55 |
ADASYN-BiTCN | KNN for create samples based on density | 96.72 | 96.72 | 96.72 | 96.72 |
SMOTE-BiTCN | KNN for random synthesis and balance the samples | 96.76 | 96.76 | 96.76 | 96.76 |
CWGAN-BiTCN | CWGAN samples | 96.84 | 96.84 | 96.84 | 96.84 |
Model in this paper | Based on feature selection and solves the vanishing gradient | 96.96 | 96.97 | 96.95 | 96.98 |
Algorithm | Malware Family | |||
---|---|---|---|---|
Vundo | Simda | Tracur | Kelihos_ver1 | |
BiTCN | 97.68 | 85.92 | 98.01 | 97.89 |
GAN-BiTCN | 98.53 | 88.92 | 99.21 | 98.50 |
CWGAN-BiTCN | 99.03 | 89.92 | 99.01 | 99.02 |
Model in this paper | 99.46 | 98.92 | 99.43 | 99.51 |
Algorithm | Malware Type | |
---|---|---|
White | Black | |
BiTCN | 98.23 | 95.66 |
GAN-BiTCN | 98.93 | 96.22 |
CWGAN-BiTCN | 99.35 | 96.51 |
Model in this paper | 99.96 | 96.92 |
Algorithm | Accuracy/% | Precision/% | Recall/% | F1-Score/% |
---|---|---|---|---|
SFCWGAN-LSTM | 98.32 | 98.34 | 98.33 | 98.32 |
SFCWGAN-BiLSTM | 98.87 | 98.82 | 98.88 | 98.86 |
SFCWGAN-GRU | 98.56 | 98.51 | 98.53 | 98.55 |
SFCWGAN-BiGRU | 98.96 | 98.97 | 98.96 | 98.93 |
SFCWGAN-TCN | 99.14 | 99.13 | 99.15 | 99.13 |
Model in this paper | 99.55 | 99.57 | 99.54 | 99.53 |
Algorithm | Accuracy/% | Precision/% | Recall/% | F1-Score/% |
---|---|---|---|---|
SFCWGAN-LSTM | 96.63 | 96.61 | 96.58 | 96.60 |
SFCWGAN-BiLSTM | 96.72 | 96.71 | 96.70 | 96.71 |
SFCWGAN-GRU | 96.69 | 96.68 | 96.68 | 96.67 |
SFCWGAN-BiGRU | 96.82 | 96.81 | 96.80 | 96.83 |
SFCWGAN-TCN | 96.8 | 96.80 | 96.78 | 96.80 |
Model in this paper | 96.96 | 96.97 | 96.95 | 96.98 |
Reference | Method | Accuracy/% |
---|---|---|
Ref. [20] | CNN + Gray | 97.49 |
Ref. [21] | LeNet5 + RGB + Word2Vec | 98.76 |
Ref. [41] | Byte + Opcode | 98.24 |
Ref. [42] | Opcode + Grayscale map | 99.36 |
Ref. [43] | Image + CNN and LSTM | 98.5 |
Ref. [44] | Opcode + image+ segment + other | 99.12 |
Ref. [45] | Byte sequence + EfficientNet | 97.47 |
Ref. [46] | API + LSTM + BiLSTM | 99.38 |
Our | CWGAN + BiTCN | 99.56 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Xuan, B.; Li, J.; Song, Y. SFCWGAN-BiTCN with Sequential Features for Malware Detection. Appl. Sci. 2023, 13, 2079. https://doi.org/10.3390/app13042079
Xuan B, Li J, Song Y. SFCWGAN-BiTCN with Sequential Features for Malware Detection. Applied Sciences. 2023; 13(4):2079. https://doi.org/10.3390/app13042079
Chicago/Turabian StyleXuan, Bona, Jin Li, and Yafei Song. 2023. "SFCWGAN-BiTCN with Sequential Features for Malware Detection" Applied Sciences 13, no. 4: 2079. https://doi.org/10.3390/app13042079
APA StyleXuan, B., Li, J., & Song, Y. (2023). SFCWGAN-BiTCN with Sequential Features for Malware Detection. Applied Sciences, 13(4), 2079. https://doi.org/10.3390/app13042079