Deep Forest and Pruned Syntax Tree-Based Classification Method for Java Code Vulnerability
Abstract
:1. Introduction
- The breadth-first algorithm is used to solve the problem of semantic structure information loss when processing abstract syntax trees. We are the first to propose an algorithm that uses breadth-first to process abstract syntax trees to obtain expression subtrees, which solves the problem of semantic structure information loss caused by code structures being translated to the same sequence of tree nodes when using depth-first to process abstract syntax trees.
- We proposed a statement tree pruning algorithm to solve the problem of information redundancy caused by irrelevant information (such as package references and comments) in source code. The irrelevant nodes in the abstract syntax tree are not helpful for classification but will lead to long training time and overfitting. To solve this problem, we propose pruning the statement tree obtained from the transformation of the abstract syntax tree to solve the problem of redundant information.
2. Related Work
3. Model Building
3.1. Parse Source Code into AST
3.2. Obtain the Sequence of Statement Trees
Algorithm 1 Original AST parsing algorithm. |
Input: AST root node Output: sequence of statement trees 1: Function dfs(root,sequence){ 2: children = root.children; //get children of root 3: sequence.add(root); 4: for child:children do //recursive traversal 5: dfs(child,sequence); 6: end for 7: if then //base case 8: return; 9: end if 10: } |
Algorithm 2 Improved AST parsing algorithm. |
Input:
AST root node Output: sequence of statement trees 1: res,queue = []; //init 2: queue.add(root); //add root into queue 3: while queue do //breadth-first traversal 4: node = queue.pop(); //head element out 5: if then 6: res.add(node); //determine whether the current node is a statement node 7: end if 8: for child:node.children do 9: queue.add(child); 10: end for 11: end while 12: return res; |
3.3. Irrelevant Node Pruning
Algorithm 3 Node pruning algorithm. |
Input:
sequence of statement trees Output: pruned sequence of statement trees 1: queue = []; //init output sequence 2: for do 3: if then //discard useless statement tree 4: for do 5: if then 6: root.remove(child); //discard useless tree node 7: end if 8: end for 9: queue.add(root); 10: end if 11: end for 12: return queue; |
3.4. Encode Sequence of Statement Treess as Vector
3.5. Vulnerability Classification
4. Experiments
4.1. Experiment Settings and Dataset Description
4.2. Experiment 1
4.3. Experiment 2
4.4. Experiment 3
- In TextCNN, LSTM, and BiLSTM, the code is regarded as plain text and converted into sequence of tokens representation as input. For TextCNN, we set the kernel size as 3 and set the number of filters as 100. For LSTM and BiLSTM, we set the dimensionality of the hidden state as 100.
- In SVM, we use the SVM with traditional statical features-based methods such as the TF-IDF (term frequency–inverse document frequency) algorithm, N-gram [31] algorithm, and LDA (linear discriminant analysis) [32] algorithm. These methods extract tokens from Java source code files. For the LDA algorithm, the number of topics is set as 300; for the N-gram algorithm, the number of max features is set as 1000 and the number of grams is set as 2.
- In MCDF, the Java source code was treated as a character stream file, read 8-bit binary numbers into decimal integers, and reshape these integers into fixed-line-width vectors as the deep forest’s input vectors.
- We also tested the performance of graph neural networks such as TextGCN. In TextGCN, documents and words in code files are regarded as graphs’ nodes; we use the co-occurrence frequency information of words to construct the edge between word nodes, and the document frequency and word frequency are used to construct the edge between different kinds of nodes (word node and document node) to construct a large graph. The graph is then modeled using GCNs (graph convolutional networks), converting the code function classification problem into a node classification problem.
4.5. Experiment 4
4.6. Experiment 5
5. Discussion
5.1. Effects of PSTDF
5.2. Privacy Issues and Limitations
6. Conclusions and Future Works
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
J2EE | Java 2 Platform Enterprise Edition |
PSTDF | Pruned Statement Tree-based Deep Forest |
CVE | Common Vulnerabilities and Exposures |
AST | Abstract Syntax Tree |
ASTNN | AST-based Neural Network |
RNN | Recurrent Neural Networks |
BiLSTM | Bi-directional Long Short-Term Memory |
LSTM | Long Short-Term Memory |
GAN | Generative Adversarial Network |
SAE | Sparse Auto Encoder |
SMGA | Semantic-complete Graph |
GSC | Graph-embedded Semantic Completion |
GRU | Gated Recurrent Unit |
ST | Statement Tree |
PST | Pruned Statement Tree |
CNN | Convolutional Neural Network |
SVM | Support Vector Machines |
MCDF | Malicious Code classification method based on Deep Forest |
LDA | Linear Discriminant Analysis |
TF-IDF | Term Frequency–Inverse Document Frequency |
GCN | Graph Convolutional Networks |
OWASP | Open Web Application Security Project |
SARD | Software Assurance Reference Dataset |
References
- CVE Details. Available online: https://www.cvedetails.com/browse-by-date.php (accessed on 18 December 2022).
- Younis, A.; Malaiya, Y.; Anderson, C.; Ray, I. To fear or not to fear that is the question: Code characteristics of a vulnerable functionwith an existing exploit. In Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy, New Orleans, LA, USA, 9–11 March 2016; pp. 97–104. [Google Scholar]
- Anbiya, D.R.; Purwarianti, A.; Asnar, Y. Vulnerability detection in php web application using lexical analysis approach with machine learning. In Proceedings of the 2018 5th International Conference on Data and Software Engineering (ICoDSE), Mataram, Indonesia, 7–8 November 2018; pp. 1–6. [Google Scholar]
- Kim, S.; Zhao, J.; Tian, Y.; Chandra, S. Code prediction by feeding trees to transformers. In Proceedings of the 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), Madrid, Spain, 22–30 May 2021; pp. 150–162. [Google Scholar]
- Liang, J.; Wang, M.; Zhou, C.; Wu, Z.; Jiang, Y.; Liu, J.; Liu, Z.; Sun, J. PATA: Fuzzing with Path Aware Taint Analysis. In Proceedings of the 2022 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 22–26 May 2022; IEEE Computer Society: Los Alamitos, CA, USA, 2022; pp. 154–170. [Google Scholar]
- Lin, B.; Wang, S.; Wen, M.; Mao, X. Context-aware code change embedding for better patch correctness assessment. ACM Trans. Softw. Eng. Methodol. (TOSEM) 2022, 31, 1–29. [Google Scholar] [CrossRef]
- Zhang, J.; Wang, X.; Zhang, H.; Sun, H.; Wang, K.; Liu, X. A novel neural source code representation based on abstract syntax tree. In Proceedings of the 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), Montreal, QC, Canada, 25–31 May 2019; pp. 783–794. [Google Scholar]
- Meng, Y.; Liu, L. A deep learning approach for a source code detection model using self-attention. Complexity 2020, 2020, 5027198. [Google Scholar] [CrossRef]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
- Liu, G.; Guo, J. Bidirectional LSTM with attention mechanism and convolutional layer for text classification. Neurocomputing 2019, 337, 325–338. [Google Scholar] [CrossRef]
- Hua, W.; Liu, G. Transformer-based networks over tree structures for code classification. Appl. Intell. 2022, 52, 8895–8909. [Google Scholar] [CrossRef]
- Xing, Y.; Qian, X.; Guan, Y.; Zhang, S.; Zhao, M.; Lin, W. Cross-project Defect Prediction Method Using Adversarial Learning. J. Softw. 2022, 33, 2097–2112. [Google Scholar]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
- Bui, D.Q.N.; Yu, Y.; Jiang, L. TreeCaps: Tree-based capsule networks for source code processing. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, Virtual Conference, 2–9 February 2021; pp. 2–9. [Google Scholar]
- Mou, L.; Li, G.; Zhang, L.; Wang, T.; Jin, Z. Convolutional neural networks over tree structures for programming language processing. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016. [Google Scholar]
- Peng, Z.; Yao, Y.; Xiao, B.; Guo, S.; Yang, Y. When urban safety index inference meets location-based data. IEEE Trans. Mob. Comput. 2018, 18, 2701–2713. [Google Scholar] [CrossRef]
- Kang, J.; Xiong, Z.; Niyato, D.; Zou, Y.; Zhang, Y.; Guizani, M. Reliable federated learning for mobile networks. IEEE Wirel. Commun. 2020, 27, 72–80. [Google Scholar] [CrossRef] [Green Version]
- Li, W.; Liu, X.; Yuan, Y. SIGMA: Semantic-complete Graph Matching for Domain Adaptive Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5291–5300. [Google Scholar]
- Li, W.; Liu, X.; Yao, X.; Yuan, Y. SCAN: Cross Domain Object Detection with Semantic Conditioned Adaptation. In Proceedings of the AAAI, Virtual, 22 February–1 March 2022; Volume 6, p. 7. [Google Scholar]
- Ye, Z.B.; Yan, B. Survey of Symbolic Execution. Comput. Sci. 2018, 45, 28–35, (In Chinese with English abstract). [Google Scholar]
- Zhou, Z.H.; Feng, J. Deep forest. Natl. Sci. Rev. 2019, 6, 74–86. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Barchi, F.; Parisi, E.; Urgese, G.; Ficarra, E.; Acquaviva, A. Exploration of convolutional neural network models for source code classification. Eng. Appl. Artif. Intell. 2021, 97, 104075. [Google Scholar] [CrossRef]
- Lu, X.; Duan, Z.; Qian, Y.; Zhou, W. Malicious Code Classification Method Based on Deep Forest. J. Softw. 2020, 31, 1454–1464. [Google Scholar]
- Lin, G.; Zhang, J.; Luo, W.; Pan, L.; Xiang, Y.; De Vel, O.; Montague, P. Cross-project transfer representation learning for vulnerable function discovery. IEEE Trans. Ind. Inform. 2018, 14, 3289–3297. [Google Scholar] [CrossRef]
- Zaremba, W.; Sutskever, I. Learning to execute. arXiv 2014, arXiv:1410.4615. [Google Scholar]
- Zhou, P.; Shi, W.; Tian, J.; Qi, Z.; Li, B.; Hao, H.; Xu, B. Attention-based bidirectional long short-term memory networks for relation classification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Berlin, Germany, 7–12 August 2016; pp. 207–212. [Google Scholar]
- Kim, Y. Convolutional Neural Networks for Sentence Classification. arXiv 2014, arXiv:1408.5882. [Google Scholar]
- Yao, L.; Mao, C.; Luo, Y. Graph convolutional networks for text classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 7370–7377. [Google Scholar]
- BaygIn, M. Classification of text documents based on Naive Bayes using N-Gram features. In Proceedings of the 2018 International Conference on Artificial Intelligence and Data Processing (IDAP), Malatya, Turkey, 28–30 September 2018; pp. 1–5. [Google Scholar]
- Wang, W.; Guo, B.; Shen, Y.; Yang, H.; Chen, Y.; Suo, X. Twin labeled LDA: A supervised topic model for document classification. Appl. Intell. 2020, 50, 4602–4615. [Google Scholar] [CrossRef]
- Tufano, M.; Watson, C.; Bavota, G.; Di Penta, M.; White, M.; Poshyvanyk, D. Deep learning similarities from different representations of source code. In Proceedings of the 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR), Gothenburg, Sweden, 28–29 May 2018; pp. 542–553. [Google Scholar]
- Ruberg, P.; Meinberg, E.; Ellervee, P. Software Parser and Analyser for Hardware Performance Estimations. In Proceedings of the 2022 International Conference on Electrical, Computer and Energy Technologies (ICECET), Prague, Czech Republic, 20–22 July 2022; pp. 1–6. [Google Scholar]
- Garion, C.; Hattenberger, G.; Pollien, B.; Roux, P.; Thirioux, X. A Gentle Introduction to C Code Verification Using the Frama-C Platform. ISAE-SUPAERO; ONERA–The French Aerospace Lab; ENAC. 2022. Available online: https://hal.science/hal-03625208/ (accessed on 6 January 2023).
- Feng, J.; Rong, C.; Sun, F.; Guo, D.; Li, Y. PMF: A privacy-preserving human mobility prediction framework via federated learning. Proc. Acm Interact. Mob. Wearable Ubiquitous Technol. 2020, 4, 1–21. [Google Scholar] [CrossRef] [Green Version]
- Chen, Z.; Yang, C.; Zhu, M.; Peng, Z.; Yuan, Y. Personalized Retrogress-Resilient Federated Learning Toward Imbalanced Medical Data. IEEE Trans. Med. Imaging 2022, 41, 3663–3674. [Google Scholar] [CrossRef] [PubMed]
CWE Number | Vulnerability Type | Positive Samples | Negative Samples | Total |
---|---|---|---|---|
CWE78 | cmdi | 906 | 1802 | 2708 |
CWE327 | crypto | 720 | 720 | 1440 |
CWE328 | hash | 707 | 714 | 1421 |
CWE90 | LDAP | 215 | 521 | 736 |
CWE22 | pathtraver | 924 | 1706 | 2630 |
CWE614 | securecookie | 215 | 201 | 416 |
CWE89 | sqli | 1232 | 2297 | 3529 |
CWE501 | trustbound | 220 | 505 | 725 |
CWE330 | weakrand | 2028 | 1612 | 3640 |
CWE643 | XPATH | 130 | 217 | 347 |
CWE79 | XSS | 1909 | 1540 | 3449 |
CWE Number | Vulnerability Type | Positive Samples | Negative Samples | Total |
---|---|---|---|---|
CWE78 | cmdi | 906 | 1802 | 2708 |
CWE327 | crypto | 720 | 720 | 1440 |
CWE328 | hash | 707 | 714 | 1421 |
CWE90 | LDAP | 215 | 521 | 736 |
CWE643 | XPATH | 264 | 264 | 528 |
CWE614 | securecookie | 17 | 17 | 34 |
CWE89 | sqli | 1320 | 1320 | 2640 |
CWE80 | XSS | 792 | 792 | 1584 |
CWE15 | External Control of System | 264 | 264 | 528 |
CWE113 | HTTP Request Splitting | 792 | 792 | 1584 |
CWE129 | Improper Validation of Array | 1584 | 1584 | 3168 |
Code Representation | Classifier | Accuracy (%) | Recall (%) | F1 (%) | Time Cost (s) |
---|---|---|---|---|---|
ST | GRU | 93.35 | 93.54 | 93.33 | 675.89 |
PST | GRU | 97.59 | 97.61 | 97.48 | 428.31 |
Classifier | Accuracy (%) | Recall (%) | F1 (%) | Time Cost (s) |
---|---|---|---|---|
LSTM | 27.20 | 27.20 | 24.01 | 345.19 |
BiLSTM | 32.26 | 32.26 | 26.0 | 322.96 |
CNN | 33.31 | 33.31 | 25.14 | 505.17 |
SVM | 7.28 | 7.28 | 5.22 | 213.34 |
Decision Tree | 13.72 | 13.72 | 8.93 | 216.91 |
GRU | 97.59 | 97.61 | 97.48 | 504.78 |
DeepForest | 99.13 | 99.13 | 99.13 | 376.33 |
Method | Accuracy (%) | Recall (%) | F1 (%) |
---|---|---|---|
SVM+TD-IDF | 44.16 | 44.16 | 27.06 |
SVM+N-gram | 49.93 | 49.93 | 33.26 |
SVM+LDA | 43.91 | 43.91 | 26.80 |
LSTM | 43.81 | 8.33 | 5.08 |
BiLSTM | 89.26 | 89.26 | 89.90 |
MCDF | 91.97 | 85.83 | 86.59 |
TextCNN | 86.36 | 86.36 | 86.30 |
ASTNN | 93.35 | 93.54 | 93.33 |
TextGCN | 90.60 | 90.04 | 89.96 |
PSTDF | 99.13 | 99.13 | 99.13 |
Method | Accuracy (%) | Recall (%) | F1 (%) |
---|---|---|---|
SVM+TD-IDF | 49.40 | 49.40 | 33.62 |
SVM+N-gram | 49.76 | 49.76 | 33.06 |
SVM+LDA | 47.28 | 47.28 | 33.94 |
LSTM | 50.02 | 7.69 | 5.13 |
BiLSTM | 93.68 | 93.68 | 93.50 |
MCDF | 97.63 | 89.74 | 92.64 |
TextCNN | 94.76 | 94.76 | 93.58 |
ASTNN | 95.95 | 95.21 | 95.71 |
TextGCN | 89.39 | 89.05 | 89.05 |
PSTDF | 99.32 | 99.32 | 99.33 |
Method | Accuracy (%) | Recall (%) | F1 (%) |
---|---|---|---|
SVM+TD-IDF | 85.45 | 85.45 | 85.56 |
SVM+N-gram | 84.66 | 84.66 | 84.99 |
SVM+LDA | 0.79 | 0.008 | 0.001 |
TextGCN | 79.16 | 78.31 | 78.21 |
PDG+GGNN | 79.61 | 79.61 | 79.74 |
TBCNN | 94.01 | 94.01 | 94.14 |
ASTNN | 97.22 | 97.29 | 97.23 |
PSTDF | 97.70 | 97.73 | 97.72 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ding, J.; Fu, W.; Jia, L. Deep Forest and Pruned Syntax Tree-Based Classification Method for Java Code Vulnerability. Mathematics 2023, 11, 461. https://doi.org/10.3390/math11020461
Ding J, Fu W, Jia L. Deep Forest and Pruned Syntax Tree-Based Classification Method for Java Code Vulnerability. Mathematics. 2023; 11(2):461. https://doi.org/10.3390/math11020461
Chicago/Turabian StyleDing, Jiaman, Weikang Fu, and Lianyin Jia. 2023. "Deep Forest and Pruned Syntax Tree-Based Classification Method for Java Code Vulnerability" Mathematics 11, no. 2: 461. https://doi.org/10.3390/math11020461
APA StyleDing, J., Fu, W., & Jia, L. (2023). Deep Forest and Pruned Syntax Tree-Based Classification Method for Java Code Vulnerability. Mathematics, 11(2), 461. https://doi.org/10.3390/math11020461