In this section, the performance of the proposed improved DBO-Catboost detection model is compared and evaluated using publicly available datasets.
4.3. Data Preprocessing
Data preprocessing is an essential step in machine learning, involving cleaning, transforming, and standardizing raw data before feeding it into a model. The main purpose of data preprocessing is to improve data quality, enhance model performance, and increase generalization ability.
Typically, data preprocessing involves several steps:
Step1: Data cleaning. The primary objective is to remove duplicate and invalid values from the dataset. This also involves handling missing values by imputation.
Step2: Data type conversion. This refers to the process of converting the data type of features to a different format. Machine learning algorithms typically require numerical data, so non-numeric features need to be converted into a numerical format for the algorithms to process.
Step3: Standardization. Some features in the dataset may have varying scales or ranges of values. To simplify the learning process of the model, it is necessary to standardize the features in the dataset.
Step4: Feature engineering. The Botnet dataset consists of 83 features, many of which are redundant. These redundant features can negatively impact the model training. In this study, principal component analysis (PCA) [
36] is used for feature dimensionality reduction.
It is worth noting that Catboost has strong capabilities in handling missing values. It can automatically detect and handle missing data, eliminating the need for additional feature engineering and reducing the time required for data preprocessing before training. In this section, to ensure consistency across all experiments, the preprocessed dataset was used.
4.5. Experimental Results
The Catboost algorithm had a significant impact on the classification performance through parameters such as the maximum number of trees (iterations), tree depth (depth), and learning rate (learning_rate). In this study, the dung beetle optimizer was used to optimize these parameters for Catboost. When using DBO-Catboost, the iteration count was set to 100. After several preliminary experiments, it was observed that 30 iterations were sufficient for the DBO-Catboost model to converge. Therefore, the iteration count was set to 30, the number of beetle populations was set to 40, and the parameter search ranges were set as follows: Iterations = [500, 1000], depth = [4, 16], learning_rate = [0.01, 0, 2].
To verify the optimization effectiveness of the improved DBO algorithm on the model, particle swarm optimization (PSO), grey wolf optimization (GWO), and the original dung beetle optimization (DBO) algorithms were used for comparison. The number of iterations was set to 30, and the AUC value of the model was used as the fitness function. The results are shown in
Figure 2.
From the results in
Figure 2, it can be observed that the improved DBO algorithm achieved faster convergence, reaching convergence after 16 iterations. In terms of the AUC value, after 30 iterations the IDBO-Catboost model had the highest AUC value, indicating better optimization performance. Comparing the improved DBO algorithm with the unimproved DBO algorithm, the improved DBO algorithm had a higher initial fitness value, and the final fitness value was also higher, with a faster iteration speed. This indicates that the improved DBO generated a better initial population that was closer to the optimal solution, requiring fewer iterations to converge on the vicinity of the best solution. This result validates the effectiveness of our improvement in the generation rules of the initial DBO population.
The experimental results on the Botnet dataset, comparing Catboost, PSO-Catboost, GWO-Catboost, and the original DBO-Catboost models, are shown in
Table 4.
According to the information in
Table 4, the optimization strategy of IDBO-Catboost effectively found the best parameter combination for the model, improving its performance. In a vertical comparison with Catboost, IDBO-Catboost outperformed Catboost with an improvement of 2.17% in accuracy and F1 score. In a horizontal comparison, DBO-Catboost outperformed PSO-Catboost and GWO-Catboost with improvements of 1.21% and 1.38% in F1 score and accuracy, respectively. As for the comparison with the original DBO-Catboost, the IDBO-Catboost showed a slight improvement of 0.35% in accuracy and F1 score. Although the improvement was modest, it still represents a significant value when applied in real-life scenarios. The main reason for these differences is that the improved DBO algorithm can increase the population size and diversity of the population by optimizing the oviposition area, thereby better utilizing the search space. Additionally, by optimizing the strategy for generating the initial population, the algorithm can come closer to the optimal solution within the search space, enhancing its search capability. Therefore, the improved DBO algorithm can comprehensively explore the search space and achieve better global search capability to obtain the optimal parameter combination for the Catboost algorithm.
Figure 3 shows the ROC curves of several models. According to the information in
Figure 3, we can see that the ROC curve of IDBO-Catboost was closer to the top-left corner, indicating that IDBO-Catboost achieved a better balance between sensitivity and specificity. Additionally, the AUC value of IDBO-Catboost was the highest among all the comparison models, reaching 0.99. Therefore, this result effectively demonstrates that the model has good performance and high discriminative ability at different thresholds.
After validating the optimization effect of IDBO on the Catboost algorithm, we compared it with the recently proposed models, namely BO-GP-DT (2020), GWO-OCSVM (2020), PSO-ONE-SVM (2021), BO-LGBM (2022), and the original DBO-Catboost. The parameter settings for the models are shown in
Table 5, and the experimental results from the Botnet dataset are presented in
Table 6.
From the results in
Table 6, it can be observed that IDBO-Catboost achieved the highest accuracy of 96.21% among all the comparative models. It outperformed BO-GP-DT, BO-LGBM, GWO-OCSVM, PSO-ONE-SVM, and the original DBO-Catboost by 1.04%, 0.63%, 0.72%, 0.55%, and 0.35%, respectively. Similarly, IDBO-Catboost also demonstrated superior precision, recall, and F1 score compared to other models, indicating its optimal classification performance.
There are two main reasons for these results: firstly, compared to decision trees and SVM algorithms, gradient-boosting tree-based algorithms such as Catboost and LGBM perform better when dealing with large-scale datasets, and the Catboost algorithm is particularly effective in handling noisy data in the dataset. Secondly, for optimization algorithms, the ability to balance global and local searches is crucial. In the dung beetle optimizer, the breeding behavior ensures that new individuals have better fitness, foraging behavior accelerates the speed of local search, stealing behavior utilizes better solutions discovered in the global search, and rolling behavior increases the algorithm’s diversity, helping it to escape local optima and further improving the global search capability. Finally, the improved DBO algorithm generates an initial population that is closer to the optimal solution, allowing it to better find the optimal parameter combination. Through the combined effect of these behaviors, the algorithm achieves a balance between global and local search, thereby enhancing its search ability and optimization performance, effectively improving the detection and classification performance of the model.
To evaluate the generalization ability of the proposed method, the Bot-IoT dataset [
37] was used for performance assessment. Due to the severe class imbalance in the Bot-IoT dataset, where there was a small amount of normal data and a large amount of attack data, the dataset was processed using a combination of the SMOTE technique and an undersampling technique to achieve a balanced dataset. The data preprocessing workflow is shown in
Figure 4, and the sample distribution before and after dataset processing is illustrated in
Figure 5.
Table 7 presents the performance of the IDBO-Catboost and the comparison models on the processed Bot-IoT dataset.
From the information in
Table 7, it can be observed that compared to other models, IDBO-Catboost achieved higher detection accuracy on the Bot-IoT dataset, with an accuracy and F1 score of 98.57%. This indicates that the hyperparameters identified by IDBO-Catboost were highly effective, and the model exhibited good generalization ability. The main reason for this result was the presence of the dynamically changing parameter R in the dung beetle optimizer, which allowed the algorithm to adapt better to different problems and enhance the model’s generalization performance.
In addition to performance metrics such as accuracy and F1 score, the average time (sum of model training and prediction time) of the proposed model and other detection models on the Botnet dataset and Bot-IoT dataset are compared in
Table 8. It can be observed that the IDBO-Catboost detection model requires less time compared to the other detection models. This is because the IDBO-Catboost model can utilize GPU acceleration for training, thereby improving the detection efficiency of the model. Regarding DBO-Catboost and Catboost, the IDBO-Catboost model has fewer iterations, resulting in a shorter training time compared to DBO-Catboost and Catboost models.
In summary, this paper conducted experiments using two datasets, Botnet and Bot-IoT, related to Internet of Things (IoT) botnet networks. By comparing with existing detection models, the proposed improved DBO-Catboost detection model demonstrated the best classification performance in terms of high accuracy and detection efficiency. Therefore, it can be effectively applied to the detection tasks of IoT botnet networks.