5.2.2. Performance Analysis
In this study, we conducted a feature importance analysis for MQTT traffic by using three different models: PCC, ExtraTreesClassifier, and RandomForestClassifier. According to the results obtained from these three models, we constructed
Figure 3,
Figure 4 and
Figure 5; our approach depends on these figures to select the essential features. We used four steps to analyze the features. The first step was choosing all 33 features in the dataset to understand their contributions to the decision-making process. In the second step, we chose the highest 20 features based on their importance rankings from the three models. In the third step, we reduced the feature set to the top 15 features. In the last step, we chose the top 10 features of significant relevance in all models. These steps help to prioritize features based on their essential scores from three models to improve the efficiency and interpretability of our models.
Based on
Figure 3,
Figure 4 and
Figure 5, we observed that only 17 to 21 features from a total of 33 features had an apparent effect on the decision-making process, while the remaining features had a feature relevance score of zero. We evaluated some of the metrics with model training using RF, DT, KNN, AdaBoost, and XGBoost on all four steps by using a hyperparameter with all features. We noticed that the accuracy improved compared with other studies, but when reducing features, the model’s overall performance maintained nearly consistent accuracy and, and we got a better accuracy result when using the top 10 selected features for some ML techniques. The main observation is that reducing the feature dimensionality can make the model faster in training and testing, as shown in
Table 6,
Table 7,
Table 8 and
Table 9. It also results in making ML models more effective while enhancing accuracy. The utilization of feature selection and the hyperparameter optimization strategy lead to efficiently enhanced results and accurate cyber-attack detection.
Table 6,
Table 7,
Table 8 and
Table 9 illustrate the results of applying the four steps divided based on the PCC, ExtraTreesClassifier, and RandomForestClassifier models. These tables also illustrate the evaluation metrics (accuracy, precision, recall, F1-score, and ROC-AUC) of the classification of each tuned model, which shows that ML methods deliver high accuracy for all models. As observed in all tables, the training and testing times decrease with the reduction in feature selection. The accuracy maintains a consistently high level in all tables, showing relatively small variations and better accuracy results in
Table 9. The RF method is considered the best-performing technique in all tables for accuracy and F1-score. The evaluation metrics F1-score, precision, and recall are also considered stable, which indicates the robustness of the models for classifying data points.
In this study, all experiments were performed several times and with five-fold cross-validation. This helped evaluate the overall quality and reliability of models and may disclose elements not clearly visible during the initial training stage.
A comprehensive analysis of the results indicates that researchers should focus on the 10 features that offer the highest values of feature importance to get sufficient accuracy: namely, [‘mqtt.qos’, ‘mqtt.msgid’, ‘mqtt.len’, ‘tcp.time_delta’, ‘mqtt.msg’, ‘mqtt.hdrflags’, ‘mqtt.dupflag’, ‘tcp.len’, ‘tcp.flags’, ‘mqtt.conack.flags’]. These selected features can achieve high performance in detecting cyber attacks on MQTT traffic. They are also considered to be better than the full feature set or subset (20 features and 15 features, respectively) in terms of training time and testing time and improve the overall accuracy. We also noticed that these ten features influenced the model’s accuracy. Deleting one of these features results in a decrease in accuracy, implying that a particular feature significantly contributes to enhancing the model’s capacity to identify and analyze patterns or attributes within the dataset.
Table 9 shows the results of the final set of 10 features with various ML techniques. The results demonstrate that RF achieved the highest accuracy (0.9633) and F1-score (0.9632) among the evaluated models. DT, KNN, and XGBoost demonstrate significantly shorter training periods than RF and AdaBoost, and RF and AdaBoost have relatively longer training times. During the testing time, DT and XGBoost provided faster model evaluation compared to the other models. The evaluation metric of ROC scores is generally high for all methods, but XGBoost achieved the highest ROC score (0.9847).
Figure 6 depicts the ROC findings of the developed ML algorithms.
ML algorithms significantly enhance the accuracy and effectiveness of a model. ML techniques have a particular set of strengths, weaknesses, and suitability of algorithms. To select the proper ML method, the characteristics of the data should be understood. In this study, we examined multiple ML methods to select the appropriate models to develop an effective IDS to detect attacks on MQTT traffic. The DT method was selected based on its interpretability, simplicity, and ability to handle non-linear correlations in the data [
45]. The KNN algorithm was chosen because of its simplicity and ease of implementation [
46]. RF and XGBoost were chosen due to their popularity in ML methods and their ensemble learning capabilities that improve predictions by integrating numerous weak learners [
45]. The AdaBoost method is a popular choice in various ML applications due to its applicability in different ML tasks and lower susceptibility to overfitting than other algorithms [
45]. Also, tuning the hyperparameters for these models made a significant impact on the model performance. Reducing feature selection also helps improve the model’s performance with regard to training time, testing time, and accuracy.
Table 10 shows the results of papers [
22,
25] and compares our proposed method using the following evaluation metrics: accuracy, precision, recall, F1-score, ROC, and performance time for each chosen approach.
As shown in
Table 10, paper [
22] uses 33 features, and paper [
25] uses 31; our model uses 10 features. This indicates that our scheme focused on the top 10 features for intrusion detection, maintained high classification accuracy of the model, and has superior performance in all evaluation metrics.
The main observation is that our approach significantly reduced training and testing times, while [
22,
25] have longer times; therefore, our method is better suited for applications that require real-time processing. It is also better at optimizing resources, improving efficiency, responding faster to preventing or underestimating the impact of attacks, and for scalability of intrusion detection models.
Compared to other studies, our scheme is also better at various other factors, including dataset preprocessing, data balancing, feature selection, hyperparameter optimization, and cross-validation. These elements make our approach a robust and effective solution for intrusion detection in the MQTT IoT traffic network. Paper [
25] achieved better results in accuracy than paper [
22] using all the ML methods. The authors did not use hyperparameter tuning in paper [
22], while this was not reported for some MLs in [
25]. Additionally, neither paper employed cross-validation in their methodologies.
Using many feature sets without clear selection criteria raises questions regarding the significance and effectiveness of the chosen features. It also might result in problems such as overfitting and interoperability. Reducing the number of features has several advantages in simplicity and by removing irrelevant features. However, this reduction may also result in disadvantages, such as losing valuable information [
47]. Various techniques can be employed to avoid and mitigate this loss when reducing features. Employing several selection techniques rather than relying on a single feature selection method helps to explore and compare essential features within the dataset. In our research, we utilized three techniques—PCC, ExtraTreesClassifier, and RandomForestClassifier—to ensure that the chosen features have reliable and consistent selection across different methods. According to
Table 10, paper [
22] utilized a 33-feature set, while paper [
25] used a 31-feature set, indicating they did not use feature selection. In contrast, we analyzed and compared to decrease the number of features in the set based on the feature selection techniques of PCC, ExtraTreesClassifier, and RandomForestClassifier for all 33 features. After that, our proposed model depends on the ten selection features to improve the reliability of the IDS of MQTT traffic and other factors such as thorough preprocessing, hyperparameter tuning, the optimal selection of ML models, data balancing, and applied cross-validation. This leads to several advantages, such as improving the model accuracy, reducing dimensionality, and facilitating faster testing and training for models compared to other studies, as shown in
Table 10.
Figure 7 depicts the confusion metrics of all ML models used. The evaluation and classification of MQTT traffic in the proposed system as either normal or attack messages relies on confusion metrics such as TP, FP, TN, and FN. The ML classifiers are presented: DT correctly identified 46,380 instances as normal (true negatives), and 3269 instances were incorrectly classified as abnormal. It correctly predicted 46,656 instances as abnormal (true positives) and incorrectly classified 2973 instances as normal. The KNN method correctly identified 45,593 instances as negatives. In comparison, 4056 instances were incorrectly defined as abnormal, and 46,866 instances were correctly identified as positives, with 2763 instances incorrectly identified as normal. Regarding the RF method, there were 46,384 correctly identified as normal and 46,656 abnormal instances, but it incorrectly classified 3265 normal instances and 2973 abnormal instances. The AdaBoost algorithm demonstrated a performance of 45,589 true negatives and 46,878 true positives along with 4060 false positives and 2751 false negatives. XGBoost showed robust predictive ability: correctly identifying 46,389 instances as true negatives and 46,634 instances as true positives while encountering 3260 instances as false positives and 2995 instances as false negatives.