4.4. Presentation of the Result of Training Models
In the initial evaluation using the open data source dataset, all models achieved moderate performance, with noticeable variation in recall (see
Table 5). Extra Trees yielded the highest accuracy (89.82%), but this was coupled with a relatively low recall (68.54%) and an F1-score of 0.7727, suggesting the model leaned toward conservative classifications with a high number of true negatives. In contrast, XGBoost demonstrated the strongest F1-score at 0.8306, reflecting a more balanced trade-off between precision (0.8954) and recall (0.7745). Random forest followed closely, offering a comparable balance (F1-score: 0.7810). These results indicate that while Extra Trees appeared superior by accuracy alone, XGBoost provided the most reliable performance for practical detection in this domain.
To address the overall performance deficit and facilitate improvement, a fine-tuning of model depth was conducted to accurately detect instances of terrorism-related terms.
Table 6 summarizes the hyperparameter tuning process conducted on machine learning models using the open data source dataset. For each model, including Decision Tree, Bagging, XGBoost, random forest, and Extra Trees, specific hyperparameters were adjusted across defined value ranges. The best-performing configurations were selected based on improvements in model accuracy, precision, recall, and F1-score. For instance, setting the max_depth to 15 and using entropy as the splitting criterion significantly improved the decision tree model’s sensitivity, while increasing the number of estimators in ensemble methods like Bagging, random forest, and Extra Trees enhanced generalization. The adjustments made to these hyperparameters demonstrated clear performance gains over the default settings and helped achieve a more balanced and reliable detection of terrorism-related terms from open text sources.
The fine-tuning has resulted in notable enhancements in all models, especially regarding accuracy, precision, and recall (refer to
Table 7). Following hyperparameter tuning, all models showed improved performance across all metrics. Notably, Extra Trees again achieved the highest accuracy (94.31%), but its recall (71.97%) remained lower than that of XGBoost (81.32%). When recalculated correctly, XGBoost reached the highest F1-score (0.8707), outperforming all other models, including random forest (F1-score: 0.8202). This underscores the effectiveness of boosting techniques in capturing nuanced lexical features in unstructured text. Although Extra Trees retained strong precision (92.97%) and accuracy, its relative weakness in recall suggests that it may be less effective in identifying all relevant terrorism-related terms, particularly in ambiguous cases.
The accuracy trends among the optimized models were compared (see
Figure 1). The increase highlights how each model’s accuracy has benefited from hyperparameter tuning and cross-validation. The tuned accuracy reflects the improvements achieved through hyperparameter tuning and cross-validation, with all models showing increased accuracy compared to their initial accuracy. Adjusting the hyperparameters and evaluating model configurations using cross-validation led to improved performance metrics for each model. The role of cross-validation was not as part of the final model, but rather as an evaluation mechanism during training to prevent overfitting and ensure the robustness of parameter selection. This comparison highlights the positive impact of fine-tuning and optimization across different models.
Further analysis to affirm the performance of the fine-tuned trained model with the ROC curve presents the complete spectrum of performance at different thresholds.
In terms of the Decision Tree (see
Figure 2), it indicates that the model performs well, with a high ability to distinguish between positive and negative classes, suggesting that this model reliably identifies true cases of interest. The shape of the curve suggests that the model has a good balance.
The Bootstrap Aggregating (Bagging) model ROC curve (see
Figure 3) represents excellent performance, suggesting the model is very good at distinguishing between the positive and negative classes. The smoothness of the ROC curve here, without sharp jumps, implies that Bagging provides a balanced approach that reduces the instability often seen in single Decision Trees. The high AUC score and shape of the curve suggest that this Bagging model is very reliable for detecting true positive cases without a significant increase in false positives (see
Figure 3).
The ROC curve for the XGBoost model indicates that the model is highly effective at distinguishing between the positive and negative classes (see
Figure 4). The sharp rise and flattening of the curve suggest that XGBoost’s boosting mechanism effectively captures relevant features and relationships, providing strong predictive performance.
Similarly, the ROC curve for the random forest model achieves a good balance with the smoothness of the curve (see
Figure 5). This demonstrates that random forest is consistent and handles variance well, capturing relevant patterns without overfitting.
The Extra Trees model ROC curve demonstrated that the model is proficient at distinguishing between positive and negative classes (See
Figure 6). Extra Trees uses a high degree of randomness in both feature selection and split points, which helps reduce overfitting and improve generalization.
The ROC curve for an XGBoost model indicates that the model is highly effective at distinguishing between the positive and negative classes. XGBoost uses boosting techniques that iteratively improve errors from previous rounds, making it highly effective in capturing complex patterns in data. The smooth and high-reaching curve signifies that XGBoost is stable and provides consistent performance across different threshold values (see
Figure 7).
The analysis of the GTD dataset within the models has indicated numerous desirable outcomes. This phase represents the second stage of the experimental training employing the GTD dataset. The training results indicate that the performance of all models is adequate.
When applied to the GTD dataset before tuning, the models followed a similar pattern. Extra Trees again achieved the highest accuracy (91.04%) but with modest recall (69.23%), leading to a corrected F1-score of 0.7725. XGBoost outperformed all other models with a recalculated F1-score of 0.8247, based on its strong recall (76.61%) and precision (89.31%). This balance is critical in minimizing false negatives, making XGBoost particularly valuable in applications requiring high sensitivity. Random forest and Bagging provided solid mid-range performance, reinforcing their consistency, but again fell short of XGBoost in both recall and the F1-score.
Table 8 show the result of The model training performance of GTD.
A similar fine-tuning strategy employed in the initial study was also applied here. Model performance metrics improved significantly after fine-tuning and cross-validation.
Table 9 presents the hyperparameter tuning details for models trained on the Global Terrorism Database (GTD). Similar to the open data analysis, key parameters such as tree depth, the number of estimators, learning rate, and sampling strategies were tuned to optimize model performance. The GTD dataset, being more structured and feature-rich, benefited from slightly different optimal settings; for instance, the Decision Tree performed better using the gini criterion, and ensemble methods like XGBoost and random forest achieved notable improvements with deeper trees and increased estimators. These fine-tuned values led to higher accuracy and recall, essential for identifying complex patterns in terrorism data. Overall, the tuning ensured that each model was better adapted to the GTD’s characteristics, improving predictive reliability compared to default configurations.
Post-tuning, XGBoost maintained its dominance, achieving the highest F1-score of 0.8786, with precision (93.89%) and recall (82.52%) both at outstanding levels (see
Table 10). Extra Trees reached the highest accuracy (94.22%) but lagged in recall (74.11%) and thus produced a slightly lower F1-score (0.8259). Random forest also showed strong balance (F1-score: 0.8255), but not to the same extent as XGBoost. These results reinforce the earlier insight that accuracy alone is not sufficient to assess model utility. XGBoost’s high recall and overall balance across metrics make it the most effective model for this high-stakes application, where missing positive cases is significantly more detrimental than classifying neutral terms as risky.
After fine-tuning, the ROC curves for each model indicate a substantial improvement in their ability to distinguish between positive and negative classes.
The decision tree model now demonstrates strong discriminatory power with a significantly improved AUC of 0.93. This indicates a more reliable classification, although not as high as the ensemble models. It shows the model’s ability to reduce misclassification after fine-tuning (see
Figure 8).
The Bagging approach shows a high AUC of 0.96, reflecting an effective ensemble strategy to stabilize and improve the model’s performance (see
Figure 9). This enhancement indicates that the model can now distinguish positive and negative instances with greater accuracy, making it suitable for applications requiring stable prediction.
XGBoost achieves an AUC of 0.95, indicating robust performance with high discriminatory power. This improvement suggests that XGBoost is effectively leveraging boosted trees to enhance its ability to identify true positive instances, making it a reliable choice for complex predictive tasks (see
Figure 10).
With an AUC of 0.98, the random forest model shows excellent performance, achieving near-optimal classification accuracy. The high AUC signifies the model’s strength in handling complex patterns, making it one of the most reliable models for terrorism prediction in this dataset (see
Figure 11).
Extra Trees also achieves a near-perfect AUC of 0.98, indicating excellent model performance. This high score highlights its ability to accurately distinguish classes, confirming the effectiveness of ensemble methods with randomized trees after fine-tuning (see
Figure 12).
The observed differences in model performance can be more deeply understood by considering the structural characteristics of the algorithms in relation to the properties of the datasets. For instance, XGBoost consistently demonstrated superior F1-scores across both datasets, which aligns with its gradient boosting mechanism that iteratively reduces residual errors and adapts to difficult cases—an advantage when dealing with ambiguous or overlapping lexical patterns common in terrorism discourse. Its ability to model complex interactions between features allows it to capture nuanced co-occurrence patterns that simpler models may overlook. In contrast, the Extra Trees algorithm, while achieving the highest overall accuracy, tended to produce conservative predictions, as evidenced by its relatively lower recall. This is attributable to its use of extreme randomization in feature splits, which, although effective in reducing variance and avoiding overfitting, may fail to capture subtle patterns associated with positive (i.e., terrorism-relevant) terms, especially in the presence of class imbalance.
Random forest and Bagging models displayed more balanced profiles, benefiting from ensemble averaging to mitigate overfitting while still capturing moderately complex relationships. However, these models may lack the iterative error–correction refinement seen in boosting algorithms. The single decision tree model, while interpretable and efficient, suffered from relatively low recall, indicating susceptibility to both overfitting and underfitting depending on tree depth.
The data characteristics further amplify these behaviors. The open-source dataset is lexically rich but semantically noisy, with high-dimensional TF-IDF vectors and considerable synonymy. This setting favors models that can handle sparse and noisy input (e.g., XGBoost), whereas more rigid algorithms may underperform. The GTD dataset, although more structured, still contains categorical and lexical ambiguity that benefits from models capable of fine-grained feature interaction modeling. The results suggest that the algorithmic structure should be matched to the data complexity. Boosting-based models like XGBoost appear better suited to handling heterogeneous, imbalanced, and lexically complex terrorism data, while ensemble bagging methods provide robustness but may require careful tuning to avoid recall deficiencies.
4.5. Presentation of Analysis on a Practical Application Related to Human–Machine Interaction
This paper’s framework can be adapted for real-time analysis, where machine learning models continuously monitor online content and provide alerts to human analysts when potential threats are detected. This real-time interaction between humans and machines is essential for proactive threat detection and response, which is a key requirement in the Future Internet. A typical scenario is presented in
Table 11. This is a case where a practical application of this research relates to human–machine interaction. The predefined TF-IDF score maps with the study’s results determine whether a given text should be “automatically flagged as suspicious” or “not”. High TF-IDF scores (above 0.06) are flagged as they strongly relate to known threat keywords. Hence, the co-occurrence with threat-related terms increases the likelihood of automatic flagging; eventually, the Machine Learning Predictions: Models with higher accuracy (Extra Trees, XGBoost) are more likely to flag content if it matches known patterns.
Based on the study’s findings and model predictions, flagged and non-flagged content “Bombing” (Flagged as Suspicious)” has a high TF-IDF score, meaning it appears frequently in terrorism-related content (see
Table 11). It is also strongly associated with words like “Explosion” and “Attack”, which indicate violence or criminal intent. Security Checkpoint” is not flagged, even though it relates to defensive measures rather than offensive or terrorist activities. While it co-occurs with words like “Surveillance” and “Monitor”, these are neutral or security-related terms.
Another case where the practical application of this research relates to human–machine interaction is in associating online “user conversations” with the study’s results. This can be determined whether an “AI-powered chatbot” should or “any monitoring interface” can be flagged up or not. A typical case to an AI-powered chatbot is presented in
Table 12. Flags tagging for “Provide de-radicalization content (mild warning, education, intervention)”, “Redirect to human counselors (high risk, immediate attention needed)”, and “No intervention (conversation is neutral or non-threatening)” are defined in order to establish an AI-powered chatbot response to radicalization indicators. This system can ensure that an AI-powered chatbot can “Detect early signs of radicalization and prevent escalation”, “Offer soft interventions through education and alternative perspectives”, or “Escalate severe cases to human experts before a threat manifests”.
The statement “They will pay for this injustice!” (Redirect to Human Counselors) expresses anger and intent for retribution, indicating a potential escalation toward violence. The co-occurrence of “revenge” and “attack” aligns with high-risk radical speech patterns found in extremist narratives. In the AI chatbot case, it flags this as a serious case and redirects the user to a human counselor for immediate intervention. This dwells on the Extra Trees (94.31% accuracy) of prediction (see
Table 12). Similarly, a “Government surveillance is too much” (no intervention). This message expresses concern about government surveillance, which is a common civil rights issue, with no direct call to violence, extremism, or radicalization. The words “Privacy” and “Freedom” are frequently used in legitimate political discussions rather than extremist rhetoric. The chatbot does not intervene since the message is within normal discourse. Hence, it established a practical application related to human–machine interaction to ensure a balanced approach between AI automation and human intervention, minimizing false positives while detecting genuine radicalization risks.