3.1. Experimental Setup
In
Figure 2, a schematic of the experimental setup is shown.
This setup is based on a Hardware in the Loop (HIL) simulation containing the embedded Device Under Test (DUT). In this case, it is a high-voltage battery controller including real and simulated sensors and actuators. The electrically simulated sensors and actuators in the HIL configuration emulate the behavior of the real counterparts, enabling supervised testing of the DUT. The physical sensors and actuators are part of the real configuration, providing real input and output signals to and from the DUT, enabling realistic test scenarios.
The HIL simulation is controlled from a PC, using dedicated software. For the simulation, a virtual representation of the vehicle is created, as a model based on physical phenomena, which simulates the behavior of the real vehicle (its dynamics, control systems and the vehicle’s reactions to various inputs). This makes it possible to assess the functionality and performance of the DUT in a controlled environment.
The setup is equipped with a software logger, which is a data acquisition system that records and stores information generated during the testing process in the test environment. It captures signals from sensors, actuators and DUTs to analyze system behavior and identify potential problems. During the execution of test cases, the test oracle assesses the correctness of the results or behavior of the DUT against the expected results. It serves as a benchmark for evaluating the success or failure of tests, helping to identify deviations from expected behavior. In the case of a negative test result, it is the task of the AVESYS framework to validate it using a previously trained neural network.
The object of the conducted experiment is to evaluate the performance and quality of the different artificial intelligence algorithms used in the proposed AVESYS framework.
3.3. Experimental Results
The first stage of the study focuses on the analysis of unsupervised algorithms. In the first step, to validate their performance and assess their quality, the algorithms are tested on the ODDS dataset.
In general, the evaluation of multiple anomaly detection algorithms on both training and test datasets reveals varying degrees of effectiveness across different metrics. The results highlight the diversity in performance, emphasizing the importance of selecting an algorithm based on specific priorities and the nature of the data. It is noteworthy that the Isolation Forest (IForest) algorithm stands out for its robust performance, consistently achieving high Roc_Auc_Score, precision and low mean square error (MSE) scores. Other algorithms, such as ECOD, PCA and INNE, also show commendable performance across multiple metrics.
The next stage of the experimentation is to implement the aforementioned algorithms using the best configured hyperparameter settings on a real embedded test dataset. Results are shown in
Table 3 and
Table 4.
The obtained results can disappointingly be described as unsatisfactory. Modifications to the hyperparameters did not result in an improved performance of anomaly detection in the logs from the real embedded systems tested.
It is decided to return to the first stage of the study to carry it out for supervised algorithms. Results of this stage are presented in
Table 5 and
Table 6.
The obtained excellent ROC AUC results (1.0) with zero mean squared error for the CatB+, LGB, XGB+, ResNet and FTTransformer algorithms demonstrate the high performance of anomaly detection in the tested ODDS dataset for supervised algorithms. An evaluation of these algorithms is also performed for test data obtained from real embedded systems. Results are presented in
Table 7 and
Table 8.
To summarize the obtained results, some algorithms perform exceptionally well on the training set, but there are some indications of potential over-fitting (LGB, CatB+) considering the performance of these algorithms on the test data. The decrease in performance observed for several models when moving from the learning set to the test set suggests that some models may have difficulties with generalizing to the new, unseen data. Mean squared error (MSE) values provide insight into the accuracy of regression predictions. In general, low MSE values indicate accurate predictions, and values are relatively consistent across algorithms. The SVM algorithm consistently achieves excellent performance in both Roc_Auc_Score and precision on both the training and test datasets, implying robust generalization capabilities. MSE values are consistently low, indicating accurate predictions. Similar results are reported for the MLP algorithm, but its lower performance on the training set suggests potential over-fitting or high sensitivity to the training data.
In the next step of this study, the K-fold cross-validation technique is employed, entailing the partitioning of the learning dataset into K equally sized segments. K-1 subsets of data are utilized for model training (train data), while the remaining subset serves for validation (test data). This technique involves K iterations, each time reserving a specified subset for validation. Exclusion of training samples from those used to evaluate candidate parameter values reduces the likelihood of overfitting, thereby enhancing the generalization of the classifier [
33].
The real embedded test dataset is divided into 10 subsets, and the model training and testing processes are repeated 10 times. Subsequently, the average values of the parameters determining the model’s quality are computed. The dataset contains 231,880 samples.
For the examined algorithms, confusion matrices are determined for both training and testing data, calculated as the sum of values
across all trials. Results are shown in
Table 9 and
Table 10.
The SVM algorithm and XGB+ achieved a high count of true negatives () and true positives () and a relatively low count of false positives () and false negatives () on both training and test data, indicating effective classification of negative instances and the ability of the model to effectively classify negative instances in unseen data. In summary, the algorithms generally show high performance in classifying both negative and positive instances, with some differences in the balance between true and false positives.
The cross-validation results, after computing the average values of the quality indicators, are presented in
Table 11 and
Table 12. The research results demonstrate the performance of five different algorithms: MLP, SVM, XGB+, LGB, and CatB+ across various evaluation metrics, including Roc_Auc_Score, precision, R2 Score, Matthews Correlation Coefficient (MCC), and Balanced Accuracy (BA).
All algorithms demonstrate excellent discriminatory power, with Roc_Auc_Score close to or equal to 1.0000 for both training and test data. The SVM, XGB+, and CatB+ algorithms show perfect Roc_Auc_Score, indicating optimal performance in distinguishing between classes. For all algorithms, the precision values are consistently high for both training and test data, reflecting the ability of the models to correctly identify positive instances. SVM, XGB+, and CatB+ algorithms stand out with precision values approaching or equal to 1.0000 on the test data. R2 scores, indicating the percentage of variance explained by the models, are generally high for all algorithms on the test data. XGB+ demonstrates perfect R2 scores, suggesting an excellent fit to the data. The MCC values, which measure the quality of binary classifications, are consistently high across algorithms and datasets. The XGB+ and SVM algorithms show particularly high MCC scores, indicating reliable classification performance. Balanced accuracy scores are consistently high, indicating a well-balanced performance between sensitivity and specificity for all algorithms. The XGB+ and SVM algorithms show the highest balanced accuracy scores on the test data.
The performance of various algorithms is assessed through both Mean Squared Error (MSE) and Mean Squared Logarithmic Error (MSLE), which is presented in
Table 13 and
Table 14. Each algorithm’s ability to minimize these metrics provides insights into their effectiveness in capturing the underlying patterns within the dataset.
Multilayer Perceptron (MLP) demonstrates competitive performance, with an MSE of 0.0082 and an MSLE of 0.0035. These values indicate reasonable precision of the model, although further optimization may be explored to potentially enhance its performance. Support Vector Machine (SVM) outperforms other models in terms of precision, exhibiting the lowest MSE (0.0008) and MSLE (0.0004). This suggests that SVM is highly effective in minimizing errors and capturing variability within the data. Ensemble models, represented by XGB+, LGB, and CatB+, consistently outperform individual models, presenting MSE and MSLE values that are significantly lower than the MLP. This highlights the effectiveness of ensemble methods in improving predictive accuracy. The lower MSLE values across all algorithms compared to MSE values suggest that the models perform well across different magnitudes of predictions. This is particularly important in scenarios where predictions cover a wide range of values. The varying performance of different algorithms indicates sensitivity to algorithmic choices. Understanding the strengths and weaknesses of each algorithm is crucial for selecting the most suitable model based on the specific requirements of the task.
In summary, the evaluated algorithms consistently perform well across a range of metrics, showcasing their effectiveness in handling the classification task. The choice of the most suitable algorithm may depend on specific priorities, such as precision, interpretability, or computational efficiency, as indicated by the different strengths observed in the metrics.
3.4. Cross-Validation
After fitting the model to the training data, it is imperative to verify whether the trained model also performs adequately when exposed to real-world data. It is essential to confirm that the model is well acquainted with patterns within the data and does not exhibit excessive noise. The performance of the machine learning model was assessed through cross-validation. This involved training the model on a subset of input data and testing it on an unseen dataset [
34].
A cross-validation of the investigated algorithms was performed using the entire dataset used in previous studies: a real embedded test dataset as the train data, and the Test Data representing new, unseen data collected in the form of logs from the real test system while performing a distinct test case from the preceding cases. In this case, the training dataset contains 231,880 samples (including 24,081 outliers, which is 10%), while the test dataset contains 146,178 samples (including 12,040 outliers, which is 10% as well). The results of the algorithm evaluation are presented in the form of quality indicators in
Table 15 and
Table 16, MSE and MSLE in
Table 17 and
Table 18, and the confusion matrices in
Table 19 and
Table 20.
MLP, SVM, XGB+, LGB and CatB+ algorithms sustained high performance on the test data, achieving high scores in ROC AUC, precision, R2 score, MCC, and balanced accuracy (BA). Although there were slight performance decreases compared to the training set. ResNet performance remained non-optimal on the test set, highlighting the challenges in generalization and capturing the data nuances. FTTransformer showed contrasting mixed results, with improvements in some metrics, but still lagged behind other algorithms.
MLP, SVM, XGB+, LGB, and CatB+ consistently performed well in terms of regression metrics on both training and test datasets, suggesting their reliability for regression tasks. The choice of algorithm may depend on the specific requirements of the application, considering the trade-off between computational efficiency and the magnitude of regression errors.
MLP, SVM, XGB, LGB, and CatB+ algorithms demonstrated strong performance on the train data, with high true positive (Tp) counts and relatively low false positive (Fp) and false negative (Fn) counts. SVM and XGB stood out with perfect true negative (Tn) counts, indicating precise classification of negative instances. ResNet and FTTransformer showed distinctive characteristics. ResNet had a perfect true positive count but misclassified all negative instances, resulting in high false negative and low true negative counts. FTTransformer exhibited a higher false positive count, indicating challenges in distinguishing negative instances. MLP, SVM, XGB, LGB, and CatB+ algorithms continued to demonstrate good performance on the test data, with high Tp counts and relatively low Fp and Fn counts. ResNet performed poorly on the test set, misclassifying all positive instances, leading to high false negative counts. FTTransformer also faced challenges, particularly with a high false positive count, suggesting difficulties in distinguishing negative instances. The choice of the most suitable algorithm should align with the specific needs and constraints of the classification task.
In summary, while certain algorithms demonstrated strong performance across various metrics, the ability to generalize to new data varied. The SVM and MLP algorithms were outstanding as highly reliable models on both training and test datasets, making them strong candidates for practical applications.