3.1. Datasets
We used SWaT [
27], WADI [
28], SMAP [
29], MSL [
29], SMD [
30], and IMOD datasets, as seen in
Table 1.
SWaT (Secure Water Treatment): SWaT is a test bed dataset designed to simulate a secure water treatment system. It includes data on various stages of water treatment processes and is used for research in cybersecurity and anomaly detection in critical infrastructure. It has 51 features.
WADI (water distribution testbed): WADI is a water distribution system that provides data on the distribution and management of water resources. It is used to study anomalies and cybersecurity issues in water infrastructure. The dataset comprises 123 features.
SMAP (Soil Moisture Active Passive): The SMAP dataset consists of satellite data on soil moisture and freeze-thaw states from NASA. It is utilized for monitoring agricultural droughts, improving weather forecasts, and understanding climate dynamics. It includes 25 features.
MSL (Mars Science Laboratory): The MSL dataset contains telemetry data from the Mars Science Laboratory’s Curiosity rover. It includes various measurements related to the rover’s environment and operations on Mars, which are useful for fault detection and diagnostics. This dataset has 55 features.
SMD (Server Machine Dataset): The SMD originates from server machines in a data center and includes metrics such as CPU usage, memory, and network traffic. It is commonly used for studying anomaly detection and predictive maintenance in IT systems. It has 38 features.
IMOD (Industrial Maritime Operational IoT Data): IMOD is collected from engines and associated machinery on ships in actual operation. It includes data related to the internal components of the engine, such as cylinder and exhaust gas temperatures, coolant temperatures, cylinder pressures, oil pressures, engine fuel flow rates, fuel tank levels, and vibration and noise levels. Additionally, it encompasses data on pressure and rotational speed (RPM). A total of 31 key features were extracted and used for data analysis.
Figure 3a shows key data for IMOD. The operating pressure of the cylinder (Pmax) refers to the maximum pressure reached within an engine cylinder during the combustion process, a crucial parameter indicating the peak pressure experienced by the cylinder, which directly impacts engine performance and efficiency. The cylinder exhaust gas outlet temperature measures the temperature of the exhaust gases as they exit the engine cylinder. Abnormal exhaust gas temperatures can signal issues such as incomplete combustion, overloading, or cooling system problems. Bearing Temperature refers to the temperature of the bearings in the engine. Bearings support and guide moving parts, and their temperature can indicate the health of the lubrication system and the bearing itself. A high bearing temperature can be a sign of insufficient lubrication, misalignment, or excessive load. Power measures the engine’s output, with units in kilowatts (kW) or horsepower (HP). It reflects the engine’s ability to convert fuel into mechanical energy. Load refers to the demand placed on the engine, measured as a percentage of the engine’s maximum capacity.
3.2. Training Setting
We used an NVIDIA GeForce RTX 3090 GPU for model training.
During the evaluation, the trained model processed the test data windows to predict anomaly scores for each data point. Using a sliding-window strategy, scores were averaged across overlapping intervals. Data points with scores above a certain threshold are classified as anomalies [
23]. We applied the F1 score (F1) over the ground truth label and anomaly prediction for evaluation. We computed TP (true positive), FP (false positive), and FN (false negative) to obtain the F1 as 2TP/(2TP + FP + FN). The model was designed to classify the data as anomalous if any anomalous data were included within the window.
The model architecture uses AnomalyBERT [
23] as the backbone, with a Transformer encoder having an embedding dimension of 512, a Transformer body with six layers, eight attention heads, and two linear layers. During training, the batch size was 8, and the maximum iterations were set to 150,000. Input windows were randomly selected from the training data and synthetic outliers were generated. Three types of outliers were applied in ratios of 6.25:1.875:1.875 for external interval replacement, uniform replacement, and peak noise, respectively. For external interval replacement, the model was trained with one of several techniques, such as global scaling, local scaling, magnitude warping, arithmetic mean, geometric mean, median, and flip.
As shown in
Table 2, different patch sizes were applied to each dataset, ranging from 2 to 14. For the IMOD dataset, a patch size of 4 was chosen, referencing similar benchmark datasets. Window sizes varied from 1024 to 7168, with the IMOD dataset using a size of 2048. To prevent a decline in model performance, the ratio of outliers was kept below a certain threshold. Specifically, for the IMOD dataset, the outlier ratio was controlled to not exceed 15% of the total data.
The AdamW optimizer was used with a learning rate of , with a warm-up of 10% and a decay of the cosine learning rate during the training phase.
All seven detailed methodologies for external interval replacement (arithmetic mean, geometric mean, median, global scale, local scale, magnitude warping, flip) were individually trained, and the model with the highest F1 score was used, as shown in
Figure 3b.
3.3. Experimental Results
As shown in
Table 3, our approach outperformed AnomalyBERT in all six datasets. We were unable to replicate the results of the original paper on the WADI and SMD datasets. However, in our experimental setup, under the same conditions, our approach showed better performance compared to the original method. As shown in
Figure 7, it can be visually confirmed through the bar graph that our approach demonstrates superior performance across all six datasets.
Table 4 shows the F1 scores for external interval replacement using various outlier generation methods. Flip was the default method for external interval replacement in AnomalyBERT [
23]. The performance with different methods, including arithmetic mean, geometric mean, median, global scale, local scale, and magnitude warping, is summarized in the table.
For the WADI dataset, the global scaling method showed the best performance, resulting in a 5.08% improvement. In the SWaT dataset, local scaling improved performance by 5.69%. Magnitude warping was most effective for the MSL dataset, enhancing performance by 23.69%. The arithmetic mean improved SMAP by 18.38%, while the median was best for SMD with a 5.09% improvement. In the IMOD dataset, the F1 score increased from 0.375 with Flip to 0.54545 with global scaling, marking an improvement of 45.43%.
As seen in
Figure 8, the most effective methods vary between different datasets.
Datasets with lower frequency data, like WADI, might benefit more from global scaling adjustments due to fewer fluctuations, whereas datasets with more frequent variations, such as SWaT, might see better performance with local scaling methods.
As seen in
Table 5, a sensitivity analysis of the global scale was conducted. The global scale was adjusted using values ranging from the mean − std to mean + std, classified as min, 1Q, 2Q, 3Q, and max, and recorded as global scales 0, 1, 2, 3, and 4, respectively.
The WADI dataset showed the best performance with a global scale of 2 (0.52543); SWaT had a global scale of 4 (0.81739), and MSL had a global scale of 0 (0.54104). SMD performed best with a global scale of 3 (0.25622), as well as IMOD, with a global scale of 4 (0.54545). WADI, a low-frequency dataset, showed improved performance at the median. The sensitivity analysis did not show a clear pattern for SWaT, SMAP, and MSL. Apart from SMAP, higher global scale values (2Q, 3Q, max) generally improved performance, as seen in
Figure 9.
Figure 10 shows the graph of the loss decreasing as our globally set model trains. Initially, the loss starts near a value of 1, but it gradually decreases over the iterations. After 20,000 iterations, the loss drops below 0.0013115085, and after 150,000 iterations, the loss decreases to 0.0000551652, indicating that the training is progressing well.
Figure 11 shows a graph comparing the original data with the matched anomaly score. The original data, representing Cylinder7 Pmax (peak maximum pressure), usually maintains a value between 50 and 60 bar. However, a sudden drop in pressure is observed around the 10,000 to 11,000 index, signifying an anomaly. During this time, the anomaly score from our model spikes close to 1.0. Typically, the Pmax value ranges from approximately 80 to 200 bar for a modern marine diesel engine. The model effectively detected the anomaly, as indicated by the significant drop in the Pmax value.
The original data in
Figure 12 denote the bearing temperature, which typically remains around 75 degrees Celsius. However, an abnormal pattern emerges when the temperature suddenly drops to 45 degrees before rising again. The normal temperature range for general bearings is approximately 60 to 80 degrees Celsius. Deviations from this range may indicate issues such as lubrication problems, overload, or misalignment.
The second and third original datasets in
Figure 12 represent the cylinder exhaust gas outlet temperature, which typically ranges between 300 and 350 degrees Celsius, but the temperature suddenly dropped to around 200 degrees and then spiked to over 400 degrees, indicating an abnormal pattern. For diesel engines, excessively high exhaust gas temperatures can indicate combustion issues, overload, or fuel quality problems. Conversely, excessively low exhaust gas temperatures can indicate incomplete combustion, low engine load, or cooling system problems.
The fourth original dataset in
Figure 12 represents the cylinder Pmax. The normal range is maintained between 50 and 60, but it suddenly dropped below 40 and then spiked to around 110, showing an abnormal pattern.
The fifth original dataset in
Figure 12 represents the engine load. The normal range is maintained between 10% and 20%, but it suddenly dropped to around 0% and then spiked to over 60%, showing an abnormal pattern. If the engine load is too low, the engine operates inefficiently, and if it is too high, there is an increased risk of overheating and damage.
The sixth original dataset in
Figure 12 represents engine power. Engine power represents the amount of output produced by the ship’s engine and is measured in kilowatts (kW). Normally, it maintains values between 200 and 400 kW, but it suddenly dropped below 100 and then spiked to 1200, showing an abnormal pattern. Engine power is directly related to the propulsion of the ship and is a key factor in determining the ship’s speed and maneuverability.
The seventh graph in
Figure 12 shows the anomaly score detected by our model. At the points corresponding to the abnormal patterns observed in the bearing temperature, cylinder exhaust gas outlet temperature, cylinder Pmax, engine load, and engine power, the anomaly score was significantly predicted.