1. Introduction
Digital twin, a digital replica of the physical asset which can safeguard the physical asset as well as optimize its operations, is becoming the most promising enabling technology for smart manufacturing in the Industry 4.0 concept [
1,
2]. Machine learning (ML) models are also widely used in the digital twinning process of cyber-physical systems for various purposes [
3]. For example, Min et al. [
4] integrated ML and industrial big data in their digital twin to optimize petrochemical production. Snijders et al. [
5] used a temporal convolutional neural network for the digital twin of a cyber-physical energy system to predict its responsiveness to specific power setpoint instructions. In addition, Xu et al. [
6] adapted a generative adversarial network model in their digital twin to detect anomalies.
In the digital twinning process, real-time model updating is essential so that the twin can continuously mimic the dynamic changes of the physical asset. However, this topic is not well addressed in the literature and only a few studies have investigated model updating of the digital twin so far. Wang et al. [
7] proposed a model updating scheme based on parameter sensitivity analysis, which was basically a direct error correction between their digital twin and the physical system. Wei et al. [
8] used a consistency retention method for their computer-numerical-control machine tool to detect the performance attenuation in the digital twin, and then subsequently update the finite element model. Farhat et al. [
9] built a numerical model to simulate the data for ML predictive models in a digital twin, and then demonstrated the importance of the updated parameters for prediction accuracy. Adam et al. [
10] further pointed out the importance of model updating to limit error amplification in a healthcare setting. However, for the real-time updating of ML models in the digital twin, there has not been any study reported in the literature as far as we are aware.
It is well known that the performance of ML models depends largely on their training datasets [
11,
12]. In our previous study on the anomaly detection framework for the digital twin of water treatment facilities [
13], the ML model trained with one dataset performed poorly on another test dataset due to their different coverage, which was defined as the interval between the minimum and maximum target value in the dataset. In other words, if the target value has not appeared within the range in the training dataset, it is normally difficult to predict with good accuracy in the test dataset [
14]. Thus, it is always desirable to continuously carry out real-time ML model updating to expand coverage so that the ML predictions in the digital twin is accurate for ongoing operations. However, this issue has not yet received proper attention and is the focus of the present study.
Computational speed is crucial for real-time model updating. If the speed is fast enough that the updating can be completed within the time step, the update can be performed continuously, and the synchronicity framework would not be needed in theory. Unfortunately, this is not the situation in most cases; hence, approaches for incremental learning need to be considered to speed up the model updating and include new knowledge with the real-time data, while inherent knowledge is maintained [
15]. We importantly note that it is also possible that ML can suffer from catastrophic forgetting when learning new information incrementally [
16,
17,
18]. For example, eXtreme Gradient Boosting (XGBoost) adopts the exact greedy algorithm to enumerate all possible splits for each tree node and to determine the best splits [
19], and it is difficult for XGBoost to achieve the best splits with incremental learning [
20]. Other tree-boosting methods opt to integrate new classifiers and discard outdated ones in classification tasks [
21], but this approach is infeasible for regression tasks. Therefore, more research on incremental ML is still needed in the future.
In the present study, we propose a novel framework with real-time model updating to optimize the simultaneous objectives of providing predictions based on current ML models while generating updated ML models in the background. It is the first framework to deal with ML model forecasts and ML model updates in parallel, as far as we are aware. New coverage-based updating algorithms are developed to determine the sufficiency of training datasets and to speed up the updating circle. Histogram selection and area selection are also established to optimize the database for a better model-updating performance. This framework can be broadly applied to predictive models in real-time digital twins with out-of-range issues. It is tested and verified in a prototype water treatment facility called the secure water treatment (SWaT) system, hosted in iTrust in Singapore.
In the following, the details of the real-time data-processing framework and coverage-based updating algorithms are first described in
Section 2. Subsequently, the prototype facility, datasets and models used in this study are introduced in
Section 3. The implementation of the framework in the SWaT system is presented with discussions of training and operational parameters in
Section 4. Finally, a conclusion is drawn in
Section 5.
4. Results and Discussion
The original performance without the implementation of the data-processing framework is used as the benchmark in all the following discussions. Dataset 1 was chosen as the original training dataset, and Dataset 2 was used to simulate the real-time operational data streams from the digital twin as mentioned in
Section 3.1. Dataset 2 had a wider coverage of target values than Dataset 1, with more than four times the number of samples.
Figure 3 shows the target values of both Datasets 1 and 2, with the blue area highlighting the samples within the original range [6.99, 8.24]. It can be observed that in Dataset 2, many samples were lower than the minimum of the original range and that sporadic samples were higher than the maximum.
Figure 3c compares the predictions from the original ML model and the true observations from the physical system. Despite the small size of Dataset 1 for training, all the targets within the original range were well predicted by the ML model. It is also observed from
Figure 3c that nearly all out-of-range samples were not predicted well due to an insufficient training dataset, with a high MAE of 0.53 for the total 3591 samples in the red area. In the following discussions, only the out-of-range area (i.e., red area in the figure) is discussed, and the training samples within the original range were not included during the real-time model updating due to their good performance.
As discussed previously, the computational time required is crucial for real-time model updating. Here, the training time of each model during the model-updating process was tracked under the conditions of frequency density = 4000 and number of bins = 10. The model was trained 139 times during the real-time simulation with operational data from Dataset 2. The training time increased progressively with expanded training datasets and was nearly double for the last training as shown in
Figure 4. It should be noted that the increase in training time during the operation remained acceptable for this pilot water treatment facility, SWaT, due to the fast computational speed of the NGBoost model using a laptop processor with Intel
® Core™ i7-8565U CPU @ 1.80 GHz and training datasets of a limited size. However, the general concern for time needs to be highlighted because other ML models can have a much longer training time, leading to a significant worsening of the model-updating performance. For example, deep neural networks, such as long short-term memory networks, can take ten times longer to train. Thus, optimizing the model updating, as well as utilizing faster hardware, is essential to meeting the real-time accuracy requirements.
To further illustrate the impact of the computational training time, an investigation was carried out with extra time added to each training process. The predictions and observations during the period of 7000–12,000 s are shown in
Figure 5a, with their differences displayed in
Figure 5b, accordingly. It is obvious that the prediction error can be significant during this period. On the other hand, the prediction lines, by adopting the original training time or plus an extra 60 s, fitted the observation lines well, and the corresponding red and magenta lines in
Figure 5b are always around 0. Except for these two conditions, the other prediction lines in
Figure 5a are obviously jagged, particularly for the long duration between 9200 s to 11,600 s. This is because closer predictions typically occur after each reloading of the new model and before the next round of model updating, during which the existing model produces erroneous results in the out-of-range areas until the model is updated. In this framework, the model is continually retrained when new out-of-range data appears, and thus the training time decides the width of the jaggedness.
Table 3 summarizes the MAE between the predictions and observations with different extra training time. The MAE with real-time model updating was 0.01, indicating that 97% of out-of-range errors can be reduced because the MAE of the initial predictions was 0.53. However, the MAE increased dramatically with more extra time added to the training process, and the improvement from the model updating became negligible when the extra time reached 1800 s (see
Table 3). Therefore, we can conclude that the computational training time is critical and lengthening the training time can largely increase the prediction errors.
To optimize the computational training time, the size of the training dataset should be restricted in terms of its frequency density as discussed in
Section 2.2; however, this restriction can also lead to a deterioration in prediction accuracy. Here, Dataset 3 was used to investigate the impact of frequency density on prediction accuracy. Samples from Dataset 3 were extracted with different intervals, and then combined as the new training datasets. During the evaluation, three models were trained with each training dataset, and their average performance is plotted in
Figure 6. As expected, the performance of larger datasets (i.e., smaller extraction interval) tends to perform better with a smaller MAE and RMSE, as shown in the figure. At the same time, the variations of MAE/RMSE were small when the extraction interval was smaller than 25, as highlighted in blue area. This implies an optimal strategy to slightly sacrifice the single prediction performance for the overall improvement, with restrictions on the frequency density of the training datasets. In addition, the analysis provides a rough estimation of the suitable density for the training dataset. According to the definition of frequency density, the overall density with the extraction interval 25 was around 7000, implying that the density parameter in the algorithms can be set to be lower than 7000 because the samples in Dataset 3 were not evenly distributed.
Figure 7 shows the histograms of Datasets 1 and 2 with different bin widths. Unprocessed datasets are usually uneven, as shown in
Figure 7. If reducing the computational training time is an exigent mission for a specific case, the samples in the original training dataset can even be screened off according to the minimum frequency density suggested above, and new samples can be added only to the sparse areas during the histogram selection step. In this case, the ML models with the original training dataset already performed well with an acceptable computational training time, and therefore the screening was not necessary. It can also be observed from
Figure 7 that histograms with smaller bin width have a more accurate distribution, and their frequency can change dramatically with bin width. For example, the highest frequency of Dataset 2 with a bin width of 0.1 was ~2400, while that with a bin width of 0.02 was even smaller at ~700. This demonstrates the necessity of choosing frequency density, instead of the frequency itself, as the key parameter for the coverage-based model-updating algorithms.
Figure 8 plots the frequency density of the different histograms. The orange lines with and without markers represent the density distribution of Dataset 1 with a bin width of 0.1 and 0.02, respectively. It can be observed that their density distributions were consistent with each other in general. In addition, the blue lines from Dataset 2 showed similarity as well, and frequency densities of both bin widths were in the same order of magnitude. Thus, the results show that the effects due to frequency density and bin width are relatively independent.
Experiments with various frequency densities were also conducted for further investigations, and their prediction accuracies in terms of MAE are listed in
Table 4. When the frequency density was higher than 1000, the performance became steady, indicating that sufficient samples were collected for the training datasets. If the density was smaller than 1000, the MAE of the predictions began to grow and increased 60% when the density decreased from 1000 to 20. As discussed above, although a higher density can potentially lead to a higher prediction accuracy, the overall performance may deteriorate due to redundant samples in the training datasets, and thus a longer training time is required. For example, a frequency density of 4000 yielded the best performance in this study according to
Table 4, instead of a frequency density of 15,000.
The effect due to the number of bins was also analyzed in this study, and the results are summarized in
Table 5. Different bins with a frequency density of 4000 were investigated first, and the results showed no obvious difference in MAE between the predictions and observations. This can be attributed to the fact that a frequency density of 4000 was sufficient in this case regardless of the number of bins. A smaller frequency density of 100 was also examined, and the prediction accuracy was affected by the number of bins in this case. The MAE increased from 0.017 to 0.019, 0.020 and 0.021 when the number of bins decreased to 50, 10 and 2, respectively. Thus, the overall results showed that the sample distribution is more important under the circumstance of a limited training dataset, and more bins are needed to achieve a balanced dataset with a narrower sparse area.
In summary, both the frequency density and number of bins can affect the overall prediction accuracy by affecting the time needed to reload the model in the forecast updating script. The reloading records of the scenarios listed in
Table 4 are tracked and further shown in
Figure 9a. It is obvious that reloading was more frequent in the scenarios with a higher frequency density. In addition, with a smaller density, the reloading process tended to stop much earlier because their dataset sizes had already hit the limit. For example, the reloading records for densities of 40 and 20 only appeared before 11,000 s, while the models were continuously reloaded after 14,000 s for densities higher than 1000. Similarly,
Figure 9b shows the records of the scenarios with a density of 100 in
Table 5. The performance with different bins can be explained by the reloading process as well, in that a smaller MAE comes with more model updates. Thus, it can be summarized that a more frequent reloading process will eventually lead to better overall performance in the coverage-based model-updating framework.