Next Article in Journal
Deep Neural Network-Based Fusion Localization Using Smartphones
Next Article in Special Issue
A Connector for Integrating NGSI-LD Data into Open Data Portals
Previous Article in Journal
Multivariable Coupled System Control Method Based on Deep Reinforcement Learning
Previous Article in Special Issue
BERT-Based Approaches to Identifying Malicious URLs
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Hybrid Missing Data Imputation Method for Batch Process Monitoring Dataset

1
Informatization Construction and Management Office, Sichuan University, Chengdu 610065, China
2
Big Data Analysis and Fusion Application Technology Engineering Laboratory of Sichuan Province, Chengdu 610065, China
3
College of Computer Science, Sichuan University, Chengdu 610065, China
*
Author to whom correspondence should be addressed.
Sensors 2023, 23(21), 8678; https://doi.org/10.3390/s23218678
Submission received: 1 September 2023 / Revised: 7 October 2023 / Accepted: 18 October 2023 / Published: 24 October 2023
(This article belongs to the Special Issue Data Engineering in the Internet of Things)

Abstract

:
Batch process monitoring datasets usually contain missing data, which decreases the performance of data-driven modeling for fault identification and optimal control. Many methods have been proposed to impute missing data; however, they do not fulfill the need for data quality, especially in sensor datasets with different types of missing data. We propose a hybrid missing data imputation method for batch process monitoring datasets with multi-type missing data. In this method, the missing data is first classified into five categories based on the continuous missing duration and the number of variables missing simultaneously. Then, different categories of missing data are step-by-step imputed considering their unique characteristics. A combination of three single-dimensional interpolation models is employed to impute transient isolated missing values. An iterative imputation based on a multivariate regression model is designed for imputing long-term missing variables, and a combination model based on single-dimensional interpolation and multivariate regression is proposed for imputing short-term missing variables. The Long Short-Term Memory (LSTM) model is utilized to impute both short-term and long-term missing samples. Finally, a series of experiments for different categories of missing data were conducted based on a real-world batch process monitoring dataset. The results demonstrate that the proposed method achieves higher imputation accuracy than other comparative methods.

1. Introduction

The batch process is an important production mode in the modern manufacturing industry. As a highly flexible production method, the batch process is essential in producing low-volume, high-value-added products, such as chemical and biological materials [1,2]. With the rapid development of the Internet of Things and sensing technology [3], the monitoring data of batch processes is being recorded more frequently. However, batch process monitoring data often contains missing values due to factors such as external environmental conditions, link failures, and sensor equipment degradation. This results in incomplete and unreliable batch process monitoring data, which poses a significant obstacle to the subsequent utilization of the data [4]. Especially, missing data will decrease the performance of data-driven modeling for fault identification and optimal control in batch processes. Therefore, it is significant to study how to deal with missing data to enhance the quality of batch process monitoring data.
There are mainly two categories of methods to handle missing data: deletion and imputation [5,6]. The deletion method may not only lose valuable information within the data but also destroy the continuity of the time series, leading to inaccurate results in subsequent data analysis. The imputation method involves replacing missing values with predicted values [7], which is more suitable for improving data quality. However, there are few studies that focus on missing data imputation for batch process monitoring datasets. Nomikos et al. [8] employed the mean method for imputing missing values. Laila et al. [9] and Meng et al. [10] introduced a methodology where the unknown observations are calculated using a weighted combination of scores from the current time point in the new batch and previously computed scores from a calibration dataset. Shi et al. [11] established a linear regression model that uses several historical values adjacent to the current time to predict the missing values. Further research is needed, as the imputation results of these methods have shown limited effectiveness.
Due to the characteristics of batch processes, such as multiple operating conditions, multiple batches, and multiple stages, missing data in batch process monitoring datasets usually presents a complex situation, making it challenging to perform accurate imputation. Furthermore, batch process monitoring datasets contain different types of missing data and directly applying an existing single method cannot achieve favorable imputation results. Consequently, how to combine or improve appropriate imputation models to effectively impute missing data within batch process monitoring datasets is still a significant problem to be solved.
In this paper, we propose a hybrid missing data imputation method for batch process monitoring datasets based on single-dimensional interpolation, a multivariate regression model, and LSTM. The main contributions are as follows:
  • We propose a missing data classification method based on the continuous missing duration for each variable and the number of variables missing simultaneously. Then we classify the missing data into five distinct categories: transient isolated missing values, short-term missing variables, long-term missing variables, short-term missing samples, and long-term missing samples.
  • We design and implement the hybrid missing data imputation method to deal with different categories of missing data step by step, taking into account the characteristics of different categories of missing data. This method employs a combination of three single-dimensional interpolation models that enables the automated detection and imputation of transient isolated missing values. We design an iterative imputation based on a multivariate regression model to automatically complete the imputation of all long-term missing variables. To address short-term missing variables, we propose a combination model based on single-dimensional interpolation and multivariate regression by utilizing system fluctuations. We use the LSTM model to impute both short-term and long-term missing samples.
  • We have carried out extensive experiments on a real-world injection molding process monitoring dataset to demonstrate the effectiveness and accuracy of the proposed hybrid missing data imputation method.
The remainder of this paper is structured as follows. Section 2 presents the related works. Section 3 describes the hybrid missing data imputation method designed. Section 4 verifies the validity of the proposed method by taking a real-world injection molding process monitoring dataset as an example. Section 5 presents the conclusions.

2. Related Works

Many imputation techniques have been proposed for different domain-specific datasets [12], primarily involving two categories: statistical and machine learning-based techniques [13,14].
Statistical imputation techniques rely on statistical models to predict missing values. Simple imputation handles missing values by using methods such as the mode, mean, or median of the available values [15]. Hot-deck imputation handles missing values by replacing them with similar object values [16]. Interpolation methods, which mainly include nearest neighbor interpolation, linear interpolation, and spline interpolation, estimate missing values by establishing interpolation functions [17]. These techniques perform imputation based on temporal continuity and are effective in the case of a handful of missing values. Regression imputation involves estimating relationships among variables using regression modeling [18], which typically includes Linear Regression (LR) and Multivariate Linear Regression (MLR). This approach can effectively utilize the correlations between time series data for imputation. Matrix-based methods recover missing data by treating an entire set of series as a matrix and applying techniques based on matrix completion principles [19]. These techniques leverage temporal continuity for imputation and mainly include Singular Value Decomposition (SVD), Principal Component Analysis (PCA), Matrix Factorization (MF), and Centroid Decomposition (CD)-based methods. PCA-based methods, SPIRIT [20] and ROSL [21], are effective for datasets with a limited number of time series or short time series. SVD-based SoftImpute [22] and MF-based TRMF [23] require data to contain repeating trends, while CD-based CDRec [24] is only effective for correlated time series. Pattern-based methods utilize pattern-matching techniques for imputation by leveraging trend similarity. For instance, STMVL [25] derives statistical models from historical data and requires highly correlated time series. DynaMMo [26] employs Kalman filters and Expectation-Maximization (EM) for imputation and is adaptable to datasets with irregular fluctuations.
Machine learning techniques are widely used in various practical application fields, such as air pollution monitoring [27], industrial process monitoring [2], dam safety monitoring [28,29], medical data processing [30], and stock price prediction [31]. To address the challenges posed by missing data, several machine learning-based methods have gained significant popularity [12]. The K Nearest Neighbor (KNN) algorithm [32] works by classifying the nearest neighbors of missing values and using those neighbors for imputation through a distance measure between instances. The Random Forest (RF) algorithm [33,34] constructs multiple decision trees based on the bootstrapping procedure and gives the final predictions by the averaged values or majority votes of each tree’s prediction. The K-means clustering algorithm [35] consists of 2 steps, where the first step gets clusters using K-means clustering, and then the second step handles missing values using cluster information. These methods utilize the correlation between time series but do not consider the continuity in the time dimension. And more advanced neural networks have also been applied to deal with missing values in time series data. The Extreme Learning Machine (ELM) [36] is an efficient machine learning model based on a single-layer feedforward neural network and is suitable for multi-dimensional time series with multiple features. Long Short-Term Memory (LSTM) [37], which is an improved form of Recurrent Neural Networks (RNNs) [38], can effectively learn long-term dependencies for predicting multi-dimensional time series.
In summary, although several imputation methods have been proposed, most of them are typically designed to estimate a specific type of missing data. And these methods often excel only when handling datasets with specific data characteristics. In practical domains, such as batch process monitoring datasets, missing data usually presents a complex situation. These datasets contain different types of missing data, and different types of missing data exhibit distinct characteristics. Applying a single imputation method directly may not be effective. Therefore, further research is still needed on how to conduct classification analysis of missing data and design a hybrid method by employing suitable imputation techniques tailored to the characteristics of different types of missing data.

3. Methodology

3.1. Data Processing

3.1.1. Data Unfolding

For a typical batch process, the monitoring data is stored in a three-dimensional matrix, X O R G ( I × J × T ) , where I represents the number of batches, J represents the number of process variables, and T represents the number of sampling moments in a batch. Since subsequent research on missing data imputation involves analyzing and processing missing variables at different sampling moments, it is necessary to unfold the original three-dimensional data along the batch dimension to obtain two-dimensional data, that is, X i J × T of i ( i = 1 ,   ,   I ) batches. As shown in Figure 1, I matrix slices are obtained by unfolding the original three-dimensional data along the batch dimension. Each matrix slice represents a set of values for variable j ( j = 1 ,   ,   J ) at sampling moments t ( t = 1 ,   ,   T ) .

3.1.2. Missing Data Classifying

The assumption in this paper is to impute missing data based on dataset denoising. The missing data can arise from data acquisition as well as from data denoising. Regarding the missing data caused by data acquisition, the causes of missing data in batch process monitoring can be summarized into the following three cases: (1) Production equipment outage, acquisition system failures, or data link failures lead to long or short periods of continuous missing for many variables; (2) Acquisition equipment failures lead to long or short periods of continuous missing for a few variables; (3) The instability or aging of acquisition equipment leads to isolated missing values for a few variables.
Based on the cause analysis of missing data, the classification rules for missing data are defined, as shown in Table 1. t represents the continuous missing duration of a variable, n v represents the number of variables missing simultaneously during this period. T 0 represents the data sampling interval, T h t 1 represents the time threshold at which the data trend does not change, T h t 2 represents the time threshold at which the data trend can be predicted. T h t 1 and T h t 2 are set according to the specific situation of different variables and the practical requirements for data analysis. Variable threshold T h v represents the critical value for the number of variables missing simultaneously in a certain period (longer than T h t 1 ), and T h v is set to n / 2 , where n represents the number of variables in batch process monitoring dataset.
By calculating the continuous missing duration t for each variable and the corresponding number of variables n v missing simultaneously, and then comparing the calculated results with the threshold values, the missing data is classified into five categories: transient isolated missing values, short-term missing variables, long-term missing variables, short-term missing samples, and long-term missing samples. Short-term and long-term missing variables are categorized as continuous missing variables, while short-term and long-term missing samples are categorized as continuous missing samples. Variables without any missing values are referred to as complete variables, while variables with missing values are referred to as incomplete variables.

3.2. Missing Data Imputation

3.2.1. Dataset Splitting

Due to the presence of many incomplete variables within the continuous missing samples, it can be considered that a system outage occurred during this period. The data segment with continuous missing samples can be seen as a missing data segment. Therefore, the unfolded dataset needs to be split into several data segments according to the locations of continuous missing samples and then imputed. Assuming that the dataset X is split K 1 times, then the dataset X contains K data segments and K 1 missing data segments (data segments with short-term or long-term missing samples):
X = X 1 , X 1 * , , X k , X k * * , ,   X K 1 , X K 1 * ,   X K   T
where X k ( k = 1 ,   . . . ,   K ) represent the k-th data segment, and each data segment X k contains only transient isolated missing values, short-term or long-term missing variables, X k * * ( k * = 1 ,   . . . ,   K 1 ) represent the missing data segment between the k-th and (k + 1)-th data segments.
Variable Missing Proportion (VMP) and Sample Missing Proportion (SMP) are introduced as measures to describe the extent of missing data within each data segment. Taking data segment X k R m k × n as an example, the sample missing proportion S M P k of X k and the variable missing proportion V M P k _ j of variable j in X k are calculated as follows:
S M P k = 1 m i n t _ k / m k V M P k _ j = 1 m i n t _ k _ j / m k
where m k is the sample size of X k , n is the number of variables in X k , m i n t _ k represents the sample size without missing values, and m i n t _ k _ j represents the number of values that are not missing in variable j.

3.2.2. Transient Isolated Missing Values Imputation

For transient isolated missing values, the data trend in the time dimension remains unchanged. The missing values can be estimated using single-dimensional interpolation models based on temporal continuity. The nearest neighbor interpolation, linear interpolation and cubic spline interpolation are used. Assuming that x i , j (the i-th value of variable j) in data segment X k is missing, and x ~ i , j represents the estimated value of x i , j .
(1) Single-dimensional Interpolation Model
The nearest neighbor interpolation: The interpolation function is established using a valid value adjacent to x i , j , as shown in Formula (3). The limitation of this method is the discontinuity at x ~ i , j .
x ~ i , j = x i 1 , j   o r = x i + 1 , j
The linear interpolation: The interpolation function is constructed using two valid value adjacent to x i , j , as shown in Formula (4). While linear interpolation ensures continuity at x ~ i , j , it lacks derivability at the endpoints.
x ~ i , j = 1 2 ( x i 1 , j + x i + 1 , j )
The cubic spline interpolation: The cubic spline interpolation requires at least four valid values and constructs the interpolation function using two adjacent values before x i , j and two adjacent values after x i , j , as shown in Formula (5). The detailed construction process can be found in reference [39].
x ~ i , j = f s p l i n e ( x i 2 , j , x i 1 , j , x i + 1 , j , x i + 2 , j )
When both values x i , j and x i + 1 , j are missing simultaneously ( T h t 1 is set to 2), the interpolation Formulas (3), (4), and (5) need to be reconstructed, respectively, as shown in Formulas (6)–(8).
x ~ i , j = x ~ i + 1 , j = x i 1 , j o r = x i + 2 , j
x ~ i , j = x i 1 , j + 1 3 ( x i + 2 , j x i 1 , j ) x ~ i + 1 , j = x i 1 , j + 2 3 ( x i + 2 , j x i 1 , j )
x ~ i , j = f s p l i n e i x i 2 , j ,   x i 1 , j ,   x i + 2 , j , x i + 3 , j x ~ i + 1 , j = f s p l i n e i + 1 x i 2 , j ,   x i 1 , j ,   x i + 2 , j , x i + 3 , j
where x ~ i + 1 , j is the interpolated value of x i + 1 , j , f s p l i n e ( i ) and f s p l i n e ( i + 1 ) , respectively, represent the cubic spline interpolation functions for x i , j and x i + 1 , j .
(2) Imputation Process for Transient Isolated Missing Values
To impute the transient isolated missing values x i , j in the data segment X k , a combination of the above three interpolation models is employed. Combining these three methods enables the automated detection and imputation of transient isolated missing values, making it an efficient complementary approach. When four adjacent valid values are available, cubic spline interpolation is utilized for imputation. If the four adjacent values do not consist of two values before x i , j and two values after x i , j , the cubic spline interpolation function needs to be adjusted. Taking one value before x i , j and three values after x i , j as an example, the adjusted cubic spline interpolation function is shown in Formula (9).
x ~ i , j = f s p l i n e i x i 1 , j , x i + 1 , j , x i + 2 , j , x i + 3 , j
When the missing value is located at the endpoint of X k , meaning that only one side (either left or right) has an adjacent value, the nearest neighbor interpolation is utilized for imputation. When two adjacent valid values are available, with one before and one after x i , j , the linear interpolation is used for imputation.

3.2.3. Continuous Missing Variables Imputation

In the case of a long-term missing variable, significant information in the time dimension is seriously lost. The missing values of the long-term missing variable can only be estimated based on the correlation with other complete variables. The multivariate regression model is suitable for imputing missing values for long-term missing variables. The model constructs a regression function between the long-term missing variable and other complete variables based on their correlations. Then, by utilizing the complete variables as input, the missing values of the long-term missing variable can be predicted. In the case of a short-term missing variable, the missing values can be estimated by considering the correlation with other complete variables, together with the data trend in the time dimension. Therefore, a combination model based on single-dimensional interpolation and multivariate regression is proposed to impute the missing values of short-term missing variables by combining the strengths of both models.
(1) Multivariate Regression Model
Three widely used multivariate regression models are chosen for this study: MLR, RF, and KNN. All three models exhibit robustness and require minimal or no parameters. Assuming that X t r a i n R m t × n and Y t r a i n R m t × 1 are the input and output of training data, respectively, and X t e s t R m s × n and Y t e s t R m s × 1 are the input and output of testing data, respectively, where m t represents the sample size of the training data, n represents the number of variables, m s represents the sample size of the testing data.
MLR establishes a linear regression function by considering the correlation between the incomplete variable and other complete variables. Then, the function is utilized to predict the missing values. An advantage of the MLR model is its lack of reliance on hyperparameters. The missing values imputation process using MLR is as follows:
Step 1: Modeling. Construct the MLR function:
Y t r a i n = X D θ + ε
where X D is the design matrix for X t r a i n and X D = I t r a i n , X t r a i n , I t r a i n = [ 1 , . . . , 1 ] T R m t × 1 is a constant vector, ε = ε 0 , ε 1 , , ε m T R m t × 1 is the error vector, θ = [ θ 0 , θ 1 , . . . , θ n ] T R ( n + 1 ) × 1 is the coefficient vector, θ can be estimated by Formula (11):
θ ~ = ( X D T X D ) 1 X D T Y t r a i n
where θ ~ is the estimated value of θ , X D T is the transpose matrix of X D , ( X D T X D ) 1 is the inverse matrix of X D T and X D .
Step 2: Missing values prediction. Estimate Y t e s t using X t e s t :
Y t e s t = X P θ ~
where X P is the design matrix for X t e s t and X P = I t e s t ,   X t e s t , I t e s t = [ 1 , . . . , 1 ] T R m s × 1 is a constant vector.
RF is an ensemble learning model based on the Classification and Regression Tree (CART). The RF model requires two hyperparameters n _ e s t i m a t o r s and m _ f e a t u r e s , which respectively represent the number of trees and the number of selected features. The missing value imputation process using the RF model is as follows:
Step 1: RF model training.
Step 1.1: Utilize the Bootstrap resampling method to select n _ e s t i m a t o r s samples from the original training dataset with replacement, and remove duplicate samples to create a new training dataset D t = X t r a i n 1 , Y t r a i n 1 .
Step 1.2: Train CART decision trees using dataset D t to generate the trained CART model C A R T _ m o d e l 1 . During the training process, randomly select m _ f e a t u r e s features from all the features, and then identify the optimal feature within the selected features as the splitting point for partitioning each node into left and right segments.
Step 1.3: Repeat Steps 1.1–1.2 n _ e s t i m a t o r s times to obtain n _ e s t i m a t o r s CART decision trees, denoted as the prediction model C A R T _ m o d e l .
Step 2: Missing values prediction.
Step 2.1: Select the same m _ f e a t u r e s features as used in the training process to create a new testing dataset X t e s t ( 1 ) .
Step 2.2: Input X t e s t ( 1 ) into the trained model C A R T _ m o d e l   1 to obtain the first prediction result Y t e s t ( 1 ) .
Step 2.3: Repeat Steps 2.1–2.2 until obtaining n _ e s t i m a t o r s prediction results.
Step 2.4: Calculate the final prediction result Y t e s t using the mean method:
Y t e s t = 1 n _ e s t i m a t o r s × i = 1 n _ e s t i m a t o r s Y t e s t i
The KNN regression model involves considering three factors [40]: the number of nearest samples (k), the distance measurement method, and the regression prediction rule. The distance measurement method employs the widely used Euclidean distance, while the regression prediction rule is based on the mean method. The appropriate value for k can be determined through cross-validation based on the sample distribution. The missing value imputation process using KNN is outlined below.
Step 1: Calculate the Euclidean distance between the s-th sample x t e s t , s in X t e s t and the t-th sample x t r a i n , t in X t r a i n , as shown in Formula (14). Then, calculate the distance between x t e s t , s and all the m t samples in X t r a i n to obtain the distance vector D x t e s t , s , · = d i s t x t e s t , s , x t r a i n , 1 ,   ,   d i s t x t e s t , s , x t r a i n , m t T .
d i s t x t e s t , s , x t r a i n , t = i = 1 n x t e s t , s i x t r a i n , t i 2
where x t e s t , s ( s = 1 , , m s ) is the s-th sample in X t e s t , x t r a i n , t ( t = 1 , , m t ) is the t-th sample in X t r a i n , n is the number of variables.
Step 2: Choose k nearest samples [ ( x t r a i n , 1 ) ,   ,   ( x t r a i n , k ) ] in X t r a i n according to the k smallest values in the distance vector D ( x t e s t , s , · ) .
Step 3: Calculate the average of the values [ ( y t r a i n , 1 ) ,   ,   ( y t r a i n , k ) ] in Y t r a i n that correspond to these k nearest samples, as shown in Formula (15), and set this average value y s as the predicted value for the sample x t e s t , s .
y s = 1 k i = 1 k   y t r a i n , i
Step 4: Repeat steps 1–3 to calculate predicted values for all samples in X t e s t , then all values in Y t e s t are obtained.
(2) Imputation Process for Long-term Missing Variables
Since the multivariate regression model has the limitation that only one variable can be imputed in each process, an iterative method is designed to overcome this constraint. The iterative imputation based on the multivariate regression model can automatically complete the imputation of all long-term missing variables. The model from MLR, RF, or KNN is selected as multivariate regression model m o d e l j . Assuming that X k 1 R m k × n is the data segment after imputing transient isolated missing values, and n l o n g _ j is the number of long-term missing variables in X k 1 . The iterative imputation based on a multivariate regression model is presented in Algorithm 1.
Algorithm 1 The iterative imputation based on multivariate regression model
Input:  X k 1 R m k × n , n l o n g _ j
Output: The imputed data segment X k 2 R m k × n
1. Begin
2.
Calculate the variable missing proportion V M P j for each long-term missing variable, and sort these variables in ascending order by V M P j , get x _   , 1 , ( x _   , 2 ) , , x _   , n l o n g _ j ;
3. Set X 0 = X k ( 1 ) ;
4. For j = 1 to n l o n g _ j :
5.
   Split X j 1 into a training dataset D t r a i n ( j 1 ) including only complete variables and a testing dataset D t e s t ( j 1 ) including only incomplete variables;
6.
   Train the multivariate regression model m o d e l j by inputting X t r a i n ( j 1 ) formed by n n l o n g _ j + ( j 1 ) complete variables from D t r a i n ( j 1 ) ;
7.
   Input X t e s t ( j 1 ) formed by n l o n g _ j ( j 1 ) incomplete variables from D t e s t ( j 1 ) into m o d e l j , and get the predicted values ( x ~ _   ,   j ) for variable ( x _ , j ) ;
8.    Impute X j 1 using ( x ~ _   ,   j ) ;
9.    Set X j = X j 1 ;
10. Return  X k ( 2 ) = X n l o n g _ j ;
11. End
(3) Imputation Process for Short-term Missing Variables
The combination model based on single-dimensional interpolation and multivariate regression is developed for imputing the missing values of short-term missing variables. This combination model is based on the property that a missing variable experiences system fluctuations due to the influence of its related variables. The model utilizes a multivariate regression model to calculate the system fluctuation and incorporate it into the interpolation value. By considering the continuity in the time dimension and the correlation among different variables, this model significantly enhances imputation accuracy by combining the strengths of both models.
Taking cubic spline interpolation and MLR as examples, the combination model for imputing missing values of short-term missing variables is designed. As shown in Figure 2, variable Y in data segment X k contains short-term missing from time s 2 to time e 1 . s 1 and e 2 , respectively, represent the corresponding time with a valid value on the left side of s 2 and on the right side of e 1 . The continuous missing duration t = e 1 s 2 , and T h t 1 < t T h t 2 . Time t a , t b , a n d t c represent three sampling times in this period. y a represents the predicted value at time t a , the imputation process for y a based on the combination model is shown in Figure 3.
Step 1: Calculate the predicted value y a 1 at time t a using cubic spline interpolation by Formula (8).
Step 2: Calculate the correlation between variable X and the short-term missing variable Y using Formula (16). If C o v ( X , Y ) > T h c , variable X is the correlated variable with Y . Then identify all the correlated variables with Y , denoted as X j ( j = 1,2 , , n c ) .
C o v ( X , Y ) = ( i m k ( x i x ¯ ) ( y i y ¯ ) ) i m k ( x i x ¯ ) 2 i m k ( y i y ¯ ) 2
where x ¯ = 1 m k i m k x i , y ¯ = 1 m k i m k y i , m k is the sample size, T h c is the correlation threshold, n c is the number of correlated variables with Y .
Step 3: The variable Y is influenced by its correlated variables, which leads to system fluctuations. The MLR model and cubic spline interpolation are used to calculate the system fluctuation 1 at y a 1 data level:
Firstly, use the MLR model for regression fitting to describe the relationship between Y and its correlated variables, and the corresponding predicted values y s 1 ,   y s 2 ,   y e 1 ,   y e 2 ,   y a 2 at time s 1 ,   s 2 ,   e 1 ,   e 2 ,   t a are calculated by Formula (12), where the dataset X c _ y   R m j × n c formed by all the correlated variables is used as the testing data, m j is the sample size of X c _ y , and the sample size of I t e s t in Formula (12) is set to m j .
Then construct a cubic spline interpolation function based on values y s 1 ,   y s 2 ,   y e 1 ,   y e 2 by Formula (8), and get the predicted value y a 3 at time t a .
Finally, calculate the system fluctuation 2 at y a 3 data level by Formula (17). Since the system fluctuation is influenced by the data level, the relationship between 1 and 2 satisfies Formula (18). So the system fluctuation 1 is calculated by Formula (19).
2 = y a 3 y a 2
1 y a 1 = 2 y a 3
1 = y a 1 y a 3 2
Step 4: Put the system fluctuation back to the original data level, as shown in Formula (20), and then the final predicted value y a at time t a is calculated:
y a = y a 1 1

3.2.4. Continuous Missing Samples Imputation

After data splitting, the information between data segments is not only lost in time dimension but also among different variables. It is difficult to impute short-term and long-term missing samples using a single-dimensional interpolation model or a multivariate regression model. We adopt the LSTM model, which can effectively learn long-term dependencies, to impute continuous missing samples after imputing transient isolated missing values and continuous missing variables.
(1) LSTM Model
The 5-layer LSTM network for the prediction of missing values in continuous missing samples is as below.
Input layer: This layer receives input data, where the number of variables in the input data is consistent with the number of neurons in this layer.
LSTM layer: This layer builds the LSTM model. The LSTM unit structure is shown in Figure 4. The memory unit in LSTM has four gates: INPUT GATE ( f ), FORGET GATE ( i ), UPDATE GATE ( g ), and OUTPUT GATE ( o ). c ( t ) is the unit state, representing the information learned before time t , which can be seen as long-term memory. h ( t ) is the hidden state, representing the output of the network in the current state, which can be seen as short-term memory. x ( t ) is the current time network input value. The forget gate determines the retention degree of the current state c ( t ) to the cell state c ( t 1 ) at the previous moment. The input gate determines the retention degree of the current state c ( t ) to the input x ( t ) . The output gate controls the degree to which c ( t ) outputs to h ( t ) in the current state. Each node in the LSTM model can be calculated as below:
i t = σ W i h t 1 ,   x t T + b i f t = σ W f h t 1 ,   x t T + b f o t = σ W o h t 1 ,   x t T + b o g t = tanh W g h t 1 ,   x t T + b g c t = f t c t 1 + i t g t h t = o t tanh c t
where f is the forget gate, i is the input gate, g is the update gate, o is the output gate, c is the unit state, h is the hidden state, σ is the activation function of Sigmoid, W is the weight matrix, b is the bias term, ⊙ represent matrix elements multiplication.
Lost layer: This layer is used to prevent overfitting [41]. During the training process, the loss probability P l o s t is set to 0.5. The input data from the LSTM layer is randomly set to 0 with rate P l o s t . The remaining data is scaled by the rate 1 / ( 1 P l o s t ) and then input into the fully connected layer.
Fully connected layer: This layer establishes full connection between the LSTM layer with the output layer. The number of input neurons in this layer is equal to the number of neurons in LSTM layer.
Output layer: This layer generates the prediction results. The number of output neurons is equal to the number of variables in the output data.
(2) Imputation Process for Continuous Missing Samples
The LSTM model takes all the complete data segments before the current moment as input and predicts the missing values at the current moment. Then the imputed values are used as input to predict the missing values at the next moment. Therefore, the continuous missing samples (the missing data segments) are imputed by iteratively executing the model. The iterative imputation process for the missing data segment X k * * R m c × n is as follows, where mc is the sample size and n is the number of variables. And l represents the time steps (the length of input data) of the LSTM model.
Step 1: LSTM model training.
Step 1.1: Generate the training dataset based on data segment X k ( 2 ) R m k × n after imputing all transient isolated missing values and continuous missing variables.
Step 1.2: Initialize i = 1 and train input data, where the i-th input sample of X t r a i n is X t r a i n ,   i = [ x t r a i n , i ,   ,   x t r a i n , i + l 1 ] ; then train output data, where the i-th output sample of X t e s t is X t e s t ,   i = x t e s t , i + l .
Step 1.3: Repeat Step 1.2 m k l times.
Step 1.4: Train the LSTM model m o d e l L S T M based on dataset X t r a i n and X t e s t , then get the trained LSTM model m o d e l L S T M _ h ( 0 ) whose output state is h ( 0 ) .
Step 2: Missing data prediction.
Step 2.1: Initialize the LSTM model, and input the training dataset X t r a i n into m o d e l L S T M _ h ( 0 ) to obtain m o d e l L S T M _ h ( m k ) whose output state is h ( m k ) .
Step 2.2: For t = m k + 1 , input the l consecutive samples before time t (i.e., X t r a i n ,   t 1 = [ x t r a i n , m k l + 1 ,   ,   x t r a i n , m k ] ) into m o d e l L S T M _ h ( t 1 ) to obtain the predicted data x ~ t . Then update m o d e l L S T M _ h ( t 1 ) according to Formula (23) and get m o d e l L S T M _ h ( t ) .
Step 2.3: Repeat Step 2.2 until t = m k + m c , then get the predicted data segment X k * * ( 1 ) = [ x ~ m k + 1 , ,   x ~ m k + m c ] .
It should be noted that the input of the LSTM model is a vector, so it is necessary to reconstruct the data matrix into a vector before model training and prediction.

3.3. The Hybrid Missing Data Imputation Method

Considering the various types and high missing proportion of missing data in batch process monitoring datasets, we propose a hybrid missing data imputation method based on the above research. The method classifies missing data according to the predefined classification rules, then combines and improves a single-dimensional interpolation model, a multivariate regression model, and LSTM to step-by-step impute different categories of missing data based on their specific characteristics. The pseudocode of this hybrid method is presented in Algorithm 2.
Algorithm 2 The proposed hybrid missing data imputation method
Input: The original dataset X O R G
Output: The imputed complete dataset X I M P
1. Begin
2. Unfolding data along the batch dimension, get the 2D dataset X;
3.
Classifying the missing data into five categories: transient isolated missing values, short-term missing variables, long-term missing variables, short-term missing samples and long-term missing samples;
4. Splitting dataset X, get X = [ X 1 ,   X 1 * ,   , X k ,   X k * * ,   ,   X K 1 , X K 1 * ,   X K ] ;
5.
Imputing transient isolated missing values in each data segment X k using single-dimensional interpolation models;
6. X k ( 1 ) ( k = 1 ,   . . . ,   K ) ← The imputed data segments;
7. Standardize each data segment;
8.
Imputing long-term missing variables in each data segment X k using the iterative imputation based on multivariate regression model, and imputing short-term missing variables in each data segment X k using the combination model based on single-dimensional interpolation and multivariate regression;
9. X k ( 2 ) ( k = 1 ,   . . . ,   K ) ← The imputed data segments;
10.
Imputing short-term missing samples and long-term missing samples (i.e., the missing data segments X k * * ) using LSTM model;
11. X k * * ( 1 ) ( k * = 1 ,   ,   K 1 ) ← The imputed data segments;
12. Complete dataset X I M P ← De-standardize, and transform 2D data to 3D data;
13. End
As shown in Figure 5, the proposed hybrid missing data imputation method consists of the following eight steps:
Step 1: Unfolding data: The original three-dimensional dataset X O R G is unfolded along the batch dimension to obtain two-dimensional dataset X.
Step 2: Classifying missing data: According to the missing data classification method (Section 3.1.2), the continuous missing duration t for each variable and the corresponding number of variables n v missing simultaneously are calculated. By comparing the calculated results with the threshold values, the missing data are classified into five categories: transient isolated missing values, short-term missing variables, long-term missing variables, short-term missing samples, and long-term missing samples.
Step 3: Splitting dataset: The dataset X is split according to the locations of continuous missing samples, then get X = [ X 1 ,   X 1 * ,   , X k ,   X k * * ,   ,   X K 1 , X K 1 * ,   X K ] , where X k ( k = 1 ,   ,   K ) represents the k-th data segment (the data segment with transient isolated missing values, short-term or long-term missing variables), X k * * ( k * = 1 ,   ,   K 1 ) represents the missing data segment (the data segment with short-term or long-term missing samples) between the k-th and (k + 1)-th data segments.
Step 4: Imputing transient isolated missing values: Transient isolated missing values in each data segment X k are imputed using three single-dimensional interpolation models as mentioned in Section 3.2.2, and the corresponding imputed data segments are X k ( 1 ) ( k = 1 ,   ,   K ) .
Step 5: Standardize each data segment X k : Taking the variable j in data segment X k as an example, values are standardized using z-score standardization:
x i ,   j z = x i , j μ j / σ j
where x i ,   j z is the standardized value of the i-th sample x i , j ( i = 1 ,   ,   m k ,   j = 1 ,   ,   n ) , μ j is the mean of variable j, σ j is the standard deviation of variable j, m k and n, respectively, represent the sample size and the number of variables in X k .
Step 6: Imputing long-term missing variables and short-term missing variables: For each data segment X k ( 1 ) , each long-term missing variable are imputed using the iterative imputation based on multivariate regression model as mentioned in Section 3.2.3 (2), all short-term missing variables are imputed using the combination model based on single-dimensional interpolation and multivariate regression as mentioned in Section 3.2.3 (3), and the corresponding imputed data segments are X k ( 2 ) ( k = 1 ,   ,   K ) .
Step 7: Imputing short-term missing samples and long-term missing samples (i.e., the missing data segments): Taking X k ( 2 ) ( k = 1 ,   ,   K ) as input, all missing data segments X k * * are imputed using LSTM model as mentioned in Section 3.2.4, and the corresponding imputed data segments are X k * * ( 1 ) ( k * = 1 ,   ,   K 1 ) .
Step 8: De-standardize the imputed data segments and transform two-dimensional data to three-dimensional data, then get the imputed complete dataset X I M P .

4. Illustration and Discussion

4.1. Data Source and Description

Injection molding, which refers to the process of making semi-finished parts of a certain shape from molten raw materials, is a typical batch process. A publicly accessible real-world injection molding dataset [42] is taken as an example, which contains data collected from both mold temperature control machines and mold sensors. Six process variables are selected, as shown in Table 2. Under this operating condition, a total of 100 normal batches with 919 sampling points are obtained, denoted as X O R G ( 100 × 6 × 919 ) . The dataset needs to be unfolded along the batch dimension to obtain two-dimensional dataset X ( 6 × 91,900 ) . It includes six variables, and the length of each variable is 91,900 sampling points. The dataset contains data fluctuations, repeating trends between different batches, and dynamic correlations among different variables.

4.2. Performance Evaluation Index

(1) Root Mean Square Error
To measure the missing data imputation accuracy, we adopt the most commonly used measure in this field: Root Mean Square Error (RMSE) [19]. The RMSE index can reflect the deviation between the predicted value and the actual value. The smaller the value of RMSE, the higher the accuracy of the algorithm. Taking variable j as an example, the RMSE value can be calculated as follows:
R M S E j = 1 n j i = 1 n j x i , j x ~ i , j 2
where n j is the number of missing values of variable j in data segment X k , x i , j is the actual value, x ~ i , j is the predicted value of x i , j .
(2) Mean Square Error
The performance of KNN, RF, and LSTM models for missing value prediction depends on the selection of hyperparameters. We adopt Mean Square Error (MSE) to construct the loss function and utilize 10-fold cross-validation to determine the optimal hyperparameters. The smaller the value of MSE, the higher the accuracy of the algorithm. The MSE value can be calculated as follows:
M S E = 1 m × n i = 1 m j = 1 n x i , j x ~ i , j 2
where m is the sample size, n is the number of variables, x i , j is the actual value, x ~ i , j is the predicted value of x i , j .

4.3. Data Processing

Firstly, the original three-dimensional dataset X O R G ( 100 × 6 × 919 ) was unfolded along the batch dimension to obtain a two-dimensional dataset X ( 6 × 91,900 ) . According to the missing data classification rules defined in Section 3.1.2, the categories of missing data were determined. The dataset X contains two data segments with continuous missing samples. Therefore, it was split into three data segments and two missing data segments according to the locations of continuous missing samples, i.e., X = [ X 1 ,   X 1 * , X 2 ,   X 2 * , X 3 ] T . Data segments X 1 ,   X 2 ,   X 3 contain transient isolated missing values and continuous missing variables, while the two missing data segments X 1 * , X 2 * are the data segments with continuous missing samples. In data segment X 1 , the plasticizing pressure variable contains continuous missing, while the cylinder pressure and SV2 value opening variables contain transient isolated missing values. In data segment X 2 , all variables only contain transient isolated missing values. In data segment X 3 , the plasticizing pressure variable contains continuous missing, while the nozzle temperature, cylinder pressure and SV2 value opening variables contain transient isolated missing values. The data integrity information is presented in Table 3. Considering the missing proportions of six process variables, we selected data segment X 2 with the lowest missing proportion to evaluate transient isolated missing value imputation and utilized the plasticizing pressure variable with continuous missing in data segment X 1 to evaluate continuous missing variable imputation.

4.4. Missing Data Imputation and Results Analysis

4.4.1. Transient Isolated Missing Values Imputation

In order to better compare the performance of different imputation methods, some transient isolated values in data segment X 2 were randomly deleted to obtain four experimental datasets with missing proportions of 5%, 10%, 15%, and 20%. The detailed imputation process for transient isolated missing values is shown in Section 3.2.2. The mean and hot-desk imputation methods were selected as baseline models.
The RMSE values for the predicted values of the six process variables calculated are shown in Table 4. Experimental results show that the single-dimensional interpolation model performs better than the mean and hot-desk imputation methods. This difference becomes more pronounced with an increasing proportion of missing values. When the missing proportion reaches 20%, the RMSE value of the single-dimensional interpolation model for the screw speed variable is 1.129, which is only about 1/3 of that obtained with the mean method.

4.4.2. Continuous Missing Variables Imputation

To evaluate the performance of different imputation methods for imputing the continuous missing variable, the continuous missing variable (plasticizing pressure) in the data segment X 1 was imputed. The transient isolated missing values in data segment X 1 were imputed first. Methods based on single-dimensional interpolation model and a multivariate regression model were used for imputation. The detailed imputation process for the continuous missing variable is shown in Section 3.2.3.
(1) Hyperparameters Selection
The hyperparameters of the RF and KNN models were selected through 10-fold cross-validation, and the results are presented in Figure 6. Figure 6a shows that the optimal parameters n _ e s t i m a t o r s and m _ f e a t u r e s for the RF model are suitable to select 500 and 1, where n _ e s t i m a t o r s is the number of CART decision trees and m _ f e a t u r e s is the number of selected features. Figure 6b shows that the optimal parameter k for the KNN model is suitable for selecting 7, where k is the number of nearest samples.
(2) Imputation Results Analysis
The combination model based on single-dimensional interpolation and multivariate regression was utilized for imputation, while six baseline models were employed for comparison. The RMSE values calculated using different methods are presented in Table 5. Experimental results show that the multivariate regression model performs better than the single-dimensional interpolation model. And the combination of a single-dimensional interpolation model and a multivariate regression model exhibits improved imputation accuracy. In particular, the combination of single-dimensional interpolation and MLR achieves the highest imputation accuracy, with an RMSE value of only 1.976. This further indicates the significance of considering both the continuity in the time dimension and the correlation between variables when dealing with short-term missing variables.

4.4.3. Continuous Missing Samples Imputation

In order to evaluate the imputation accuracy of short-term and long-term missing samples, the missing data segments X 1 * and X 2 * were imputed based on LSTM model after completing the imputation of all transient isolated missing values and continuous missing variables in data segments X 1 , X 2 and X 3 . The detailed imputation process for continuous missing sample is shown in Section 3.2.4.
(1) Hyperparameters Selection
The parameters L r and l have a significant impact on the performance of LSTM, where L r represents the learning rate and l represents the time steps. They were optimized separately, considering their minimal mutual influence. Initially, the LSTM network was initialized with the following parameters: the number of neurons was set to 120, the number of iterations was set to 400, the Adam optimization algorithm was used as the Optimizer, a gradient threshold of 1 was set to prevent gradient explosions, and the dropout rate P l o s t was set to 0.
The parameters L r and l were selected through 10-fold cross-validation. For L r , the early stopping technique was applied to prevent overfitting. The frequency of verification was set to 20, and the tolerance of verification was set to 4. While for l , the dropout rate was set to 0.2 as a replacement for the early stopping technique to prevent overfitting. The results obtained for parameters L r and l through 10-fold cross-validation are shown in Figure 7.
As L r is increased from 0.0001 to 0.1, the MSE curve initially exhibits a rise followed by a decline. However, when L r exceeds 0.1, training begins to fail. Therefore, L r is set to 0.001. The MSE curve for l shows an almost linearly increasing trend, which indicates that the imputation accuracy will decrease as the historical input data increases. Therefore, l is set to 1. In addition, when using the lost layer instead of the early stopping technique to prevent overfitting, the MSE value decreases from 0.315 to 0.198. This indicates that the lost layer is more effective in preventing overfitting than the early stopping technique.
(2) Imputation Results Analysis
The ARIMA (Autoregressive Integrated Moving Average) [43,44] and ELM [36,45] were selected as baseline models. ARIMA is a classical time series model that combines autoregressive, differencing, and moving average components to predict missing values through data autocorrelation. ELM is an efficient machine learning model based on a single-layer feedforward neural network that uses multiple features to predict missing values. The number of hidden layer neurons of both the ELM and LSTM models is the same. The RMSE values calculated for different models are shown in Table 6. The results show that the LSTM model exhibits higher imputation accuracy compared to the ARIMA and ELM models. This indicates that the LSTM model is more effective in capturing long-term dependencies in time series data.

5. Conclusions

In real-world batch process monitoring datasets, missing data usually occurs in different patterns. Failing to identify the type of missing data or applying imputation methods regardless of the missing type may decrease imputation performance. Many imputation methods have been developed to impute the missing data; however, most of them still do not fulfill the need for data quality in datasets with different types of missing data. Therefore, this paper proposes a novel hybrid missing data imputation method to deal with different types of missing data in a real-world batch process monitoring dataset. By classifying missing data into five distinct categories, we combine and improve suitable models to step-by-step impute different categories of missing data based on their unique characteristics. Through experiments taking a real-world injection molding process monitoring dataset as an example, it can be concluded that missing data pattern analysis combined with appropriate models to impute missing data has better imputation accuracy. Therefore, the hybrid method proposed in this paper excels at missing data imputation for complex batch process monitoring datasets. In practical applications, this method can be employed to impute missing data in batch process monitoring datasets, and the design concept of first categorizing and then stepwise imputing based on data features in this method can also be extended to other datasets containing different types of missing data.
In future research, we plan to conduct studies on the following aspects: The 10-fold cross-validation method, employed for hyperparameter selection in LSTM models, still needs some degree of manual tuning; Bayesian Optimization or Successive Halving could be introduced for automated optimization. Although we have designed a missing data classification method, automated techniques for missing data classification need to be further explored. Data noise can potentially impact imputation performance, and methods such as data cleaning or outlier detection to preprocess the data for noise elimination can be explored. Furthermore, referring to the benchmark proposed in reference [19], additional metrics besides RMSE, such as MAE and runtime, can be introduced. A comprehensive evaluation of imputation accuracy and efficiency could be conducted by selecting suitable baseline methods and utilizing multiple batch process monitoring datasets, while considering various factors like the missing block size, the number of sequences, etc. Based on the evaluation results, the proposed hybrid method might be further improved by enhancing existing models or introducing new models.

Author Contributions

Conceptualization, Y.J. and X.D.; methodology, L.G. and Q.G.; validation, L.G., Q.G. and X.D.; formal analysis, D.H.; writing—original draft preparation, Q.G.; writing—review and editing, Q.G. and X.D.; supervision, D.H., Y.J. and X.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Key R&D Program of China under Grant No. 2020YFB1707900 and 2020YFB1711800; the National Natural Science Foundation of China under Grant No. 62262074, 62172061 and U2268204; the Science and Technology Project of Sichuan Province under Grant No. 2022YFG0159, 2022YFG0155, 2022YFG0157.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

LSTMLong Short-Term Memory
LRLinear Regression
MLRMultivariate Linear Regression
SVDSingular Value Decomposition
PCAPrincipal Component Analysis
MFMatrix Factorization
CDCentroid Decomposition
EMExpectation Maximization
KNNK Nearest Neighbor
RFRandom Forest
ELMExtreme Learning Machine
RNNsRecurrent Neural Networks
VMPVariable Missing Proportion
SMPSample Missing Proportion
CARTClassification and Regression Tree
RMSERoot Mean Square Error
MSEMean Square Error
ARIMAAutoregressive Integrated Moving Average

References

  1. Yao, Y.; Dai, Y.; Luo, W. Early fault diagnosis method for batch process based on local time window standardization and trend analysis. Sensors 2021, 21, 8075. [Google Scholar] [CrossRef] [PubMed]
  2. Ge, Z.; Gao, F.; Song, Z. Batch process monitoring based on support vector data description method. J. Process Control 2011, 21, 949–959. [Google Scholar] [CrossRef]
  3. Zhao, L.; Yang, J. Batch process monitoring based on quality-related time-batch 2D evolution information. Sensors 2022, 22, 2235. [Google Scholar] [CrossRef] [PubMed]
  4. Zhao, Z.; Huang, B.; Liu, F. Bayesian method for state estimation of batch process with missing data. Comput. Chem. Eng. 2013, 53, 14–24. [Google Scholar] [CrossRef]
  5. Donders, A.R.; van der Heijden, G.J.; Stijnen, T.; Moons, K.G. Review: A gentle introduction to imputation of missing values. J. Clin. Epidemiol. 2007, 59, 1087–1091. [Google Scholar] [CrossRef]
  6. Zhang, Z. Missing values in big data research: Some basic skills. Ann. Transl. Med. 2015, 3, 323. [Google Scholar] [PubMed]
  7. Aittokallio, T. Dealing with missing values in large-scale studies: Microarray data imputation and beyond. Brief. Bioinform. 2010, 11, 253–264. [Google Scholar] [CrossRef] [PubMed]
  8. Nomikos, P.; MacGregor, J.F. Multivariate SPC charts for monitoring batch processes. Technometrics 1995, 37, 41–59. [Google Scholar] [CrossRef]
  9. Stordrange, L.; Rajalahti, T.; Libnau, F.O. Multiway methods to explore and model NIR data from a batch process. Chemom. Intell. Lab. Syst. 2004, 70, 137–145. [Google Scholar] [CrossRef]
  10. Meng, X.; Morris, A.; Martin, E. On-line monitoring of batch processes using a PARAFAC representation. J. Chemom. 2003, 17, 65–81. [Google Scholar] [CrossRef]
  11. Shi, W.; Zhu, Y.; Huang, T.; Sheng, G.; Lian, Y.; Wang, G.; Chen, Y. An integrated data preprocessing framework based on apache spark for fault diagnosis of power grid equipment. J. Signal Process. Syst. 2017, 86, 221–236. [Google Scholar] [CrossRef]
  12. Emmanuel, T.; Maupong, T.; Mpoeleng, D.; Semong, T.; Mphago, B.; Tabona, O. A survey on missing data in machine learning. J. Big Data 2021, 8, 1–37. [Google Scholar] [CrossRef]
  13. García-Laencina, P.J.; Sancho-Gómez, J.-L.; Figueiras-Vidal, A.R. Pattern classification with missing data: A review. Neural Comput. Appl. 2010, 19, 263–282. [Google Scholar] [CrossRef]
  14. Lin, W.-C.; Tsai, C.-F. Missing value imputation: A review and analysis of the literature (2006–2017). Artif. Intell. Rev. 2020, 53, 1487–1509. [Google Scholar] [CrossRef]
  15. Farhangfar, A.; Kurgan, L.A.; Pedrycz, W. A novel framework for imputation of missing values in databases. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 2007, 37, 692–709. [Google Scholar] [CrossRef]
  16. Andridge, R.R.; Little, R.J. A review of hot deck imputation for survey non-response. Int. Stat. Rev. 2010, 78, 40–64. [Google Scholar] [CrossRef] [PubMed]
  17. Langkamp, D.L.; Lehman, A.; Lemeshow, S. Techniques for handling missing data in secondary analyses of large surveys. Acad. Pediatr. 2010, 10, 205–210. [Google Scholar] [CrossRef]
  18. Yu, L.; Liu, L.; Peace, K.E. Regression multiple imputation for missing data analysis. Stat. Methods Med. Res. 2020, 29, 2647–2664. [Google Scholar] [CrossRef] [PubMed]
  19. Khayati, M.; Lerner, A.; Tymchenko, Z.; Cudre-Mauroux, P. Mind the gap: An experimental evaluation of imputation of missing values techniques in time series. Proc. Vldb. Endow. 2020, 13, 768–782. [Google Scholar] [CrossRef]
  20. Papadimitriou, S.; Sun, J.; Faloutos, C.; Yu, P.S. Dimensionality reduction and filtering on time series sensor streams. In Managing and Mining Sensor Data; Aggarwal, C.C., Ed.; Springer: Boston, MA, USA, 2013; pp. 103–141. [Google Scholar]
  21. Shu, X.B.; Porikli, F.; Ahuja, N. Robust orthonormal subspace learning: Efficient recovery of corrupted low-rank matrices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
  22. Mazumder, R.; Hastie, T.; Tibshirani, R. Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 2010, 11, 2287–2322. [Google Scholar]
  23. Yu, H.-F.; Rao, N.; Dhillon, I.S. Temporal regularized matrix factorization for high-dimensional time series prediction. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
  24. Khayati, M.; Böhlen, M.H.; Mauroux, P.C. Using lowly correlated time series to recover missing values in time series: A comparison between SVD and CD. In Proceedings of the Advances in Spatial and Temporal Databases: 14th International Symposium, Hong Kong, China, 26–28 August 2015. [Google Scholar]
  25. Yi, X.; Zheng, Y.; Zhang, J.; Li, T. ST-MVL: Filling missing values in geo-sensory time series data. In Proceedings of the 25th International Joint Conference on Artificial Intelligence, New York, NY, USA, 9–15 July 2016. [Google Scholar]
  26. Li, L.; McCann, J.; Pollard, N.; Faloutsos, C. DynaMMo: Mining and summarization of coevolving sequences with missing values. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, 28 June–1 July 2009. [Google Scholar]
  27. Kim, T.; Kim, J.; Yang, W.; Lee, H.; Choo, J. Missing value imputation of time-series air-quality data via deep neural networks. Int. J. Environ. Res. Public Health 2021, 18, 12213. [Google Scholar] [CrossRef] [PubMed]
  28. Chen, Y.; Gu, C.; Shao, C.; Gu, H.; Zheng, D.; Wu, Z.; Fu, X. An approach using adaptive weighted least squares support vector machines coupled with modified ant lion optimizer for dam deformation prediction. Math. Probl. Eng. 2020, 2020, 9434065. [Google Scholar] [CrossRef]
  29. Wei, W.; Gu, C.; Fu, X. Processing method of missing data in dam safety monitoring. Math. Probl. Eng. 2021, 2021, 9950874. [Google Scholar] [CrossRef]
  30. Nadimi-Shahraki, M.H.; Mohammadi, S.; Zamani, H.; Gandomi, M.; Gandomi, A.H. A hybrid imputation method for multi-pattern missing data: A case study on type II diabetes diagnosis. Electronics 2021, 10, 3167. [Google Scholar] [CrossRef]
  31. Liang, X.; Ge, Z.; Sun, L.; He, M.; Chen, H. LSTM with wavelet transform based data preprocessing for stock price prediction. Math. Probl. Eng. 2019, 2019, 1340174. [Google Scholar] [CrossRef]
  32. Maillo, J.; Ramírez, S.; Triguero, I.; Herrera, F. kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbors classifier for big data. Knowl.-Based Syst. 2017, 117, 3–15. [Google Scholar] [CrossRef]
  33. Tang, F.; Ishwaran, H. Random forest missing data algorithms. Stat. Anal. Data Min. ASA Data Sci. J. 2017, 10, 363–377. [Google Scholar] [CrossRef]
  34. Hong, S.; Lynn, H.S. Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction. BMC Med. Res. Methodol. 2020, 20, 199. [Google Scholar] [CrossRef]
  35. Raja, P.S.; Thangavel, K. Missing value imputation using unsupervised machine learning techniques. Soft Comput. 2020, 24, 4361–4392. [Google Scholar] [CrossRef]
  36. Huang, G.-B.; Zhu, Q.-Y.; Siew, C.-K. Extreme learning machine: Theory and applications. Neurocomputing 2006, 70, 489–501. [Google Scholar] [CrossRef]
  37. Song, W.; Gao, C.; Zhao, Y.; Zhao, Y. A time series data filling method based on LSTM-Taking the stem moisture as an example. Sensors 2020, 20, 5045. [Google Scholar] [CrossRef] [PubMed]
  38. Yoon, J.; Zame, W.R.; van der Schaar, M. Estimating missing data in temporal data streams using multi-directional recurrent neural networks. IEEE Trans. Biomed. Eng. 2018, 66, 1477–1490. [Google Scholar] [CrossRef] [PubMed]
  39. Dyer, S.A.; Xin, H. Cubic-spline interpolation: Part 2. IEEE Instrum. Meas. Mag. 2001, 4, 34–36. [Google Scholar] [CrossRef]
  40. Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
  41. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
  42. The Injection Molding Process Monitoring Dataset. Available online: https://github.com/Chow-kk/DATASET_4th_industrial-bigdata_competion_ (accessed on 1 March 2022).
  43. Kohn, R.; Ansley, C.F. Estimation, prediction, and interpolation for ARIMA models with missing data. J. Am. Stat. Assoc. 1986, 81, 751–761. [Google Scholar] [CrossRef]
  44. Sura, T.; Nassir, A.B.K.; Wassan, T. Mousa Estimation the missing data of meteorological variables in different Iraqi cities by using ARIMA model. Iraqi J. Sci. 2018, 59, 792–801. [Google Scholar]
  45. Sovilj, D.; Eirola, E.; Miche, Y.; Björk, K.-M.; Nian, R.; Akusok, A.; Lendasse, A. Extreme learning machine for missing data using multiple imputations. Neurocomputing 2016, 174, 220–231. [Google Scholar] [CrossRef]
Figure 1. Unfolding data along the batch dimension.
Figure 1. Unfolding data along the batch dimension.
Sensors 23 08678 g001
Figure 2. Example of short-term missing variable imputation based on the combination model.
Figure 2. Example of short-term missing variable imputation based on the combination model.
Sensors 23 08678 g002
Figure 3. Imputation process for short-term missing variable based on the combination model.
Figure 3. Imputation process for short-term missing variable based on the combination model.
Sensors 23 08678 g003
Figure 4. LSTM unit: f is the forget gate, i is the input gate, g is the update gate, o is the output gate, c is the unit state, h is the hidden state, σ is the activation function of Sigmoid, W is the weight matrix, Stack, ⊙, ⊕ and ⊗, respectively, represent matrix stacking, matrix elements multiplication, matrix addition and matrix multiplication.
Figure 4. LSTM unit: f is the forget gate, i is the input gate, g is the update gate, o is the output gate, c is the unit state, h is the hidden state, σ is the activation function of Sigmoid, W is the weight matrix, Stack, ⊙, ⊕ and ⊗, respectively, represent matrix stacking, matrix elements multiplication, matrix addition and matrix multiplication.
Sensors 23 08678 g004
Figure 5. The proposed hybrid missing data imputation method.
Figure 5. The proposed hybrid missing data imputation method.
Sensors 23 08678 g005
Figure 6. Hyperparameter selection through 10-fold cross-validation: (a) n _ e s t i m a t o r s and m _ f e a t u r e s of the RF model; (b) k of the KNN model.
Figure 6. Hyperparameter selection through 10-fold cross-validation: (a) n _ e s t i m a t o r s and m _ f e a t u r e s of the RF model; (b) k of the KNN model.
Sensors 23 08678 g006
Figure 7. Hyperparameters selection for LSTM model through 10-fold cross-validation.
Figure 7. Hyperparameters selection for LSTM model through 10-fold cross-validation.
Sensors 23 08678 g007
Table 1. Classification rules for missing data in batch process monitoring dataset.
Table 1. Classification rules for missing data in batch process monitoring dataset.
Missing Data CategoriesClassification Rules
Transient isolated missing values T 0 t T h t 1
Short-term missing variables T h t 1 < t T h t 2 and n v < T h v
Long-term missing variables t > T h t 2 and n v < T h v
Short-term missing samples T h t 1 < t T h t 2 and n v T h v
Long-term missing samples t > T h t 2 and n v T h v
Table 2. Process variables of a real-world injection molding process monitoring dataset.
Table 2. Process variables of a real-world injection molding process monitoring dataset.
Variable TypeVariable DescriptionUnit
ProcessScrew speed M m / s
Plasticizing pressure B a r
Nozzle temperature
Cylinder pressure B a r
SV1 value opening%
SV2 value opening%
Table 3. Data integrity information.
Table 3. Data integrity information.
Data Segment X 1 X 2 X 3
S M P ( k ) 0.2160.0370.130
Screw speed V M P 1 k 00.0040
Plasticizing pressure V M P 2 ( k ) 0.2150.0290.129
Nozzle temperature V M P 3 ( k ) 000.002
Cylinder pressure V M P 4 ( k ) 0.0020.0020.003
SV1 value opening V M P 5 ( k ) 000
SV2 value opening V M P 6 ( k ) 0.0290.0170.003
Table 4. RMSE of missing data imputation results for transient isolated missing values.
Table 4. RMSE of missing data imputation results for transient isolated missing values.
Imputation Method X 2 Screw
Speed
Plasticizing PressureNozzle
Temperature
Cylinder PressureSV1 Value OpeningSV2 Value Opening
Single-dimensional
interpolation model
5%1.0512.0563.8812.0890.1030.893
Mean2.6733.3854.5322.0530.0671.426
Hot-deck imputation1.1052.7344.3642.0470.0321.940
Single-dimensional
interpolation model
10%1.4382.0723.6592.0780.0560.912
Mean2.9373.6194.2332.0580.0991.503
Hot-deck imputation1.9352.8024.6742.0490.0421.784
Single-dimensional
interpolation model
15%1.3012.0893.4312.0670.0551.425
Mean3.8013.6234.5672.1080.1121.285
Hot-deck imputation2.5722.7234.3472.0870.0451.731
Single-dimensional
interpolation model
20%1.1292.0783.6262.0740.0541.373
Mean3.2563.6114.9102.0990.1131.891
Hot-deck imputation2.5332.8054.2212.0720.0511.992
Table 5. RMSE of missing data imputation results for continuous missing variables.
Table 5. RMSE of missing data imputation results for continuous missing variables.
Imputation Method R M S E
The combination model based on single-dimensional interpolation and multivariate regression modelSingle-dimensional interpolation + MLR1.976
Single-dimensional interpolation + RF2.016
Single-dimensional interpolation + KNN2.159
Single-dimensional interpolation modelLinear interpolation5.812
Mean6.031
Spline interpolation5.903
Multivariate regression modelMLR4.392
RF4.204
KNN4.450
Table 6. RMSE of missing data imputation results for continuous missing samples.
Table 6. RMSE of missing data imputation results for continuous missing samples.
Imputation MethodMissing Data SegmentScrew
Speed
Plasticizing PressureNozzle
Temperature
Cylinder PressureSV1 Value OpeningSV2 Value Opening
LSTM X 1 * 0.8421.0982.7191.0930.1120.149
ARIMA1.6911.1042.9031.0070.1190.201
ELM2.7151.1242.8121.1320.1050.218
LSTM X 2 * 0.5291.0712.0271.0730.0940.173
ARIMA1.6261.1762.2971.5190.1130.191
ELM2.3711.1932.1511.1680.1510.264
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gan, Q.; Gong, L.; Hu, D.; Jiang, Y.; Ding, X. A Hybrid Missing Data Imputation Method for Batch Process Monitoring Dataset. Sensors 2023, 23, 8678. https://doi.org/10.3390/s23218678

AMA Style

Gan Q, Gong L, Hu D, Jiang Y, Ding X. A Hybrid Missing Data Imputation Method for Batch Process Monitoring Dataset. Sensors. 2023; 23(21):8678. https://doi.org/10.3390/s23218678

Chicago/Turabian Style

Gan, Qihong, Lang Gong, Dasha Hu, Yuming Jiang, and Xuefeng Ding. 2023. "A Hybrid Missing Data Imputation Method for Batch Process Monitoring Dataset" Sensors 23, no. 21: 8678. https://doi.org/10.3390/s23218678

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop