A Hybrid Missing Data Imputation Method for Batch Process Monitoring Dataset
Abstract
:1. Introduction
- We propose a missing data classification method based on the continuous missing duration for each variable and the number of variables missing simultaneously. Then we classify the missing data into five distinct categories: transient isolated missing values, short-term missing variables, long-term missing variables, short-term missing samples, and long-term missing samples.
- We design and implement the hybrid missing data imputation method to deal with different categories of missing data step by step, taking into account the characteristics of different categories of missing data. This method employs a combination of three single-dimensional interpolation models that enables the automated detection and imputation of transient isolated missing values. We design an iterative imputation based on a multivariate regression model to automatically complete the imputation of all long-term missing variables. To address short-term missing variables, we propose a combination model based on single-dimensional interpolation and multivariate regression by utilizing system fluctuations. We use the LSTM model to impute both short-term and long-term missing samples.
- We have carried out extensive experiments on a real-world injection molding process monitoring dataset to demonstrate the effectiveness and accuracy of the proposed hybrid missing data imputation method.
2. Related Works
3. Methodology
3.1. Data Processing
3.1.1. Data Unfolding
3.1.2. Missing Data Classifying
3.2. Missing Data Imputation
3.2.1. Dataset Splitting
3.2.2. Transient Isolated Missing Values Imputation
3.2.3. Continuous Missing Variables Imputation
Algorithm 1 The iterative imputation based on multivariate regression model |
Input: , |
Output: The imputed data segment |
1. Begin |
|
3. Set ; |
4. For to : |
|
|
|
8. Impute using ; |
9. Set ; |
10. Return ; |
11. End |
3.2.4. Continuous Missing Samples Imputation
3.3. The Hybrid Missing Data Imputation Method
Algorithm 2 The proposed hybrid missing data imputation method |
Input: The original dataset |
Output: The imputed complete dataset |
1. Begin |
2. Unfolding data along the batch dimension, get the 2D dataset X; |
|
4. Splitting dataset X, get ; |
|
6. ← The imputed data segments; |
7. Standardize each data segment; |
|
9. ← The imputed data segments; |
|
11. ← The imputed data segments; |
12. Complete dataset ← De-standardize, and transform 2D data to 3D data; |
13. End |
4. Illustration and Discussion
4.1. Data Source and Description
4.2. Performance Evaluation Index
4.3. Data Processing
4.4. Missing Data Imputation and Results Analysis
4.4.1. Transient Isolated Missing Values Imputation
4.4.2. Continuous Missing Variables Imputation
4.4.3. Continuous Missing Samples Imputation
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
LSTM | Long Short-Term Memory |
LR | Linear Regression |
MLR | Multivariate Linear Regression |
SVD | Singular Value Decomposition |
PCA | Principal Component Analysis |
MF | Matrix Factorization |
CD | Centroid Decomposition |
EM | Expectation Maximization |
KNN | K Nearest Neighbor |
RF | Random Forest |
ELM | Extreme Learning Machine |
RNNs | Recurrent Neural Networks |
VMP | Variable Missing Proportion |
SMP | Sample Missing Proportion |
CART | Classification and Regression Tree |
RMSE | Root Mean Square Error |
MSE | Mean Square Error |
ARIMA | Autoregressive Integrated Moving Average |
References
- Yao, Y.; Dai, Y.; Luo, W. Early fault diagnosis method for batch process based on local time window standardization and trend analysis. Sensors 2021, 21, 8075. [Google Scholar] [CrossRef] [PubMed]
- Ge, Z.; Gao, F.; Song, Z. Batch process monitoring based on support vector data description method. J. Process Control 2011, 21, 949–959. [Google Scholar] [CrossRef]
- Zhao, L.; Yang, J. Batch process monitoring based on quality-related time-batch 2D evolution information. Sensors 2022, 22, 2235. [Google Scholar] [CrossRef] [PubMed]
- Zhao, Z.; Huang, B.; Liu, F. Bayesian method for state estimation of batch process with missing data. Comput. Chem. Eng. 2013, 53, 14–24. [Google Scholar] [CrossRef]
- Donders, A.R.; van der Heijden, G.J.; Stijnen, T.; Moons, K.G. Review: A gentle introduction to imputation of missing values. J. Clin. Epidemiol. 2007, 59, 1087–1091. [Google Scholar] [CrossRef]
- Zhang, Z. Missing values in big data research: Some basic skills. Ann. Transl. Med. 2015, 3, 323. [Google Scholar] [PubMed]
- Aittokallio, T. Dealing with missing values in large-scale studies: Microarray data imputation and beyond. Brief. Bioinform. 2010, 11, 253–264. [Google Scholar] [CrossRef] [PubMed]
- Nomikos, P.; MacGregor, J.F. Multivariate SPC charts for monitoring batch processes. Technometrics 1995, 37, 41–59. [Google Scholar] [CrossRef]
- Stordrange, L.; Rajalahti, T.; Libnau, F.O. Multiway methods to explore and model NIR data from a batch process. Chemom. Intell. Lab. Syst. 2004, 70, 137–145. [Google Scholar] [CrossRef]
- Meng, X.; Morris, A.; Martin, E. On-line monitoring of batch processes using a PARAFAC representation. J. Chemom. 2003, 17, 65–81. [Google Scholar] [CrossRef]
- Shi, W.; Zhu, Y.; Huang, T.; Sheng, G.; Lian, Y.; Wang, G.; Chen, Y. An integrated data preprocessing framework based on apache spark for fault diagnosis of power grid equipment. J. Signal Process. Syst. 2017, 86, 221–236. [Google Scholar] [CrossRef]
- Emmanuel, T.; Maupong, T.; Mpoeleng, D.; Semong, T.; Mphago, B.; Tabona, O. A survey on missing data in machine learning. J. Big Data 2021, 8, 1–37. [Google Scholar] [CrossRef]
- García-Laencina, P.J.; Sancho-Gómez, J.-L.; Figueiras-Vidal, A.R. Pattern classification with missing data: A review. Neural Comput. Appl. 2010, 19, 263–282. [Google Scholar] [CrossRef]
- Lin, W.-C.; Tsai, C.-F. Missing value imputation: A review and analysis of the literature (2006–2017). Artif. Intell. Rev. 2020, 53, 1487–1509. [Google Scholar] [CrossRef]
- Farhangfar, A.; Kurgan, L.A.; Pedrycz, W. A novel framework for imputation of missing values in databases. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 2007, 37, 692–709. [Google Scholar] [CrossRef]
- Andridge, R.R.; Little, R.J. A review of hot deck imputation for survey non-response. Int. Stat. Rev. 2010, 78, 40–64. [Google Scholar] [CrossRef] [PubMed]
- Langkamp, D.L.; Lehman, A.; Lemeshow, S. Techniques for handling missing data in secondary analyses of large surveys. Acad. Pediatr. 2010, 10, 205–210. [Google Scholar] [CrossRef]
- Yu, L.; Liu, L.; Peace, K.E. Regression multiple imputation for missing data analysis. Stat. Methods Med. Res. 2020, 29, 2647–2664. [Google Scholar] [CrossRef] [PubMed]
- Khayati, M.; Lerner, A.; Tymchenko, Z.; Cudre-Mauroux, P. Mind the gap: An experimental evaluation of imputation of missing values techniques in time series. Proc. Vldb. Endow. 2020, 13, 768–782. [Google Scholar] [CrossRef]
- Papadimitriou, S.; Sun, J.; Faloutos, C.; Yu, P.S. Dimensionality reduction and filtering on time series sensor streams. In Managing and Mining Sensor Data; Aggarwal, C.C., Ed.; Springer: Boston, MA, USA, 2013; pp. 103–141. [Google Scholar]
- Shu, X.B.; Porikli, F.; Ahuja, N. Robust orthonormal subspace learning: Efficient recovery of corrupted low-rank matrices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
- Mazumder, R.; Hastie, T.; Tibshirani, R. Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 2010, 11, 2287–2322. [Google Scholar]
- Yu, H.-F.; Rao, N.; Dhillon, I.S. Temporal regularized matrix factorization for high-dimensional time series prediction. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
- Khayati, M.; Böhlen, M.H.; Mauroux, P.C. Using lowly correlated time series to recover missing values in time series: A comparison between SVD and CD. In Proceedings of the Advances in Spatial and Temporal Databases: 14th International Symposium, Hong Kong, China, 26–28 August 2015. [Google Scholar]
- Yi, X.; Zheng, Y.; Zhang, J.; Li, T. ST-MVL: Filling missing values in geo-sensory time series data. In Proceedings of the 25th International Joint Conference on Artificial Intelligence, New York, NY, USA, 9–15 July 2016. [Google Scholar]
- Li, L.; McCann, J.; Pollard, N.; Faloutsos, C. DynaMMo: Mining and summarization of coevolving sequences with missing values. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, 28 June–1 July 2009. [Google Scholar]
- Kim, T.; Kim, J.; Yang, W.; Lee, H.; Choo, J. Missing value imputation of time-series air-quality data via deep neural networks. Int. J. Environ. Res. Public Health 2021, 18, 12213. [Google Scholar] [CrossRef] [PubMed]
- Chen, Y.; Gu, C.; Shao, C.; Gu, H.; Zheng, D.; Wu, Z.; Fu, X. An approach using adaptive weighted least squares support vector machines coupled with modified ant lion optimizer for dam deformation prediction. Math. Probl. Eng. 2020, 2020, 9434065. [Google Scholar] [CrossRef]
- Wei, W.; Gu, C.; Fu, X. Processing method of missing data in dam safety monitoring. Math. Probl. Eng. 2021, 2021, 9950874. [Google Scholar] [CrossRef]
- Nadimi-Shahraki, M.H.; Mohammadi, S.; Zamani, H.; Gandomi, M.; Gandomi, A.H. A hybrid imputation method for multi-pattern missing data: A case study on type II diabetes diagnosis. Electronics 2021, 10, 3167. [Google Scholar] [CrossRef]
- Liang, X.; Ge, Z.; Sun, L.; He, M.; Chen, H. LSTM with wavelet transform based data preprocessing for stock price prediction. Math. Probl. Eng. 2019, 2019, 1340174. [Google Scholar] [CrossRef]
- Maillo, J.; Ramírez, S.; Triguero, I.; Herrera, F. kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbors classifier for big data. Knowl.-Based Syst. 2017, 117, 3–15. [Google Scholar] [CrossRef]
- Tang, F.; Ishwaran, H. Random forest missing data algorithms. Stat. Anal. Data Min. ASA Data Sci. J. 2017, 10, 363–377. [Google Scholar] [CrossRef]
- Hong, S.; Lynn, H.S. Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction. BMC Med. Res. Methodol. 2020, 20, 199. [Google Scholar] [CrossRef]
- Raja, P.S.; Thangavel, K. Missing value imputation using unsupervised machine learning techniques. Soft Comput. 2020, 24, 4361–4392. [Google Scholar] [CrossRef]
- Huang, G.-B.; Zhu, Q.-Y.; Siew, C.-K. Extreme learning machine: Theory and applications. Neurocomputing 2006, 70, 489–501. [Google Scholar] [CrossRef]
- Song, W.; Gao, C.; Zhao, Y.; Zhao, Y. A time series data filling method based on LSTM-Taking the stem moisture as an example. Sensors 2020, 20, 5045. [Google Scholar] [CrossRef] [PubMed]
- Yoon, J.; Zame, W.R.; van der Schaar, M. Estimating missing data in temporal data streams using multi-directional recurrent neural networks. IEEE Trans. Biomed. Eng. 2018, 66, 1477–1490. [Google Scholar] [CrossRef] [PubMed]
- Dyer, S.A.; Xin, H. Cubic-spline interpolation: Part 2. IEEE Instrum. Meas. Mag. 2001, 4, 34–36. [Google Scholar] [CrossRef]
- Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
- Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
- The Injection Molding Process Monitoring Dataset. Available online: https://github.com/Chow-kk/DATASET_4th_industrial-bigdata_competion_ (accessed on 1 March 2022).
- Kohn, R.; Ansley, C.F. Estimation, prediction, and interpolation for ARIMA models with missing data. J. Am. Stat. Assoc. 1986, 81, 751–761. [Google Scholar] [CrossRef]
- Sura, T.; Nassir, A.B.K.; Wassan, T. Mousa Estimation the missing data of meteorological variables in different Iraqi cities by using ARIMA model. Iraqi J. Sci. 2018, 59, 792–801. [Google Scholar]
- Sovilj, D.; Eirola, E.; Miche, Y.; Björk, K.-M.; Nian, R.; Akusok, A.; Lendasse, A. Extreme learning machine for missing data using multiple imputations. Neurocomputing 2016, 174, 220–231. [Google Scholar] [CrossRef]
Missing Data Categories | Classification Rules |
---|---|
Transient isolated missing values | |
Short-term missing variables | and |
Long-term missing variables | and |
Short-term missing samples | and |
Long-term missing samples | and |
Variable Type | Variable Description | Unit |
---|---|---|
Process | Screw speed | |
Plasticizing pressure | ||
Nozzle temperature | ||
Cylinder pressure | ||
SV1 value opening | % | |
SV2 value opening | % |
Data Segment | |||
---|---|---|---|
0.216 | 0.037 | 0.130 | |
Screw speed | 0 | 0.004 | 0 |
Plasticizing pressure | 0.215 | 0.029 | 0.129 |
Nozzle temperature | 0 | 0 | 0.002 |
Cylinder pressure | 0.002 | 0.002 | 0.003 |
SV1 value opening | 0 | 0 | 0 |
SV2 value opening | 0.029 | 0.017 | 0.003 |
Imputation Method | Screw Speed | Plasticizing Pressure | Nozzle Temperature | Cylinder Pressure | SV1 Value Opening | SV2 Value Opening | |
---|---|---|---|---|---|---|---|
Single-dimensional interpolation model | 5% | 1.051 | 2.056 | 3.881 | 2.089 | 0.103 | 0.893 |
Mean | 2.673 | 3.385 | 4.532 | 2.053 | 0.067 | 1.426 | |
Hot-deck imputation | 1.105 | 2.734 | 4.364 | 2.047 | 0.032 | 1.940 | |
Single-dimensional interpolation model | 10% | 1.438 | 2.072 | 3.659 | 2.078 | 0.056 | 0.912 |
Mean | 2.937 | 3.619 | 4.233 | 2.058 | 0.099 | 1.503 | |
Hot-deck imputation | 1.935 | 2.802 | 4.674 | 2.049 | 0.042 | 1.784 | |
Single-dimensional interpolation model | 15% | 1.301 | 2.089 | 3.431 | 2.067 | 0.055 | 1.425 |
Mean | 3.801 | 3.623 | 4.567 | 2.108 | 0.112 | 1.285 | |
Hot-deck imputation | 2.572 | 2.723 | 4.347 | 2.087 | 0.045 | 1.731 | |
Single-dimensional interpolation model | 20% | 1.129 | 2.078 | 3.626 | 2.074 | 0.054 | 1.373 |
Mean | 3.256 | 3.611 | 4.910 | 2.099 | 0.113 | 1.891 | |
Hot-deck imputation | 2.533 | 2.805 | 4.221 | 2.072 | 0.051 | 1.992 |
Imputation Method | ||
---|---|---|
The combination model based on single-dimensional interpolation and multivariate regression model | Single-dimensional interpolation + MLR | 1.976 |
Single-dimensional interpolation + RF | 2.016 | |
Single-dimensional interpolation + KNN | 2.159 | |
Single-dimensional interpolation model | Linear interpolation | 5.812 |
Mean | 6.031 | |
Spline interpolation | 5.903 | |
Multivariate regression model | MLR | 4.392 |
RF | 4.204 | |
KNN | 4.450 |
Imputation Method | Missing Data Segment | Screw Speed | Plasticizing Pressure | Nozzle Temperature | Cylinder Pressure | SV1 Value Opening | SV2 Value Opening |
---|---|---|---|---|---|---|---|
LSTM | 0.842 | 1.098 | 2.719 | 1.093 | 0.112 | 0.149 | |
ARIMA | 1.691 | 1.104 | 2.903 | 1.007 | 0.119 | 0.201 | |
ELM | 2.715 | 1.124 | 2.812 | 1.132 | 0.105 | 0.218 | |
LSTM | 0.529 | 1.071 | 2.027 | 1.073 | 0.094 | 0.173 | |
ARIMA | 1.626 | 1.176 | 2.297 | 1.519 | 0.113 | 0.191 | |
ELM | 2.371 | 1.193 | 2.151 | 1.168 | 0.151 | 0.264 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Gan, Q.; Gong, L.; Hu, D.; Jiang, Y.; Ding, X. A Hybrid Missing Data Imputation Method for Batch Process Monitoring Dataset. Sensors 2023, 23, 8678. https://doi.org/10.3390/s23218678
Gan Q, Gong L, Hu D, Jiang Y, Ding X. A Hybrid Missing Data Imputation Method for Batch Process Monitoring Dataset. Sensors. 2023; 23(21):8678. https://doi.org/10.3390/s23218678
Chicago/Turabian StyleGan, Qihong, Lang Gong, Dasha Hu, Yuming Jiang, and Xuefeng Ding. 2023. "A Hybrid Missing Data Imputation Method for Batch Process Monitoring Dataset" Sensors 23, no. 21: 8678. https://doi.org/10.3390/s23218678
APA StyleGan, Q., Gong, L., Hu, D., Jiang, Y., & Ding, X. (2023). A Hybrid Missing Data Imputation Method for Batch Process Monitoring Dataset. Sensors, 23(21), 8678. https://doi.org/10.3390/s23218678