Next Article in Journal / Special Issue
Big Data Analytics for Smart Manufacturing: Case Studies in Semiconductor Manufacturing
Previous Article in Journal
On the Use of Multivariate Methods for Analysis of Data from Biological Networks
Previous Article in Special Issue
Industrial Process Monitoring in the Big Data/Industry 4.0 Era: from Detection, to Diagnosis, to Prognosis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Principal Component Analysis of Process Datasets with Missing Values

by
Kristen A. Severson
,
Mark C. Molaro
and
Richard D. Braatz
*
Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
*
Author to whom correspondence should be addressed.
Current address: Element Analytics, San Fransisco, CA 94107, USA.
Processes 2017, 5(3), 38; https://doi.org/10.3390/pr5030038
Submission received: 31 May 2017 / Revised: 28 June 2017 / Accepted: 30 June 2017 / Published: 6 July 2017
(This article belongs to the Collection Process Data Analytics)

Abstract

:
Datasets with missing values arising from causes such as sensor failure, inconsistent sampling rates, and merging data from different systems are common in the process industry. Methods for handling missing data typically operate during data pre-processing, but can also occur during model building. This article considers missing data within the context of principal component analysis (PCA), which is a method originally developed for complete data that has widespread industrial application in multivariate statistical process control. Due to the prevalence of missing data and the success of PCA for handling complete data, several PCA algorithms that can act on incomplete data have been proposed. Here, algorithms for applying PCA to datasets with missing values are reviewed. A case study is presented to demonstrate the performance of the algorithms and suggestions are made with respect to choosing which algorithm is most appropriate for particular settings. An alternating algorithm based on the singular value decomposition achieved the best results in the majority of test cases involving process datasets.

1. Introduction

Principal component analysis (PCA) is a widely used tool in industry for process monitoring. PCA and its variants have been proposed for process control [1], identification of faulty sensors [2], data preprocessing [3], data visualization [4], model building [5], and fault detection and identification [6] in continuous as well as batch processing [7,8]. PCA has been applied in a variety of industries including chemicals, polymers, semiconductors, and pharmaceuticals. Classic PCA methods require complete observations; however, often online process measurements or laboratory data have missing observations. Causes of missing data in this context include sensor failure, changes in sensor instrumentation over time, different sampling rates, merging of data from different systems, and samples that are flagged as poor quality and subsequently dropped from storage [9]. The nonlinear iterative partial least squares (NIPALS) algorithm was an early approach for handling missing process data when applying PCA [10,11]. The problem started to gain more attention in the late 1990s [12,13] and, because of the ubiquity of missing data, many PCA algorithms that can handle missing data have been proposed since. This article reviews these approaches and provides guidance to practitioners on which methods to apply.
A framework for analysis in the presence of missing data has been available since the mid 1970s [14], which introduces categories of missingness and explains when missingness can be ignored. Three categorizations of missingness are (1) missing completely at random (MCAR), (2) missing at random (MAR), and (3) not missing at random (NMAR) [15]. These categories can be described using the missing-data indicator matrix, M , which is of the same size as the data matrix X where M i j = 1 if X i j is missing and 0 otherwise. The MCAR assumption applies when the independence statement
f ( M | X , ϕ ) = f ( M | ϕ ) , X , ϕ ,
is true, where f is a probability density, variables to the right of | indicate the conditioning set, and ϕ are unknown parameters. MCAR implies that the missingness is not a function of the data, regardless of whether the data points are observed or missing. The MAR assumption applies when the independence statement
f ( M | X , ϕ ) = f ( M | X o b s , ϕ ) , X m i s , ϕ ,
is true. MAR implies that the missingness depends on the observed data. NMAR is assumed when neither of these criteria apply [15].
Recently, access to large amounts of process data have been enabled by improved sensor technology, the Industrial Internet of Things, and decreased data storage costs. Due to an increasing number and diversity of measurements [16], data with missing elements will become increasingly common. When working with a dataset, the first step is to identify which data are missing and why. If the missingness mechanism is MCAR or MAR, a model for the missingness mechanism is not needed and is referred to as ignorable when performing inference. To perform inference, the quantity of interest is the likelihood, which is the probability of the observed data, given the distributional parameters. If the MAR assumption holds, the likelihood is proportional to the probability of the observed data given the true parameters and therefore it is not necessary to model the missingness [15]. However, when data are NMAR and the missingness mechanism is not taken into account, algorithms can lead to systemic bias and poor prediction [15]. Conclusive tests for determining the appropriate missingness categorization do not exist, and so the categorization is selected based on process understanding. The conclusions of missingness categorization depend on the specific scenario, but some typical examples for the process industry are presented here to provide guidance to practitioners. MCAR is applicable to data that are missing due to random sensor failure or mishandling of the data. MAR applies to scenarios where data are acquired sequentially, for example, a quality test that is only performed based on the results of previous testing. NMAR applies to measurements that are not recorded due to censoring, where the value is outside of limits of detection [9].

2. Methods

2.1. Introduction to PCA

Principal component analysis is a technique for dimensionality reduction. Pearson [17] and Hotelling [18] are typically attributed with the first descriptions of the technique [19]. Hotelling described PCA as the set of linear projections that maximizes the variance in a lower dimensional space. For a data matrix X R d × n where d is the number of measurements and n is the number of samples, the linear projection described by Hotelling can be found via the singular value decomposition (SVD),
X = U Σ V ,
where U R d × d and V R n × n are orthogonal matrices and Σ R d × n is a pseudo-diagonal matrix. The linear projection matrix P R d × a , also called the matrix of loading vectors, is defined by the columns of U that correspond to the largest a singular values. The principal components, also called the scores, are defined as
T = P X
or as the first a rows of Σ V . Equivalently, P can be found by solving the eigenvalue decomposition of the sample covariance matrix,
S = 1 n XX = U Λ U ,
where the diagonal matrix Λ = Σ Σ , with P defined as the columns of U that correspond to the largest a eigenvalues.
Pearson [17] described PCA as the optimal rank a approximation of a data matrix X for a < d using the least-squares criterion. Here, the observed data are modeled as
x ^ i = P t i + μ
where x ^ i is the reconstruction of a column of the previously defined data matrix X , P is again an orthogonal matrix, t i is the score and is equivalent to a column of the previously defined matrix T , and μ is the mean of the observed data such that the reconstruction error
C = i = 1 n x i x ^ i 2
is minimized.
PCA can also be described as the maximum likelihood solution of a probabilistic latent variable model [20,21]. This formulation is referred to as PPCA. PPCA assumes the data are modeled by a generative latent variable model,
x i = P t i + μ + ϵ i ,
where the variables are defined as above and ϵ i is the error. The distributional assumptions are
t i N ( 0 , I a )
ϵ i N ( 0 , σ 2 I d )
x i | t i N ( P t i + μ , σ 2 I d )
x i N ( μ , PP + σ 2 I d )
where I k is the k × k identity matrix, N ( μ , Σ ) indicates a normal distribution with mean μ and covariance Σ , and all other terms are defined as above. Tipping and Bishop [20] and Roweis [21] independently proposed finding the maximum likelihood estimates of the distributional parameters via expectation maximization (EM). EM is a general framework for learning parameters with incomplete data which iteratively updates the expected complete data log-likelihood and the maximum likelihood estimates of the parameters [22]. In PPCA, the data are incomplete because the principal components, t i , are not observed. Typically, t i are referred to as latent variables, as opposed to missing data, because they cannot be observed. Generally, EM is only guaranteed to converge to a local maximum, but Tipping and Bishop [20] showed that EM converges to a global maximum for PPCA. To apply EM to PPCA, first the observed data are mean-centered using the sample mean. Then the algorithm alternates between calculating the conditional expectations of the latent variables,
t i = W 1 P ( x i μ ) ,
t i t i = σ 2 W 1 + t i t i ,
where W = P P + σ 2 I a , and updating the parameters
P = i = 1 n ( x i μ ) t i i = 1 n t i t i 1
σ 2 = 1 n d i = 1 n x i μ 2 2 t i P ( x i μ ) + tr ( t i t i P P )
Before application of the PCA algorithm, each measurement (i.e., row when X R d × n ) in the data matrix is typically mean centered around zero and rescaled to have standard deviation equal to one. For all PCA implementations, it is necessary to choose the latent dimension a, and several approaches exist. Scree plots [23] visualize the singular values in decreasing order and look for an “elbow” or “gap” and truncate at that point. The percent variance explained approach considers the variance, defined as the square of the corresponding singular value, of each loading vector and truncates at a specified threshold, often 90% or 95%. Cross-validation strategies choose a such that the reconstruction error of a held-out set is minimized. In the PPCA framework, the negative log-likelihood of a validation set can also be used. Parallel analysis [24] compares the scree plot of the data matrix to that of a random matrix of the same size and thresholds at the crossing point. Donoho and Gavish [25] propose an optimal threshold based on the asymptotic mean-squared error.

2.2. PCA Methods for Missing Data

To apply an algorithm to a dataset with missing data, the simplest approaches are complete case analysis, in which only samples that have all of the measurements are used in analysis, and mean imputation, in which missing elements are replaced with the sample mean. These techniques can lead to large amounts of data loss or bias and are undesirable. Because complete case analysis and mean imputation first address missing data and then proceed with modeling, these techniques are referred to as two-step procedures. More advanced two-step procedures exist, such as multiple imputation [26], as well as two-step procedures that are designed for certain types of missingness, such as lifting [27] which is applied to multi-rate missingness. Here, the focus is on methods that integrate missing data handling and model building for PCA. All of the PCA methods in the previous section assume that the data matrix is complete, however in practice, the data matrix may not be complete and several approaches have been proposed for finding the principal components in the presence of missing data.
Grung and Manne [13] proposed an alternating least-squares type of approach. Their algorithm is initialized by computing the singular value decomposition where missing values have been filled in using the sample mean. The algorithm then alternates between minimizing
C = i j ( 1 M i j ) X i j k t i k p j k 2
with either fixed scores T , or fixed loadings P where M i j = 1 if X i j is missing and zero otherwise. The first set of update equations are
t i = x i A i ( A i A i ) 1
where t i is the ith column of T , x i is the ith column of X , and A i is a d × a matrix with elements A j k = ( 1 M i j ) p j k . The second set of update equations is
p j = ( B j B j ) 1 B j x j
where p j is the jth row of P , x j is the jth row of X , and B j is a n × a matrix with elements B i k = t i k ( 1 M i j ) . To address the estimation of μ , Grung and Manne [13] suggest augmenting the model with an additional loading vector with a corresponding principal component equal to all ones. This approach leverages the reconstruction error derivation of the PCA problem and uses the change in the reconstruction error as the convergence criteria.
Another approach is to start from the SVD derivation of PCA. The origin of this method is unclear, with Troyanskaya et al. [28] and Walczak and Massart [29] both studying alternating algorithms utilizing the SVD. The algorithm is initialized as before, using mean imputation. The singular value decomposition is then performed and the data matrix is reconstructed. The missing elements are replaced using the reconstructed elements and the algorithm continues until convergence. Convergence is again based on the reconstruction error of the observed data. This approach is referred to as SVDImpute here.
Imtiaz and Shah [9] alter SVDImpute to account for measurement error by combining the ideas of SVD-based imputation with bootstrap re-sampling, which is referred to as PCA-data augmentation (PCADA). In this approach, when replacing the missing elements with the reconstructions, the estimates are augmented with residuals from the observed data. The residuals are defined as
R i j = X i j o b s X ^ i j o b s
and the missing data estimates are
X ˜ i j m i s = X ^ i j m i s + R k j
where k is a random integer between 1 and n. The reconstruction estimates using X ˜ i j m i s are then used in the next iteration. To calculate the SVD, K bootstrap datasets are created by randomly drawing samples from the reconstructed data. The loading matrix is then calculated from
P ˜ = 1 K k = 1 K P k
with P ˜ then used in the reconstruction step. Convergence is based on the reconstruction error of the observed data, which is not guaranteed to decrease at each iteration due to the stochastic nature of the algorithm.
Another approach to performing PCA in the presence of missing data utilizes the PPCA formulation. The EM framework is amenable to problems with missing data and the framework as applied to PPCA can be extended to account for missing observations [30]. In the E-step, the expectation of the complete-data log-likelihood is taken with respect to the conditional distribution of the unobserved variables given the observed variables. Two approaches to this expectation calculation have been proposed in the literature. Ilin and Raiko [31] propose using an element-wise version of PPCA and taking the expectation using T as the unknown variables, i.e., missing data, and P , μ , and σ 2 as the parameters. The resulting update equations are
t i = W i 1 j o i p j ( x i j μ j ) ,
t i t i = σ 2 W i 1 + t i t i
where W i = j o i p j p j + σ 2 I a ,
μ j = 1 # ( o j ) i o j x i j p j t i ,
p j = i o j t i t i T 1 i M j t i ( x i j μ j ) ,
σ 2 = 1 # ( O ) i j O ( x i j p j t i μ i ) 2 + p j σ 2 W i 1 p j ,
O = 1 M is the observed data indicator matrix, and # ( · ) represents the number of observed elements in the set. Alternatively, the unknown variables can be taken to be T and the missing elements of the data matrix X [32,33]. The resulting update equations are
t i = W i 1 j o i p j ( x i j μ j )
x i j = p j t i + μ j if M i j = 1 x i j if M i j = 0
t i t i = σ 2 W i 1 + t i t i
x i x i j k = σ 2 ( p j W i 1 p k ) + x i j x i k if M i j = M i k = 1 , j k σ 2 ( 1 + p j W i 1 p k ) + x i j x i k if M i j = M i k = 1 , j = k x i j x i j if M i j = 1 , M i k = 0 x i j x i k if M i j = 0 , M i k = 1 x i j x i k if M i j = M i k = 0
x i t i = σ 2 p j W i 1 + x i t i if M i j = 1 x i t i if M i j = 0
where W i = j o i p j p j + σ 2 I a and
μ = 1 n i = 1 n x i P t i
P = i = 1 n ( x i t i μ t i ) i = 1 n t i t i 1
σ 2 = 1 n d i = 1 n tr x i x i 2 x i t i P 2 μ x i + 2 μ t i P + P t i t i P + μ μ .
Performing PPCA using this conditioning set is referred to here as PPCA-M.
Bayesian PCA (BPCA) is a variation on the PPCA approach [34]. A limitation of PPCA is that the method can be prone to overfitting [31], which BPCA attempts to prevent by using a prior distribution on the parameters. Conjugate priors are used for μ and σ 2 and a hierarchical prior is used for P . When the PPCA problem is modified in this way, the E-step no longer has a closed form and variational approaches are preferred [35]. Oba et al. [36] extended the BPCA method to cases with missing data.
The last approaches for PCA in the presence of missing data presented here are from the matrix completion literature. In matrix completion, sometimes also referred to as robust PCA, elements of a matrix are corrupted and the goal is to recover a low rank reconstruction. If the corrupted elements are treated as missing, this is exactly the same problem as has been discussed, however the problem is often framed directly as the optimization
minimize A A * , subject to A i j = X i j , ( i , j ) O ,
where · * denotes the nuclear norm of a matrix, which is the sum of the singular values of the matrix, X i j are the observed elements in the data matrix, and O is the set of observed indices. An approach for solving this problem is singular value thresholding (SVT) [37], which solves
minimize A A * , subject to P O ( A ) = P O ( X ) ,
where P O is the orthogonal projector onto the span of matrices vanishing outside of O . Cai et al. [37] propose an alternating algorithm that approximately solves (37) which results in a matrix that is sparse and low rank. A second approach for the matrix completion problem is the inexact augmented Lagrange multiplier method (ALM) [38], which solves
minimize A A * , subject to A + E = X , P O ( E ) = 0 ,
where P O is a linear operator that also is zero outside of O . ALM was proposed to solve the more general problem of a corrupted matrix without knowledge of which entries are corrupted but can also be applied in this setting.

3. Case Study

The performance of the different techniques are compared in several case studies. Two types of simulations are considered: one based on distributional assumptions and one based on a chemical process simulation.

3.1. Simulations of Gaussian Data

The design of the distributional-assumption simulations is based on the study by Ilin and Raiko [31] and uses data from multivariate Gaussian distributions. The distributional assumptions follow the development of the PPCA model. While data that exactly follow the model are idealized, the assumptions approximately hold for data that have been pre-processed using standard methods. That is, data that have been pre-processed by sub-sampling and z-scoring approximately have independent and identically distributed multivariate Gaussian (symmetric) distributions. This type of pre-processing can introduce error in the presence of missing data, particularly if missingness is due to censoring. Therefore, this analysis lays a foundation of the best-case results.
The loading matrix P is modeled using a random orthogonal matrix of size d × a where a = 4 and the columns of P rescaled by 1 , , a . μ is modeled using a standard normal distribution. Two scenarios are considered. In the first, n d . Specifically, the dataset is n = 1000 samples from a 10-dimensional Gaussian distribution described by N ( μ , PP + σ 2 I d ) where σ 2 = 0 . 25 . In the second scenario, the opposite case is considered, d > n , and n = 100 samples from a 200-dimensional Gaussian distribution described by N ( μ , PP + σ 2 I d ) where σ 2 = 0 . 25 . For each of the scenarios, 20 simulations are used, each with four types of missingness, described below.
Ten PCA approaches were tested: mean imputation (MI), alternating least squares (ALS) as implemented by MATLAB’s pca command, alternating least squares (Alternating) as implemented by Ilin and Raiko [31], SVDImpute as implemented by Ilin and Raiko [31], PCADA as implemented by the authors, PPCA as implemented by MATLAB’s ppca command, PPCA-M as implemented by the authors, BPCA as implemented by Oba et al. [36], SVT as implemented by Cai et al. [37], and ALM as implemented by Lin et al. [38]. All approaches were implemented in MATLAB, used a convergence tolerance of 10 6 , and were limited to 1000 iterations. Alternating, SVDImpute, PCADA, BPCA, SVT, and ALM use relative change in the reconstruction error as the convergence criteria. ALS uses relative change in the reconstruction error as well as the relative change in the parameters are the convergence criteria. PPCA and PPCA-M use the relative changes in the negative log-likelihood and parameters as the convergence criteria.
To evaluate performance, two metrics were used: the root mean square error (RMSE), and the subspace angle between the true and recovered principal component loadings. The RMSE is defined
RMSE = 1 n d i = 1 n j = 1 d ( x i j x ^ i j ) 2
and is reported for only the missing data. The full definition of the subspace angle is provided in the Appendix A. A subspace angle of 0 implies that the subspaces are dependent, which is the desired result here. The maximum value of the subspace angle is π 2 . In all analysis, the subspace angle is calculated using the MATLAB function subspace.

3.2. Tennessee Eastman Problem

The Tennessee Eastman problem (TEP) is a benchmark dataset that models an industrial chemical process [39]. The benchmark contains datasets both under normal operation as well as during several process faults. The process consists of five major units: reactor, condenser, compressor, separator, and stripper. There are 8 components, 41 measured variables, and 11 manipulated variables. Several control structures have been proposed for plant-wide control of the TEP. The datasets can be found online [40] and utilize “control structure 2” as described by Lyman and Geogakis [41]. Unlike the Gaussian data simulations, the latent dimension a is unknown. To determine a, parallel analysis was used. Three missingness mechanisms were considered, as described below, and 20 simulations were used for each. The same 10 approaches for PCA as described above were implemented with a small change to the mean imputation approach. Because the data are collected in time, the last measurement before and the first measurement after the missing data point are averaged and used to fill-in. The learned model is then used in two tasks: reconstruction of a test dataset and fault detection. For the fault detection problem, the Q statistic, defined as
Q = r r , r = ( I d PP ) x i ,
was used. The Q statistic, also known as the squared prediction error, has been well studied in the area of fault detection [11,42,43,44]. To determine the detection threshold, the tenth largest value of Q on the nominal test set was used [44].
To evaluate the performance, three metrics were used: the RMSE on a held-out test of nominal data, the detection time, and whether or not a false detection occurred. Two faults are chosen for analysis: Fault 1, which is a step change in A/C feed ratio in stream 4, and Fault 13, which is a slow drift of the reaction kinetics. In both cases, the testing dataset is used and the faults are introduced at t = 160 . The mean detection time is defined as the average detection time for all models in which the detection time is greater than 160 and the number of false detections is defined as the number of models where there is a detection before 160. For a given model, either a detection time or a false detection time is recorded.

3.3. Addition of Missing Data

Four types of missingness were considered: random, sensor drop-out, multi-rate, and censoring. The types of missingness were chosen based on the authors’ experience with realizations of missing data in process datasets. Random, sensor drop-out, and multi-rate missingness are all MCAR but have different patterns: random exhibits no pattern, sensor drop-out is correlated in time, and multi-rate has a known frequency of missingness in time. Censor missingness is NMAR. Examples of the patterns are shown in Figure 1. In all cases, a full dataset is generated or obtained and measurements are removed to represent the missing data mechanism. For instance, in the censoring case, a random set of variables is selected to be censored from above or below. The censoring level for each variable is then iteratively updated until the desired level of missingness is achieved. The location of the code used to introduce missing can be found in the Supplementary Materials. Missing data are introduced at levels of 1, 5, 10, and 15% for the Gaussian datasets. The multi-rate pattern is not considered for the 1% missingness level for the Gaussian datasets. The TEP is naturally a multi-rate missing data problem at a level of 21% [44]. TEP is individually combined with random, sensor drop-out, and censored missingness to total 25%.

3.4. Results

The results of the Gaussian simulations are shown in Figure 2, Figure 3 and Figure 4. SVDImpute and the probabilistic methods (PPCA, PPCA-M, and BPCA) performed the best overall. As the missingness level increased, the probabilistic models performed slightly better, except for SVDImpute performing better for censored data at low levels of missingness. PCADA never outperformed SVDImpute. ALS and the alternating methods both suffered from finding local optima and performed very poorly, as evidenced by the large standard deviations. ALM failed to converge in many cases, and sometimes in all cases, as in the d > n scenarios. The SVT approach fell in the middle while never outperforming the best approaches. For d > n , most approaches did only slightly better than mean imputation whereas significant improvements were observed for n d , especially in the censoring case.
For the TEP, the results of the reconstruction task are shown in Figure 5. For all missingness types, ALS and SVDImpute performed well. ALM failed to converge and Alternating and BPCA had poor results. PPCA, PPCA-M, and SVT performed moderately well, but were more affected by censoring than ALS and SVDImpute. The minimum, average, and maximum number of PCs used in the models, as determined by parallel analysis can be found in Table 1. The number of PCs chosen by SVDImpute, PPCA, and BPCA were very consistent whereas Alternating and PCADA had widely varying number of PCs. Across all methods, the amount of variability in the number of PCs is larger in the censoring case. The results of the fault detection task are in Table 2 and Table 3. For Fault 1, ALS and SVD had the best performance overall, with low detection times and few false detections. MI performed well in terms of detection time but had many false detections. PCADA and BPCA performed the worst overall. For Fault 13, SVT performed the best in the random and drop-out cases, whereas SVDImpute performed the best for the censoring case. PCADA and BPCA again performed the worst overall. ALM was excluded from analysis as no model was learned during the training phase.

4. Discussion

Overall, the best technique to apply PCA in the presence of missing data can depend on the scenario. Several criteria should be considered when choosing an approach, such as the amount of missing data, the missingness mechanism, and the available computational resources. The computational complexity per iteration for each of the algorithms can be found in Table 4, which should only be used as a guideline since the exact implementation will affect computational cost. For instance, SVT [37] and ALM [38] recommend using the Lanczos algorithm to compute the singular values. The Lanczos algorithm is iterative and has reported speed-up of 10× vs. traditional calculation of the full SVD. The Lanczos algorithm returns the singular values that are larger than a certain threshold, which works well in the SVT and ALM frameworks. On the other hand, Lin et al. [38] report that the full SVD computation is faster for scenarios where greater than 0.2 d of singular values are required. While experience indicates that a is significantly lower than d in applications, if no bound on a is known a priori, then the full SVD is typically calculated during procedures to select a, which impacts the computational cost. The probabilistic frameworks have the convenient relation that
σ M L 2 = 1 d a j = a + 1 d λ j
which can be used to estimate the percent variance without calculating the full SVD. Another benefit of the probabilistic frameworks is that they are generative and therefore provide parameters for estimation. For all analysis, the test data have been treated as fully observed, which may not be true in practice as new data may be subject to the same type of missingness as the data used in model building. If the data are subject to NMAR missingness, these parameters may not be useful. Note also that the probabilistic approaches can have slow convergence.
The difference in the results of the two ALS approaches also highlights the importance of the exact implementation. Both methods are using the same underlying algorithm but differ in the implementation of the update steps and convergence criteria. Empirically, this results in the Alternating algorithm finding local optima more often as the amount of missing data increases for the n d case and the ALS algorithm finding local optima more often for d > n .
It may be surprising that the robust PCA methods (SVT and ALM) did not perform better, but it is important to recognize that these methods were developed for cases with very low rank solutions, a large number of missing values, and random missingness. These assumptions are well suited to some applications such as computer vision and imaging but do not necessarily fit the assumptions of missing data in process datasets. A benefit of SVT and ALM is that they can be applied to problems where the location of the corrupt (missing) data is unknown. In the event that additional information is known about the measurement error, methods such as maximum likelihood PCA (MLPCA) [45,46] or heteroscedastic latent variable model (HLV) [47] can be applied to leverage that information. MLPCA is suited to scenarios where the error covariance matrix is known and the errors are correlated or uncorrelated. HLV is suited to scenarios the measurement error is evolving in time. Both algorithms can be applied to scenarios with missing data.
Without additional problem information, we recommend SVDImpute for performing PCA in the presence of missing data for industrial datasets. SVDImpute can be viewed as an implementation of EM [31]. In this view, the missing observations are treated as the unknown variables and P , μ , σ 2 , and T are the model parameters. The corresponding cost function, for only terms involving the parameters, is
C = d n 2 log 2 π σ 2 1 2 σ 2 i j O ( x i j x ^ i j ) 2 1 2 σ 2 i j M ( x ¯ i j x ^ i j ) 2 + σ 2
where x ¯ i j are the imputed values from the SVD. This cost function forces the imputed terms to be near the observed terms which helps to prevent overfitting [31]. A drawback of SVDImpute is that there are many possible reconstructions that will achieve the same result for the observed data, and different results for the missing data, which implies a dependence on the initial guess [31].
In the event that the testing data will also have missing elements, PPCA or PPCA-M is recommended. PPCA-M performs slightly better in the TEP but has higher storage costs during model training. Both result in generative parameters that can be used during the testing phase.
In summary, for missing data problems, the most important step is to determine why some data are missing. If censoring is occurring and not accounted for, the results will be biased. Approaches that incorporate understanding about the underlying mechanisms are likely to perform the best. Expectation maximization frameworks are an important tool in missing data problems and can be applied generally if distributional assumptions are made.

Supplementary Materials

The MATLAB software to add missingness to the datasets can be found at http://web.mit.edu/braatzgroup/links.html.

Acknowledgments

The Edwin R. Gilliland professorship is acknowledged for support.

Author Contributions

K.A.S., M.C.M., and R.D.B. conceived and designed the experiments and wrote the paper. K.A.S. performed the experiments.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

Abbreviations used in this article are:
ALMAugmented Lagrange multipliers
BPCABayesian PCA
EMExpectation maximization
HLVHeteroscedastic latent variable model
MARMissing at random
MCARMissing completely at random
MLPCAMaximum likelihood PCA
NMARNot missing at random
PCAPrincipal component analysis
PCADAPCA-data augmentation
PPCAProbabilistic PCA
RMSERoot mean square error
SVDSingular value decomposition
SVTSingular value thresholding
TEPTennessee Eastman problem

Appendix A. Definition of the Subspace Angle

To compute the subspace angle between matrices A R n × m and B R n × p , where rank ( A ) rank ( B ) , compute the orthonormal basis of each matrix using the singular value decomposition. Then compute the projection
P = B A ( A B ) .
The subspace angle, θ , is defined by
sin θ = min ( 1 , P )
where · is the 2-norm. See [48] and [49] for additional information on subspace angles.

References

  1. MacGregor, J.F.; Kourti, T. Statistical process control of multivariate processes. Control Eng. Pract. 1995, 3, 403–414. [Google Scholar] [CrossRef]
  2. Dunia, R.; Qin, S.J.; Edgar, T.F.; McAvoy, T.J. Identification of faulty sensors using principal component analysis. AIChE J. 1996, 42, 2797–2812. [Google Scholar] [CrossRef]
  3. Liu, J. On-line soft sensor for polyethylene process with multiple production grades. Control Eng. Pract. 2007, 15, 769–778. [Google Scholar] [CrossRef]
  4. Kirdar, A.O.; Conner, J.S.; Baclaski, J.; Rathore, A.S. Application of multivariate analysis toward biotech processes: Case study of a cell-culture unit operation. Biotechnol. Prog. 2007, 23, 61–67. [Google Scholar] [CrossRef] [PubMed]
  5. Yu, H.; MacGregor, J.F. Multivariate image analysis and regression for prediction of coating content and distribution in the production of snack foods. Chemom. Intell. Lab. 2003, 67, 125–144. [Google Scholar] [CrossRef]
  6. Ku, W.; Storer, R.H.; Georgakis, C. Disturbance detection and isolation by dynamic principal component analysis. Chemom. Intell. Lab. 1995, 30, 179–196. [Google Scholar] [CrossRef]
  7. Nomikos, P.; MacGregor, J.F. Monitoring batch processes using multiway principal component analysis. AIChE J. 1994, 40, 1361–1375. [Google Scholar] [CrossRef]
  8. Nomikos, P.; MacGregor, J.F. Multivariate SPC charts for monitoring batch processes. Technometrics 1995, 37, 41–59. [Google Scholar] [CrossRef]
  9. Imtiaz, S.A.; Shah, S.L. Treatment of missing values in process data analysis. Can. J. Chem. Eng. 2008, 86, 838–858. [Google Scholar] [CrossRef]
  10. Christoffersson, A. The One Component Model with Incomplete Data. Ph.D. Thesis, Uppsala University, Uppsala, Sweden, 1970. [Google Scholar]
  11. Wold, S.; Esbensen, K.; Geladi, P. Principal component analysis. Chemom. Intell. Lab. 1987, 3, 37–52. [Google Scholar] [CrossRef]
  12. Nelson, P.R.C.; Taylor, P.A.; MacGregor, J.F. Missing data methods in PCA and PLS: Score calculations with incomplete observations. Chemom. Intell. Lab. 1996, 35, 45–65. [Google Scholar] [CrossRef]
  13. Grung, B.; Manne, R. Missing values in principal component analysis. Chemom. Intell. Lab. 1998, 42, 125–139. [Google Scholar] [CrossRef]
  14. Rubin, D.B. Inference and missing data. Biometrika 1976, 63, 581–592. [Google Scholar] [CrossRef]
  15. Little, R.J.A.; Rubin, D.B. Statisical Analysis with Missing Data, 2nd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2002. [Google Scholar]
  16. Qin, S.J. Process data analytics in the era of big data. AIChE J. 2014, 60, 3092–3100. [Google Scholar] [CrossRef]
  17. Pearson, K. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos. Mag. J. Sci. 1901, 2, 559–572. [Google Scholar] [CrossRef]
  18. Hotelling, H. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 1933, 24, 417–441. [Google Scholar] [CrossRef]
  19. Jolliffe, I.T. Principal Component Analysis, 2nd ed.; Springer: New York, NY, USA, 2002. [Google Scholar]
  20. Tipping, M.E.; Bishop, C.M. Probabilistic Principal Component Analysis; Technical Report; Aston University: Birmingham, UK, 1997. [Google Scholar]
  21. Roweis, S. EM algorithms for PCA and SPCA. In Advances in Neural Information Processing Systems 10; Jordan, M.I., Kearns, M.J., Solla, S.A., Eds.; MIT Press: Cambridge, MA, USA, 1998; pp. 626–632. [Google Scholar]
  22. Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 1977, 39, 1–38. [Google Scholar]
  23. Cattell, R.B. The scree test for the number of factors. Multivar. Behav. Res. 1966, 1, 245–276. [Google Scholar] [CrossRef] [PubMed]
  24. Horn, J.L. A rationale and test for the number of factors in factor analysis. Psychometrika 1965, 30, 179–185. [Google Scholar] [CrossRef] [PubMed]
  25. Donoho, D.L.; Gavish, M. The Optimal Hard Threshold for Singular Values Is 4 3 ; Technical Report; Stanford University: Stanford, CA, USA, 2013. [Google Scholar]
  26. Schafer, J.L. Multiple imputation: A primer. Stat. Methods Med. Res. 1999, 8, 3–15. [Google Scholar] [CrossRef] [PubMed]
  27. Lee, J.H.; Dorsey, A.W. Monitoring of batch processes through state-space models. AIChE J. 2004, 50, 1198–1210. [Google Scholar] [CrossRef]
  28. Troyanskaya, O.; Cantor, M.; Sherlock, G.; Brown, P.; Hastie, T.; Tibshirani, R.; Botstein, D.; Altman, R.B. Missing value estimation methods for DNA microarrays. Bioinformatics 2001, 17, 520–525. [Google Scholar] [CrossRef] [PubMed]
  29. Walczak, B.; Massart, D. Dealing with missing data: Part I. Chemom. Intell. Lab. 2001, 58, 29–42. [Google Scholar] [CrossRef]
  30. Tipping, M.E.; Bishop, C.M. Probabilistic principal component analysis. J. R. Stat. Soc. Ser. B 1999, 61, 611–622. [Google Scholar] [CrossRef]
  31. Ilin, A.; Raiko, T. Practical approaches to principal component analysis in the presence of missing values. J. Mach. Learn. Res. 2010, 11, 1957–2000. [Google Scholar]
  32. Marlin, B.M. Missing Data Problems in Machine Learning. Ph.D. Thesis, University of Toronto, Toronto, ON, Canada, 2008. [Google Scholar]
  33. Yu, L.; Snapp, R.R.; Ruiz, T.; Radermacher, M. Probabilistic principal component analysis with expectation maximization (PPCA-EM) facilitates volume classification and estimates the missing data. J. Struct. Biol. 2012, 171, 18–30. [Google Scholar] [CrossRef] [PubMed]
  34. Bishop, C.M. Variational principal components. In Proceedings of the 9th International Conference on Artificial Neural Networks, Edinburgh, UK, 1999; pp. 509–514. [Google Scholar]
  35. Neal, R.M.; Hinton, G.E. A view of the EM algorithm that justifies incremental, sparse, and other variants. In Learning in Graphical Models; Jordan, M.I., Ed.; Kluwer Academic Publishers: Dordrecht, The Netherlands, 1998; pp. 355–368. [Google Scholar]
  36. Oba, S.; Sato, M.; Takemasa, I.; Monden, M.; Matsubara, K.; Ishii, S. A Bayesian missing value estimation method for gene expression profile data. Bioinformatics 2003, 19, 2088–2096. [Google Scholar] [CrossRef] [PubMed]
  37. Cai, J.F.; Candès, E.J.; Shen, Z. A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 2010, 20, 1956–1982. [Google Scholar] [CrossRef]
  38. Lin, Z.; Liu, R.; Su, Z. Linearized alternating direction method with adaptive penalty for low rank representation. In Advances in Neural Information Processing Systems; Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F., Weinberger, K.Q., Eds.; MIT Press: Cambridge, MA, USA, 2011; pp. 612–620. [Google Scholar]
  39. Downs, J.J.; Vogel, E.F. A plant-wide industrial process control problem. Comput. Chem. Eng. 1993, 17, 245–255. [Google Scholar] [CrossRef]
  40. Russell, E.L.; Chiang, L.H.; Braatz, R.D. Tennessee Eastman Problem Simulation Data. Available online: http://web.mit.edu/braatzgroup/links.html (accessed on 12 April 2017).
  41. Lyman, P.R.; Georgakis, C. Plant-wide control of the Tennessee Eastman problem. Comput. Chem. Eng. 1995, 19, 321–331. [Google Scholar] [CrossRef]
  42. Jackson, J.E.; Mudholkar, G.S. Control procedures for residuals associated with principal component analysis. Technometrics 1979, 21, 341–349. [Google Scholar] [CrossRef]
  43. Kresta, J.V.; MacGregor, J.F.; Marlin, T.E. Multivariate statistical process monitoring of process operating performance. Can. J. Chem. Eng. 1991, 69, 35–47. [Google Scholar] [CrossRef]
  44. Russell, E.L.; Chiang, L.H.; Braatz, R.D. Data-Driven Methods for Fault Detection and Diagnosis in Chemical Processes; Springer: London, UK, 2000. [Google Scholar]
  45. Wentzell, P.D.; Andrews, D.T.; Hamilton, D.C.; Faber, K.; Kowalski, B.R. Maximum likelihood principal component analysis. J. Chemom. 1997, 11, 339–366. [Google Scholar] [CrossRef]
  46. Andrews, D.T.; Wentzell, P.D. Applications of maximum likelihood principal component analysis. Anal. Chim. Acta 1997, 350, 341–352. [Google Scholar] [CrossRef]
  47. Reis, M.S.; Saraiva, P.M. Heteroscedastic latent variable modelling with applications to multivariate statistical process control. Chemom. Intell. Lab. 2006, 80, 57–66. [Google Scholar] [CrossRef]
  48. Björck, A.; Golub, G.H. Numerical methods for computing angles between linear subspaces. Math. Comput. 1973, 27, 579–594. [Google Scholar] [CrossRef]
  49. Wedin, P. On angles between subspaces of a finite dimensional inner product space. In Matrix Pencils; Lecture Notes in Mathematics 973; Kagstrom, B., Ruhe, A., Eds.; Springer: Berlin, Germany, 1983; pp. 263–285. [Google Scholar]
Figure 1. Possible realizations of the investigated missingness mechanisms: (a) shows random missingness; (b) shows sensor failure which results in missingness that is correlated in time; (c) shows multi-rate data, and (d) shows censored data.
Figure 1. Possible realizations of the investigated missingness mechanisms: (a) shows random missingness; (b) shows sensor failure which results in missingness that is correlated in time; (c) shows multi-rate data, and (d) shows censored data.
Processes 05 00038 g001
Figure 2. Average RMSE of the missing data with standard deviation for the Gaussian cases. In the d > n case, ALM never converged to a solution.
Figure 2. Average RMSE of the missing data with standard deviation for the Gaussian cases. In the d > n case, ALM never converged to a solution.
Processes 05 00038 g002
Figure 3. Average RMSE of the missing data with standard deviation for the Gaussian cases. In the d > n case, ALM never converged to a solution.
Figure 3. Average RMSE of the missing data with standard deviation for the Gaussian cases. In the d > n case, ALM never converged to a solution.
Processes 05 00038 g003
Figure 4. Average subspace angle of learned vs. true subspace with standard deviation for the Gaussian cases.
Figure 4. Average subspace angle of learned vs. true subspace with standard deviation for the Gaussian cases.
Processes 05 00038 g004
Figure 5. Average RMSE and standard deviation of the fully observed TEP test set. In all cases ALM failed to converge.
Figure 5. Average RMSE and standard deviation of the fully observed TEP test set. In all cases ALM failed to converge.
Processes 05 00038 g005
Table 1. The minimum, average, and maximum number of PCs chosen using parallel analysis for each method over 20 realizations of the missing data. Each missingness type is combined with the naturally arising multi-rate missingness to total 25% missing data. ALM never converged and therefore no results are reported.
Table 1. The minimum, average, and maximum number of PCs chosen using parallel analysis for each method over 20 realizations of the missing data. Each missingness type is combined with the naturally arising multi-rate missingness to total 25% missing data. ALM never converged and therefore no results are reported.
MIALSAlt.SVD.PCADAPPCAPPCA-MBPCASVTALM
Random
  Min231313434
  Avg2.953.24.1532.5534.334.95
  Max347343535
Drop
  Min131313334
  Avg3.153.34.1532.6534.0534.9
  Max446353535
Censoring
  Min131212121
  Avg33.53.652.92.62.853.32.91.65
  Max457373534
Table 2. The mean detection times for each of the methods and missingness types. Cases are marked by “–” where every trial resulted in a false detection (e.g., a detection prior to t = 160 ).
Table 2. The mean detection times for each of the methods and missingness types. Cases are marked by “–” where every trial resulted in a false detection (e.g., a detection prior to t = 160 ).
MIALSAlt.SVD.PCADAPPCAPPCA-MBPCASVT
Fault 1
  Random163.1163163163163.8163.1171.0
  Drop163163163163.7163.4170.5
  Censor163.1163.2163163.5163.2163.4
Fault 13
  Random182181.8210182180.3183.2174
  Drop182181.4181.3182.3179.3174.5
  Censor180.3181.9411184.9185189.7
Table 3. The number of false detections for each of the methods and missingness types.
Table 3. The number of false detections for each of the methods and missingness types.
MIALSAlt.SVD.PCADAPPCAPPCA-MBPCASVT
Fault 1
  Random001902020200
  Drop902002011201
  Censor5319320692020
Fault 13
  Random731912044200
  Drop1142052054200
  Censor1291982019172020
Table 4. The computational costs of each of the methods where d is the number of measurements, n is the number of samples, a is the latent dimension, and k is the number of bootstrap samples.
Table 4. The computational costs of each of the methods where d is the number of measurements, n is the number of samples, a is the latent dimension, and k is the number of bootstrap samples.
ALS/Alternating/PPCA/BPCASVDImpute/SVT/ALMPCADAPPCA-M
O ( a 2 d n + a 3 n + a 3 d ) O ( min ( n d 2 , n 2 d ) ) O ( min ( k n d 2 , k n 2 d ) ) O ( n a 3 + n d a 2 )

Share and Cite

MDPI and ACS Style

Severson, K.A.; Molaro, M.C.; Braatz, R.D. Principal Component Analysis of Process Datasets with Missing Values. Processes 2017, 5, 38. https://doi.org/10.3390/pr5030038

AMA Style

Severson KA, Molaro MC, Braatz RD. Principal Component Analysis of Process Datasets with Missing Values. Processes. 2017; 5(3):38. https://doi.org/10.3390/pr5030038

Chicago/Turabian Style

Severson, Kristen A., Mark C. Molaro, and Richard D. Braatz. 2017. "Principal Component Analysis of Process Datasets with Missing Values" Processes 5, no. 3: 38. https://doi.org/10.3390/pr5030038

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop