Considering the shortcomings of traditional methods, based on the multisampling rates of plant-wide processes, missing data probability estimation-based Bayesian outlier detection is adopted here. In this strategy, considering the computing complexity of plant-wide processes, given that both the historical data and online horizon data are multisampling rates with incomplete data, the research includes four aspects: (1) to reduce complexity, variables with the same sampling period are placed in a sub-block, and PCA is performed for each sub-block to form monitoring evidence; (2) marginalization-based probability estimation for realization of current incomplete evidence is executed through historical multisampling rate samples and online moving horizon data; (3) the EM algorithm is used to estimate the likelihood of different evidence using multisampling rate historical data; and (4) the posterior probability of different process statuses for current incomplete data is calculated according to Bayesian theory and full probability theory.
3.1. Marginalization-Based Realization Estimation
For the multisampling rates system, in order to ensure the total amount of modeling and monitoring data, this study intends to probabilistically estimate the realization of missing data by using complete and incomplete historical data, as well as online moving windows. Several variables need to be defined before the estimation is made:
(1) Evidence,
, which is usually the monitoring variable of a process. For a system with
monitors, its evidence can be expressed as
, where
is the
source with
discrete values. Therefore, the collection of all possible evidence is
, where
. However, for a plant-wide system, the variable’s measured values are continuous, and the number of process variables is large; thus, taking each process variable as a source of evidence may be beyond the capability of a normal computer. Therefore, the suitable evidence should be designed first in this study. To reduce its size, the multi-block method is adopted to handle data and obtain suitable evidence. First, variables with the same sampling rate are placed into a sub-block, a PCA model is established for each sub-block, and the
and
statistics and control limits of each sub-block can be obtained. For PCA details, please refer to [
23].
According to the statistic and control limit, the evidence can be generated as
where
denotes the block number;
is the
source in evidence;
and
are the
statistic and
control limit for the
sub-block;
and
are the
statistic and
control limit for the
sub-block; and
indicates that the data of the
sub-block is an outlier.
Note that the number of principal components is selected by the cumulative percent variance (CPV). For sub-block data, the covariance matrix of the data is calculated and related eigenvalues are sorted. The ratio between the first eigenvalues and the sum of all eigenvalues is defined as the CPV, which is used to represent the proportion of data explained by the first principal components in all data. A reasonable is very important for PCA modeling. In this research, we choose the first principal components whose CPV is greater than 85% as the modeling principal components, as shown in the equation .
(2) Process status, , which is the internal state of the system. A system with possible internal states is denoted . For the outlier detection problem, .
(3) History dataset, , which is labeled data. A historical dataset with historical training samples can be expressed as , where contains evidence and process internal status at time : .
(4) Online horizon, , which is defined as , where refers to the forward sample from the current sample, and is the length of the horizon. Data consists of an evidence vector and posteriori probability of under evidence .
Next, for a current evidence sample,
, with missing data, a marginalization-based solution is used for its realization probability estimation through its observed part,
, the history training data, and online moving horizon data, which can be expressed as
where
,
,
is the number of the possible realizations of
and
is the space of all possible parameters in
. For instance, for a three-dimensional system with two possible discrete values, assuming the current incomplete evidence is
, then its possible evidence set
, and the number of the possible realizations,
, is 4.
In Equation (2),
can be calculated by the Bayes’ rule as
where the numerator can be calculated by integration over the likelihood space
.
The Dirichlet distribution with Dirichlet parameters is commonly used to estimate the probability of
where
is the Gamma function,
and
are the number of prior samples for the possible realization,
, given the observed part,
, which is calculated through the historical training data,
.
The likelihood of the samples that are possible realizations of
in the horizon can be written as
where
is the expected number of samples that are possible realizations of
in the horizon, and
is the number of
in the moving horizon.
Taking Equations (3)–(6) into Equation (2)
Using the same derivation procedures as in [
20], the realization probabilities can be achieved as [
22]
where
is the total number of samples that are possible realizations of
in historical data.
Generally, to reflect the prior knowledge, a number of prior samples should be added to the history data; that is , where is the number of prior hypothetical samples for the that are possible realizations of , and is the number of that are possible realizations of in historical data.
Then, the realization probability is re-expressed as
where
are the total prior samples that are possible realizations of
.
To calculate the realization probability in Equation (9), the below should be performed
where
is the number of samples with process status
in historical dataset
,
is the likelihood which will be introduced in the next section, and
.
For the priori information, the uniform prior is employed, therefore
where
.
Then
where
is the sample in the online moving horizon,
is the length of the horizon,
is the recorded realization probability of sample
, and
.
In summary, the realization probability is calculated through historical data, online horizon data, and prior knowledge.
3.2. Expectation–Maximization-Based Likelihood Probability Estimation
Here, we introduce the estimation method for the likelihood probability of different evidence under different process statuses. Considering that the historical data is multisampling rates with missing data, the expectation–maximization (EM) method for multiple missing data patterns proposed in [
21] is adopted for likelihood estimation here.
First of all, the EM algorithm with missing data is introduced, which iteratively switches between the expectation step (E-step) and maximization step (M-step) to find the maximum likelihood estimate of parameters of interest.
In the E-step, the expected value of the log-likelihood function (Q-function) is built by using the previously estimated parameter,
where
is the parameter set to be estimated;
is the estimation result in the previous step;
is the unobserved data; and
is the observed dataset.
In the M-step, the new estimation of the parameter set is obtained by maximizing the Q-function obtained from the E-step
This iteration continues until some stop criterion is satisfied.
Next, we describe how to use the EM algorithm to solve the outlier identification problem. The likelihood probability is denoted , which is interpreted as the probability of under process status , and is the likelihood probability set for process status . As for the outlier detection problem, there are two process statuses: normal and outlier. The optimized parameter set of all process statuses is .
The likelihood probability set
for
is estimated first. Since the process involves multisampling rates, the monitoring data subset
of
contains the complete part
and the incomplete part
; i.e.,
, and
where
is the complete evidence,
is the total number of complete evidence,
is the incomplete evidence, and
is the number of incomplete evidence in
.
Given that the data are independent, the probabilities of the complete part and the incomplete part under current likelihood probability parameter set can be expressed, respectively
The total likelihood function for can be calculated as
Moreover, since the incomplete data entries of can be further partitioned into the monitoring part, , and the missing part, ,we have
Taking Equation (24) into Equation (17), the Q-function is denoted as
where
is the space for all possible values of the realization
.
Following the derivations of [
21], the Q-function can be expressed as
where
, and
is the number of evidence
in the complete dataset, and
is the estimated amount of evidence
in the incomplete dataset under the likelihood parameter set
.
Considering that
, Equation (26) can be expanded as
where
By taking the first derivative with respect to and setting it to zero to obtain the maximum value of the Q-function, the estimation of is achieved as
Moreover, the initial conditions are set through the available complete evidence, . Finally, the process is repeated until the parameters converge.
3.3. Bayesian and Full Probability-Based Outlier Detection
Based on the likelihood probability of each evidence under different process statuses, given current evidence
and historical evidence data
, the Bayesian strategy is adopted to infer the posterior probability of each possible process status,
where
is the likelihood probability,
is the prior probability of process status
, and
is the posteriori probability of
under current evidence
and historical database
. The process status with a large posterior probability is considered to be the probable internal process status.
Then, according to the realization probability of the unavailable monitor’s reading and posterior probability of each possible process status under each realization, the outlier probability of incomplete evidence can be calculated by the full probability method as
Overall, the outlier probability estimation for a missing data point in multisampling rates of plant-wide processes can be executed in four phases.
Phase 1: Likelihood probability estimation, which is performed offline:
(1) Calculate using the complete dataset of certain process statuses.
(2) Use the complete historical data of each process status to calculate the initial value , which is set as .
(3) According to , obtain through the incomplete dataset based on Equation (28).
(4) Calculate the new likelihood by using and based on Equation (29).
(5) Check whether the terminating conditions are satisfied; if so, record the final likelihood probability. Otherwise, set as and repeat steps (3)–(5).
Phase 2: Offline posterior probability estimation
(1) Based on offline likelihood probability, the posterior probability of each possible process statusunder each evidence is obtained by Equation (30).
Phase 3: Realization probability estimation, which is performed online.
(1) For the current observed part of the incomplete sample, calculate , , and according to Equations (10)–(13).
(2) Obtain and according to online moving horizon and Equations (14)–(16).
(3) Achieve the realization probability based on Equation (9).
Phase 4: Online full probability estimation.
Using the realization probability of Equation (9) and posterior probability of each possible process status of Equation (30), the outlier probability of an incomplete evidence can be calculated through Equation (31).
The details are also illustrated in
Figure 2.