*2.2. Data Pre-Treatment*

Data pre-treatment comprised the following steps: reading the Excel Workbookfiles in UTF-8 encoding; pre-treatment of raw data spreadsheets; and data storage in Hierarchy Data Format version 5 (HDF5).

During real-time monitoring, the data reading step was performed as described in the previous section. Data storage in HDF5 format was also performed as described previously. When the file is saved in HDF5 format, reading presents better performance, since reading data directly from the Excel spreadsheet can be too slow for real-time applications [28,29].

The pre-treatment stage organized the raw data from the PI into data frames (Pandas library) [26,27]. The treatment followed the following steps: standardize the indices (day/hour/minute of the samples); variables that contain some string must be replaced (such as on/off by 0/1 and error notices by NaN, not a number); chronologically sort the data and replace missing data (NaN) with neighbors (back and forward fill). This last step is crucial to assure that the acquired data window does not contain missing data, which can make the calculation of variances difficult in the acquisition window. Unit conversions, normalization of concentrations and calculation of standard deviations and variances completed this step.

Data storage must be carried out after data reconciliation and energy balance calculations. The file was saved with the measured, reconciled, estimated, and calculated variables. As already explained, storage was performed during monitoring, to avoid accumulation of data in the computer memory and save the relevant information in real time.

## *2.3. Data Characterization*

During this step, the procedures that must precede the data reconciliation task, such as the visualization and statistical characterization of the data, must be carried out. The characterization of the data was performed in accordance with the following steps:


The main pursued objectives during this stage were the proper characterization of the data quality, the analysis of the operation dynamics, and the characterization of the stationarity of the phenomenological process model. The analyses of missing data and boxplot properties can indicate the quality of collected data during the selected time period and the number of gross errors in the data base.

The boxplot is a graph used to assess the empirical distribution of data. The boxplot is a non-parametric analytical technique, which shows the measurement variations within a statistical population without making any assumption about the underlying statistical distribution. The box is usually built with the first ( *Q*1) and third ( *Q*3) quartiles (50% of the data) and the median ( *Q*2). The lower and upper lines extend, respectively, from the lower quartile to the lowest value not lower

than the lower limit (*LL*), and from the upper quartile to the highest value not higher than the upper limit (*UL*). The limits can be calculated according to Equations (1) and (2):

$$LL = \max\left[\min(data), \quad Q\_1 - 1.5(Q\_3 - Q\_1)\right] \tag{1}$$

$$LL = \min\left[\max(data), \quad Q\_1 - 1.5(Q\_3 - Q\_1)\right] \tag{2}$$

The value 1.5 is tuned to capture 99.7% of the data between the lower and upper limits, assuming the normal distribution [30]. In summary, the boxplot identifies the regions in the variable domain where 50% and 99.7% of the data are located. The points outside these limits are tagged as outliers. Boxplots were plotted using the Seaborn library available for Python [32].

The analysis of the variance spectra shows how the process variance depends on the size of the sampling window. This type of spectrum provides information about the various sources that contribute to the signal of a variable, including noise/measurement errors (short window sizes) and intrinsic process variations (large window sizes). The variance spectrum can be defined as a set of variances calculated while some variable related to them evolves [31]. The spectrum of variances for short sampling windows are controlled by the variances of the measuring instrument. This way, the best estimate for the variance of the measuring device can be calculated using the variance spectrum with short sampling windows. However, the use of very short sampling windows may not reveal the actual variability of the data, due to poor measurement quality. The spectrum for sufficiently large sampling windows captures the variability of the entire process, including operational changes. More details about the usefulness of this technique for characterization of process data are provided by Feital and Pinto [31].

Analyzing the correlations, autocorrelations, and cross-correlations between pairs of variables of interest allows the characterization of stationarity, seasonality, and regions of operation of the process [33,34]. Observing correlations between input and output flows can indicate process stationarity. The absence of correlation can indicate the occurrence of significant nonlinearity or dynamics in the process response [35]. For dynamic responses, cross-correlation can capture the lags between the actions on the input variables and the steady-state of the output variables. Obviously, the identification of dynamic correlations among the many variables of the system may indicate the necessity to build and implement dynamic models for more accurate representation of the available data. For this reason, proper characterization of stationarity can be important for more successful implementations of monitoring procedures [36].

## *2.4. Data Reconciliation*

The DR procedure consists of solving an optimization problem characterized by an Objective Function (OF) that must be minimized while respecting certain restrictions (model). The OF of the DR is often proposed as the maximum likelihood estimator resulting from a statistical distribution of measurement errors, which is commonly adopted as the normal distribution. After application of the principle of maximum likelihood, the normal distribution results in the WLS estimator. This way, the problem originally formulated by Kuehn and Davidson [18] can be written in accordance with Equations (3)–(7) [17]:

$$\underline{\mathfrak{e}} = \min\_{\underline{\mathfrak{e}}} \frac{1}{2} \left[ \underline{\mathfrak{z}} - \underline{\mathfrak{z}} \right]^T \underline{\mathfrak{V}}^{-1} \left[ \underline{\mathfrak{z}} - \underline{\mathfrak{z}} \right] \tag{3}$$

subject to:

$$
\underline{h}(\overleftarrow{\omega}, \underline{\mu}) = \underline{0} \tag{4}
$$

$$g(\underline{\mathfrak{z}}, \underline{\mathfrak{u}}) \ge \underline{\mathfrak{q}}\tag{5}$$

$$
\not\triangleq \not\triangleq \not\triangleq \not\triangleq^{11} \tag{6}
$$

$$
\underline{\mu}^{\mathcal{L}} \le \underline{\mu} \le \underline{\mu}^{\mathcal{U}} \tag{7}
$$

where *z*ˆ is the vector of the reconciled variables; *z* is the vector of the measured variables; *V* is the matrix of variances for measurement errors; *u* is the vector of the unmeasured variables (observable); *h*() is the vector of linear or nonlinear algebraic constraint equations; *g*() is the vector of the inequalities of linear or nonlinear algebraic restrictions; *z*ˆ*L* and *z*<sup>ˆ</sup>*<sup>U</sup>* are the upper and lower parameter vectors of the *z*ˆ vector and *u<sup>L</sup>* and *u<sup>U</sup>* are the upper and lower parameter vectors of the *u* vector.

For DR problems where the model constraints are linear and all variables are measured, the analytical resolution of the problem can be obtained with help of Lagrange Multipliers [37]. However, the set of individual mass balance equations generate a nonlinear system of equations that involve unmeasured variables, as described below. For this reason, a successive linearization procedure was used to solve this problem [38].

The observability analysis of the system can also be performed during the successive linearization procedure, using QR factorization. The procedure consists of describing the nonlinear model in terms of two matrices: a matrix related to the measured variables and a second matrix related to the unmeasured variables. Therefore, the system is observable if the rank of the matrix of unmeasured variables is equal to the number of unmeasured variables [39].

The following points were considered during the implementation of the proposed model in the industrial site, illustrated in Figure 3:

	- ◦ Flowrates: Total Feed (*F*), Total Retentate (*R*), Train Retentate *A*, *B* and *C* (*RA*, *RB* and *RC*) [*kNm*3/*h*];
	- ◦ Components: *C*1(methane), *C*2(ethane), *C*3(propane), *C*6(hexane), *C*7(heptane), *C*8(octane), CO2(carbon dioxide), *iC*4(*i*-butane), *iC*5(*i*-pentane), *N*2(nitrogen), *nC*4(*n*-butane) and *nC*5(*n*-pentane) in the 3 streams.
	- ◦ Flowrate: Total permeate (*P*) [*kNm*3/*h*].

*Model* (Component Mass Balance—Steady-State):

$$\underline{\boldsymbol{\mu}}(\underline{\boldsymbol{\Sigma}},\underline{\boldsymbol{\mu}}) = \begin{cases} \underline{\boldsymbol{\Omega}} = F\underline{\boldsymbol{\underline{y}}}\_{F} - R\underline{\boldsymbol{\underline{y}}}\_{R} - P\underline{\boldsymbol{\underline{y}}}\_{P} \\ \boldsymbol{0} = \boldsymbol{R} - \boldsymbol{R}\_{A} - \boldsymbol{R}\_{B} - \boldsymbol{R}\_{C} \\ \boldsymbol{0} = 1 - \sum\_{i=1}^{nc} \underline{\boldsymbol{\underline{y}}}\_{F,i} \\ \boldsymbol{0} = 1 - \sum\_{i=1}^{nc} \underline{\boldsymbol{\underline{y}}}\_{R,i} \\ \boldsymbol{0} = 1 - \sum\_{i=1}^{nc} \underline{\boldsymbol{y}}\_{P,i} \end{cases} \tag{8}$$


*Processes* **2020**, *8*, 1035


**Figure 3.** Individual mass balances in the envelope around the process.

#### *2.5. Gross Error Detection*

Gross errors are those originating from non-random events, having little or no connection with the measured value. They may be related to measurements (such as malfunctioning of instruments) or process (such as leaks). Consequently, gross errors invalidate the classic statistical basis of traditional DR methods and undermine the systematic analysis of the data, demanding the implementation of Gross Error Detection procedures for compensation or elimination of gross errors, which may precede the DR step. Gross errors can be further classified as **bias**, **outliers**, and **drifts** [40].

The metrics applied in this step to identify the occurrence of gross errors is related to the statistical test of standard hypotheses for GED [12]. This test is widely used and simple, and also forms the basis for all other classic GED procedures. The test is applied to the data set to determine whether the measurements follow a symmetrical distribution of the measurement errors.

Reilly and Carpani [41] were the first to study the detection of gross errors in process engineering. The authors proposed the use of a statistical test of the type *χ*2 (chi-square) based on the residues of the process model, which was called the Global Test (GT).

In the present work, GED was based on the GT, using the sampling window to observe samples and identify the occurrence of gross measurement errors. According to this procedure, the data in the moving window are used to compare the errors of the Weighted Least Squares estimator with the deviations from median values within the window and using the *χ*2 function to characterize significant deviations between these computed variances [42].

The outlier effect during DR is removed through compensation. After identification of a possible outlier, the variable value was manipulated to adjust the expected variance. Thus, a moving variance window was implemented to monitor changes in operation, measurement errors, and failures. The statistical test was performed in sampling windows containing at least 20 samples. With this, the median and the standard deviation were calculated and compared with the value of the variable at the current point. Therefore, if the value was more than seven standard deviations apart from

the median, the measurement would be regarded as an outlier and the variance adjustment would be performed.

To monitor the possible occurrence of bias, a dynamic bar graph illustrating the magnitude of errors for each variable was implemented, using the following metrics:

$$Bias\_{i,t} = \frac{med(|\underline{z}\_{i,t} - \underline{\sharp}\_{i,t}| )}{dp\_{i,t}} \tag{9}$$

$$dp\_{i,t} = \text{NMAD}(|\underline{z}\_{i,t} - \underline{\mathfrak{z}}\_{i,t}|) \tag{10}$$

where *NMAD* is Normalized Median Absolute Deviation.
