**1. Introduction**

Six Sigma is a strategy for process improvement widely used in various sectors such as manufacturing, finance, healthcare, and so on. It is defined by Linderman et al. [1] as "an organized and systematic method for strategic process improvement and new product and service development that relies on statistical methods and the scientific method to make dramatic reductions in customer defined defect rates". Besides, Six Sigma, as a quality tool, has fostered a never-ending improvement culture based on a strong and professionalized organization for improvement, a clear and well thought methodology (DMAIC), and also powerful tools and statistical techniques to carry out the improvement projects within the DMAIC framework that has proved highly e ffective in a large variety of situations [2].

The DMAIC methodology in Six Sigma is a five-step improvement cycle, i.e., Define, Measure, Analyze, Improve, and Control. Reliable data and objective measurements are critical at each step of the method and, hence the statistical techniques are incorporated into the structured method as needed [1]. Traditionally, classical statistical techniques (e.g., multiple linear regression (MLR)) have been used within the DMAIC framework in a data-scarce context mainly from experimental designs. However, with the emergence of Industry 4.0 and the Big Data movement gaining momentum, data abounds now more than ever, and the speed at which they accumulate is accelerating [3]. Besides, due to the increasing availability of sensors and data acquisition systems collecting information from process units and streams, univariate or low-dimensional data matrices evolve to high-dimensional data matrices. In addition, these data often exhibit high correlation, rank deficiency, low signal-to-noise ratio, and missing values [4].

For all this, the Six Sigma statistical toolkit traditionally focused in classical statistical techniques must incorporate new approaches being able to handle complex data characteristics from this current Industry 4.0 context. In such context, latent variable-based multivariate statistical techniques are widely recommended.

In the literature there are some examples of this integration of multivariate statistical tools into the Six Sigma toolkit. For example, Peruchi et al. [5] integrated principal component analysis (PCA) into a Six Sigma DMAIC project for assessing the measurement system, analyzing process stability and capability, as well as modeling and optimizing multivariate manufacturing processes in a hardened steel turning case involving two critical-to-quality (CTQ) characteristics. In [6], discriminant analysis and PCA were integrated into the DMAIC Six Sigma framework in order to improve the quality of oil type classification from oil spills chromatographic data.

In addition to that, latent variable regression models (LVRM) have also very attractive features, not only for their ability to build models when good predictions, process monitoring, fault detection, and diagnosis are desired (passive use), but also for being able to use this kind of Industry 4.0 data for process understanding, trouble-shooting, and optimization (active use) [4,7]. Note that for an active use causal models are required and, in contrast to machine learning (ML) and MLR models that can only yield causal models if data come from experimental designs, latent variable regression models (such as partial least squares (PLS) [8,9]) do provide causality in the reduced dimensional space of the latent variables even when using historical data corresponding to the daily production of the processes (happenstance data) [10].

This paper reinforces conclusions from previous works in the literature on how Six Sigma's DMAIC methodology can be used to achieve competitive advantages, e fficient decision-making, and problem-solving capabilities within the Industry 4.0 context, by incorporating latent variable-based techniques such as PCA into the statistical toolkit leading to the Multivariate Six Sigma. An important contribution of this paper to past literature is that we advocate the use of more advanced techniques via LVRM such as PLS, and illustrate their successful integration into the DMAIC problem solving strategy of a batch production process, one of the most iconic Industry 4.0 scenarios. This type of process, although it shares many of the characteristics represented by the four V's (volume, variety, velocity, and veracity), may not really be Big Data in comparison to other sectors such as social networks, sales, marketing, and finance. However, the complexity of the questions we are trying to answer is really high, and the information that we wish to extract from them is often subtle. This info needs to be analyzed and presented in a way that is easily interpreted and that is useful to process engineers. Not only do we want to find and interpret patterns in the data and use them for predictive purposes, but we also want to extract meaningful relationships that can be used to improve and optimize a process [11] (García-Muñoz and MacGregor 2016), thus making latent variable-based techniques especially relevant, as they permit making proper use of all the data available. More specifically, this paper addresses a case study based on the batch production of one of the star products at a chemical plant.

#### **2. Methods and Materials**

#### *2.1. Six Sigma's DMAIC Methodology*

In studying Six Sigma's DMAIC methodology, De Mast and Lokkerbol [12] already commented that there are essentially two options: to study the method as it is prescribed in courses and textbooks (prescriptive accounts), or to study it as it is factually applied by practitioners in improvement Six Sigma projects (descriptive accounts). Here, this work is focused on the second option. However, it is crucial to prescribe initially the main functions of each step. Thus, a rational reconstruction of the DMAIC methodology is shown below [12]:


As commented above, due to the Industry 4.0 context of the case study, we propose to incorporate latent variable-based techniques within DMAIC usual framework. To aid the reader's understanding, such techniques are described below.

#### *2.2. Latent Variable Models*

Latent variable models (LVMs) are statistical models specifically designed to analyze massive amounts of correlated data. The basic idea behind LVMs is that the number of underlying factors acting on a process is much smaller than the number of measurements taken on the system. Indeed, the factors that drive the process leave a similar signature on di fferent measurable variables, which therefore appear correlated. By combining the measured variables, LVMs find new variables (called latent variables (LVs)) that optimally describe the variability in the data and can be useful in the identification of the driving forces acting on the system and responsible for the data variability [13].

#### 2.2.1. Principal Component Analysis

Principal component analysis (PCA) [14–16] is a latent variable-based technique very useful to apply to a data matrix *X* (*N* × *K*), where *N* is the number of observations and *K* the variables measured. The goal of PCA is to extract the most important information from *X* by compressing the *K* measured variables into new *A* latent variables that summarize such important information. This allows simplifying the description of the data matrix and easing the analysis of the correlation structure of the observations and the variables. Thus, PCA can be used not only to reduce the dimension of the original space but also to identify patterns on data, trends, clusters, and outliers. In machine learning terminology PCA is an unsupervised method.

In order to achieve these goals PCA projects *X* into orthogonal directions obtaining new variables of maximum variance (i.e., principal components (PCs) also called LVs) which are obtained as linear combinations of the original variables. The decomposition carried out by PCA can be expressed as:

$$X = T \cdot P^T + E \tag{1}$$

where *P* (*K* × *A*) is the orthogonal loadings matrix, *A* being the number of LVs, *T* (*N* × *A*) is the scores matrix composed of the score vectors (columns of *T*), and *E* (*N* × *K*) is the residuals matrix. Score vectors are orthogonal to each other and explain most of the variance of *X*. Besides, due to the orthogonality in *P*, the *A* LVs have independent contributions to the overall explained variation. To calculate the parameters in a sequential manner, the non-linear iterative partial least squares (NIPALS) algorithm can be used [17].

#### 2.2.2. Partial Least Squares Regression

LVMs can be also used to relate data from di fferent datasets: an input data matrix *X* (*N* × *K*) and an output data matrix *Y* (*N* × *L*), where *L* is the number of output variables measured. This is done by means of latent variable regression models (LVRMs), such as partial least squares (PLS) regression. Thus, LVRMs find the main driving forces acting on the input space that are more related to the output space by projecting the input ( *X*) and the output variables ( *Y*) onto a common latent space. The number of LVs corresponds to the dimension of the latent space and can be interpreted, from a physical point of view, as the number of driving forces acting on a system [18].

In contrast to classical MLR or ML techniques, PLS regression [8,9] not only model the inner relationships between the matrix of inputs *X* and the matrix of output variables *Y*, but also provide a model for both. Thus, both *X* and *Y* are assumed to be simultaneously modelled by the same LVs providing unique and causal models, which is why PLS yields causal models even with data from daily production (i.e. happenstance data not coming from an experimental design). The PLS regression model structure can be expressed as follows:

$$T = \mathbf{X} \cdot \mathbf{W}^\* \tag{2}$$

$$X = T \cdot P^T + E \tag{3}$$

$$\mathbf{Y} = \mathbf{T} \cdot \mathbf{Q}^{T} + \mathbf{F} \tag{4}$$

where the columns of the matrix *T* (*N* × *A*) are the PLS scores vectors, consisting of the first *A* LVs from PLS These score vectors explain most of the covariance of *X* and *Y*, and each one of them is estimated as a linear combination of the original variables with the corresponding "weight" vector (Equation (2)). These weights vectors are the columns of the weighting matrix *W*∗ (*K* × *A*). Besides, the PLS scores vectors are, at the same time, good "summaries" of *X* according to the *X*-loadings ( *P*) (Equation (3)) and good predictors of *Y* according to *Y*-loadings ( *Q*) (Equation (4)), where *F* and *E* are residual matrices. In a predictive use, the sum of squares of *F* is the indicator of how good the predictive model is, and the sum of squares of *E* is an indicator of how well the model explains the *X*-space.

To evaluate the model performance when projecting the *n*-th observation *xn* onto it, the Hotelling-*T*<sup>2</sup> in the latent space, *T*<sup>2</sup> *n*, and the Squared Prediction Error (SPE), *SPE<sup>x</sup>n* , are calculated [19]:

$$\mathbf{r}\_n = \mathbf{W}^{\mathbf{r}^\mathrm{T}} \cdot \mathbf{x}\_n \tag{5}$$

$$T\_n^2 = \pi\_n^\mathrm{T} \cdot \Lambda^{-1} \cdot \pi\_n \tag{6}$$

$$SPE\_{\mathfrak{x}\_{\mathfrak{n}}} = \left(\mathfrak{x}\_{\mathfrak{n}} - P \cdot \mathfrak{x}\_{\mathfrak{n}}\right)^{\mathsf{T}} \cdot \left(\mathfrak{x}\_{\mathfrak{n}} - P \cdot \mathfrak{x}\_{\mathfrak{n}}\right) = \mathfrak{e}\_{\mathfrak{n}}^{\mathsf{T}} \mathfrak{e}\_{\mathfrak{n}} \tag{7}$$

where *en* is the residual vector associated to the *n*-th observation, **Λ**−<sup>1</sup> the ( *A* × *A*) diagonal matrix containing the inverse of the *A* variances of the scores associated to the LVs, and τ*n* the vector of scores corresponding to the projection of the *n*-th observation *xn* onto the latent subspace of the PLS model. *T*<sup>2</sup> *n* is the estimated squared Mahalanobis distance from the center of the latent subspace to the projection of the *n*-th observation onto this subspace, and the *SPExn* statistic gives a measure of how close (in an Euclidean way) such observation is from the *A*-dimensional latent space.

PLS model can also be expressed as a function of the input variables (as in a classical regression model) by substituting Equation (2) into Equation (4):

$$\mathbf{Y} = \mathbf{X} \cdot \mathbf{W}^\* \cdot \mathbf{Q}^T + \mathbf{F} = \mathbf{X} \cdot \mathbf{B} + \mathbf{F} \tag{8}$$

where *B* (*K* × *L*) is the PLS regression coe fficient matrix. To calculate the parameters of the model in a sequential manner, NIPALS algorithm can be used [20]. In both PCA and PLS regression, NIPALS algorithm has two main advantages: it easily handles missing data and calculates the LVs sequentially (an important property from a practical point of view).

Although PLS was not inherently designed for classification or discrimination, it can be used for both purposes in the form of PLS discriminant analysis (PLS-DA) [21]. Thus, by means of PLS-DA one can explain di fferences between overall class properties and classify new observations. In machine learning terminology PLS and PLS-DA are supervised methods.

#### *2.3. LVMs in Batch Processes*

Batch processes operate for a finite period with a defined starting and ending point, and a time-varying behavior over the operating period. Thus, the data available on batch processes fall into three categories: summary variables ( *XSV*) characteristics of each batch such as initial conditions, charge of ingredients, shift, operator, or features from the trajectories of the process variables throughout batch evolution; time-varying process variables throughout the batch evolution (trajectory variables—*XTV*; and the CQCs of final product (*Y*)). The nature of these data is represented in Figure 1.

**Figure 1.** Nature of the batch data set.

The presence of very large arrays, missing data, and the fact that there is temporal and contemporaneous correlation among all variables, provide a strong motivation for using latent variable-based techniques in such data. However, these techniques are based on bilinear modeling approaches and, therefore, a priori they are not able to handle that three-dimensional (3D) *XTV* arrays. For that, several approaches for analyzing batch data have been proposed where the differences mainly revolve around how to treat *XTV* [22]. For example, the simplest approach is to extract meaningful landmark features from the trajectories and recording the values of such features for each batch run in a 2D data matrix.

More sophisticated approaches are based on unfolding the 3D *XTV* in a 2D matrix. One approach is the batchwise unfolding (BWU) [23], that is, to unfold *XTV* such that all the information for each batch is contained in one row. The data are then mean centered and scaled for each column. Mean centering removes the mean trajectories of each variable, eliminating the main nonlinearity due to the dynamic behavior of the process, and focusing on the variation of all the variables about their mean trajectories. This approach allows exploiting the complex auto and lagged cross correlation structure (i.e., dynamics) of the batch process. Performing any subsequent PCA or PLS analysis on this matrix then summarizes the major sources of variation among the different batches and allows efficient batch-to-batch comparison. This approach also allows for incorporating summary variables (*XSV*) and final product quality variables (*Y*) associated with each batch when performing either a PCA or a PLS analysis [19].

Another approach is to unfold the data observation-wise (OWU) [24] with each row corresponding to an observation at some time in each batch, and each column corresponding to the variables measured. Mean centering by column (variable) then simply centers the origin of each variable about zero, but does not remove the average trajectories. The variation remaining is the total trajectory variation for each variable. Performing PCA or PLS (using local batch time or a maturity variable as **y**) on these OWU data finds a smaller number of components that summarize the major behavior of the complete trajectories of the original variables. It does not initially focus directly on the differences among batches as BWU does, but on summarizing the variable trajectories [22]. The latter can be overcome by carrying out a second model, being similar to the single model BWU, except that it is based on the unfolded score matrix *T* of the OWU (i.e., OWU-TBWU). The main motivation of the OWU-TBWU approach is dimensionality reduction before BWU stage, which is required when the number of trajectory variables is very large (e.g., when online analytical sensors such as Mass or NIR spectrometers are used). However, the modeling of the OWU data by a single PLS model works well if there is no important dynamics in the process and the instantaneous correlation structure (i.e., the correlation structure or the variables at the same time) remains stable throughout the whole batch evolution. Nevertheless, such assumption does not seem to be realistic at the chemical batch processes analyzed in this case study and, hence, BWU approach is used in this work.

Moreover, batch trajectories are dependent on time (more specifically on the pace the batch is run) and they are rarely synchronized (i.e., the key process events do not overlap at the same time in all batches) [25]. To compare these batch histories and apply statistical analysis one needs to reconcile the timing differences and the key process events among these trajectories [26]. This can be achieved using the dynamic time warping (DTW) method with only a minimal amount of process knowledge [27]. This method nonlinearly warps all batch trajectories to match as closely as possible that of a reference batch. Figure 2 illustrates the BWU approach followed by DTW synchronization.

**Figure 2.** (**a**) Batchwise unfolding (BWU) approach and (**b**) dynamic time warping (DTW) synchronization of the 3D *XTV*.

Note that, once *XTV* is unfolded and synchronized as seen in Figure 2, it is concatenated with *<sup>X</sup>SV*, resulting in the matrix *X* to be used in the PCA or PLS model.
