**4. Discussion**

While the success of the Six Sigma methodology has already been documented in the past in numerous industrial case studies, the tools used (mainly based on linear regression and simple graphical displays) are usually those suitable for scenarios where not much information is available yet, a relatively limited number of factors are involved or relevant, and/or experimentation can be carried out to a minimum ye<sup>t</sup> significant extent. In the problem addressed in this project, related to a batch production process, due to the nature of the data registered, typical of Industry 4.0, none of these apply and, therefore, alternative methods had to be resorted to.

In particular, latent variable-based methods such as PCA, PLS, and PLS-DA, applied to historical (i.e., not from DOE) data of both summary variables and trajectory variables (usually just referred to as batch data) were able to extract valuable information to pinpoint the actual causes of the loss of productivity in a real case study. These tools were also implemented for troubleshooting purposes in the future.

In contrast with these latent variable-based methods, data from DOE would have been required to use tools such as linear regression or machine learning tools to infer causality, which is needed for process understanding and optimization purposes. However, as it is typical in Industry 4.0 no data from DOE were available. As an example, when linear regression was applied to the available historical data, due to the highly correlated regressors (process variables), di fferent models using di fferent regressors and having di fferent weights or coe fficients on them gave nearly identical predictions and similar to PLS model, but failed to properly identify the relationships between *x*1, *x*2, *x*3, and *y*8, as well as the existing interaction e ffect between the use of Premix 1 (*<sup>x</sup>*15 = 0)or Premix 2 (*<sup>x</sup>*15 = 1) and other process variables, such as the ones shown in Figure 8. Had any of such linear regression models been used in the 'Improve' step of the DMAIC for process improvement, a di fferent set of process operating conditions would have been advised with a high probability of not being feasible in practice, as a result of e.g., the actual relationships between *x*1, *x*2, and *x*3 going unnoticed. Furthermore, a lesser degree of improvement would have been achieved, presumably, if the interaction between e.g., *x*2 and *x*15 had not been discovered. Therefore, this constitutes a clear example of the dangers of resorting to more basic linear regression (and also machine learning) techniques for process optimization in scenarios they are not suitable for (i.e., where causality cannot be inferred directly from the raw data), as in this case study, analyzing historical data.

In summary, the use of latent variable-based methods allowed the e fficient use of the Six Sigma methodology in a batch production process where this could not have been done using a traditional Six Sigma toolkit, which lead to significant short- and long-term savings, in addition to the implementation of a more robust monitoring system.
