*3.3. Analyze*

The main goal of this stage was to identify which process parameters have a significant e ffect on the product's purity, evaluate the nature of their e ffect (antagonistic or synergic), and how they relate to each other. In order to achieve it, a PCA model was fitted, and presented in Section 3.3.1, with all summary variables and CQC to explore the correlation structure among them in the database, and to detect clusters of batches that operated in similar way in the past. Afterwards, a PLS-regression model permits predicting the CQCs from the summary variables, and was used to determine which of these factors have a significant e ffect on the purity of the product of interest (*y*8), as seen in Section 3.3.2. Once these variables are identified, a PLS-DA was performed considering the trajectory variables, to assess which of them are responsible for the observed di fferences between batches with higher and lower performance, as illustrated in Section 3.3.3.

#### 3.3.1. Principal Component Analysis of the Summary Variables and CQCs

This first exploratory analysis was aimed at providing relevant information regarding the existing correlation structure among summary variables and CQCs, and detecting clusters of batches operating in similar conditions and/or providing similar results. Given that outliers were already eliminated from the dataset, a PCA model with five LVs [R2(X) = 76%] could be directly fit. Additional LVs beyond the fifth corresponded to either variation of individual variables independently of others, or variations not related to the CQCs, and were therefore not considered relevant to the goal of the project.

The Hotelling-*T*<sup>2</sup> values for the observations used to fit this model can be seen in Figure 7a. Here, batches were colored by variable *x*13 (black: 0; orange: 1). In Figure 7b, the scores plot of LV2 (explaining 20% of the variability of the data) versus LV1 (explaining 22% of the variability) is shown, such that the left red cluster corresponds to the observations in orange in Figure 7a, and the rest correspond to observations in black in Figure 7a. Figure 7c, where the loadings for the variables in the two first latent variables are represented, allows the interpretation of this clustering. In it, variables *x*6 and *x*10, and *x*13 = 1, can be seen on the left side, with values close to zero in the second component, while variables *x*4 and *x*8, and *x*13 = 0, are found in the opposite side. This provides two valuable pieces of information (which Figure 7d illustrates, too):


**Figure 7.** (**a**) Hotelling-*T*<sup>2</sup> plot of the observations in the dataset with the summary variables and CQCs for a PCA model fitted with five LVs [R2(X) = 76%], T<sup>2</sup> 95% (dotted red line) and 99% (continuous red line) confidence limits, colored by *x*13 (black: 0; orange: 1); (**b**) scores plot for the two first LVs (*<sup>t</sup>*2 vs. *t*1) showing three clusters of observations: red circled orange dots associated to *x*13 = 1, above average values for *x*6 and *x*10 and below average values for *x*4 and *x*8; red circled black dots associated to *x*13 = 0, above average values for *x*6 and *x*10 and below average values for *x*4 and *x*8, and; blue circled black dots associated to *x*13 = 0, below average values for *x*6 and *x*10 and above average values for *x*4 and *x*8; (**c**) loadings plot for the two first LVs (*p*2 vs. *p*1) with CQCs in red, continuous process variables in black, and binary process variables in cyan; (**d**) scatterplot for *x*10 vs. *x*8, using the same color code as in Figure 7b.

Note that, more generally, the relationships among all process variables and CQCs in the dataset used to fit the PCA model can also be assessed by looking at the loading plots. In this plot, if the corresponding LVs explain a relevant percentage of the model variability, variables lying close to each other (and far away from the center) will tend to show positive correlation; while if they lay at the opposite site in the plot they will tend to show negative correlation. Figure 7d can be resorted to for carrying out such analysis (latent variables three to five do not, in this case, alter this interpretation). This way, in addition to the aforementioned correlations, positive correlations were found between variables *y*4, *y*6, *y*8, and *y*10, and variables *x*1, *x*2, and *x*15 = 0, as well as between *x*3 and variables *y*5, *y*7, and *y*9. On the other hand, negative correlations were found between *y*2, and all other CQCs except for *y*1 and *y*3, as well as between *x*3 and variables *x*1 and *x*2, and between *x*9 and variables *x*7 and *x*11. More importantly, no clear correlation was found between *y*8 and variables *x*8 to *x*11. Bivariate dispersion plots for each pair of variables were used to visualize each of these relationships (or lack thereof), and also allowed detecting that not only was *x*15 = 0 positively correlated with *y*8, but that the intensity of the positive/negative correlations between *y*8 and other process variables and CQCs varied when *x*15 = 1 (Premix 2 fed to the reactor) with respect to *x*15 = 0 (Premix 1 fed to the reactor). As an

example, the positive correlation between *x*2 and *y*8, as well as the relationship between *x*15 and *y*8, and the interaction effect between *x*2 and x15, are shown in Figure 8.

**Figure 8.** Scatter plot for *y*8 vs. *x*2, for (**a**) *x*15 = 1, and the approximate direction of maximum variability indicated by a green arrow, and (**b**) *x*15 = 0, and the direction of maximum variability indicated by a yellow arrow.

From Figure 8a,b, the positive correlation between *x*2 and *y*8 can be immediately confirmed. Furthermore, the cluster of batches for which *x*15 = 0 presents higher values (on average) than those for which *x*15 = 1. This is coherent with the conclusions extracted from Figure 7c Additionally, however, the slopes of the green arrow in Figure 8a and the yellow one in Figure 8b differ, pointing to a stronger, more positive correlation between *x*2 and *y*8 when *x*15 = 1, compared to their weaker, but still positive, relationship when *x*15 = 0. Therefore, it can be suspected that an interaction exists between *x*2 and *x*15.

#### 3.3.2. Partial Least Squares Regression to Predict the CQCs from the Summary Variables

This analysis was performed in order to identify the sources of variability of the process most related to the product's purity (i.e., variables *y*4, *y*6, and *y*8). This required confirming previous results and quantifying the relationship between the summary variables and the CQCs. For the sake of brevity, only the results regarding the established predictive model for *y*8 will be shown in this section, as those to predict *y*4 and *y*6 provide the same overall conclusions. The potential effects of time related variables ('month' and 'year') and interaction effects between the categorical variables *x*12 to *x*15 and other summary variables were also considered initially. However, no statistically significant differences in the CQCs were found between reactors, and the effect of variables 'month' and 'year' was not statistically significant either. Furthermore, all interaction effects were discarded for the same reason, except for the interactions of both levels of *x*15 with variables *x*1 and *x*2. Figure 9 presents the coefficients of the resulting PLS-regression model fitted with two LVs [R2(Y) = 25.86%; Q2(Y) = 25.81%] to predict *y*8.

One important consideration in this analysis is that variable *x*3, which is known to be critical in the process, did not appear to have a significant effect on the most relevant CQCs. This is partially because this is a very strictly controlled variable. On the other hand, variables *x*1 and *x*2, with which *x*3 is negatively correlated, are found to apparently have a significant positive effect on *y*8. From this, and according to the technicians' knowledge of the process, one may conclude that it is *x*3 that has a statistically significant, negative effect on *y*8, but one should be cautious given that *x*1, *x*2, and *x*3 do not vary independently. This is illustrated in Figure 10, where batches with higher values for *x*1 also present higher values of *y*8, on average (i.e., for lower values of *x*1, batches with similar and smaller values of *y*8 are observed), which points to the positive correlation between *x*1 and *y*8. On the other hand, batches with the highest values of *x*1 only operated at values of *x*3 close to its historical minimum, which also illustrates the negative correlation between *x*1 and *x*3. This could also be seen in

the loadings plot of the PCA in Figure 7c. To disentangle this potential aliasing some experimentation should be run in the future, when/if possible.

**Figure 9.** Regression coefficients of the partial least squares (PLS)-regression model fitted with two LVs [R2(Y) = 25.86%; Q2(Y) = 25.81%] to predict *y*8 from the summary variables and their interactions with *x*15.

**Figure 10.** Scatter plot for *y*8 vs. *x*1, with the observations colored according *x*3.

Nevertheless, the negative relationship between *x*15 = 1 (using Premix 2) and *y*8, already seen in Figure 8, was confirmed once more, as seen in Figure 9, and so finding what is being done differently in such case becomes relevant.

3.3.3. PLS-Discriminant Analysis to Identify Differences in Batches Using Premix 1 and Premix 2

Since *x*15 seems to be one of most important variables affecting the purity of the product, *y*8, conclusions obtained by previous analysis were confirmed by means of a PLS-DA model, which was resorted to for finding which variables are responsible for the differences in how batches operated when Premix 2 (*<sup>x</sup>*15 = 1) was fed into the reactor, compared to when Premix 1 (*<sup>x</sup>*15 = 0) was used. This analysis was carried out considering both the summary and trajectory variables, but only the results with the trajectory variables (*<sup>x</sup>*16 to *x*26) are illustrated here, for the sake of both brevity and clarity. Figure 11 shows the separation between both clusters of batches in the latent space, while Figure 12 presents the model coefficients associated to each process variable, included the 'warping profile' that results from aligning the trajectories, for a PLS-DA regression model to predict *x*15 = 1. This model was fitted with eight LVs, as this number of LVs provided the model with the most discriminant power [R2(Y) = 79.80%; Q2(Y) = 71.90].

**Figure 11.** Scores plot for the two first LVs (*<sup>t</sup>*2 vs. *t*1) for the PLS-discriminant analysis (DA) regression model fitted with eight LVs [R2(Y) = 79.80%; Q2(Y) = 71.90%], showing the separation between batches with *x*15 = 0 (orange; right side of the red straight line) and *x*15 = 1 (black; left side of the red straight line).

**Figure 12.** Model coefficients (blue bars) and the confidence interval for a 95% confidence level (grey intervals) associated to each variable for the PLS-DA regression model to predict *x*15 = 1 from variables 'warping profile' and *x*16 to *x*26, fitted with eight LVs [R2(Y) = 79.80%; Q2(Y) = 71.90]. Positive values indicate that higher values for the corresponding variable at that point in the batch are expected, on average, for batches with *x*15 = 1 (Premix 2 used).

From Figure 12 it is concluded that batches where Premix 2 was fed to the reactor (*<sup>x</sup>*15 = 1):


• Operated at higher values for variable *x*23 (ingredient temperature) at the start of the batch, but lower during the middle part of the batch.

It is worth noting that, although the technicians at the plant were not surprised by the discrepancy in the values observed for variable *x*23 at the start of the batch for the two clusters, the values during the middle part was, according to them, opposite to their expectations.
