3.2.1. Available Data

Data from a total of 17,147 batches produced in two different reactors during a nine years period was available, containing information about:


The first three groups of variables will be referred to as 'summary variables', since they provide summarized information of each batch (e.g., average observed values or setpoint values), disregarding their variation during the evolution of the chemical reaction until completion and discharge of the reactor. The fourth group, on the contrary, is comprises 'trajectory variables' that may show differences in the evolution of the corresponding process conditions among batches even when their average or target values coincide.

Although further experimentation may be suggested during this step to enrich the database used in following stages, such experimentation is not possible in this case, as previously stated, and therefore no such approach will be addressed in this section.

#### 3.2.2. Validation of the Data

In order to detect potential outliers, a PCA model with two latent variables (LVs) was fitted using all available data for the 'summary variables' as provided, resulting in a model that explained 17% of the variability of these data. Adding any more LVs provides no additional information that is useful at this stage, and instead results in PCA models with lower explanatory and predictive capabilities, and less ability for the detection of outliers. A representation of the SPE of all observations in the database resulted in Figure 4.

**Figure 4.** SPE for all observations in the dataset with the summary variables and critical to quality characteristics (CQCs) for a principal component analysis (PCA) model fitted with two LVs [R2(X) = 17%], SPE 95% (dotted red line) and 99% (continuous red line) confidence limits, and the four observations with highest SPE values highlighted.

This plot allows quickly detecting observations that do not abide by the correlation structure found by the PCA model in the dataset for the 'summary variables'. A contribution plot, such as the one in Figure 5, provides additional information regarding which variables are responsible for the high SPE value for the corresponding observation. In this case, variable *x*4 presents an abnormally high value for observation 1080, not following the correlation structure found in the data by the PCA model.

**Figure 5.** Contribution plot for observation 1080, seen in Figure 4 to have a SPE value significantly above the 99% confidence limit for a PCA model fitted with two LVs [R2(X) = 17%]. Variable *x*4 is seen to be the biggest contributor to the SPE value.

An in-depth analysis of the factors that contribute to the high SPE values of all observed outliers allowed curating the database to either correct wrongly registered data or to eliminate outliers before continuing with the analysis. Consequently, the dataset was reduced from 17,147 original batches to 16,813.

The same procedure was followed for the dataset containing the 'trajectory variables', but no outliers were found among these data other than the ones already identified with the 'summary variables'. As a consequence, these observations were discarded before continuing.

On the other hand, process variable *x*5 was found to present almost no variability in the dataset and was also discarded before going on. Additionally, variables 'day', 'month', and 'year' were included only as labels with which observations could be colored, in order to identify possible patterns, stationary e ffects or changes with time without artificially biasing the model to account for these variables. However, no clustering or displacement of the observations was detected this way, and the presence of outliers was not found to be correlated to these time-related variables.

3.2.3. Quantified Initial Situation and Potential Causes of the Observed Problem

Once outliers were eliminated from the dataset, the starting point of the project was determined according to the remaining information. Figure 6 shows the evolution of the purity of the product of interest (*y*8) with time for both reactors. The superimposed dashed blue lines mark the separation between batches produced before and after September 2014.

**Figure 6.** Evolution of the purity of the product of interest (*y*8) with time for the first (black) and second (orange) reactors; the dashed blue lines separate batches produced before (left side) and after (right side) September 2014.

The average value for *y*8 after September 2014, compared to before, was around 0.1% lower, while its standard deviation had increased to 1.008% (1.47 times that of past batches). Both changes (in average value and variability) were found to be statistically significant (*p*-value < 0.05), which corroborated, at least partially, the concerns expressed by the technicians at the start of the project. When asked, they mentioned that several changes had taken place at some point during 2014, such as the addition of an auxiliary refrigerating system to one of the reactors, the way the reactants were fed or the recovery of some amount of unreacted raw materials after each reaction.
