3.3.3. Statistical Methods

Statistical data treatment methods refer to algorithms that identify anomalies in the distributions in data sets. The algorithms are purely mathematical and do not take into account any domain-specific reasons behind anomalies or domain-specific relations between the variables. The advantage of this method is that it is easy to apply and does require very little, if any, domain-specific knowledge. However, applying outlier detection algorithms to multiple variables will potentially remove most of the data. A simple example is the various types of scrap types used in production, which are prone to extreme (very high or very low) values since some scrap types are not available at all times. Applying a statistical cleaning algorithm to all scrap types will omit most of the data. Hence, statistical treatment methods must be used with caution and only on well-selected variables that are expected to impact the predictions dramatically. For example, total charged scrap weight and TTT.

Tukey's fences [19], which is based on the interquartile range of a distribution, will be used in the numerical experiments. The method removes data points that are present outside the range defined as:

$$q\_1 - \epsilon(q\_1 - q\_3) \le x\_{\hat{\jmath}} \le q\_3 + \epsilon(q\_1 - q\_3) \tag{5}$$

where *q*<sup>1</sup> and *q*<sup>3</sup> are the first and third quartiles of variable *j*, respectively.  is a pre-specified constant indicating how far out the outlier must be before being cleaned.  = 3.0 will be used in the numerical experiments which removes extreme outliers [19].

Since this method is based on the quartiles of the distribution under consideration, it is less sensitive to skewed distributions compared to cleaning by omitting data points outside of ±3*σ* from the mean. This inherent characteristic of Tukey's fences is advantageous, since most variables governing the EAF process are non-Gaussian. See Figure 4 for examples.

**Figure 4.** The distributions for four variables governing the EAF under study highlighting the absence of the Gaussian distribution. All values are normalized, and the dashed lines indicate the mean values. (**A**): TTT. (**B**): Charged weight of internal scrap. (**C**): Total charged weight of raw materials. (**D**): Charged weight of shredded scrap.

#### 3.3.4. Applied Data Treatments

Due to the opposing effects of data treatment as mentioned earlier, it is impossible to know which approach strikes a good balance between model generalizability and model accuracy. Therefore, four different data treatment approaches will be used in the modeling to investigate the influence of the different approaches.

The first data treatment approach was conducted by a senior engineer at the steel plant. This data treatment was done by manually inspecting each row of the data set and flagging rows which contained values that were not consistent with the data instance as a whole. This data treatment is a combination of domain-expertise and some subjectivity of the senior engineer. However, because the data is inherently coupled with the steel plant it originates from, using on-plant experts is critical to a successful data treatment operation. This data treatment is referred to as *Expert*.

For the following three data treatment approaches, two filter steps were applied to remove unrealistic heats with respect to events in time. The first filter removed heats where the timestamp of charging the second heat was negative respect to the start of the heat. The second filter removed heats where the charging of the second basket occurred after the heat ended. Applying these filters were enough to remove all unrealistic heats.

The second data cleaning approach was conducted by the authors of this study which was based only on domain-specific knowledge. Two cleaning steps were applied. The first cleaning step removed heats with TTT at, or above, 180 min since these heats are likely experiencing a longer delay in the process or a scheduled stop. Usually, the TTT is aimed at 60–70 min. The second cleaning step removed heats with a Total charged weight at, or above, 141 ton. This is a limit set by the steel plant for abnormally large charge weights. This data treatment approach is referred to as *Domain-specific*.

The third data treatment used Tukey's fences to remove clear outliers, see Section 3.3.3. Tukey's fences were calculated and applied to each of the following input variables using the training data: Total Weight, TTT, Time until Charging of Basket 2 (TCB2), Burner *O*2, Burner oil, *O*2-lance, and Injection carbon. Each 'fence' was then applied to the training and test data, respectively. This data treatment approach is referred to as *Tukey*.

The fourth data treatment approach used the second and third data treatment approaches, in order. This data treatment approach is referred to as *Domain-specific Tukey*.

Each of the four described data treatment approaches will be used in the modeling to investigate which one of the data treatments is the most optimal for a model applied in practice.
