2.3.1. Inherent Traits

There is one key difference that separates statistical models from physico-chemical models and that is the connection between the input and output values. Physico-chemical models present their prediction, i.e., output value, based on pre-determined equations that use the input values. These equations are related to established physical and chemical laws. Statistical models, on the other hand, interpret the values of the input variables in the context of previously observed input values and output values. Hence, the output of a statistical model is purely based on probability and does not necessarily adhere to established physical and chemical laws. The connection between the two is purely dependent on the data that is used to adapt the statistical model to the prediction problem of interest. This leads to three distinct traits unique to statistical models. These are data quality, data variability, and correlation.

Data quality is related to how close the registered value is to the true value that is intended to be measured. Uncertainties are imposed and the performance of a statistical model is reduced if the data quality is low. There are numerous sources that can affect the data quality. Two examples are the manual logging of data, which is prone to human error, and the definition of a variable in the logging system, which may differ from what is measured in reality. One common data quality issue in the steel industry is the precision of measurement equipment. For example, scales weighting raw materials can have a precision of ±100 kg and temperature sensors can have a precision of ±5 ◦C. In this study, data quality refers to the extent to which the data is affected by the aforementioned examples.

Data variability is a requirement because variations of the values in previously observed data are what the statistical model learn. An input variable that is constant is useless to a statistical model. A constant variable in a physico-chemical model is not useless. A straightforward example is the latent heat of melting of steel, which is an important component of a physico-chemical model predicting the temperature of steel in, for example, the EAF.

Correlation is a metric that indicates the relation between random variables. Strongly correlated input variables are similar and therefore redundant as input variables to a statistical model. In this case, only one of the input variables should be selected. Weakly correlated variables, on the other hand, may be redundant. The degree of redundancy for weakly correlated variables depends on the intra-correlation between the input variable and the other input variables. For example, scrap type A may be redundant if the total scrap weight and scrap types B and C are included in a potential model. The sum of scrap type B and C implies that scrap type A must be the difference between the total scrap weight and scrap types B and C. In this case, adding scrap type A as input variable would likely not increase the accuracy of the statistical model since there is no new information gained from this input variable. It is important to note that correlation *does not* imply causation. However, statistical models lack the ability to distinguish between the two. Even though a correlation can shed light on areas where causation may exist, it is the task of the practitioner with domain-expertise to separate the causative relations from the non-causative relations. This stresses the importance of possessing knowledge about the domain in which the statistical model is used.
