4.3.2. Robust Regression

To decrease the influence of the outliers, a new probabilistic model was built. The response variable were modeled with a Student-t distribution. This distribution has the 3rd parameter ν (degrees of freedom), which controls the tails [46]. Lower values of ν lead to the distribution high tails and the regression is more robust to outliers. At about ν >30, the Student-t distribution coincides with the normal (Figure 10).

The parameters of the distribution were random variables, for which the appropriate prior distributions should be given. For regression coefficients, non-informative (Normal with high variance) were chosen. The degrees of freedom can be fixed or if treated as a random, variable exponential prior can be given.

$$\begin{array}{lcl} y \sim \text{t} \left( \not\mathfrak{j}, \sigma^2 \ll \text{v} \right), \\ \sigma^2 \sim \text{HalfCauchy}(s\_0) \\ v \sim \text{Exp}(\lambda\_0) \\ \beta\_i \sim \text{N} \left( \mu\_{0i}, \sigma\_{0i}^2 \right) \end{array} \tag{10}$$

where *v* degrees of freedom; *s*0, *λ*0, *μ*0, *σ*0—known parameters.

**Figure 10.** Student-t and normal distributions.

For sampling this Bayesian model, the PyMC3 Python library was used [47]. This probabilistic programming language uses clear and easy to understand syntax. In particular, the PyMC3 generalized linear model module (GLM) is used to define this model. It, by default, sets the non-informative priors for the regressors *β<sup>i</sup>* ∼ *N* 0, 106 . The model was defined just in 1 row with a string for the regression formula, data, and the family parameter, which set the likelihood to Student-t distribution (default was Normal). The degrees of freedom were fixed to *ν* = 1. To sample from the mode, two independent Markov chains with length of 6000 samples were generated. The first 1000 samples of each chain were "burned" to avoid correlated samples. By default, PyMC3 uses the gradient based No-U-Turn-Sampler (NUTS).

After sampling the trace, a plot of posterior distributions was available. On the right, the generated Markov chains were presented, and on the left, the posterior distribution density of the model parameters. As can be seen from Figure 11 they converged well. There were no abrupt changes, patterns, or other weird observations.

More detailed inferences about sampling can be made by reviewing the summary statistics (Table 8). First, the r-hat for all parameters was 1.00, indicating that simulated chains came from one distribution. Effective sample size (ess\_mean) shows the number of non-correlated samples in the chains. They should be more than 10% of draws. It is seen that for all the parameters, the values were in the range of 5000–7000 from 10,000 draws.


**Table 8.** Bayesian regression model summary.

**Figure 11.** Trace plots (left—distributions; right—chains)

Intercept's posterior mode was very close to the data mean (15.4). Posteriors of the regression coefficients were significantly away from zero, with the exception of AD coefficient posterior. It was centered close to zero, which indicates that it should be excluded from the model, unlike the previous OLS model, where its mean was positive with 95% HDI, predominantly on the positive side. To inspect other coefficients' posteriors in detail, histogram plots with a highlighted 95% HDI and vertical line at zero as a reference value were plotted (Figure 12).

**Figure 12.** Posterior distributions of the proposed model coefficients. Histogram plot with 95% HDI and null reference value.

The posterior distributions of the regression coefficients AC and BD were not clearly on one of the sides (16% and 7.3% of the posterior lay below or above the reference value, respectively). Since these two factor interactions (AC, BD) were correlated with the threefactor interactions ACD and BCD, they can be considered excluded from the model due to overfitting.

The proposed probabilistic model can be further improved by centering the data by using the scaled data, setting the weakly informative priors to regression coefficients and the degree of freedom. The weakly informative is a prior distribution that covers the data behavior and generates the data at a reasonable scale but is not so strong as to influence the posterior. In addition, statements such as "weakly informative" depend crucially on what questions are being asked [48].

$$\begin{aligned} y\_{centered} &\sim \text{tr}\left(X\beta, \sigma^2, v\right) \\ \sigma^2 &\sim \text{Half}\text{Cauchy}(s\_0) \\ v &\sim \text{Gamma}(2, 0.1) + 1 \\ \beta\_0 &\sim \text{N}\left(0, 1^2\right) \\ \beta\_i &\sim \text{N}\left(0, 1^2\right), \text{ for } i = 1, 2 \dots k \end{aligned} \tag{11}$$

Following some of the recommendations in [49], the Normal distributions with variance 1<sup>2</sup> for the regression coefficients and Gamma (2, 0.1) prior for the degrees of freedom were chosen Equation (11). By trying different numbers of regression coefficients, the final model consisted of four regression coefficients A, D, BD, ACD. The posterior histograms with a 90% HDI are given in Figure 13.

**Figure 13.** Posterior distributions of the final model coefficients. Histogram plot with 90% HDI and null reference value in scaled units.
