*2.6. Mutual Information Gain*

Feature selection with mutual information gain enables the discrimination of features based on their interaction measurement, both for linear and nonlinear models [53]. Mutual information measures the uncertainty based on the entropy *H* of one variable while observing the other one. Let *X* be a random variable with values {*x*1, *x*2, *x*3, ..., *xn*}; its entropy is given by Equation (5) [54].

$$H(X) = -\sum\_{i=1}^{n} P(\mathbf{x}\_i) \log\_2[P(\mathbf{x}\_i)] \tag{5}$$

Let *Y* be an output variable with values {*y*1, *y*2, *y*3, ..., *yn*} and let *X* be a features array with values {*x*1, *x*2, *x*3, ..., *xn*}; *H*(*X*|*Y*) is then given by Equation (6) [54].

$$H(X \mid Y) = -\sum\_{j=1}^{n} [P(\mathbf{x}\_{i})] \sum\_{i=1}^{n} P\left(\mathbf{x}\_{i} \mid \mathbf{y}\_{j}\right) \log\_{2} \left[ P\left(\mathbf{x}\_{i} \mid \mathbf{y}\_{j}\right) \right] \tag{6}$$

The mutual information in Equation (7) measures the reduction in the uncertainty of *X* given *Y* [54,55].

$$MI(X|Y) = H(X) - H(X|Y) \tag{7}$$

#### *2.7. Univariate Linear F-Regression Selection*

This method uses a linear model to measure the degree of linear dependence between two random variables; in other words, it measures the significance of a feature in a linear model [56].

The F-regression equations use the null hypothesis *H*0, indicating that the data only intercept the model, and the alternative hypothesis *H*1, indicating the compatibility of the data with the model. The selection of the true hypothesis relies on the *Fscore* given in Equation (8), the explained variance from Equation (9), and the unexplained variance from Equation (10) [56].

$$F = \frac{\text{explained variance}}{\text{unexplained variance}}\tag{8}$$

$$\text{explained variance} = \sum\_{i=1}^{K} n\_i \frac{\left(\overline{Y}\_{i\cdot} - \overline{Y}\right)^2}{\left(K - 1\right)}\tag{9}$$

$$\text{Unnexplained variance} = \sum\_{i=1}^{K} \sum\_{j=1}^{n\_i} \frac{\left(\overline{\mathbf{y}\_{ij}} - \overline{\mathbf{y}\_{i\cdot}}\right)^2}{(N - K)} \tag{10}$$

where *Yij* is the *j* th observation in the *i* out group in *K*, which is the number of out groups. *N* is the overall sample size and *ni* is the number of observations.

Additionally, following Section 2.4, one can determine a *pvalue* for the hypothesis conclusion, and, like with the Pearson correlation, if *pvalue* > 0.05, the conclusion is unreliable [56].
