*3.2. Gradient-Based Multi-Objective Feature Selection*

Although MOBBO and other gradient-free MOO methods have the potential to find the globally optimal solution [22,23], they are computationally expensive due to the need for many iterations of the classifier training process (multiple individuals in the population, and multiple generations). To reduce computational complexity, we propose a novel algorithm called gradient-based multi-objective feature selection (GMOFS). In GMOFS, feature selection and data classification are performed simultaneously.

GMOFS incorporates a regularization penalty term to the optimization problem of its learning algorithm. The penalty term, which is handled by a Lagrange multiplier, directs the trained model toward a parsimonious as well as accurate model. We use an MLP network as the classifier, and include an elastic net to penalize the size of the selected feature subset. The first step of GMOFS is to train a constrained MLP network with the cost function

$$J = \frac{1}{2} \sum\_{l=1}^{m} \sum\_{j=1}^{K} \left( t\_j^{(l)} - o\_j^{(l)} \right)^2 + \lambda \sum\_{i=1}^{n} \left( a\beta\_i^2 + (1 - a)|\beta\_i| \right), \tag{4}$$

where *β<sup>i</sup>* is the multiplier of the *i*-th input feature before input to the MLP network; *t* (*l*) *<sup>j</sup>* and *o* (*l*) *<sup>j</sup>* are the target and actual value, respectively, of the *j*-th output neuron associated with the *l*-th training pattern; *K* is the number of output neurons (classes); *m* is the number of training patterns; and *n* is the number of input features. We use an MLP network with one hidden layer and *p* hidden neurons (including the bias node). *vih* denotes the weight that connects the *i*-th input neuron to the *h*-th hidden neuron, and *whj* denotes the weight that connects the *h*-th hidden neuron to the *j*-th output neuron. The first term of the cost function penalizes classification error while the second term, which is the elastic net, penalizes the number of selected features. The elastic net is a convex combination of ridge regression (*α* = 1) and least absolute shrinkage and selection operator (*α* = 0). *λ* ≥ 0 is a complexity parameter that controls the shrinkage of the input features. Large *λ* leads to a shrinkage of *β<sup>i</sup>* toward zero, which implies that the input feature corresponding to *β<sup>i</sup>* is not significant. However, as shrinkage increases, classification error tends to increase. Therefore, *λ* provides a trade-off between classification error and the number of selected features. In summary, the construction of the MLP network with the elastic net is formulated as the following optimization problem:

$$\min\_{\beta, w, v} J \text{ subject to } \begin{cases} 0 \le \beta\_l \le 1, \\ |w\_{l\bar{l}}| \le a \text{ and } |v\_{l\bar{l}}| \le b\_\prime \end{cases} \tag{5}$$

for all *i*, *j*, *h*, where *β<sup>i</sup>* = 0 or 1 implies that the associated feature is the least or most significant input variable, respectively. *a* and *b* are the bounds for neuron weights *whj* and *vih*, respectively. However, due to the direct relationship between *β<sup>i</sup>* and neuron weights *vih* and *whj*, we cannot conclude that an input feature with small *β<sup>i</sup>* and large neuron weights is insignificant. To avoid optimization solutions with large weights, neuron weights are constrained. Backpropagation is used to update *βi*, *vih*, and

*whj*. The derivative of *J* with respect to output weights *whj*, hidden weights *vih*, and input weights *β<sup>i</sup>* is obtained by the chain rule as

$$\begin{aligned} \frac{\partial \mathcal{J}}{\partial w\_{hj}} &= \sum\_{l=1}^{m} \delta\_{j}^{(l)} y\_{h}^{(l)} \\ \frac{\partial \mathcal{J}}{\partial v\_{ih}} &= \sum\_{l=1}^{m} \left[ \sum\_{k \in D\_{2}(h)} \left[ \delta\_{k}^{(l)} w\_{hk} \right] y\_{h}^{(l)} (1 - y\_{h}^{(l)}) \, \beta\_{i} x\_{i}^{(l)} \right] \\ \frac{\partial \mathcal{J}}{\partial \beta\_{i}} &= \sum\_{l=1}^{m} \left\{ \sum\_{s \in D\_{1}(i)} \left[ \sum\_{k \in D\_{2}(h)} \left[ \delta\_{k}^{(l)} w\_{sk} \right] y\_{s}^{(l)} (1 - y\_{s}^{(l)}) v\_{is} \right] x\_{i}^{(l)} \right\} + \lambda \left[ 2a \beta\_{i} + (1 - a) \frac{\beta\_{i}}{|\beta\_{i}|} \right], \end{aligned} \tag{6}$$

where *D*1(*i*) is the set of middle layer neurons whose inputs come from the *i*-th input neuron, *D*2(*h*) is the set of output neurons whose inputs come from the *h*-th middle layer neuron, and *δ* (*l*) *<sup>j</sup>* = −(*o* (*l*) *<sup>j</sup>* − *t* (*l*) *<sup>j</sup>* )(1 − *t* (*l*) *<sup>j</sup>* )*t* (*l*) *<sup>j</sup>* . Detailed derivation of the derivative of *J* with respect to *whj* and *vih* is available in [45]. The derivative of *J* with respect to input weights *β<sup>i</sup>* is straightforward to obtain using the chain rule. We use the derivatives in Equation (6) and constraints in Equation (5) along with the trust region reflective algorithm to train the constrained MLP network.

Once MLP training phase is completed, input weights *β<sup>i</sup>* are sorted in descending order. The input variable with the largest *β<sup>i</sup>* is the most significant feature. The second step of GMOFS is to select the most *r* significant features, which are associated with the *r* largest input weights *βi*, and which satisfy

$$\begin{array}{ll}\frac{\sum\_{i=1}^{r}\beta\_{i}}{\sum\_{j=1}^{n}\beta\_{j}} \ge \text{95\%} \\ \beta\_{1} \ge \beta\_{2} \ge \cdots \ge \beta\_{n} \end{array} \tag{7}$$

The 95% threshold value in Equation (7) determines the trade-off between the number of selected features and the accuracy of the model. A low threshold value will decrease the number of selected features and will be more likely to remove informative features that can significantly contribute to the accuracy of the UIR model. On the other hand, a high threshold value will be more likely to include irrelevant features that cannot contribute to the accuracy of the model. To tune the threshold value, we gradually increase the threshold from zero, and, for each value, we compute the accuracies of the trained MLP once with the original *β<sup>i</sup>* values and once with *β<sup>i</sup>* = 0 for all unselected input features. We increase the threshold value until there is no significant difference between the performances of MLP for these two cases. That point was reached with the threshold value of 95%. We then repeat the first two steps of GMOFS for different *λ* in the range [*λl*, *λu*] with a predefined increment *λ*. The selected subsets associated with each *λ* comprise a population. The population size depends on *λ*. To assess the performance of the selected feature subset, we train a classifier with each selected subset and find classification error. In this population, the subset associated with *λ* → ∞ has minimum size and maximum classification error, whereas the subset with *λ* = 0 has maximum size and probably has the lowest classification error. Thus, the size of the selected subset and the classification error, defined in Equation (3), are two conflicting objectives. To find the GMOFS Pareto front, we first obtain the Pareto set as

$$P\_{\mathbf{i}} = \left\{ \mathbf{x}^\* : \left[ \nexists \mathbf{x} : f\_i(\mathbf{x}) \le f\_i(\mathbf{x}^\*) \text{ for all } i \in [1, 2], \text{ and } f\_j(\mathbf{x}) < f\_j(\mathbf{x}^\*) \text{ for some } j \in [1, 2] \right] \right\}. \tag{8}$$

*x*<sup>∗</sup> denotes the set of non-dominated solutions in the population and *fi*(*x*) is the *i*-th objective function. The Pareto front *Pf* is obtained from all function vectors *f*(*x*) that correspond to the Pareto set:

$$P\_f = \left\{ f(\mathbf{x}^\*) : \mathbf{x}^\* \in P\_s \right\}. \tag{9}$$

Note that all Pareto points are equally preferable apart from subjective prioritization. The outline of GMOFS is given in Algorithm 1.

**Algorithm 1:** The outline of gradient-based multi-objective feature selection (GMOFS), where *xi* is the *i*-th feature in the training set *X*, and *Y* is the corresponding set of output classes.

**Initialization:** *λ* = *λ<sup>l</sup>* ≤ *λu*, Population = ∅, *k* = 1 **While** *λ* ≤ *λ<sup>u</sup>* **Step 1:** Use the training data {*X*,*Y*} to train the constrained MLP network in Equation (4) by solving Equation (5) **Step 2:** Sort the input weights {*βi*} in descending order Use Equation (7) to select subset *Sk* ⊂ *X* where size(*Sk*) ≤ size(*X*) **Step 3:** Population ← Population + *Sk k* ← *k* + 1 **Next** *λ* ← *λ* + *λ* **Step 4: For** each subset *Sk* in Population Use cross-validation to train and test a classifier with dataset {*Sk*,*Y*} Calculate objective functions *f <sup>k</sup>* <sup>1</sup> and *<sup>f</sup> <sup>k</sup>* <sup>2</sup> using Equation (3) **Next** subset *Sk* **Step 5:** Find the Pareto set using Equation (8)
