*4.2. Evaluation Criteria*

Each algorithm carried out 20 independent runs with a random initial positioning of the search agents. Repeated runs were used to test the capability of the convergence. Eight well-known and common measures are recorded in order to investigate the algorithms performance in a comparative way. Such metrics are listed as follows:

• Best: The minimum (or best for a minimization problem) fitness function value obtained at different independent runs, as depicted in Equation (14).

$$Best = Min\_{i=1}^{M} \mathbf{g}\_{\*}^{i} \tag{14}$$

• Worst: The maximum (or worst for a minimization) fitness function value obtained at different independent operations, as shown in Equation (15).

$$\text{Worst} = \text{Max}\_{i=1}^{M} \text{g}\_{\*}^{i} \tag{15}$$

• Mean: Average calculation performance of the optimization algorithm applied *M* times, as shown in Equation (16).

$$Mean = \frac{1}{M} \sum\_{i=1}^{M} \mathbf{g}\_{\*}^{i} \tag{16}$$

where *gi*∗ is the optimal solution obtained in the i-th operation;

• Standard deviation (Std) can be calculated from the following Equation (17).

$$Std = \sqrt{\frac{1}{M} \sum (\mathcal{g}\_\*^i - Mean)^2} \tag{17}$$

• Average classification accuracy: Investigates the accuracy of the classifier and can be calculated by Equation (18).

$$AveragePerformance = \frac{1}{M} \sum\_{j=1}^{M} \frac{1}{N} \sum\_{i=1}^{N} Match(\mathbb{C}\_i, L\_i) \tag{18}$$

where *Ci* refers to classifier output for instance i; *N* refers to the instance number in the test set; and *Li* refers to the reference class corresponding to instance i;

• Average selection size (Avg-Selection) measures the average reduction in selected features from all feature sets and is calculated by Equation (19)

$$Avgera\text{geSelection}\\Size = \frac{1}{M} \sum\_{i=1}^{M} \frac{size(\mathbf{g}\_{\*}^{i})}{N\_{t}} \tag{19}$$

where *Nt* is the total number of features in the original dataset;

• Average execution time (Avg-Time) measures the average execution time in milliseconds for all comparison optimization algorithms to obtain the results over the different runs and calculated by Equation (20)

$$R\_d = \frac{1}{M} \sum\_{i=1}^{M} R \mu n T\_{a,i} \tag{20}$$

where *M* refers to the run number for the optimizer *a*, and *RunTa*,*<sup>i</sup>* is the computational time for optimizer *a* in milliseconds at run number *i*;

• Wilcoxon rank sum test (Wilcoxon): a non-parametric test called Wilcoxon Rank Sum (WRS) [67]. The test gives ranks to all the scores in one group, and after that the ranks of each group are added. The rank-sum test is often described as the non-parametric version of the *t* test for two independent groups.

The two proposed versions of whale optimization algorithm (bWOA-S and bWOA-V) are compared with three common algorithms that are famous in this domain. Four different initialization methods/techniques are used to guarantee the two proposed algorithms' ability to converge from different initial positions. These methods are: (1) a large initialization is expected to evaluate the capability of locally searching a given algorithm, as the search agents' positions are commonly close to the optimal solution; (2) a small initialization method is expected to evaluate the ability of a given algorithm to use global searching as the initial search; (3) mixed initialization is the case in which some search agents are close enough to the optimal solution, whereas the other search agents are apart. It will provide diversity of the population frequently. since the search agents are expected to be apart from each other. (4) random initialization.

#### *4.3. Performance on Small Initialization*

The statistical average fitness values of the different datasets obtained from the compared algorithms using the small initialization methods are shown in Table 3. Table 4 shows average classification accuracy on the test data of the compared algorithms using small initialization methods. From these tables, we can conclude that both bWOA-S and bWOA-V achieve better results compared with other algorithms.

#### *4.4. Performance on Large Initialization*

The statistical average fitness values of the different datasets obtained from the compared algorithms using the large initialization methods are shown in Table 5. Table 6 shows average classification accuracy of the test data of the compared algorithms using small initialization methods. From these tables, we can conclude that when using large initialization methods, both bWOA-S and bWOA-V achieve better results compared with other algorithms.

#### *4.5. Performance on Mixed Initialization*

The statistical average fitness values on the different datasets obtained from the compared algorithms using the large initialization methods are shown in Table 7. Table 8 shows average classification accuracy of the test data of the compared algorithms using small initialization methods. As is notable from this table, we can conclude that both bWOA-S and bWOA-V achieve better results compared with other algorithms.

**Table 3.** Statistical mean fitness measure on the different datasets calculated for the compared algorithms using small initialization.


**Table 4.** Average classification accuracy for the compared algorithms on the different datasets using small initialization.

