*3.5. Evaluation Metrics*

The validation of the proposed algorithm was performed using inter-patient classification, following a leave-one-patient-out cross-validation. Overall accuracy (OA), sensitivity and specificity metrics were calculated to measure the performance of the different approaches. OA is defined by Equation (5), where TP is true positives, TN is true negatives, P is positives, and N is negatives. Sensitivity and specificity are defined in Equations (6) and (7), respectively, where FN is false negatives, and FP is false positives. In addition, the Matthews correlation coefficient (MCC) was employed to evaluate the different approaches (Equation (8)). This metric is mainly used to analyze classifiers that work with unbalanced data, which computes the correlation coefficient between the observed and the predicted values [58]. MCC has a value range between [−1, 1], where −1 represents a completely wrong prediction and 1 indicates a completely correct prediction. For comparison purposes with other metrics presented in this work, the MCC metric has been normalized within the [0, 1] range applying Equation (9).

$$OA = \frac{TP + TN}{P + N} \tag{5}$$

$$Sensitivity = \frac{TP}{TP + FN} \tag{6}$$

$$OA = \frac{TP + TN}{P + N} \tag{7}$$

$$\text{MCC} = \frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP + FP) \cdot (TP + FN) \cdot (TN + FP) \cdot (TN + FN)}} \tag{8}$$

$$M\text{CC} = \frac{M\text{CC} + 1}{2} \tag{9}$$

Classification maps are another evaluation metric commonly used in HSI. This evaluation metric allows users to visually identify where each of the different classes are located. This metric is employed to visually evaluate the classification results obtained when the entire HS cube is processed, including labeled and non-labeled pixels. After performing the classification of the HS cube, a certain color is assigned to each class. This process allows mainly evaluating the results obtained in the prediction of non-labeled pixels. The colors that are represented in the classification map are the following: green was assigned to the first class (healthy tissue); red was assigned the second class (tumor tissue); blue was assigned to the third class (hypervascularized tissue); and black was assigned to the fourth class (background).

In addition to the standard OA, an additional metric has been proposed for the identification of the best results obtained with the optimization algorithms but taking into account also the number of selected bands. This *OAPenalized* is based on the OA presented in Equation (5) but including a penalty in the case that a high number of bands is used. Equation (10) presents the mathematical expression to compute this *OAPenalized*, where λ is the number of bands selected by the algorithm and λ*max* is the total number of bands in the original dataset.

The specific Figure ofMerit (*FoM*) employed to obtain the most relevant bands with the optimization algorithms in the *PF2* has the goal of finding the most balanced accuracy results for each class, as observed in Equation (12), where *n* is the number of classes, *i* and *j* are the indexes of the classes that are being calculated. The mathematical expression of the *ACCperClass* in a multiclass classification is obtained by dividing the total number of successful results (*TP*) for a particular class by the total population of this class (*TP* + *FN*). This expression is equal to the sensitivity of a certain class in a multiclass classification problem. Equation (11) shows the mathematical expression of the *ACCperClass*.

In addition to the previously presented *FoM*, another metric has been proposed for the identification of the best results obtained with the optimization algorithms but taking into account also the number of selected bands. This *FoMPenalized* is based on the *FoM* presented in Equation (12) but including a penalty in the case that a high number of bands is used. Equation (13) presents the mathematical expression to compute this *FoMPenalization*, where λ is the number of bands selected by the algorithm and λ*max* is the total number of bands in the original dataset.

$$OA\_{\text{Penailized}} = 1 - \frac{OA}{1 + \frac{\lambda}{\lambda\_{\text{max}}}} \tag{10}$$

$$\text{ACC}\_{\text{perClass}} = \frac{\text{TP}}{\text{TP} + \text{FN}} \tag{11}$$

$$FoM = \frac{1}{2} \cdot \left( \sum\_{\substack{i,j\\i$$

$$FoM\_{Penzlizad} = 1 - \frac{FoM}{1 + \frac{\lambda}{\lambda\_{\text{max}}}} \tag{13}$$

#### **4. Experimental Results and Discussion**

This section will present the results obtained in the three proposed processing frameworks, as well as the overall discussion of the results. Table A1 in the Appendix A shows the acronym list of each proposed method named in the next sections in order to help the reader to follow the experimental results explanation.

#### *4.1. Sampling Interval Analysis (PF1)*

The *PF1* has the goal of performing a comparison between the use of different numbers of spectral bands in the HS database, modifying the sampling interval of the spectral data in order to simulate the use of different HS cameras and reducing the size of the database. This will lead to a reduction of the execution time of the processing algorithm. In addition, as shown in Figure 3a, the *PF1* was evaluated using two different training datasets for the SVM model generation: the original and the reduced training dataset.

Figure 5a shows the classification results obtained for each sampling interval (different number of bands), training the SVM algorithm with the original dataset, while Figure 5b shows the results using the reduced training dataset. It can be observed in both figures that the overall accuracy of the classifier decreases as the number of bands is reduced. However, the sensitivity values for each class are quite similar from 826 to 64 bands. When the number of bands is lower than 64, the sensitivity values drop drastically. Respect to the standard deviation, the behavior obtained in both datasets are similar. It can be observed that for the OA, normal sensitivity and hypervascularized sensitivity, the results remain stable as the number of bands are reduced. In the case of tumor sensitivity, the standard deviation is higher than in the other cases. This behavior is caused by one of the HS test images (*P020-01*) presenting problems in the classification and not being able to correctly identify any of the tumor pixels. In the background class, as the number of bands decrease, the standard deviation increases. On the other hand, as it can be seen in Figure S1 of the supplementary material, the specificity results are very similar in both cases being higher than 80% in general. In addition, Figure S2 of the supplementary material shows the results of the normalized MCC metric for the original and reduced training datasets, which takes into account the unbalance of the test labeled dataset. As can be seen, the obtained results in all the classes are similar except for the tumor tissue class, which improve an average of ~5% when the reduced training dataset is employed.

**Figure 5.** Average and standard deviation results of the leave-one-patient-out cross-validation for each band reduction. (**a**) Using the original training dataset. (**b**) Using the reduced training dataset. (**c**) Difference of the results computed using the reduced dataset respect to the original dataset.

Figure 5c shows the differences in the results between the reduced and original training dataset. As it can be seen, the reduced dataset provides better accuracy results in the tumor class respect to the original dataset, reaching an average increment of ~20%. Since the main goal of this work is to accurately identify the tumor pixels, this increment provides a significant improvement on this goal. It is worth noticing that the test dataset was not reduced in the number of samples, only the training dataset was reduced.

On the other hand, the image size is directly related to the execution time of the processing algorithm. Figure 6a shows the execution time of the SVM training and classification processes computed by using the MATLAB® programming environment together with the LIBSVM package. This figure presents the execution time results (expressed in minutes) for both training schemes (original training dataset, and reduced training dataset). The times depicted in such a table include both the time required to train the model for one leave-one-patient-out cross-validation iteration, and the time needed to classify the correspondent patient data. In order to compare the results, a logarithmic scale was used. On one side, as the number of bands decreases, the execution time also decreases, being practically stable from 128 to 8 bands in both cases. On the other side, it is clear that the use of the reduced training dataset offers a significant execution time reduction. For example, using 826 bands, the execution time for the original training dataset is ~778 min, while for the reduced dataset it is ~16 min, obtaining a speedup factor of ~48×. Taking this into account, the reduced training dataset was selected for the next experiments.

**Figure 6.** (**a**) Band reduction execution time for original and reduced training datasets (representation in logarithmic scale). (**b**) Overall accuracy and inverse normalized execution time achieved using the reduced training dataset with respect to the number of bands.

In order to select a good trade-off sampling interval, which provides a reduction on the execution time of the algorithm while keeping high discrimination, a relation between these two metrics was performed. Figure 6b shows the relation between the accuracy and the inverse normalized execution time depending on the number of bands employed in the HS dataset, ranging from 826 to 8 bands. The analysis of the overall accuracy shows that when all the bands are used the overall accuracy is high, but the execution time is also very high. However, when only a few bands are used the overall accuracy decreases more than 20%, but the execution time is quite low. In this sense, the suitable range that provides a good compromise between the execution time and the overall accuracy is found between the 214 and 128 bands (dashed red lines in Figure 6b). In this range, the accuracy results are stabilized in the value of 80% and the execution times are acceptable. For this reason, the number of bands selected to conform the HS dataset in the next experiments was 128 (lowest number of bands with the same overall accuracy), involving a sampling interval of 3.61 nm.

The use of the reduced training dataset together with the selection of 128 spectral bands will ensure that we reduce the execution time (mainly in the SVM training process), allowing us to perform the band selection using the optimization algorithms in the next processing framework (PF2). This will avoid large processing times during the huge number of iterations required by the optimization algorithms until reaching the optimal solution.
