*3.1. Validation Results*

We trained different CNNs using the data from each fold, and, using the validation data, we selected the aforementioned CNN architecture (Table 2) as the best candidate for the classification of the samples. As can be observed in Table 1, the data between tumor and non-tumor classes are not balanced: the number of non-tumor samples is twice the number of tumor samples. For this reason, we performed data augmentation on the tumor data to balance the data during training, creating twice the number of tumor patches to train the CNN than cited in Table 1. Such data augmentation consisted in a single spatial rotation of tumor patches.

At the beginning of the validation phase, some of the folds presented problems when they were trained, showing poor performance metrics in the validation set. For this reason, we carefully examined the tumor HS images from each patient, and we detected that accidentally some necrosis areas were included in the dataset as tumor samples. These necrosis areas (found in P8) were excluded from the dataset. After excluding the necrosis areas, we got competitive results for all the folds in the validation set. These results are shown in Table 5. The models for each fold were selected because they all presented high AUC, higher than 0.92, and the results in terms of accuracy, sensitivity and specificity were balanced, indicating that the models identified correctly both non-tumor and tumor tissue.

**Table 5.** Classification results on the validation dataset, across all four folds (F). AUC: area under the curve.


In order to provide a comparison of performance between HSI and RGB imagery, we performed the classification of synthetic RGB images using the same CNN. Such RGB images were extracted from the HS data, where each color channel was generated equalizing the spectral information to match the spectral response of the human eye [30]. After separately training the CNN with RGB patches, the models selected after the validation were found to be competitive. Nevertheless, the validation performance when using HSI data was more accurate in each fold and presented more balanced sensitivity and specificity values (Table 5).

## *3.2. Test Results*

After the model selection in the validation phase, we applied them to independent patients for the test set. These results are shown in Table 6. Some results show good discrimination between non-tumor and tumor tissues, i.e., patients P1, P3, and P8. For these patients, the AUC, sensitivity and specificity are comparable to the values obtained during validation. The tumor detection in patients P9 to P13 was also highly accurate. However, there are some patients where the classification performance was poor. Although the sensitivity is high in patients P2 and P5, the specificity is low, which indicates there may be an issue classifying non-tumor patches. There are also some patients with poor accuracy, namely patients P4 and P7, which have results slightly better than random guessing. Finally, the results obtained for patient P6 are suspicious, being substantially worse than random guessing.


**Table 6.** Initial classification results on the test dataset.

The selection of the models in the validation phase was performed using independent patients for validation; for this reason, such inaccuracies on test data was unexpected. To determine the reasons for the misclassifications, we used the CNN models to generate heat maps for all the patients, and we carefully examined them. After this analysis, we found that some HS images presented problems; hence, the results were worsened for these reasons. As mentioned before, we performed a careful inspection of tumor HS images in the validation set. However, upon inspection after the test outcomes, we discovered there were also problems in non-tumor samples. There were four main sources of errors in the images: (1) some HS images were contaminated with the ink used by pathologists to delimitate the diagnosed regions (*n* = 15); (2) some images were unfocused (*n* = 13); (3) some samples presented artifacts from histopathological processing (*n* = 2); and (4) other images were composed mainly by red blood cells (*n* = 2). Examples of these images can be observed in Figure 5.

**Figure 5.** Example of image defects detected in the test dataset. (**a**) Ink contamination; (**b**) unfocused images; (**c**) artifacts in the specimens; (**d**) samples mainly composed of red blood cells.

Furthermore, due to the suspicious results obtained on patient P6, the specimen was examined again by a pathologist for reassessing the initial diagnosis. After this examination, the pathologist realized a problem with the selection of ROIs in the HS acquisition for the non-tumor areas. In Figure 6, we show the initial evaluation of the sample, where the tumor area was annotated by using a red marker contour, and the rest of the sample was considered as non-tumor. Figure 6b corresponds to the second evaluation of the sample. The original annotation of tumor was technically correct, but the

yellow markers indicate the location of the highly invasive malignant tumor, i.e., GB. Although the other tumor areas correspond to tumor, their cells are atypical and cannot be considered a high-grade GB. In both Figure 6a,b, the ROIs selected for HS acquisitions are highlighted with squared boxes, where red and blue boxes indicate tumor and non-tumor ROIs, respectively. As can be observed in Figure 6b, the non-tumor areas selected for our experiments were located too close to areas where the infiltrating GB was identified; thus, they contain extensive lymphocytic infiltration and cannot be considered strictly non-tumor samples. Furthermore, it was found that the GB of this patient was not typical, presenting low cellular density in the tumor areas. Finally, the ROI selected from the tumor area was located where the diagnosis is tumor but cannot be considered a high-grade glioma, i.e., GB. These reasons explain the seemingly inaccurate results obtained in the classification. Nevertheless, such bad results helped us to find an abnormality in the sample.

**Figure 6.** Evaluation assessment for the samples of Patient P6. Red pen markers indicate the initial evaluation of tumor regions. Regions without pen contour were considered as non-tumor. Red squares indicate the ROIs of tumor samples. Blue squares indicate the ROIs of non-tumor samples. (**a**) Initial evaluation of the sample; (**b**) second evaluation of the sample, where a yellow marker is used for the updated tumor areas; (**c**) example of HSI from tumor ROI; (**d**) example of HSI from non-tumor ROI.

In order to quantify the influence of the inclusion of incorrect HS images in the classification, we evaluated again the classifiers when the corrupted HS images were excluded from the dataset. These HS images were only removed from the test. The CNN was not trained again to avoid introducing bias in our experiments. These results are shown in Table 7. Patients where data exclusion was performed are indicated with an asterisk (\*), and the results of patient P6 were removed due to the diagnosis reasons explained before. The results of the classification after data exclusion improved significantly for patients P2 and P7, while the results of other patients keep constant after the exclusion of some HS images. This data removal also boosts the overall metrics across the patients, due to the improvement in the classification in some patients and because of the removal of patient P6 due to justifiable clinical reasons.


**Table 7.** Final classification results on the test set after excluding incorrect HS images.

\* Data exclusion; † Data removed.

Regarding the classification performance of HSI compared to RGB, the results suggest the superiority of HSI (see Tables 6 and 7). The average metrics on the whole datasets are worse for RGB images, especially in terms of specificity and sensitivity. We consider good performance in classification when all the metrics are high, with balanced specificity and sensitivity. For example, P4 presents a better AUC for RGB but really poor specificity (7%). For this reason, the HSI classification for such patient presents a better performance. Only for P2, P3, and P8, the performance of RGB is approximately equivalent to HSI. P11 is the only patient where RGB substantially outperforms HSI. For patients where the performance is the most promising (e.g., P1, P2, P3, and P8), RGB classification is also accurate. However, the sensitivity and specificity are not as balanced compared to HSI. Furthermore, the standard deviation in specificity and sensitivity are higher for RGB classification, which show a wider spread of the classification results compared to HSI. The decrease of performance of RGB images compared to HSI is more evident in patients with only tumor samples, where HSI classification was shown to be really accurate (e.g., P9, P10, P12, and P13). Finally, in patients where the classification of HSI was found poor (e.g., P4, P5, and P7), HSI performance is still shown to be more competitive than the RGB counterpart. On average, the accuracy of the classification is improved 5% when using HSI instead of RGB imaging, and particularly, the specificity and specificity are increased achieving 7% and 9% of improvement, respectively (Table 7).
