*3.4. Comparison of Stepwise Discriminant Analysis (SDA) with non-SDA Feature Selectors*

Stepwise discriminant analysis is a filter method that selects a subset of features by attempting to minimise within-class variation while simultaneously maximising between-class variation [90]. Although a number of metrics are available to determine class separability, Wilk's lambda is by far the most frequently used to enter and remove variables from the selection in a stepwise manner. Some studies reported Wilk's lambda approaching zero and becoming asymptotic, indicating near perfect separation of classes [48]. Features selected after this point can be safely removed from the model as they will not substantially increase classification accuracy. This normally resulted in the selection of 10–20 wavebands [5,38,47,48,51].

SDA in general selects wavebands more uniformly across the spectrum than other methods, though the greatest number of selected bands is still found in the VIS (Figure 4). The most significant difference for selection rates is the increased importance of the NIR beyond the red edge. The NIR demonstrates significant selection with the use of SDA in all bar a first derivative dataset from [51], and [38], with the author of the latter suggesting high levels of intraspecific variance due to differences in leaf maturity as the reason no bands were selected in this region.

Upon comparing the selection rates of SDA studies compared to non-SDA, a clear difference in selection of NIR bands is apparent. As with the difference between canopy and leaf scale spectra, the increased selection is focused around the NIR water absorption features (Figure 4). Additionally, in the VIS, there is significantly higher selection for the blue, green, and red regions in SDA studies. In order to determine if the spectral acquisition scale or feature selection technique had a greater influence on band selection, the selection rates were further subset into canopy studies using SDA and non-SDA feature selection, and leaf scale studies using SDA and non-SDA selection (Figure 5). It is apparent that the feature selection method has a greater impact on band selection rates, with SDA selecting from the NIR with far greater rates than the non-SDA methods in both canopy and leaf scale studies. The non-SDA methods demonstrated minimal selection in the NIR beyond the red edge for leaf scale spectra, with only a slight increase in selection for canopy spectra focused around the water absorption wavelengths from 1150–1250 nm. The studies that did select from the NIR with leaf scale samples via non-SDA methods stated that the selected bands represented differences in internal reflectance for leaf scale spectra [50]. The blue and red shifts around the green peak for canopy and leaf scale spectra are still evident once the data has been subset into SDA/non-SDA, although it becomes apparent that the high rates of selection in many parts of the VIS is driven by the SDA studies. However, the use of SDA does not explain the selection rates of the VIS for the reduced spectral domain

VIS/NIR studies, as only a single study used SDA for feature selection, perhaps indicating an alternate driving force. The red edge demonstrates its robustness to variations in measurement scale and band selection technique as it was frequently selected for all study subsets, although slightly less frequently for leaf scale spectra with non-SDA feature selection.

**Figure 4.** Feature selection rates for 350–2500 nm studies that used SDA feature selection, and the selection rate of all other feature selection methods combined.

**Figure 5.** Feature selection rates for 350–2500 nm studies that used SDA feature selection subset by canopy and leaf scale spectra, and the selection rate of all other feature selection methods combined.

According to [91], "Stepwise analytic methods may be among the most popular research practices employed in both substantive and validity research". Despite this statement being made in the late 1980s, the use of SDA in approximately a third of the studies included in this review demonstrates its continued popularity, being by far the most used method encountered. However, the widespread use of stepwise methods has prompted strong arguments against its usage [90,92–94], particularly when utilised in a predictive discriminant analysis application such as feature selection for classification [95]. The studies that utilised SDA in this review made no mention of these criticisms and therefore no direct attempt to mitigate them. Despite this, [25] did validate their model with 20 repetitions of 1000

random samples, with the final feature subset being based on the selection rates of features across the repetitions, the consideration of important features identified in the literature from [6] and [47], as well as the results from principal component analysis (PCA). PCA is a mathematical transformation used to produce uncorrelated features from the spectral features, reducing dimensionality whilst retaining the most informative spectral data. Additionally, [47], and [5] included SDA as part of an ensemble of feature selection methods, again determining the final feature subset based on the selection rates of features across all methods within the ensemble. Although one of these ensemble methods (Lambda– Lambda plots) allows for the identification and removal of correlated features, in both cases, it was run in parallel to SDA with the removal of correlated features occurring after features had been selected. The remaining studies reported no efforts to mitigate the concerns of using SDA for feature selection [38,41–43,48,51].

It must be acknowledged that the sub-setting of reviewed studies into canopy and leaf scale, and then into SDA and non-SDA, meant each class was only represented by a small number of samples (~8 per class), though leaf-SDA was only represented by five studies extracted from two papers. As a result of this, a few outliers are evident, such as the 100% selection in bin 1700–1749 nm, and the 100% selection of the 500–549 nm bin, both associated with the low leaf-SDA sample size. Additionally, the comparison of SDA to non-SDA may disguise selection biases of the non-SDA methods as they are often only represented by one or two studies, with any bias they may exhibit being masked by the selection rates of the other methods.

#### **4. Study Design Influence**

All aspects of a study design influence waveband selection. However, many of these aspects may be outside the control or be heavily constrained for the researcher, such as target classes, number of samples and collection method, though the researcher often has control over data pre-processing, feature selection, and classification methods. Due to this, and the apparent influence of feature selectors previously described, we focus on how the choice of feature selection method effects waveband selection.

In order to ascertain any influence feature selection may have over waveband selection, some of the most common feature selection methods were applied to a synthesised dataset. A key requirement for these experiments is the need for a dataset with many species with a large number of samples, something generally lacking in vegetation hyperspectral data. To accomplish this, a hyperspectral synthesis method was created [20] to allow for the creation of any number of samples from 22 species of New Zealand plants. The synthesised dataset consisted of 500 samples per class with 540 wavebands from 350–2450 nm at 3-nm bandwidths, excluding regions of high noise.

Hyperparameters for the feature selectors were tuned via a holdout dataset, with the parameters that selected features resulting in the highest classification accuracy being used for all experiments (Table 3). This is crucial to ensure that the only variables that could affect waveband selection were constrained to either the feature selector (svm\_\*, sda\_\*,sffs\_\*,rf\_\*) or the dataset (\*\_0 ... \*\_9).


**Table 3.** Software packages and hyperparameters for each feature selection method.

Three experiments were devised. First, each feature selection method was performed on the same dataset, cross-validated 10 times (eg. rf\_0, rf\_1 ... rf\_9), selecting the top 30 discriminative wavebands, thus revealing any possible biases in waveband selection resulting from the choice of feature selection

method (Figures 6 and 7). Secondly, feature selection was performed on datasets consisting of different classes and samples to simulate many different studies, giving an idea if attributes of the samples affect the wavebands being selected, which will impact generalizability and transferability. Variants of this experiment were performed wherein the classes used remained the same as did the number of samples, though actual samples were randomly selected. Additionally an experiment with the same classes though with differing numbers of samples. Results for these variants did not significantly differ and therefore aren't shown here.

Each dataset produced significantly different waveband selections. This is especially evident in Figure 6b were the histogram is ordered by feature selector, placing each repetition with a new dataset next to each other. Here, it is clear the RF favours the red edge and NIR bands, essentially ignoring the SWIR. SFFS demonstrated higher selection in the VIS, especially at shorter wavelengths, minimal selection in the NIR, and medium selection in late SWIR. SDA and SVM are the most similar due to both selecting broadly and relatively evenly along the entire spectrum. Dimensionality reduction techniques offer a way to visualize the relationship between selection histograms (Figure 7). Due to their broad general selection, SDA and SVM are grouped close to each other with SFFS and RF adjacent though separate. Further, the histograms are clearly grouped by feature selection method rather than dataset, indicating that feature selection method is a dominant factor affecting the selection of wavebands.

**Figure 6.** *Cont*.

**Figure 6.** (**a**): Histogram of band feature selection binned at 50nm, ordered by dataset. Four feature selectors run on the same dataset 10 cross-validation (new dataset consisting of 10 classes and 200 samples for each cross-val.). (**b**): Results of Figure 5a ordered by feature selection method. (RF = random forest, SDA = stepwise discriminant analysis, SFFS = sequential floating feature selection, SVM = support vector machine).

**Figure 7.** (**a**) PCA dimensional reduction of histogram waveband feature selection. (**b**) t-Distributed Stochastic Neighbor Embedding (T-SNE) dimensional reduction of histogram waveband feature selection. (**c**) Uniform Manifold Approximation and Projection (UMAP) dimensional reduction of histogram waveband feature selection.
