1. Introduction
Hyperspectral anomaly detection (HAD) is a key area of research in remote sensing with a history spanning multiple decades [
1]. Over this time span, many algorithms have emerged, each designed to address challenges inherent in identifying anomalies within hyperspectral images. This undertaking is far from simple, given the dynamic nature of the datasets, which profoundly influence the optimal modeling of hyperspectral data. Factors such as the quantity, range in size, and spatial distribution of targets wield significant influence over complexity. Furthermore, nuances like spatial resolution, spectral band count, number of anomalous spectral signatures, and external factors such as time of day or instrument calibration exert additional influence on both target and background complexity. The multifaceted nature of hyperspectral data underscores the array of challenges encountered, driving the continued prolificacy of HAD algorithms.
While many algorithms have been developed to address the unique challenges posed by different datasets, there is a growing recognition of the need for algorithms that exhibit consistent performance across a wide range of instances, i.e., adaptable algorithms [
2]. Impartial studies, which have evaluated large numbers of algorithms across diverse sets of hyperspectral data, reveal that no single algorithm typically outperforms others consistently in identifying anomalies across a significant number of datasets. For instance, in a comprehensive study [
3], none of the 12 assessed diverse HAD algorithms demonstrated superior anomaly detection performance across multiple datasets. Moreover, in all cases except one, the optimal performance varied depending on the metric used. Similarly, Ref. [
4] compared 22 HAD algorithms and found that while one algorithm showed superior performance on six out of 17 datasets, double that of the next best-performing algorithm, a different algorithm exhibited the best mean and median performance across the datasets, highlighting a dichotomy between performance and generalizability in HAD algorithms. These findings underscore the persistent challenge of accurately modeling the diverse distributions present in hyperspectral data, where algorithms tend to excel on specific datasets or demonstrate good generalization across multiple datasets but not both.
The approaches from which the above-mentioned algorithms originate provide some context into this problem. Surveys for HAD algorithms consistently highlight that most of the algorithms stem from what can be collectively identified as statistical, machine learning, linear algebra, clustering, or reconstruction-based methods [
2,
3,
5,
6]. As such, many of these algorithms share biases (e.g., Gaussian, linear, hyperspherical) about the nature of the data. However, due to the complex nature of hyperspectral data, biases that define one dataset well are unlikely to equally define another dataset, and biases that would work well on either image are likely to compromise performance on both, hence identifying a potential root of the dichotomous performance between algorithms. This implies that the complexity of the biases needs to align with the complexity of the data to prevent underfitting or overfitting. Hence, any one bias is unlikely to perform well universally across a sufficiently diverse set of datasets.
A potential research direction to address the aforementioned issues is evaluating the impact of different biases on HAD performance, which is ensemble learning. Ensemble learning provides a means by which to employ multiple biases at once to enhance the benefits and attenuate the limitations of each. Despite the development of some forms of ensemble learning for HAD, only one study [
7] incorporates more than a singular type of bias. Notably absent from existing research are examinations of prominent ensemble learning techniques like stacking and soft voting-based methods within the context of HAD. Consequently, a systematic evaluation of major ensemble learning methods applicable to HAD remains absent in the current literature. Given the potential implications of biases on the adaptability of algorithms for HAD, this represents an important area warranting further investigation.
To this end, we propose stacking-based ensemble learning for hyperspectral anomaly detection (SELHAD). SELHAD introduces the integration of HAD algorithms with diverse biases (e.g., Gaussian, density, partition) into a singular ensemble learning model and learns the factor to which each bias should contribute so anomaly detection performance is optimized. In this paper, this methodology is rigorously assessed across 14 distinct datasets, contrasting its efficacy against alternative ensemble learning approaches such as soft voting, hard voting, joint probability, and bagging-based [
7,
8] methods in the context of HAD. Furnished through this comprehensive evaluation is a systematic comparison of SELHAD’s performance against both individual HAD algorithms and other prominent ensemble learning techniques applicable to HAD scenarios.
The primary contribution of this work is the integration of algorithmic adaptability into HAD. The proposed algorithm, SELHAD, learns the unique magnitude of contribution each base learning should optimally apply toward scoring spectral signatures in each dataset. This adaptability allows for the dynamic utilization of multiple diverse algorithms simultaneously, which is novel among existing HAD algorithms. SELHAD amplifies the strengths and mitigates the weaknesses of existing HAD algorithms. This solves the problem of the dichotomy between performance and generalizability in existing algorithms as SELHAD enhances both through the integration strategies it dynamically learns.
The rest of this paper is organized as follows.
Section 2 introduces background information and related works.
Section 3 lays out the architecture for the methods utilized in this paper.
Section 4 provides details for the experimental design.
Section 5 presents the results obtained from the experiments.
Section 6 provides analyses and explanations of the results obtained by the experiments.
Section 7 closes the paper with a summary and concluding remarks about the study.
4. Experiments
4.1. Datasets
The experiments in this study use datasets comprised of hyperspectral images that encompass a wide array of characteristics aimed at thoroughly assessing algorithmic performance and generalizability of SELHAD [
Table 1]. These datasets exhibit considerable diversity, both in terms of image content and attributes, ensuring the introduction of multiple sources of heterogeneity. Such diversity is important in evaluating the generalizability of models, mitigating the risk of any single algorithm gaining undue advantage through the exploitation of incidental correlations that align with any one model’s biases. Uniformity across dataset contents or characteristics could inadvertently favor certain algorithms, rendering evaluations skewed.
The deliberate variation in dataset characteristics—including spatial resolutions, spectral band counts, target numbers, the mean and standard deviation of target pixels, and the percentage of anomalous pixels—plays an important role in shaping algorithmic robustness. Anomalies themselves present in varying sizes, shapes, locations, and quantities across datasets. Some datasets feature moderately sized singular anomalies, while others host as many as 60 targets with sizes between 1 and 50 pixels. Fluctuations in spatial resolutions and spectral bands additionally introduce disparities in the representation of similar objects and impact pixel counts and spectral signature lengths. These disparities present unique challenges for anomaly detection algorithms, as they must contend with distinct characteristics and quantities of anomalous pixels in each hyperspectral image, posing obstacles to achieving consistent performance across all datasets.
The content of the datasets—encompassing diverse scenes, backgrounds, and anomalies—further influences algorithmic robustness. The datasets include urban, beach, and airport-based scenes that are drawn from the HYDICE Urban, ABU-Airport, ABU-Beach, and ABU-Urban dataset repositories [
28,
29]. Therefore, backgrounds include airport terminals, airport runways, large bodies of water, coastlines, residential areas, and industrial areas, to name a few. Target objects span planes, cars, boats, and buildings, contingent upon the specific scene and background of a given dataset. These inherent disparities induce shifts in spectral signature distributions, rendering each dataset unique in its spectral characteristics.
4.2. Experimental Settings
The experiments have two crucial components: the sets of augmentations, denoted as A (Equation (2)), and the base learner hyperparameters, denoted as G (Equation (3)). These elements are foundational as they yield multiple similarly performing yet subtly perturbed ρ outcomes for each base learner, thereby facilitating the implementation of common image and parameter augmentation practices prevalent in computer vision. The image augmentation strategies encompass both rotation-based and flipping-based operations. Rotation operations involve rotating the dataset by 0, 90, 180, or 270 degrees around its center, while flipping-based operations entail flipping the dataset about its vertical axis. Collectively, these operations generate eight distinct sets of augmentations to be applied across the datasets. For hyperparameters, the approach varies based on the underlying base model. Principal Component Analysis (PCA) is employed in Gaussian, hypersphere, and density-biased base learners, wherein 15 to 20, 6 to 11, and 16 to 21 principal components are, respectively, selected from the original data. In contrast, for partition and reconstruction-biased base learners, algorithmic stochasticity induces slight variations in outcomes. This occurs through random variable selection in the partition-biased algorithm and random initialization of model weights in the reconstruction-biased algorithm. The combined application of augmentation techniques and hyperparameters yields multiple subtly perturbed ρ outcomes (Equation (5)) for each dataset simulating bootstrapping strategies.
For the proposed ensemble learning algorithm (SELHAD), there are two important components to establish: the threshold for loss convergence (Equation (9)) and the update factor λ (Equation (10)). Given that all inputs and outcomes for the algorithm fall within the zero to one range, it follows that a small loss convergence threshold is needed. In this study, a threshold of 1 × 10−10 was employed to ensure peak loss convergence was reached, aligning with the demands of precision inherent in HAD. Concurrently, an update factor λ = 0.1 was applied in this work. This proves the optimal influence of the gradient of the loss upon β (Equation (10)) to facilitate model adaptation and stability. Larger and smaller values of λ were evaluated but were consistently found to diverge or reach convergence too slowly across all datasets. Together, the loss convergence and update factors provide the key criteria needed to maximize performance and generalizability.
4.3. Comparative Ensemble Learners
To evaluate the proposed algorithm, its performance is compared against the performances of four prominent ensemble learning methods. The bagging [
8] and hard voting [
7] algorithms from
Section 2.2 are included for comparison. Soft voting and joint probability algorithms are also implemented to cover major ensemble learning frameworks that can be used in HAD. The soft voting algorithm computes the average of the standardized predictions the base learners assign to each spectral signature. The joint probability algorithm computes the product of the standardized predictions the base learners assign to each spectral signature. This comprehensive approach allows us to ascertain the optimal performance of various ensemble learning techniques and to gauge the frequency with which ensemble methods enhance anomaly detection beyond individual HAD algorithms. Like our proposed algorithm, these additional ensemble methods leverage HAD algorithms as their base learners and derive their results from the anomaly probabilities generated by these base learners.
4.4. Base Learners
In this work, all ensemble learning methods operate by utilizing the predictive power of multiple HAD algorithms in concert. These HAD algorithms constitute the base learners for the ensemble learning methods. By amalgamating the outputs of diverse models, ensemble learning can enhance the strengths and mitigate the limitations of individual algorithms. This approach effectively addresses the varying degrees of bias complexity that can afflict standalone models, thereby bolstering overall performance and generalizability. The selection of a heterogeneous set of HAD algorithms that cover a spectrum of biases is critical to ensure that we extract as meaningful and broad of representation of the data as possible. This is the mechanism through which, ultimately, the performance and generalizability of individual HAD algorithms are improved.
To this end, we assembled a collection of HAD algorithms, each embodying one of the five bias types delineated in the background section and employing a unique implementation strategy. Algorithms were included to represent the reconstructive [
30], density [
31], partitioning [
18], hypersphere [
32], and Gaussian [
10] biases. The selected base algorithms will be referred to by their bias type throughout this work to reinforce the importance of their bias diversity within the SELHAD algorithm. Each algorithm assigns a score to each spectral signature within a dataset to indicate the degree to which each spectral signature deviates from the norm compared with the other spectral signatures. A higher score indicates a greater degree of deviation and a higher likelihood of being anomalous. Once all scores for a given algorithm applied to a given dataset are calculated, those scores are standardized to each other within a range between zero and one. This is accomplished by computing the difference between each score and the minimal score in the set and then dividing the output by the difference between the maximal and minimal score in the set. The outcomes of this calculation provide an anomaly probability, i.e., a probability of each spectral signature in a given dataset being anomalous.
5. Results
Table 2 presents the AUROC scores for all the base algorithms and ensemble learning algorithms across all datasets. Base algorithms are listed on the left, while ensemble learning algorithms are listed on the right. These results highlight multiple important points regarding the performances of the base, ensemble, and SELHAD algorithms.
If the base learning scores are analyzed in isolation from the ensemble learning scores, a few considerations about their performances come into focus. The base learner with the reconstruction bias performed best on more datasets than any of the other base learners, with four top performances. However, it also claimed the lowest performance among the base learners on 7 of the 14 datasets. The hypersphere-biased base learner encountered similar issues. It had top performances on 3 of the datasets and claimed the highest AUROC scores of any of the algorithms in this study (including the ensemble learning algorithms). It additionally had the lowest performance more often than any of the other base learners except the hypersphere-biased base learner. Conversely, while the Gaussian and partition-biased base learners had 3 and 1 top performances, respectively, their top scores were lower overall compared to the top scores of the reconstruction and hypersphere-biased base learners. However, neither the Gaussian nor partition-based biases had any cases where they were the lowest-performing algorithms. These differences in the quantity of highest and lowest performances provide evidence supporting the dichotomy between the base algorithms.
When the ensemble learning algorithms are included in the analysis, their impact on the dichotomy becomes evident. The ensemble algorithms assume all the top performances except for the instances where the hypersphere-biased base learner was a top performer, hence demonstrating the importance of the hypersphere-biased base learner’s top scores among all the base learning algorithms. The SELHAD algorithm was the top performer on 8 of the 14 datasets across all algorithms, double the quantity of top performance for any of the base learners. More importantly, the SELHAD, joint probability, bagging, and soft vote ensemble algorithms outperformed all the base learning algorithms 9, 5, 3, and 3 times, respectively. The ensemble algorithms (with the exception of the hard voting algorithm) additionally never had the lowest scores on any of the datasets among all the algorithms. Hence, using ensemble learning, specifically the SELHAD algorithm, improved the quantity of top performance while never making performance worse in any case. This provides empirical evidence that ensembling helps alleviate the dichotomy.
Table 3 displays the mean, standard deviation, minimum, median, and maximum of the AUROC scores attained by each algorithm across all datasets. This multifaceted analysis facilitates a thorough evaluation of the algorithms’ generalizability. Algorithms that have a higher mean or median AUROC, or lower standard deviation in AUROC scores, are better generalizers. Their performance is better across a broader range of inputs. Positioned at the top are the base learners, with the ensemble learners at the bottom.
Like the results in
Table 2, it is important to consider the results in
Table 3 both without and with the ensemble learning algorithms. When the base learners are evaluated in isolation, it is noteworthy that the partition-biased base learner assumes the best mean and standard deviation in AUROC scores, and the Gaussian-biased base leaner assumes the best median AUROC score. Comparatively, the hyperspherical and reconstruction-biased base learners had the worst mean, median, and standard deviation in AUROC scores. Therefore, the dichotomy is highlighted in
Table 2 and
Table 3, which directly measure performance and generalizability.
When the ensembling learning algorithms are included in the analysis,
Table 3 provides further support for how ensemble learning can help mitigate the dichotomy between performance and generalizability. The SELHAD, joint probability, and bagging ensemble learning algorithms all achieve higher mean and median AUROC scores than any of the base learners. Most notably, the SELHAD algorithm had the best mean, median, and standard deviation in AUROC scores across all the algorithms. This, in addition to the top performances SELHAD achieved on eight of the 14 datasets in
Table 2, provides strong empirical evidence of how SELHAD can alleviate the dichotomy between performance and generalizability over the performance of the base learners in isolation.
Figure 2 and
Figure 3 visualize AUROC performance on a generalized and individual dataset level, respectively.
Figure 2 presents the distribution of AUROC scores each algorithm achieved across all datasets. Algorithms are listed across the X axis approximately in order of increasing generalizability as determined by their median and third quartile AUROC scores.
Figure 3 displays the receiver operator characteristics of the top-performing algorithms exclusively (to provide better visibility).
Figure 2 further demonstrates how ensemble learning approaches strengthen generalizability. The medians and inner quartile ranges of the ensemble learning algorithms (except for the hard voting algorithm) are visually higher than the base learning algorithms, which agrees with the observations in
Table 2 and
Table 3. The results from the hard voting algorithm were expected as they agree with the results obtained in prior studies [
7]. The medians and inner quartile ranges of the Gaussian and partition-biased base learners are also greater than all the other base learners, which also agree with
Table 2 and
Table 3, showcasing their enhanced generalizability over the other base learners. The ROC curves in
Figure 3 demonstrate multiple instances where the top performance of SELHAD is visually evident in addition to the quantifiable differences evident in
Table 2. The airport 1, airport 4, beach 1, beach 2, beach 3, and HYDICE datasets all display ROC curves for SELHAD with noticeably higher true positive rates and lower false positive rates at lower thresholds. Similar distinctions, albeit to a lesser extent, can be observed for the soft voting algorithm for the airport 2 and airport 3 datasets and for the hypersphere-biased base learner for the beach 4 and urban 4 datasets. All other datasets demonstrate closer performance between algorithms, as shown by the higher degree of overlap in ROC curves, which does not make any one algorithm visually distinguishable. Note that this does not mean that one algorithm is not discernably better than the others; it simply means that it cannot be observed in the ROC plots. As anomalies are rare, most of the scores produced by the algorithms were high. The median top AUROC score among the base learners in
Table 2 was 0.9803, so small changes in performance that cannot be observed in an ROC plot may still be substantial given the overall small margin available for improvement.
Figure 4 provides ground truth information for each dataset in the study alongside the anomaly probabilities produced by algorithms in this study. The ground truth for each dataset takes the form of a mask where white pixels indicate locations of anomalous pixels in the dataset and black pixels indicate locations of background pixels in the dataset. The heatmaps following the ground truth masks indicate the likelihood of each algorithm being assigned to each spectral signature for each dataset. Brighter colors indicate a higher predicted likelihood of a spectral signature being anomalous. Heatmaps include the hypersphere-biased and Gaussian-biased base learns alongside all the ensemble learning algorithms. The hypersphere-biased base learner is included, as it is the only base learner to outperform any of the ensemble learners. The Gaussian-biased base learner is included, as it is the best generalizer among all the base learners.
Distinctions for the algorithms can be observed across the heatmaps they produced. The hyperspherical-biased base learner employs what may be considered an “all or nothing” approach compared with the other algorithms. It has the hottest and most frequent hot spots of all the algorithms. Therefore, when these predictions are correct, it excels in performance, hence its ability to outperform the ensemble learning algorithms on 3 of the datasets. However, when those predictions are incorrect, its performance suffers, which is why it had the second lowest quantity of lowest AUROC scores. Comparatively, the Gaussian-biased base learner heatmaps are defined by their low amount of noise and low quantity of background spectral signatures assigned to any heat compared with the other algorithms. The Gaussian-biased base learner additionally assigns lower heats compared with the other algorithms in general and is more selective in doing so. This results in the Gaussian-biased base learner having more conservative predictions that provide good broad delineation between target and background spectral signatures but miss some of the nuances that would enable it to excel on any one dataset. This provides further visual evidence as to what behaviors delineate top-performing and highly generalizing HAD algorithms.
The characteristics of the heatmaps for the comparative ensemble learning algorithms differ from the base algorithms as they are composed of information aggregated across the base learners. Both the soft vote and bagging ensemble learning algorithms produce similar results. This is expected as they represent the central tendency of the base learners and the central tendency of multiple bootstraps of the base learners, respectively. They maintain the stronger predictions made by the top-performing base learner, but these predictions are tempered by the top generalizing algorithms, mitigating the “all or nothing” pitfalls. Small nuances of higher heat and less noise can be observed in the bagging algorithm as a result of the robustness that bootstrapping provides that algorithm. As the hard vote algorithm relies on the majority prediction, and all the algorithms produce some noise, this carries over into the results for the hard vote algorithm and is detrimental except in the cases where all the algorithms tend to perform equally well on the datasets. The heatmaps for the joint probability algorithm are the sparsest as they reflect the products of multiple fractional values; remember, all the outputs of the base learners are standardized between zero and one in this work. This makes the predictions for cases of disagreement between base learners much lower than for any other ensemble learning method, eliminating much of the noise visible in heatmaps from the base learners. This noise removal helps in many cases but causes the joint probably algorithm to assign too low of scores to some anomalies when disagreements between base learners arise, hence hindering performance instead. Therefore, while these algorithms show improvement over the base algorithms, allowing all the base algorithms to increase or diminish results equally has limitations. Larger discrepancies between base algorithms or poor performance of a base algorithm on a specific dataset can cause some base learners to have an outsized impact on the aggregated results in cases where the base learners’ results may not ultimately be as informative.
To alleviate the outsized impact of specific base learners, the β vectors learned by SELHAD can be employed.
Table 4 provides the factors learned by the β vector in Equation 10 of the SELHAD algorithm.
Table 4 demonstrates how SELHAD adapts to each dataset by applying the base learners in varying degrees to derive the anomaly probability predictions, thereby alleviating the issues with equal base learner contribution. The impact of these adaptations can be visualized in the heatmaps SELHAD produces. For most datasets (11 of 14), the Gaussian-biased learner had the highest factor and consequently had the most influence on the anomaly probability predictions produced by SELHAD. Therefore, a top-generalizing algorithm proved the most beneficial in most cases. The primary contribution of the Gaussian-biased base learner is evident in the lower degree of noise in the SELHAD heatmaps in comparison to the soft vote and bagging algorithms. The secondary contributions of the other base algorithms then act to boost the predictions of anomalous pixels above the predictions provided by the Gaussian-biased base learner, creating greater delineation between probability predictions and improving performance. This creates hotter hot spots in the heatmaps for the SELHAD algorithm than the Gaussian-biased base learner, but not so hot that it falls into the “all or nothing” pitfall, hence balancing performance and generalizability.
The remaining three datasets from
Table 4, where the Gaussian-biased base learner was not the top contributor for SELHAD, are unique. In the airport 1 dataset, a slightly inverse contribution of the Gaussian-biased base learner was used. The heatmap of the airport 1 dataset for the Gaussian-biased base learner shows that the Gaussian-biased base learner assigned higher probabilities predominantly to background spectral signatures. Therefore, by assigning an inverse contribution to the Gaussian-biased base learner in this case, the probabilities produced by the SELHAD algorithm for these background spectral signatures could be attenuated. A stronger example of this exists in the beach 2 dataset, whereby applying an inverse contribution, a roadway identified with high probabilities of being anomalous could be completely removed in the resulting SELHAD probability predictions. Lastly, in the urban 5 dataset, the primary contributor was the hypersphere-biased base learner, and the Gaussian-biased base learner was employed closely behind. As the hypersphere-biased base learner performed better on this dataset, the Gaussian-biased base learner probability predictions could instead be used to attenuate some of the more aggressive scores observed in the hypersphere-biased base learner and improve performance in a similar fashion to the soft vote and bagging ensemble algorithms.
6. Discussion
The results discussed in the previous section can be used cumulatively to provide empirical evidence of the dichotomy in performance introduced in this study between HAD algorithm biases. While some biases produced more top results [
Table 2] and top results that outperformed even the ensemble learning algorithms, they were not without their limitations. These top individual performances came at the cost of an even greater number of bottom performances [
Table 2]. They produced probability predictions that were higher more often than the other algorithms, enabling them to excel when they were correct but struggle when they were not, creating “all or nothing” scenarios. The over-arching implications of this instability could additionally be seen in how this impacted the generalizability [
Table 3] of these top-performing algorithms. They had lower mean and median AUROCs and greater variance in AUROCs, showcasing their overall volatility. The volatility and “all or nothing” trends were visually apparent in the exceptionally high heats the algorithms produced in their heatmaps [
Figure 4], especially in the high heats assigned to background spectral signatures.
Conversely, the other biases produced results that enabled them to perform in a more stable fashion, generalizing well across a diverse range of datasets [
Table 3] [
Figure 2]. A hallmark of this was the lower degree of noise and lower heat generally observed in their heatmaps [
Figure 4]. However, this stability came at the cost of these biases having lower and fewer top performances and never achieving any top performances when the ensemble learning algorithms were included [
Table 1]. As such, these differences present in the results between performance and generalizability support the concept of a dichotomy between HAD algorithm biases.
While these different biases demonstrate a pressing issue with HAD algorithms, they also provide the catalyst for alleviating it. By utilizing the diverse biases in concert, this study was able to mitigate the dichotomous nature of their individual results. The dataset-specific trends modeled in the top-performing algorithms could be balanced with the more abstract trends modeled in the top-generalizing algorithms. This relaxed the high heat predictions of the top-performing algorithms while simultaneously increasing selectivity over the top generalizing algorithms, all while reducing background noise [
Figure 4]. More succinctly, utilizing the biases in concert increased the separation between background and target spectral signature probability predictions, which resulted in higher AUROC scores. By employing bagging, soft vote, and joint probability ensemble algorithms to aggregate the information of the base learners, the number of top performances was able to be matched or exceeded over the top performing base learners while suffering no lowest performances [
Table 2]. The ensemble algorithms had the additional effect of increasing the mean performance across all datasets, outperforming the generalizing ability of top generalizing base learners [
Table 3]. By employing ensemble algorithms that could model the central or joint tendencies of multiple HAD algorithms in concert, the advantages of the base learners could be enhanced while their weaknesses were attenuated, hence easing the dichotomy between the performances of the base learners in isolation.
The results in this study indicate that while central and joint tendencies of diverse HAD biases can improve performance and generalizability [
Table 2 and
Table 3], not all biases are equally informative for every dataset [
Table 4] and, as such, should not be utilized equally. In this study, the factor of the contribution of each bias was learned [Equation (10)] using the SELHAD algorithm for each dataset [
Table 4] so that more informative biases could have a greater influence on central tendency modeling. This provided greater fidelity to retain high heats (instead of attenuating them) when they were informative [
Figure 4]. It additionally enabled a finer balance in noise removal that was more inclusive than the bagging and soft voting algorithms while not being as aggressive as the joint probability algorithm [
Figure 4]. Inverse contributions could be applied to make the effects even more pronounced. This enabled an even better separation of probability predictions from SELHAD that even further mitigated the dichotomy between performance and generalizability.
A key component for the enhanced performance observed by SELHAD is the selection of base learners that originate from a diverse set of HAD algorithm biases. Each base learner models the data differently and consequently assigns different probability predictions to each spectral signature. These variations enable SELHAD to increase probabilities for anomalous pixels and decrease probabilities for background pixels accordingly. More diverse biases provide more information to SELHAD and result in more robust models of the data. The same level of diversity would not occur with base learners that employ similar biases, and the probability predictions they would produce would be more uniform. Less information would be available for improvement. The aid provided by this diversity is evidenced by the factors assigned to the base learners by SELHAD [
Table 4] and the heatmaps that resulted from them [
Figure 4]. Optimal performance was achieved on most datasets using information from multiple biases, and those optimal performances reflected the central tendencies from the biases SELHAD learned to incorporate into its probability predictions.
While SELHAD demonstrated marked enhancements in performance and generalizability, these advancements were not unanimously observed. Datasets exhibiting similarly high levels of performance and agreement across all base learners [
Figure 3] posed challenges for SELHAD in learning appropriate factors. For instance, the airport 3, beach 4, urban 2, and urban 4 datasets ranked among the five datasets with the lowest standard deviation in AUROC scores [
Figure 2] across all base learners (
Table 2). Consequently, all base learners for these datasets generally performed well and exhibited substantial agreement, making it more difficult for SELHAD to discern appropriate factors. We hypothesize that by introducing trivial or zero factors for some base learners, performance could potentially be enhanced. However, a current limitation of SELHAD is its inability to learn such factors. Hence, a future direction for this work could involve incorporating L1 and L2 regularization terms into the loss function (Equation (9)) to encourage the convergence of select factors to trivial or zero values. Such regularization could promote sparsity across the β vector. This, in turn, could enhance not only metric-based performance but runtime-based performance as well by eliminating the need to compute anomaly probabilities for some base learners for a given dataset.
An important consideration for SELHAD, like most ensemble learning methods, is its computational runtime. Since ensemble methods aggregate the outputs of multiple base learners, their runtime reflects both the cumulative execution time of each learner and the time required for result integration. While parallel computing can mitigate runtime by allowing independent learners to run concurrently, this does not eliminate the increased demand for computational resources. In practical applications, the trade-off between the additional computational cost and the potential gains in performance and generalizability must be carefully weighed to determine if SELHAD is the optimal approach.
7. Conclusions
This study introduced SELHAD, an algorithm designed to learn the factor in which to employ predictions from diverse hyperspectral anomaly detection algorithm biases in concert to mitigate the dichotomy between performance and generalizability in their individual use. SELHAD employs a two-part stacking-based ensemble learning approach comprising a base-learning phase and a meta-learning phase. In the base-learning phase, the algorithm produces multiple slightly perturbed permutations of the data through a set of HAD algorithms (i.e., the base learners), generating anomaly probability matrices. This integrates bootstrapping-based strategies into HAD, thereby bolstering the robustness of the subsequent meta-learning phase. In the meta-learning phase, the factor to which the anomaly probability matrices from each base learner contribute to the outcome is learned by SELHAD. The learned factors can then be used to regulate the influence each base learner has on the outcome via a broadcasting operation, yielding the final probability of each spectral signature’s anomaly status within the dataset. This defines a method of using the biases of multiple hyperspectral anomaly detection algorithms in concert to mitigate the dichotomy between performance and generalizability in their individual use.
To assess the effectiveness of the proposed algorithm, a systematic set of experiments was conducted, comparing SELHAD against the base learners and four ensemble learning algorithms: bagging, soft voting, hard voting, and joint probability. Evaluation encompassed a diverse set of 14 hyperspectral datasets. The datasets were assessed to ensure heterogeneity in spatial resolution, spectral band count, and the characteristics of anomalous targets, including their quantity, average size, and variance. Perturbation of the anomaly probability matrices generated by the base learners was achieved through rotational and flipping-based operations applied to the datasets, along with minute hyperparameter adjustments to the base learners. Subsequently, the AUROC was computed for all ten algorithms using the original data. The frequency with which each algorithm attained the best AUROC across the 14 datasets was tabulated to identify the top-performing algorithm. Furthermore, statistical metrics, including mean, standard deviation, minimum, median, and maximum AUROCs, were calculated for each algorithm across all datasets to assess their generalizability. Aggregating these metrics provided a comprehensive evaluation of algorithms’ performances.
The findings from the experiments demonstrate that SELHAD successfully learned the contributing factors for using multiple HAD algorithms in concert, which largely mitigated the dichotomy between performance and generalizability over their individual use. By regulating each base learner’s influence on the final outcome, SELHAD capitalized on the strengths of each bias while tempering their inherent limitations. This resulted in nearly doubling the number of datasets for which top performance was achieved compared with the other nine algorithms and in claiming four of the five statistical measures for generalizability. Consequently, SELHAD represents a departure from the conventional inverse trend between performance and generalizability observed in traditional HAD algorithms, where enhancing one aspect often compromises the other and makes advances toward adaptability. However, there remains room for improvement. Future work could explore the integration of regularization terms into the loss function to induce sparsity across learned factors, thereby further attenuating or eliminating the influence of non-informative base learners. Such refinement not only holds promise for enhancing performance-based metrics but also offers potential gains in runtime efficiency by eliminating the need to compute anomaly probability matrices for some base learners.