1. Introduction
Hyperspectral images (HSIs) are widely applied in land cover mapping, agriculture, urban planning, and other applications [
1]. The high spectral resolution of HSIs enables the detection of a wide range of small and specific targets through spatial analysis or visual inspection [
2]. Hyperspectral imaging offers numerous advantages for target detection, especially in high-precision and sensitive applications. By capturing detailed spectral information through hundreds or thousands of narrow bands, it distinguishes subtle differences between materials like minerals, vegetation, and chemicals. Even with mixed pixels, spectral unmixing can identify small or blended targets. Compared to traditional imaging methods, hyperspectral imaging provides greater accuracy and robustness, especially in complex backgrounds, while reducing false detections. It is highly useful for applications like environmental monitoring, military surveillance, agriculture, and food quality control. The technology allows for remote, real-time data collection, making it versatile and valuable across multiple fields.
In HSIs, the definition of small targets typically depends on the specific application scenario and task requirements. For instance, if the task is to detect traffic signs in HSIs, small targets may be defined as objects with traffic sign characteristics, with their size determined by the actual dimensions of traffic signs. These small targets usually have dimensions ranging from the pixel level to a few dozen pixels. Although their size may be minuscule compared to the entire image, they hold significant importance in specific contexts. Additionally, the semantic content of small targets may carry particular significance; for example, HSIs may include various objects such as vehicles, buildings, and people.
In this topical area, the most challenging targets in HSIs have the following characteristics. (1) Weak contrast: HSIs with their high spectral resolution but lower spatial resolution depict spatially and spectrally complex scenes, affected by sensor noise and material mixing; the targets are immersed in this context, leading to usually weak contrast and low signal-to-noise ratio (SNR) values. (2) Limited numbers: Due to the coarse spatial resolution, a target often occupies only one or a few pixels in a scene, which provides limited training samples. (3) Spectral complexity: HSI contains abundant spectral information. This leads to a very large dimension of the feature space, strong data correlation in different bands, extensive data redundancy, and eventually long computational times. In HSTD, the background typically refers to the parts of the image that do not contain the target. Compared to the target, the background usually has a higher similarity in color, texture, or spectral characteristics with the surrounding area, resulting in lower contrast within the image. This means it does not display distinct characteristics or significant differences from the surrounding regions. Background areas generally exhibit a certain level of consistency, with neighboring pixels having high similarity in their attributes. For example, a green grass field, a blue sky, or a gray building wall. This consistency means that the background may show relatively uniform spectral characteristics. Additionally, the background usually occupies a larger portion of the image and may have characteristics related to the environment, such as terrain, vegetation types, and building structures. These characteristics help distinguish the background from the target area. Accurately defining and modeling the background is crucial for improving the accuracy and robustness of target detection.
Based on these characteristics, the algorithms for HSI target extraction may be roughly subdivided into two big families: more traditional signal detection methods and more recent data-driven pattern recognition and machine learning methods.
Traditionally, detection involves transforming the spectral characteristics of target and background pixels into a specific feature space based on predefined criteria [
3]. Targets and backgrounds occupy distinct positions within this space, allowing targets to be extracted using threshold or clustering techniques. In this research domain, diverse descriptions of background models have led to the development of various mathematical models [
4,
5,
6,
7] for characterizing spectral pixel changes.
The first category of algorithms in this family are the spectral-information-based models when the target and background spectral signatures are supposed to be generated by a linear combination of end member spectra. The Orthogonal Subspace Projection (OSP) [
8] and the Adaptive Subspace Detector (ASD) [
9] are two representative subspace-based target detection algorithms. OSP employs a signal detection method to remove background features by projecting each pixel’s spectral vector onto a subspace. However, the fact that the same object may have distinct spectra and the same spectra may appear in different objects because of spectral variation caused by the atmosphere, sensor noise, and the mixing of multiple spectra makes the identification of a target more challenging in reality due to imaging technology limitations [
10]. To tackle this issue, Chen et al. proposed an adaptive target pixel selection approach based on spectral similarity and spatial relationship characteristics [
11], which addresses the pixel selection problem.
A second category consists of statistical methods. These approaches assume that the background follows a specified distribution and then establish whether or not the target exists by looking for outliers with respect to this distribution. The Adaptive Cosine consistency Estimator (ACE) [
12] and the Adaptive Matched Filter (AMF) [
13] are among the techniques in this group, both of which are based on the Generalized Likelihood Ratio-based detection Test (GLRT) method [
14].
Both ACE and AMF are spectral detectors that measure the distance between target features and data samples. ACE can be considered as a special case of a spectral angle-based detector. AMF was designed based on the hypothesis testing method for Gaussian distributions. The third category comprises representation-based methods that do not assume any data distribution, e.g., Constrained Energy Minimization (CEM) [
15,
16], hierarchical CEM (hCEM) [
17], ensemble-based CEM (eCEM) [
18], sCEM [
19], target-constrained inference-minimized filter (TCIMF), and sparse representation (ST)-based methods [
20]. Among these approaches, the classic and foundational CEM method constrains targets while minimizing the variance of data samples, and the TCIMF method combines CEM and OSP.
The most recent deep learning methods for target detection are mainly based on data representations involving kernels, sparse representations, manifold learning, and unsupervised learning. Specifically, methods based on sparse representation take into account the connections between samples in the sparse representation space. The Combined Sparse and Collaborative Representation (CSCR) [
21] and the Dual Sparsity Constrained (DSC) [
22] methods are examples of sparse representation. However, the requirement for exact pixel-wise labeling makes the task of achieving good performance an expensive one. To address the challenge of obtaining pixel-level accurate labels, Jiao et al. proposed a semantic multiple-instance neural network with contrastive and sparse attention fusion [
23]. Kernel-based transformations [
24] are employed to address the linear inseparability issue between targets and background in the original feature space. Gaussian radial basic kernel functions are commonly used, but there is still a lack of rules for choosing the best-performing kernel. Furthermore, in [
25], manifold learning is employed to learn a subspace that encodes discriminative information. Finally, for unsupervised learning methods, an effective feature extraction method based on unsupervised networks is proposed to mine intrinsic properties underlying HSIs. The spectral regularization is imposed on autoencoder (AE) and variational AE (VAE) to emphasize spectral consistency [
26]. Another novel network block with the region-of-interest feature transformation and the multi-scale spectral-attention module is also proposed to reduce the spatial and spectral redundancies simultaneously and provide strong discrimination [
27].
While recent advancements have shown increased effectiveness, challenges remain in efficiently tuning a large number of hyperparameters and obtaining accurate labels [
28,
29]. Moreover, the statistical features extracted by GAN-based methods often overlook the potential topological structure information. Such a phenomenon greatly limits the ability to capture non-local topological relationships to better represent the underlying data structure of HSI. Thus, the representative features are not fully exploited and utilized to preserve the most valuable information across different networks. Detecting the location and shape of small targets with weak contrast against the background remains a significant challenge. Therefore, this paper proposes a deep representative model of the graph and generative learning fusion network with frequency representation. The goal is to develop a stable and robust model based on an effective feature representation. The primary contributions of this study are summarized as follows:
We explore a collaboration framework for HSTD with less computation cost and high accuracy. Under the framework, the feature extraction from the graph and generative learning compensate for each other. To our knowledge, it is the first work to explore the collaborative relationship between the graph and generative learning in HSTD.
The graph learning module is established for HSTD. The GCN module aims to compensate for the information loss of details caused by the encoder and decoder via aggregating features from multiple adjacent levels. As a result, the detailed features of small targets can be propagated to the deeper layers of the network.
The primary mini batch GCN branch for HSTD is designed by following an explicit design principle derived from the graph method to reduce high computational costs. It enables the graph to enhance the feature and suppress noise, effectively dealing with background interference and retaining the target details.
A spectral-constrained filter is used to retain the different frequency components. Frequency learning is introduced into data preparation for coarse candidate sample selection, favoring strong relevance among pixels of the same object.
The remainder of this article is structured as follows.
Section 2 provides a brief overview of graph learning.
Section 3 elaborates on the proposed GCN and introduces the fusion module. Extensive experiments and analyses are given in
Section 4.
Section 5 provides some conclusions.
2. Graph Learning
A graph describes one-to-many relations in a non-Euclidean space [
30], and includes directed and un-directed patterns. It can be combined with a neural network to design Graph Convolution Networks (GCNs), Graph Attention Networks (GATs), Graph Autoencoders (GAEs), Graph Generative Networks (GGNs), and Graph Spatial-Temporal Networks (GSTNs) [
31].
Among the graph-based networks mentioned above, this work focuses on GCNs, a set of neural networks that infer convolution from traditional data to graph data. Specifically, the convolution operation can be applied within a graph structure, where the key operation involves learning a mapping function. In this work, an undirected graph models the relationship between spectral features and performs feature extraction [
32]. In matrix form, the layer-wise propagation rule of the multi-layer GCN is defined as
where
denotes the output at the
-th layer.
is the activation function with respect to the weights to be learned and the biases of all layers. Drepresents a diagonal matrix of node degrees.
represents the adjacency matrix
of undirected graph
G added to the identity matrix
of proper size throughout this article.
Deep-learning-based approaches to HSIs based on graph learning are increasingly common, especially for detection and classification. For instance, Weighted Feature fusion of Convolutional neural network and Graph attention network (WFCG) exploits the complementary characteristics of superpixel-based GAT and pixel-based CNN [
33,
34]. Additionally, a Robust Self-Ensembling Network (RSEN) has been proposed, comprising a base and ensemble network, which achieves satisfactory results even with limited sample sizes [
35].