**Detection and Classification of Rice Infestation with Rice Leaf Folder (Cnaphalocrocis medinalis) Using Hyperspectralȱ ȱ Imaging Techniques**

**GuiȬChou Liang 1, YenȬChieh Ouyang <sup>2</sup> and ShuȬMei Dai 1,\***


This paper developed a hyperspectral image technique that combines constrained energy minimization (CEM) and deep neural networks to detect defects in the spectral images of infected rice leaves and compare the performance of each in the full spectral band, selected bands, and band expansion process (BEP) to compressed spectral information for

the selected bands. A total of 339 hyperspectral images were collected in this study; the results showed that six bands were sufficient for detecting early infestations of rice leave folder (RLF), with a detection accuracy of 98% and a Dice similarity coefficient of 0.8, which provides advantages in the commercialization of this field.

remotesensingȬ12Ȭ02348Ȭv2

### **Detection of Insect Damage in Green Coffee Beans Usingȱ ȱ VISȬNIR Hyperspectral Imaging**

**ShihȬYu Chen 1,2,\*, ChuanȬYu Chang 1,2, ChengȬSyue Ou <sup>1</sup> and ChouȬTien Lien <sup>1</sup>**


This paper developed a hyperspectral insect damage-detection algorithm (HIDDA) that can automatically detect insect-damaged beans using only a few bands and one spectral signature. It used a push-broom visible-near infrared (VIS-NIR) hyperspectral sensor to obtain images of coffee beans. It takes advantage of recently developed constrained energy minimization (CEM)-based band selection methods coupled with two classifiers, support vector machine (SVM) and convolutional neural networks (CNN), to select bands. The experiments showed that 850–950 nm is an important wavelength range for accurately identifying insect damaged beans, and HIDDA can indeed detect insect damaged beans with only one spectral signature, which will provide an advantage in terms of practical applications and commercialization in the future.

#### **3. Conclusions**

The success of this Special Issue is owed to many researchers who are willing to share their original ideas and findings. All guest editors would like to express their sincere gratitude to the researchers for their time and efforts devoted to make this Special Issue a reality. Lastly, we extend a special thanks to the anonymous reviewers for their hard work in valuable and insightful comments to help the authors improve their presentation and quality of their papers. Without their contributions, this Special Issue could not have been completed.

**Acknowledgments:** The work of M. Song was supported by National Nature Science Foundation of China (61971082, 61890964). The work of H. Yu was supported by the National Natural Science Foundation of China under Grant 42101350, and the China Postdoctoral Science Foundation under Grant 2022T150080 and Grant 2020M680925. The work of H. Li was supported by the National Science and Technology Council, Taiwan under Grant No. MOST 109-2221-E-027-124-MY3.

**Conflicts of Interest:** The Guest editors declare no conflict of interest.

#### **References**


### *Article* **Deep Relation Network for Hyperspectral Image Few-Shot Classification**

**Kuiliang Gao 1,\*, Bing Liu 1, Xuchu Yu 1, Jinchun Qin 2, Pengqiang Zhang <sup>1</sup> and Xiong Tan <sup>1</sup>**


Received: 20 February 2020; Accepted: 10 March 2020; Published: 13 March 2020

**Abstract:** Deep learning has achieved great success in hyperspectral image classification. However, when processing new hyperspectral images, the existing deep learning models must be retrained from scratch with sufficient samples, which is inefficient and undesirable in practical tasks. This paper aims to explore how to accurately classify new hyperspectral images with only a few labeled samples, i.e., the hyperspectral images few-shot classification. Specifically, we design a new deep classification model based on relational network and train it with the idea of meta-learning. Firstly, the feature learning module and the relation learning module of the model can make full use of the spatial–spectral information in hyperspectral images and carry out relation learning by comparing the similarity between samples. Secondly, the task-based learning strategy can enable the model to continuously enhance its ability to learn how to learn with a large number of tasks randomly generated from different data sets. Benefitting from the above two points, the proposed method has excellent generalization ability and can obtain satisfactory classification results with only a few labeled samples. In order to verify the performance of the proposed method, experiments were carried out on three public data sets. The results indicate that the proposed method can achieve better classification results than the traditional semisupervised support vector machine and semisupervised deep learning models.

**Keywords:** hyperspectral image few-shot classification; deep learning; meta-learning; relation network; convolutional neural network

#### **1. Introduction**

Hyperspectral remote sensing, as an important means of earth observation, is one of the most important technological advances in the field of remote sensing. Utilizing the imaging spectrometer with very high spectral resolution, hyperspectral remote sensing can obtain abundant spectral information on the observation area so as to produce hyperspectral images (HSI) with a threedimensional data structure. As HSI have the unique advantage of "spatial–spectral unity" (HSI contain both abundant spectral and spatial information), hyperspectral remote sensing has been widely used in fine agriculture, land-use planning, target detection, and many other fields.

HSI classification is one of the most important steps in HSI analysis and application, the basic task of which is to determine a unique category for each pixel. In early research, the working mode of feature extraction combined with classifiers such as support vector machines (SVM) [1] and random forest (RF) [2] was dominant at the time. Initially, in order to alleviate the Hughes phenomenon caused by band redundancy, researchers introduced a series of feature extraction methods to extract spectral features conducive to classification from abundant spectral information. Common spectral feature extraction methods include principal component analysis (PCA) [3], independent component analysis (ICA) [4], linear discriminant analysis (LDA) [5], and other linear methods, as well as kernel principal component analysis (KPCA) [6], locally linear embedding (LLE) [7], t-distributed stochastic neighbor embedding (t-SNE) [8], and other nonlinear methods. Admittedly, the above feature extraction method can achieve some results, but ignoring spatial structure information in HSI still seriously hinders the increase of classification accuracy. To this end, a series of spatial information utilization methods are introduced, such as extended morphological profile (EMP) [9], local binary patterns (LBP) [10], 3D Gabor features [11], Markov random field (MRF) [12], spatial filtering [13], variants of non-negative matrix underapproximation (NMU) [14], and so on. The extraction and utilization of spatial features can effectively improve classification accuracy. However, due to the separation of feature extraction process and classification in traditional classification mode, the adaptability between them cannot be fully considered [15]. In addition, the classification results of traditional methods largely depend on the accumulated experience and parameter setting, which lacks stability and robustness.

In recent years, with the development of artificial intelligence, deep learning has been applied to the field of remote sensing [16]. Compared to traditional methods, deep learning can automatically learn the required features from the data by establishing a hierarchical framework. Moreover, these features are often more discriminative and more conducive to the classification. Stacked AutoEncoder (SAE) [17], recurrent neural network (RNN) [18,19], and deep belief networks (DBN) [20,21] are first applied to HSI classification. These methods can achieve higher classification accuracy than traditional methods under certain conditions. Nevertheless, some necessary preprocessing must be performed to transform HSI into a one-dimensional vector for feature extraction, which destroys the spatial structure information in HSI. Compared with the above deep learning models, convolutional neural networks (CNNs) are more suitable for HSI processing and feature extraction. At present, 2D-CNN and 3D-CNN are two basic models widely used in HSI classification [22]. By means of two-dimensional and three-dimensional convolution operation, 2D-CNN and 3D-CNN can both fully extract and utilize the spatial–spectral features in HSI. Yue et al. take the lead in exploring the effect of 2D-CNN in HSI classification. Subsequently, many improved models based on 2D-CNN have been proposed and refresh classification accuracy constantly, such as DR-CNN [23], contextual deep CNN [24], DCNN [25], DC-CNN [26], and so on. Most 2D-CNN-based methods use PCA to reduce the dimension of HSI in order to reduce the number of channels in the convolution operation. However, this practice inevitably loses important detail information in HSI. The advantage of 3D-CNN is that it can directly perform three-dimensional convolution operation on HSI without any preprocessing and can make full use of spatial–spectral information to further improve classification accuracy. Chen et al. take the lead in utilizing 3D-CNN for HSI classification and have conducted detailed studies on the number of network layers, number of convolution kernels, size of the neighborhood, and other hyperparameters [27]. On this basis, methods such as residual learning [28], attention mechanism [29], dense network [30], and multiscale convolution [31] are combined with 3D-CNN, resulting in higher classification accuracy. In addition, CNN is combined with other methods such as active learning [32], capsule network [33], superpixel segmentation [34], and so on, which can achieve promising classification results when the training samples are sufficient.

Indeed, deep learning has seen great success in HSI classification. However, there is still a serious contradiction between the huge parameter space of the deep learning model and the limited labeled samples of HSI. In other words, the deep learning model must have enough labeled samples as a guarantee, so as to give full play to the its classification performance. Nevertheless, it is difficult to obtain enough labeled samples in practice, because the acquisition of labeled samples is time-consuming and laborious. In order to improve classification accuracy under the condition of limited labeled samples, semisupervised learning and data augmentation are widely applied. In [35,36], CNN was combined with semisupervised classification. In [37], Kang et al. first extracted PCA, EMP, and edge-preserving features (EPF), then carried out classification by combining semisupervised method and decision confusion strategy. In [27], Chen et al. generated virtual training samples by adding noise to the original labeled samples, while in [38,39], the number of training samples were increased by constructing training sample pairs. In recent years, with the emergence of generative adversarial networks (GANs), many researchers have utilized the synthetic sample generated by GAN to assist in training networks [40–42]. It is true that the above methods can improve classification accuracy under the condition of limited labeled samples, but they either further explore the feature

of the insufficient labeled samples or utilize the information of unlabeled samples in the HSI being classified to further train the model. In other words, the HSI used to train model are exactly identical to the target HSI used to test the model. This means that when processing a new HSI, the model must be retrained from scratch. However, it is impossible to train a classifier for each HSI, which will incur significant overhead in practice.

Few-shot learning is when a model can effectively distinguish the categories in the new data set with only a very few labeled samples processing a new data set [43]. The availability of very few samples challenges the standard training practice in deep learning [44]. Different from the existing deep learning model, however, humans are very good at few-shot learning, because they can effectively utilize the previous learning experience and have the ability to learn how to learn, which is the concept of meta-learning [45,46]. Therefore, we should effectively utilize transferable knowledge in the collected HSI to further classify other new HSI, so as to reduce cost as much as possible. Different HSI contain different types and quantities of ground objects, so it is difficult for the general transfer learning [47,48] to obtain satisfactory classification accuracy with a few labeled sample. According to the idea of meta-learning, the model not only needs to learn transferable knowledge that is conducive to classification but also needs to learn the ability to learn.

The purpose of this paper is to explore how to accurately classify new HSI which are absolutely different from the HSI used for training with only a few labeled samples (e.g., five labeled samples per class). More specifically, this paper designs a new model based on a relation network [49] for HSI few-shot classification (RN-FSC) and trains it with the idea of meta-learning. The designed model is an end-to-end framework, including two modules: feature learning module and relation learning module, which can effectively simplify the classification process. The feature learning module is responsible for extracting deep features from samples in HSI, while the relation learning module carries on relation learning by comparing the similarity between different samples, that is, the relation score between samples belonging to the same class is high, and the relation score between samples belonging to different class is low. From the perspective of workflow, the proposed RN-FSC method consists of three steps. In the first step, we use the designed network model to carry out meta-learning on the source HSI data set, so that the model can fully learn the transferable feature knowledge and relation comparison ability, i.e., the ability to learn how to learn. In the second step, the network model is fine-tuned with only a few labeled samples in the target HSI data set so that the model can quickly adapt to new classification scenarios. In the third step, the target HSI data sets are used to test the classification performance of the proposed method. It is important to note that the target HSI data set for classification and the source HSI data set for meta-learning are completely different.

The main contributions of this paper are as follows:


The remainder of this paper is structured as follows. In Section 2, HSI few-shot classification is introduced. In Section 3, the design relation network model is described in detail. In Section 4, experimental results and analysis on three public available HSI data sets are presented. Finally, conclusions are provided in Section 5.

#### **2. HSI Few-Shot Classification**

In this section, we first explain the definition of few-shot classification, then describe the task-based learning strategy in detail, and finally give the complete process of HSI few-shot classification.

#### *2.1. Definition of Few-Shot Classification*

In order to explain the definition of few-shot classification, we must first distinguish several concepts: source data set, target data set, fine-tuning data set, and testing data set. Both the fine-tuning data set and the testing data set are subsets of the target data set, sharing the same label space, while the source data set and the target data set are totally different. With reference to most of the existing deep learning models, we can only utilize the fine-tuning data set to train a classifier. However, the classification performance of this classifier is very poor due to the very small fine-tuning data set. Therefore, we need to use the idea of meta-learning to carry out the classification task (as shown in Figure 1). The model first performs meta-learning on the source data set to extract the transferable feature knowledge and cultivate the ability of learning to learn. After meta-learning, the model can acquire enough generalization knowledge. Then, the model is fine-tuned on the fine-tuning data set to extract individual knowledge, so as to adapt to the new classification scenario quickly. The fine-tuning data set is very small compared to the testing data set, so the process of fine-tuning can be called few-shot learning. If the fine-tuning data set contains *C* unique classes and each class includes *K* labeled samples, the classification problem can be called *C*-way *K*-shot. Finally, the model is utilized to classify the testing data set.

**Figure 1.** Definition of few-shot classification.

#### *2.2. Task-Based Learning Strategy*

At present, batch-based training strategy is widely used in deep learning models, as shown in Figure 2a. In the training process, each batch contains a certain amount of samples with specific labels. The training process of the model is actually based on samples to calculate the loss and update the network parameters. General transfer learning also uses this strategy for model training.

**Figure 2.** Different training and learning strategy where color represents class. (**a**) Batch-based training strategy used widely in deep learning. (**b**) Task-based learning strategy used in meta-learning.

Meta-learning can also be regarded as a learning process of transferring feature knowledge. The key of meta-learning allowing the model to acquire more outstanding learning ability than general transfer learning is the task-based learning strategy. In meta-learning, tasks are treated as the basic unit for training [45,49]. As shown in Figure 2b, a task contains a support set and a query set. The support set and the query set are sampled randomly from the same data set and share the same label space. The sample *x* in the support set are clearly labeled by *y*, while the labels of samples in the query set are regarded as unknown. The model predicts the labels of samples in the query set under the supervision of the support set and calculates the loss by comparing the predictive labels with the real labels, thus realizing the update of parameters.

The model runs on the basis of the task-based learning strategy, whether in the meta-learning phase, the few-shot learning phase, or the classification phase. One task is actually a training iteration. Take meta-learning on a source data set containing *Csrc* classes as an example. During each iteration, a task is generated by randomly selecting *C* classes and *K* samples per class from the source data set. Thus, the support set can be denoted as <sup>S</sup> <sup>=</sup> {(*xi*, *yi*)}*C*×*<sup>K</sup> <sup>i</sup>*=<sup>1</sup> . Similarly, *C* × *N* samples are randomly sampled from the same *<sup>C</sup>* classes to form a query set <sup>Q</sup> <sup>=</sup> {(*xj*, *yj*)}*C*×*<sup>N</sup> <sup>j</sup>*=<sup>1</sup> . It is important to note that there is no intersection between S and Q. In practice, we usually set *C* < *Csrc*, which can guarantee the richness of tasks and thus improve the robustness of the model. In theory, *N* tends to be much larger than *K*, so as to mimic the actual few-shot classification scenario. In summary, through the above description, a *C*-way *K*-shot *N*-query learning task has been builded on the source data set.

#### *2.3. HSI Few-Shot Classification*

In the previous sections, we explained in detail the few-shot classification and its learning strategy. It is not difficult to apply it to HSI classification. We only need to utilize the collected HSI as the source data set, e.g., the Botswana and Houston data sets, and utilize other HSI as the target data set, e.g., the Pavia Center data set. The complete HSI few-shot classification process based on the task-based learning strategy can be summarized as follows.


#### **3. The Designed Relation Network Model**

This section introduces the designed relation network model for the HSI few-shot classification. The designed model consists of two core modules, feature learning module and relation learning

module, which are introduced in detail. In addition, we explain how the model acquires the ability to learn how to learn from three different perspectives.

#### *3.1. Model Overview*

The designed relation network model for HSI few-shot classification consists of three parts: feature learning, feature concatenation, and relation learning, as illustrated in Figure 3. The model is an end-to-end framework, with tasks as inputs and predictive labels as outputs.

**Figure 3.** Visual representation of the designed relation network model for HSI few-shot classification.

Specifically, we select the data cubes belonging to each pixel in HSI as the samples in the task. As defined in Section 2.2, the sample in the support set is denoted as *xi*, and the sample in the query set is denoted as *xj*. The feature learning module is equivalent to a nonlinear embedding function *f* , which maps samples *xi* and *xj* in the data space to abstract features *f*(*xi*) and *f*(*xj*) in the feature space. Then, features *f*(*xi*) and *f*(*xj*) are concatenated in the depth dimension, which can be denoted as C(*f*(*xi*), *f*(*xj*)). Of course, there is more than one way to perform concatenation. It should be noted, however, that each sample feature in the query set should be concatenated to each feature generated by the support set. In addition, in order to simplify the following calculation and improve the robustness of the model, the sample features belonging to the same class in the support set are averaged. Consequently, the number of features generated from the support set is always equal to *C*. This means that, for the support set <sup>S</sup> <sup>=</sup> {(*xi*, *yi*)}*C*×*<sup>K</sup> <sup>i</sup>*=<sup>1</sup> and the query set <sup>Q</sup> <sup>=</sup> {(*xj*, *yj*)}*C*×*<sup>N</sup> <sup>j</sup>*=<sup>1</sup> , *C* × *C* × *K* concatenations would be generated. The relation learning module can also be regarded as a nonlinear function *g*, which maps each concatenation to a relation score *ri*,*<sup>j</sup>* = *g*[C(*f*(*xi*), *f*(*xj*))] representing the similarity between *xi* and *xj*. If samples *xi* and *xj* belong to the same class, the relation score will be close to 1, otherwise the relation score will be close to 0. Finally, the maximum score is obtained from the relation score set R = {*rl*,*j*}(*l* = 1, . . . , *C*) of sample *xj*, so as to decide the predictive label.

The model is trained with mean square error (MSE) as loss function (Equation (1)). MSE is easy to calculation and sufficient for training. If *yi* and *yj* belong to the same class, (*yi* == *yj*) is 1, otherwise 0, which can effectively achieve relation learning.

$$L\_{MSE} = \sum\_{i=1}^{C \times K} \sum\_{j=1}^{C \times N} (r\_{i,j} - 1 \cdot (y\_i == y\_j))^2. \tag{1}$$

#### *3.2. The Feature Learning Module*

The goal of the feature learning module is to extract more discriminative features from the input data cubes. Theoretically, any network structure can be built in this module for feature learning. A large number of studies have shown that 3D convolution is more suitable for the spatial–spectral

features extraction because of the close correlation between the spatial domain and spectral domain in HSI, . Therefore, we take the 3D convolutional layer as the core and construct the feature learning network as shown in Figure 4.

**Figure 4.** Visual representation of the feature learning module.

The feature learning module consists of a 3D convolutional layer, batch normalization layer, Relu activation function, maximum pooling layer, and concatenation operation. 3D convolution can process the input data cubes directly without any preprocessing. Compared with the general 2D convolution, 3D convolution can extract more discriminative spatial–spectral features by cascading spectral information of adjacent bands. Specifically, the 3D convolution kernel is set as 3 × 3 × 3, and the number of convolution kernel increases from 8 to 32 by multiples, which is consistent with the experience in the field of computer vision. Batch normalization layers are added after each 3D convolutional layer, which can effectively alleviate the problem of vanishing gradient and enhance the generalization ability of the model. Relu activation function, one of the most widely used activation functions in deep learning, can increase the nonlinearity of the model and speed up the convergence. The 3D convolutional layer, batch normalization layer, and Relu layer can be considered as a basic unit. Each unit is connected via maximum pooling layer. Considering the characteristics of HSI, the maximum pooling layer is set to 2 × 2 × 4 to deal with spectral redundancy.

After three convolution operations, the input samples become data cubes with 32 channels. To facilitate the subsequent operation in the feature concatenation phase, we first concatenate the 32 data cubes in the channel dimension. Given that the dimension of the data cubes is (32, *H*, *W*, *D*), it becomes (*H*, *W*, *D* × 32) after channel concatenation.

#### *3.3. The Relation Learning Module*

Under the combined action of the first two phases, the data cubes are transformed into different concatenations which are the input of the relation learning module (Figure 5). The purpose of the relation learning module is to map each concatenation to a relation score measuring the similarity between the two samples, i.e., the relationship.

In order to speed up computation, 2D convolution is regarded as the core to build the relation learning module. Therefore, the dimension of the concatenations can be regarded as (*H*, *W*, *C*), where *C* stands for the channel dimension. Considering that the channel dimension is much larger than the spatial dimension, the 1×1 2D convolution [50] is first adopted, which can extract the cascaded features across the channel while reducing the dimension effectively. After 1 × 1 convolution, 128 convolution kernels of 3 × 3 are utilized to ensure the diversity of features. In order to fully train the network, the batch normalization layer and Relu activation function are also applied after each convolution. Finally, two fully connected layers of 128 and 1 are added, so as to transform the feature maps into relation scores. Dropout is introduced between the fully connected layer to further enhance the generalization capability. In addition, sigmoid activation function is used to limit the output to the interval [0, 1].

Relation score is not only the final result of relation learning, but also a kind of similarity measure. If the two samples belong to the same class, the relation score is close to 1, otherwise 0. Therefore, the classes of samples in the query set will be determined according to the relation score.

**Figure 5.** Visual representation of the relation learning module.

#### *3.4. The Ability of Learning to Learn*

Our proposed method, RN-FSC, is essentially a meta-learning-based method for HSI fewshot classification. The core idea of meta-learning is to cultivate the ability of learning to learn. In this section, we expound this ability of RN-FSC from the following three aspects:

(1) Learning process

General deep learning models are trained based on the unique correspondence between data and labels and can only be trained in one specific task space at a time. However, the proposed method is task-based learning at any phase. The model focuses not on the specific classification task but on the learning ability with many different tasks;

(2) Scalability

The proposed method performs meta-learning on the source data set to extract the transferable feature knowledge and cultivate the ablitity of learning to learn. From the perspective of knowledge transfer, the richer the categories in the source data set, the stronger the acquired learning ability, which is consistent with the human learning experience. Therefore, we can appropriately extend the source data set to enhance the generalization ability of the model;

(3) Core mechanism

The proposed method is not to learn how to classify a specific data set, but to learn a deep metric space with the help of many tasks from different data sets, in which relation learning is performed by comparison. In a data-driven way, this metric space is nonlinear and transferrable. By comparing the similarity between the support samples and the query samples in the deep metric space, the classification is realized indirectly.

#### **4. Experiments and Discussion**

All experiments were carried out on a laptop with an Intel Core i7-9750H, 2.60 GHz and an Nvidia GeForce RTX 2070. The laptop's memory is 16 GB. All programs are developed and implemented based on Pytorch library.

#### *4.1. Experimental Data Sets*

#### 4.1.1. Source Data Sets

Four publicly available HSI data sets were collected to build the source data sets, which are Houston, Botswana, Kennedy Space Center (KSC), and Chikusei. The four data sets were photographed by different

imaging spectrometers on different regions, with different ground sample distance and spectral range (as shown in Table 1). This can ensure the diversity and richness of samples, which is conducive to meta-learning. There are 76 different ground objects contained in the four data sets, and the distribution of their respective labeled samples can be seen in Figures 6–9. We exclude the classes with less samples and only select the 54 classes with more than 200 samples to build the source data set. In addition, 100 bands are selected on each data set via the graph-representation-based band selection (GRBS) [51] instead of all bands, so as to reduce spectral redundancy and guarantee the uniformity of the number of bands (Table 2). GRBS, an unsupervised band selection method based on graph representation, can perform better in both accuracy and efficiency. The spatial neighborhood of each pixel is set to 9 × 9 with reference to [25,39,48]. After the above processing, each HSI is transformed into a number of 9 × 9 × 100 data cubes, so as to standardize the data dimensions and optimize the learning process.

**Table 1.** Detailsof the source data sets. Kennedy Space Center (KSC), ground sample distance (GSD)(m), spatial size (pixel), spectral range (nm), airborne visible infrared imaging spectrometer (AVIRIS).


(b)

**Figure 6.** Houston data set. (**a**) Pseudocolor image. (**b**) Ground-truth map.

**Figure 7.** Botswana data set. (**a**) Pseudocolor image. (**b**) Ground-truth map.

**Figure 8.** Kennedy Space Center (KSC) data set. (**a**) Pseudocolor image. (**b**) Ground-truth map.

**Figure 9.** Chikusei data set. (**a**) Pseudocolor image. (**b**) Ground-truth map.

**Table 2.** The selected bands on the source data sets via graph-representation-based band selection (GRBS). Kennedy Space Center (KSC).


#### 4.1.2. Target Data Sets

Three well-known HSI data sets, i.e., the University of Pavia (UP), the Pavia Center (PC), and Salinas, were selected to build the target data sets. Table 3 shows the detailed information. In order to standardize data dimensions, we still used the GRBS method to select 100 bands for each HSI (Table 4) and set the spatial neighborhood as 9×9. Furthermore, five labeled samples per class were selected to build the fine-tuning data set, and the remaining samples were used as the testing data set. Consequently, we used three different HSI to build three different target data sets. The proposed method performs few-shot classification on the three target data sets respectively, so as to verify its effectiveness.

In summary, Houston, Botswana, KSC, and Chikusei were used to build the source data sets, and UP, PC, and Salinas were used to build the target data sets. Therefore, the source data set and the target data set are completely different. In the target data sets, only a few labeled samples (five samples per class) were used to build the fine-tuning data sets to fine-tune the designed model. In order to make a fair comparison with other classification methods, fine-tuning data sets were also used for supervised training in comparison experiments (Section 4.3).

**Table 3.** Details of three target data sets. University of Pavia (UP), Pavia Center (PC), ground sample distance (GSD) (m), spatial size (pixel), spectral range (nm), reflective optics system imaging spectrometer (ROSIS), airborne visible infrared imaging spectrometer (AVIRIS).


**Table 4.** The selected bands on the target data sets via GRBS. University of Pavia (UP), Pavia Center (PC).


#### *4.2. Experimental Setup*

Meta-learning is a very important phase for the proposed method. The main hyperparameters in meta-learning include the number of class in each task *C*, the number of support samples per class *K*, and the number of query samples per class *N*, which are directly related to building the learning task. Therefore, we first carried out experiments to explore the influence of *C*, *K*, *N* on classification results.

The hyperparameters *C* determine the number of classes in each learning task, i.e., the complexity of the task. As described in Section 4.1, the source data set consists of 54 classes, so we explored the influence of *C* at 10, 20, 30, and 40. Figure 10 shows the experimental results. It can be seen that on three different target data sets, the model can always obtain the highest classification accuracy when *C* is 20. This indicates that when the number of classes in task is too small, the model cannot carry on sufficient learning. Given a class contained in the source data sets, if *C* is too small, this class will appear less often in the task, which reduces the chances of model learning from this class. Otherwise, when *C* is equal to 30 or 40, the complexity of the task exceeds the representation ability of the model, resulting in a significant decrease in classification accuracy.

**Figure 10.** Overall accuracy under different *C*.

*K* and *N* together determine the diversity and richness of samples in the task and directly affect the size of the task. With reference to [49], we fixed the size of task as 20 samples per class and explored the influence of *K* and *N* on the classification results by trying different combinations. Table 5 shows the experimental results. It can be found that with the increase of *K*, the classification accuracy decreases gradually. When *K* is 1, the highest classification accuracy is obtained for all three different data sets. This experimental result verifies the theory described in Section 2.2, i.e., setting *K* < *N* in the meta-learning phase can imitate the subsequent few-shot classification process, so as to obtain better classification results.

**Table 5.** Overall accuracy (%) with different combinations of *K* and *N*.


Through the above experimental exploration, the optimal task setting in the meta-learning phase has been found, i.e, the 20-way 1-shot 19-query learning task. In order to further optimize the meta-learning process, the appropriate value of learning rate is analyzed. With reference to relevant experience, we analyzed the influence of learning rate at 0.01 and 0.001 on the loss function value, as shown in Figure 11. It can be seen that the loss value obviously fluctuates, due to the diversity of source data set and the randomness of task. Nevertheless, after approximately 2000 episodes, the 0.001 learning rate is able to acquire a lower loss value, indicating that the 0.001 learning rate can enable the model to learn fully.

**Figure 11.** Loss value under different learning rates.

In addition, we utilized UP as the target data set to explore the influence of different network structure settings on classification results. Table 6 lists the specific structures of the feature learning module and the relation learning module and their corresponding classification accuracy. It should be noted that only the changed structure settings are listed in Table 6, while other basic settings, such as the batch normalization layer and Relu activation function, are set in accordance with Section 3. The exploration for network settings can be divided into two parts: *NO*.1 to *NO*.4 settings change the feature learning module, and *NO*.5 to *NO*.7 settings change the relation learning module. It can be found that *NO*.2 network settings can achieve the best classification effect, the specific structure of which is consistent with the description in Sections 3.2 and 3.3. According to the experimental results in the table, it is not difficult to obtain the following three observations:


In addition to the hyperparameters explored above, other basic experimental settings are given directly by referring to the existing deep learning model. We used *Adam* as the optimization algorithm and set the number of episodes in the meta-learning phase to 10,000, and the number of episodes in the few-shot learning phase to 1000. In the Dropout layer, the probability of random discard is 50%. All convolution kernels are initialized by Xavier [52].


**Table 6.** Overall classification accuracy (OA, %) on the UP data set under different network structure settings. The feature learning module (FLM), the relation learning module (RLM), max pooling (MP), fully connected layer (FC).

#### *4.3. Comparison and Analysis*

In order to verify the effectiveness of the proposed method in HSI few-shot classification, we compared the experimental results of RN-FSC with the widely used SVM, two classical semisupervised methods LapSVM and TSVM provided in [53], the deep learning model Res-3D-CNN [54], two semisupervised deep models SS-CNN [35] and DCGAN+SEMI [55], and the graph convolutional network (GCN) [56] model. SVM can map nonlinear data to linearly separable high-dimensional feature spaces utilizing the kernel method, so it can obtain a better classification effect than other traditional classifiers when processing high-dimensional HSI. LapSVM and TSVM are both classical semisupervised support vector machines. Res-3D-CNN constructs a deep classification model with the 3D convolutional layer and residual structure, which can make full use of the spatial–spectral information in HSI. By combining CNN and DCGAN with semisupervised learning, respectively, SS-CNN and DCGAN+SEMI can use the information of unlabeled samples for classification. GCN is also an advanced semisupervised classification model.

In order to quantitatively compare the classification performance of the above different methods, the overall accuracy (OA), classification accuracy per class, average accuracy (AA), and *Kappa* coefficient are used as evaluation indicators. The overall accuracy is the percentage of samples classified correctly in all samples, and the average accuracy is the average of classification accuracy per class. It should be noted that for RN-FSC, five labeled samples per class in the target data set were used for fine-tuning, and for other methods, five labeled samples per class were used for training. Tables 7–9 summarize the experimental results on the three different target data sets, from which the following five observations can be obtained:


features from labeled and unlabeled samples by building an end-to-end hierarchical framework, so they can obtain better classification results;


**Table 7.** Classification results of the different methods on the UP data set (5 samples per class in the fine-tuning data set for RN-FSC; 5 samples per class are used for training for other methods; bold values represent the best results among these methods).


**Table 8.** Classification results of the different methods on the PC data set (5 samples per class in the fine-tuning data set for RN-FSC; 5 samples per class are used for training for other methods; bold values represent the best results among these methods).



**Table 9.** Classification results of the different methods on the Salinas data set (5 samples per class in the fine-tuning data set for RN-FSC; 5 samples per class are used for training for other methods; bold values represent the best results among these methods).

In order to better compare and analyze the classification results of the above methods, Figures 12–14 respectively show their classification maps on the three target data sets. With the continuous improvement of the classification accuracy, the noise and misclassification phenomena gradually decrease, and the classification map gradually approaches the ground-truth map. In fact, the results of Figures 12–14 and Tables 7–9 are the same, both of which can prove the effectiveness of the proposed method.

**Figure 12.** Classification maps resulting from different methods on the UP data set.

**Figure 13.** Classification maps resulting from different methods on the PC data set.

(f) SS-CNN (g) DCGAN+SEMI (h) GCN (i) RN-FSC

**Figure 14.** Classification maps resulting from different methods on the SA data set.

In order to further verify that the observed increase in classification accuracy is statistically significant, we repeated the experiment 20 times for different methods and carried out the paired *t*-test on OA. The paired *t*-test is a widely used statistical method to verify whether there is a significant difference between the two groups of related samples [17,39]. In our test, if the result *t* is greater

than 3.57, it indicates that there is a significant difference between the two groups of samples at the 99.9% confidence level. As seen from Table 10, all the results are greater than 3.57, indicating that the increase in classification accuracy is statistically significant.


**Table 10.** Results of the paired *t*-test on three target data sets.

#### *4.4. Influence of the Number of Labeled Samples*

The objective of the experiments is to verify the classification effect of the proposed method on new HSI with only a few labeled samples. Therefore, it is necessary to explore the classification effect of the proposed method under different numbers of labeled samples. To this end, we randomly selected 5, 10, 15, 20, and 25 labeled samples per class to build the fine-tuning data set. Accordingly, we explored the classification results of other methods with 5, 10, 15, 20, and 25 labeled samples per class for training. Figure 15 shows the experimental results. It can be seen that the OA of all methods presents an increasing trend with the increase in the number of labeled samples. RN-FSC always has the highest classification accuracy, which indicates that it has the best adaptability to the number of labeled samples.

Experimental results from Tables 7–9 and Figure 15 have shown that the proposed method can achieve better classification results when classifying new HSI with only a few labeled samples. In order to further explore the influence of the number of labeled samples on the classification effect of RN-FSC, we conducted comparative experiments on Salinas and Indian Pines data sets with reference to [57–59]. The Indian Pines data set, containing 16 classes of the Indian Pine test site in Northwestern Indiana, was collected by AVIRIS. Salinas and Indian Pines both contain 16 classes, and Indian Pines contains 4 small classes with less than 100 labeled samples, which can further verify the effectiveness of the classification method. In the experiments, 10% and 2% labeled samples were randomly selected to build the fine-tuning data set (1083 labeled samples for Salinas and 1025 labeled samples for Indian Pines), which is far more than that of the previous experiments. It should be noted that the selection of labeled samples per class is exactly the same as in [57–59]. EPF-B-g, EPF-B-c, EPF-G-g, EPF-G-c, and IEPF-G-g provided in [57–59] were selected to make a comparison with the proposed method. Table 11 shows the experimental results. In the Salinas data set, the OA and AA of RN-FSC are higher

than those of other methods. In the Indian Pines data set, the classification results of IEPF-G-g are the best, followed by those of RN-FSC. Overall, when the labeled samples are further increased (approximately 1000–1100 labeled samples for each data set), the proposed method can still obtain satisfactory results.

**Figure 15.** Classification accuracy under different number of labeled samples on three target data sets.


**Table 11.** Classification results of the different methods on Salinas and Indian Pines.

#### *4.5. Exploration on the Effectiveness of Meta-Learning*

The learning process of the proposed method, RN-FSC, can be divided into two phases: meta-learning on the source data set and few-shot learning on the fine-tuning data set. As mentioned in the previous sections, the reason RN-FSC has a better classification effect in the HSI few-shot classification is that it has acquired a large amount of feature knowledge and mastered the ability to learn how to learn through meta-learning. To verify this point, we carried out experiments to explore the influence of the meta-learning phase on the final classification results. Table 12 lists the overall accuracy with and without meta-learning with a different number of labeled samples. The model without meta-learning can only perform supervised training with a few labeled samples in the fine-tuning data set, so its classification results are poor. On the UP, PC, and Salinas data sets, the meta-learning phase can increase the classification accuracy by 20.20%, 10.73%, and 15.91%,

respectively, when *L* = 5, which fully proves the effectiveness of meta-learning in HSI few-shot classification. In addition, it can be found that with the increase in the number of labeled samples, the difference between the results with and without meta-learning shows a decreasing trend. For example, on the UP data set, the difference is 20.20% when *L* = 5, and 10.83% when *L* = 25.


**Table 12.** Influence of the meta-learning phase on classification accuracy (OA, %). L is the number of labeled samples per class.

#### *4.6. Execution Time Analysis*

The execution time of general deep learning models usually consists of training time and testing time. As described in Section 2.3, the proposed method consists of three phases: meta-learning, few-shot learning, and classification. The biggest difference between RN-FSC and other general deep models for HSI classification is that it first performs meta-learning on the previously collected source data sets and then classifies the new HSI data sets, which are absolutely different from the source data sets. In other words, only performing meta-learning in advance one time, RN-FSC can quickly classify all other new data sets, which is of great significance in practical applications. In our experiment, it takes approximately 12.83 h for the model to perform meta-learning. In practice, the model used to perform the classification task should have completed meta-learning. Therefore, the model needs only to perform few-shot learning and classification when processing the target HSI. Table 13 lists the execution times of DCGAN+SEMI, GCN, and RN-FSC on three different target data sets, because they present better classification results than other methods. DCGAN+SEMI and GCN include training and testing time, while RN-FSC includes few-shot learning time and classification time. DCGAN+SEMI needs to train the generator and the discriminator, respectively, while GCN utilizes all the labeled samples for graph construction, so their training time is longer. RN-FSC only utilizes a few labeled samples for fine-tuning, so the few-shot learning time is shorter. However, since RN-FSC needs to calculate the relation score through comparison, its classification time is longer. Generally speaking, the execution time of RN-FSC is shorter than that of DCGAN+SEMI and GCN, which indicates RN-FSC has better work efficiency.

**Table 13.** Execution times on three target data sets (5 samples per class are used as labeled samples).


#### *4.7. Discussion*

It is difficult for deep learning models to be fully trained and achieve promising classification results with a few labeled samples. At the same time, for complex and diverse HSI, the working mode that general deep learning models need to be trained from scratch every time is very inefficient and not desirable in practice. However, our method can obtain better classification results with only a few labeled samples (five samples per class) when processing new HSI. The root cause is the implementation of meta-learning, the core of which is the ability to learn how to learn. In our method, this ability is demonstrated in the form of comparison. Firstly, the model maps the data space to a deep metric space, where it performs relation learning by comparing the similarity of sample features, i.e., the similarity between samples belonging to the same class is high and the relation score is high, whereas the similarity between samples belonging to the different class is low and the relation score is low. In fact, the form of the ability to learn how to learn is not unique in the field of meta-learning, which largely depends on the specific network structure and loss function.

The task-based learning strategy is key to performing meta-learning. Lots of randomly generated tasks from different HSI can effectively enhance the generalization ability of the model, because the model learns how to compare with different tasks instead of how to classify a specific data set. To acquire the best learning effect, we explored the optimal task setting, including the number of classes, the number of support samples, and the number of query samples in the task. Experiments showed that the support samples should be much fewer than the query samples, so as to fully simulate the situation of HSI few-shot classification. In addition, experiments were conducted to explore the influence of learning rate to further optimize the meta-learning process. At the same time, the network structure can directly affect the classification results. A new deep model based on relation network was designed for HSI few-shot classification. In the feature learning module, the 3D convolutional layer can effectively utilize the spatial–spectral information to extract the highly discriminant features. In addition, we found that the convolutional layer is necessary in the relation learning module, which can guarantee the comparison ability of the model to some extent.

Through detailed comparison and analysis, it can be demonstrated that the proposed method outperforms SVM, semisupervised SVM, and several supervised and semisupervised deep learning models with a few labeled samples. Moreover, the proposed method has better adaptability to the number of samples. The paired *t*-test shows that the increase in classification accuracy is statistically significant and not accidental. In addition, by comparing the results of the model with and without meta-learning, the importance of the meta-learning phase is directly proved again. Finally, the efficiency of different methods was compared, indicating the potential value of the proposed method in practical application.

#### **5. Conclusions**

Although the deep learning model has achieved great success in HSI classification, it still faces great difficulties in classifying new HSI with a few labeled samples. To this end, this paper proposes a new classification method based on a relation network for HSI few-shot classification. Meta-learning is the core of this method, and the network settings realize the ability to learn how to learn in the form of comparison in deep metric space, that is, the relation score between samples belonging to the same class is high, while the relation score between samples belonging to different classes is low. Benefitting from a large number of tasks generated from different data sets, the generalization ability of the model is constantly enhanced. Experiments on three different target data sets show that the proposed method outperforms traditional semisupervised SVM and semisupervised deep learning methods when only a few labeled samples are available.

**Author Contributions:** Methodology, K.G. and B.L.; investigation, X.Y., J.Q., and P.Z.; resources, X.Y., P.Z., and X.T.; writing—original draft preparation, K.G.; writing—review and editing, K.G. and B.L.; visualization, J.Q. and X.T.; supervision, X.Y., P.Z., and X.T. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by National Natural Science Foundation of China under Grant 41801388 .

**Acknowledgments:** The authors would like to thank Yokoya for providing the data used in this study. The authors would also like to thank all the professionals for kindly providing the codes associated with the experiments.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Hyperspectral Image Classification Based on a Shuffled Group Convolutional Neural Network with Transfer Learning**

#### **Yao Liu 1, Lianru Gao 2,***∗***, Chenchao Xiao 1, Ying Qu 3, Ke Zheng <sup>2</sup> and Andrea Marinoni <sup>4</sup>**


Received: 4 May 2020; Accepted: 27 May 2020; Published: 1 June 2020

**Abstract:** Convolutional neural networks (CNNs) have been widely applied in hyperspectral imagery (HSI) classification. However, their classification performance might be limited by the scarcity of labeled data to be used for training and validation. In this paper, we propose a novel lightweight shuffled group convolutional neural network (abbreviated as SG-CNN) to achieve efficient training with a limited training dataset in HSI classification. SG-CNN consists of SG conv units that employ conventional and atrous convolution in different groups, followed by channel shuffle operation and shortcut connection. In this way, SG-CNNs have less trainable parameters, whilst they can still be accurately and efficiently trained with fewer labeled samples. Transfer learning between different HSI datasets is also applied on the SG-CNN to further improve the classification accuracy. To evaluate the effectiveness of SG-CNNs for HSI classification, experiments have been conducted on three public HSI datasets pretrained on HSIs from different sensors. SG-CNNs with different levels of complexity were tested, and their classification results were compared with fine-tuned ShuffleNet2, ResNeXt, and their original counterparts. The experimental results demonstrate that SG-CNNs can achieve competitive classification performance when the amount of labeled data for training is poor, as well as efficiently providing satisfying classification results.

**Keywords:** lightweight convolutional neural networks; deep learning; hyperspectral imagery classification; transfer learning

#### **1. Introduction**

Hyperspectral sensors are able to grasp detailed information of objects and phenomena on Earth's surface by severing their spectral characteristics in a large number of channels (bands) over a wide portion of the electromagnetic spectrum. Such rich spectral information allows hyperspectral imagery (HSI) to be used for interpretation and analysis of surface materials in a more thorough way. Accordingly, hyperspectral remote sensing has been widely used in several research fields, such as environmental monitoring [1–3], land management [4–6], and agriculture [7–9].

Land cover classification is an important HSI analysis task that aims to label every pixel in the HSI image with its unique type [10]. In the past several decades, various classification methods have been developed based on spectral features [11,12] or spatial-spectral features [13–15]. Recently, deep-learning (DL)-based methods have attracted increasing attention for HSI classification [16]. Compared to traditional methods that require sophisticated feature extraction methods [17], DL methods allow models to automatically extract hidden features and learn parameters from labeled samples. Existing DL methods include fully connected feedforward neural networks [18–20], convolutional neural networks (CNNs) [21–23], recurrent neural networks (RNNs) [24,25], and so on. Among these networks, CNN has become the major deep learning framework applied for hyperspectral image classification, as it can maintain the local invariance of the image and has a relatively small number of coefficients to be tuned [26].

For HSI classification, the scarcity of labeled data to be used for training is a common problem [27]. Nonetheless, supervised DL methods require large training datasets to achieve accurate classification results [28]. Since data labeling is time-consuming and costly, many techniques have been developed to deal with HSI classification of small datasets, such as data augmentation [29–31] and transfer learning [32–38]. Data augmentation is an effective technique that artificially enlarges the size of a training dataset by creating its modified versions, e.g., by flipping and rotating the original sample image [30]. On the other hand, transfer learning reuses a trained model and adapts it to a related new task, alleviating the requirement on large-scale labeled samples for effective training. In [32,33], transfer learning has been employed between HSI records acquired by the same sensor. Recently, HSI classification based on cross-sensor transfer learning has become a hot topic within the scientific community, since it allows to achieve high accuracy by combining the information retrieved from multiple hyperspectral images [34–38]. In these studies, efficient network architecture was proposed with units that have only a few parameters to be tuned (e.g., separable convolutions [34], bottleneck unit[36]) and deeper layers that can accurately extract complex features (e.g., VGGNet in [35], ResNet in [36]). However, with tens of layers in these CNNs, the number of parameters can easily reach several hundred thousands, or even millions, and hyperparameters need to be carefully tuned for these networks to avoid overfitting. When labeled samples are scarce (either in terms of quality, reliability, or size), a simpler structure is suitable to avoid the risk of overfitting. Accordingly, we propose a new CNN called shuffled group convolutional neural network (SG-CNN). SG-CNN has efficient building blocks called SG conv units and does not contain a large number of parameters. In addition, we applied SG-CNN with transfer learning between HSI of different sensors to improve the classification performance with limited samples.

The main contributions of this study are summarized as follows.

(1) We propose a DL-based method that brings improvement to HSI classification with limited samples through transfer learning on the new proposed SG-CNN. The SG-CNN reduces the number of parameters and computation time whilst guaranteeing high classification accuracy.

(2) To conduct transfer learning, a simple dimensionality reduction strategy is put forward to keep the dimensions of input data consistent. This strategy is very easily and quickly performed and requires no labeled samples from the HSIs. The bands of original HSI datasets are selected according to this strategy to ensure both the source data and target data have the same number of bands to be the SG-CNN inputs.

The remainder of this paper is organized as follows. Section 2 gives a detailed illustration of the proposed framework for classification, including the structure of the network and the new proposed SG conv unit. Datasets, experimental setup, as well as classification results and analysis are given in Section 3. Finally, conclusions are presented in Section 4.

#### **2. Proposed Method**

As previously mentioned, DL models have been applied in HSI classification with satisfying performance. However, as a lack of sufficient samples is typical for HSI, there is still room for improvement of DL-based classification methods. Inspired by the lightweight networks [39,40] and the effects of atrous convolution in semantic segmentation tasks [41–43], we developed a new lightweight CNN for HSI classification. In this section, the structure of this new proposed network as well as how it is applied to transfer learning is given next.

#### *2.1. A SG-CNN-Based Classification Framework*

The framework of the proposed classification is shown in Figure 1. It consists of three parts: (1) dimensionality reduction (DR), (2) sample generation, and (3) SG-CNN for feature extraction and classification.

First, DR is conducted to ensure that the SG-CNN input data from both the source and target HSIs have the same dimensions. Considering that typical HSIs have 100–200 bands and generally require less than 20 bands to summarize the most informative spectral features [44], a simple band reduction strategy is implemented, and the number of bands is fixed to 64 for the CNN input data. These 64 bands are selected at equal intervals from the original HSI. Specifically, given HSI data with *Nb* bands, the number of bands and intervals are determined as follows.

(1) Two intervals are used and respectively set to -*Nb*/64 and -*Nb*/64 + 1, where represents the floor operation of its input.

(2) Assume *x* and *y* are the number of bands selected respectively at these two intervals. Then we can have equations as follows:

$$\begin{cases} \mathbf{x} + \mathbf{y} = 64\\ \lfloor N\_b/64 \rfloor \ast \mathbf{x} + \left( \lfloor N\_b/64 \rfloor + 1 \right) \ast \mathbf{y} = N\_b \end{cases} \tag{1}$$

where *x* and *y* are solved using these linear equations. The 64 selected bands of both source and target data are thus determined. Compared with band selection methods, this DR strategy retains more bands but is very easy and fast to implement.

Second, a *S* × *S* × 64-sized cube is extracted as a sample from a window centered around a labeled pixel. S is the window size, and 64 is the number of bands. The label of the center pixel in the cube is used as the sample's label. In addition, we used the mirroring preprocessing in [23] to ensure sample generation for pixels belonging to image borders.

Finally, samples are fed to the SG-CNN that mainly consists of two parts to achieve classification: (1) the input data are put through SG conv units for feature extraction; (2) the output of the last SG conv unit is subject to global average pooling and then fed to a fully connected (FC) layer, further predicting the sample class using the softmax activation function.

**Figure 1.** Shuffled group convolutional neural network (SG-CNN)-based hyperspectral imagery (HSI) classification framework.

#### *2.2. SG Conv Unit*

Networks with a large number of training parameters can be prone to overfitting. To tackle this issue, we designed a lightweight SG conv unit inspired by the structure in ResNeXt [45]. In the SG conv units, group convolution is used to decrease the number of parameters. We used not only conventional convolution, but we also introduced atrous convolution into the group convolution, which was followed by a channel shuffle operation; this is a major difference with respect to the ResNeXt structure. To further boost the training efficiency, batch normalization [46] and short connection [47] were also included in this unit.

The details of this unit are displayed in Figure 2. From top to bottom, this unit mainly contains a 1 × 1 convolution, group convolution layers followed by channel shuffle, and another 1 × 1 convolution, which is added to the input of this unit and then fed to the next SG conv unit or global average pooling layer. Specifically, in the group convolution, half the groups perform conventional convolutions, while the other half employ subsequent convolutional layers that have different dilation rates. The inclusion of atrous convolution is motivated by its ability to enlarge the respective field without increasing the number of parameters. Moreover, atrous convolution has shown outstanding performance in semantic segmentation [41–43], whose task is similar to HSI classification, i.e., to label every pixel with a category. In addition, since stacked group convolutions only connect to a small fraction of input channels, channel shuffle (Figure 2b) is performed to make the group convolution layers more powerful through connections with different groups [39,40].

**Figure 2.** SG conv unit: (**a**) A SG conv unit has a 1x1 convolution, group convolution layers followed by channel shuffle, another 1x1 convolution, and a shortcut connection. (**b**) Channel shuffle operation in the SG conv unit mixes groups that have conventional convolution and atrous convolution.

#### *2.3. Transfer Learning between HSIs of Different Sensors*

In order to improve the classification results for HSI data with limited samples, transfer learning was applied to the SG-CNN. As shown in Figure 3, this process consisted of two stages: pretraining and fine-tuning. Specifically, the SG-CNN was first trained on the source data that had a large number of samples, and then it was fine-tuned on the target data with fewer samples. In the fine-tuning stage, apart from parameters in the FC layer, all other parameters from the pretrained network were used in the initialization to train the SG-CNN; parameters in the FC layer were randomly initialized.

#### **3. Experimental Results**

Extensive experiments were conducted on public hyperspectral data to evaluate the classification performance of our proposed transfer learning method.

*Remote Sens.* **2020**, *12*, 1780

**Figure 3.** Transfer learning process: (**a**) pretrain the SG-CNN with samples from source HSI data, (**b**) fine-tune the SG-CNN for target HSI data classification.

#### *3.1. Datasets*

Six widely known hyperspectral datasets were used in this experiment. These hyperspectral scenes included Indian Pines, Botswana, Salinas, DC Mall, Pavia University (i.e., PaviaU), and Houston from the 2013 IEEE Data Fusion Contest (referred as Houston 2013 hereafter). The Indian Pines and Salinas were collected by the 224-band Airborne Visible/Infrared Imaging Spectrometer (AVIRIS). Botswana was acquired by the Hyperion sensor onboard the EO-1 satellite, with the data acquisition ability of 242 bands covering the 0.4–2.5 μm. DC Mall was gathered by the Hyperspectral digital imagery collection experiment (HYDICE). PaviaU and Houston 2013 were acquired by the ROSIS and CASI sensor, respectively. Detailed information about these data are listed in Table 1: uncalibrated or noisy bands covering the region of water absorption have been removed from these datasets.

Three pairs of transfer learning experiments were designed using these six datasets: (1) pretrain on the Indian Pines, and fine-tune on the Botswana scene; (2) pretrain on the PaviaU scene, and fine-tune on the Houston 2013 scene; (3) pretrain on the Salinas scene, and fine-tune on the DC Mall scene. The experiments were designed as above for two reasons: (1) the source data and target data were collected by different sensors, but they were similar in terms of spatial resolution and the spectral range; (2) the source data have more labeled samples in each class than those of the target data. Despite that slight differences of band wavelengths may exist between the source and target data, SG-CNNs will automatically adapt its parameters to extract spectral features for the target data in the fine-tuning process.


**Table 1.** Hyperspectral datasets used in the experiment.

\* Only nine classes having most labeled samples were used from the Indian Pines data. Other classes with fewer training samples were excluded from the experiment.

#### *3.2. Experimental Setup*

To evaluate the performance of the proposed classification framework, classification results of three target datasets were compared with those predicted from two baseline models, i.e., ShuffleNet V2 (abbreviated as ShuffleNet2) [40] and ResNeXt [45]. ShuffleNet2 is well-known for its speed and accuracy tradeoff. ResNeXt consists of building blocks with group convolution and shortcut connections, which are also used in the SG-CNN. It is worth noting that we used ShuffleNet2 and ResNeXt with fewer building blocks rather than their original models, considering the limited samples of HSIs. Specifically, convolution layers in Stages 3 and 4 of ShuffleNet2 were removed, and output channels was set to 48 for Stage 2 layers; for the ResNeXt model, only one building block was retained. For further details on ShuffleNet2 and ResNeXt architectures, the reader is referred to [40,45]. In addition, simplified ShuffleNet2 and ResNeXt were both trained on the original target HSI data as well as fine-tuned on the 64-band target data using a corresponding pretrained network from the 64-band source data. Classification results obtained from the transfer learning of baseline models were referred to ShuffleNet2\_T and ResNeXt\_T, respectively. In addition, we performed transfer learning with SG-CNNs throughout the experiment.

Three SG-CNNs with three levels of complexity were tested for evaluation (see Table 2). SG-CNN-X represents the SG-CNN with X layers of convolution. It is worth noting that ResNeXt and SG-CNN-8 have the same number of layers, and the only difference between their structure is the introduction of atrous convolution for half the groups and shuffle operation in the SG-CNN-8 model. The number of groups was fixed to eight for both the SG-CNNs and ResNeXt, and the sample size was set to 19 × 19. In the SG conv unit, the dilation rates of three atrous convolutions were set to 1, 3, and 5 to get a receptive field of 19 (i.e., the full size of a sample).


**Table 2.** Overall SG-CNN architecture with different levels of complexity.

Groups that have conventional convolution in SG conv units are omitted in the table, as this operation is the same as the first layer of subsequent atrous convolution layers with a dilation rate of 1 (i.e., r = 1).

Before network training, original data were normalized to guarantee input values within 0 to 1. Data augmentation techniques (including horizontal and vertical flip) were used to increase the training samples. All classification methods were implemented using python code with high-level APIs Tensorflow [48] and Keras. To further alleviate possible overfitting, the sum of multi-class cross entropy and L2 regularization term was taken as the loss function, and we set the weight decay to <sup>5</sup> <sup>×</sup> <sup>10</sup>−<sup>4</sup> in the L2 regularizer. The Adam optimizer [49] was adopted with an initial learning rate of 0.001, and the learning rate would be reduced to one-fifth of its value if the validation loss function

did not decrease for 10 epochs. We used the Adam optimizer with a mini-batch size of 32 on a NVIDIA GEFORCE RTX 2080Ti GPU. The number of epochs was set to 150–250 for different datasets, and it is determined based on the number of training samples.

#### *3.3. Experiments on Indian Pines and Botswana Scenes*

The false-color composites of the Indian Pines and Botswana scenes are displayed in Figures 4 and 5, with their corresponding ground truth. In the pre-training and fine-tuning stage, Table 3 gives the number of labeled pixels that were randomly selected for training, and the remaining labeled samples were used for the test.

**Figure 4.** The Indian Pines scene: (**a**) false-color composite image; (b) ground truth.

**Figure 5.** The Botswana scene: (**a**) false-color composite image; (**b**) ground truth.


**Table 3.** The number of training and test samples used in Indiana Pines and Botswana datasets.

The loss function of SG-CNNs converged in the 150 epochs of training, indicating no overfitting during the fine-tuning process (see Figure 6). Classification results obtained by SG-CNNs were then compared with other methods in Table 4 for the Botswana scene. A range of criteria, including overall accuracy (OA), average accuracy (AA), and Kappa coefficient (K), were all reported as well as the classification accuracy of each class and training time. OA and AA are defined as below:

$$OA = \frac{\sum\_{i=1}^{n} \mathbb{C}\_{i}}{\sum\_{i=1}^{n} \mathbb{S}\_{i}} \tag{2}$$

$$AA = \frac{1}{n} \sum\_{i=1}^{n} \frac{C\_i}{S\_i} \tag{3}$$

where *Ci* is the number of correctly predicted samples out of *Si* samples in class *i*, and *n* is the number of classes.

**Figure 6.** Convergence curves during the fine-tuning process of the Botswana scene: (**a**) SG-CNN-7, (**b**) SG-CNN-8, (**c**) SG-CNN-12.

Based on the results in Table 4, several preliminary conclusions can be drawn as follows.

(1) Compared with baseline models, SG-CNNs typically achieve better classification performance, providing higher accuracy and spending relatively less training time. Specifically, the overall accuracy of SG-CNNs was 98.97–99.65%, which was approximately ∼1% and ∼3.5% higher, on average, than ResNeXt and ShuffleNet2 models, respectively. In addition, SG-CNN-7 and SG-CNN-8 were shown to be quite efficient, as the execution time of their fine-tuning process was comparable to that of ShuffleNet2\_T and ResNeXt\_T. As an effect of its complicated structure with more trainable parameters, SG-CNN-12 required a longer period of time to fine-tune.

(2) As mentioned in Section 3.2, SG-CNN-8 can be seen as the baseline ResNeXt model that introduces atrous convolution and channel shuffle into its group convolution. Comparing the classification results of these two models, we can appreciate that the inclusion of atrous convolution and channel shuffle improved the classification.

(3) For the baseline models, both ShuffleNet2\_T and ResNeXt\_T, which were fine-tuned on the 64-band target data, obtained similar accuracy with much lower execution time, compared with their counterparts that were directly trained from original HSIs. This indicates that the simple band selection strategy applied in transfer learning can generally help to enhance the training efficiency.


**Table 4.** Classification accuracy (%) and computation time of the Botswana scene. A total of 420 labeled samples (30 per class) were used for fine-tuning. The No. column refers to the corresponding class in Table 3. The best results are in **bold**.

For the SG-CNNs, all classification results are obtained with fine-tuning on the target data based on a pretrained model using the source data.

Our second test with the Botswana scene evaluated the classification performance of transfer learning with SG-CNNs using varying sizes of samples. Specifically, 15, 30, 45, 60, and 75 samples per class from the Botswana scene were used, respectively, to fine-tune the pretrained SG-CNNs, and their classification performances were evaluated from OAs of the corresponding remaining samples (i.e., the test samples). Meanwhile, the same samples used for fine-tuning SG-CNNs were utilized to train ShuffleNet2 and ResNext and fine-tune ShuffleNet2\_T and ResNext\_T. These models were also assessed with OA of test samples. Figure 7 displays OAs in the test dataset from different classification methods with different numbers of training samples. Several conclusions can be drawn:

(1) Compared with ShuffleNet2, ShuffleNet2\_T, and ResNeXt, SG-CNNs showed a remarkable improvement for classification by providing a higher classification accuracy, especially when labeled samples were relatively small (i.e., 15–60 samples per class).

(2) Compared with ResNeXt\_T, SG-CNNs generally yielded better classification results when the training samples were limited (i.e., 15–45 per class). As the number of samples increased to 60–75 for each class, ResNeXt\_T provided comparable accuracy.

(3) Although SG-CNN-12 generally achieved the best performance, its classification accuracy was merely 0.1–0.7% higher than that of SG-CNN-7 and SG-CNN-8. However, the latter two showed smaller values of execution time for the fine-tuning than the former. In other words, SG-CNN-7 and SG-CNN-8 had better tradeoffs between classification accuracy and efficiency.

**Figure 7.** Overall classification accuracies of the test samples based on various methods trained/finetuned with 15–75 labeled samples for the Botswana scene.

#### *3.4. Experiments on PaviaU and Houston 2013 Scenes*

PaviaU and Houston 2013 datasets are displayed with their labeled sample distributions in Figures 8 and 9. Figure 8 shows that the PaviaU scene contained five manmade types, two types of vegetation, and one type for soil and shadow. As shown in Figure 9, the Houston 2013 scene had nine manmade types, four types of vegetation, and one type for soil and water. Surface types distributions were similar in these two scenes. ShuffleNet2, ResNeXt, and SG-CNNs were fine-tuned on the Houston 2013 scene, with pretrained models acquired from training with the PaviaU dataset. Table 5 displays the number of samples used in the experiment, respectively. Six hundred labeled samples per class in the PaviaU scene were utilized to pretrain the models, whereas 100 randomly selected samples per class in the Houston scene were used for fine-tuning.

**Figure 8.** The PaviaU scene: (**a**) false-color composite image; (**b**) ground truth.

**Figure 9.** Houston 2013 scene: (**a**) true-color composite image; (**b**) ground truth.


**Table 5.** The number of training and test samples for PaviaU and Houston 2013 datasets.

Convergence curves of the loss function are shown in Figure 10 for the fine-tuning of SG-CNNs applied to the Houston 2013 scene. Classification results acquired from SG-CNNs and baseline models are detailed in Table 6. As shown in Table 6, SG-CNNs with different levels of complexity achieved higher classification accuracies than those of ShuffleNet2, ShuffleNet2\_T, ResNeXt, and ResNeXt\_T. Specifically, SG-CNN-12 provided the best classification results with the highest OA (99.45%), AA (99.40%), and Kappa coefficient (99.35%), and it also achieved the highest classification accuracy for eight classes in the test samples. Comparing the results from SG-CNN-8 and ResNeXt\_T, the former obtained a slightly higher OA than the latter but spent less than half the training time, indicating the SG conv unit's effectiveness for classification improvement. In addition, fine-tuned ResNeXt\_T and ShuffleNet2\_T yielded better results than the original ResNeXt and ShuffleNet2. Hence, this confirms the previous conclusion that our band selection strategy applied in transfer learning boosts the classification performance.

**Figure 10.** Convergence curves during the fine-tuning process of the Houston 2013 scene: (**a**) SG-CNN-7, (**b**) SG-CNN-8, and (**c**) SG-CNN-12.

**Table 6.** Classification accuracy (%) and computation time of the Houston 2013 scene. A total of 1500 labeled samples (100 per class) were used for fine-tuning. The No. column refers to the corresponding class in Table 5. The best results are in **bold**.


Classification experiments with varying numbers of training samples were also conducted. Specifically, 50–250 samples per class in the Houston scene were used for fine-tuning the SG-CNNs, as well as for training or fine-tuning the baseline networks. OAs of the remaining test samples are shown in Figure 11 for all the methods. Some conclusions can be reached from making comparisons between these results:

(1) As training samples varied from 50 to 250 per class, SG-CNNs outperformed ShuffleNet2, ShuffleNet2\_T, and ResNeXt for the Houston 2013 scene classification. The accuracies of the fine-tuned SG-CNNs are ∼1.3–7.4% higher than that of the other three baseline networks, indicating that SG-CNNs greatly improved the classification performance with both limited and sufficient samples.

(2) Comparing with ResNeXt\_T, SG-CNNs obtained better results when few samples were provided (i.e., 50–100 per class). As the number of samples increased to 150–250 per class, the ResNeXt\_T and SG-CNNs achieved comparable accuracy. This suggests that SG-CNNs have better performance with limited samples.

(3) In general, SG-CNN-12 provided the highest classification accuracy among the three SG-CNNs. However, as the number of training samples increased, the performance of SG-CNN-12 showed no obvious improvement compared to SG-CNN-7 and SG-CNN-8, which are more efficient and require less computing time.

**Figure 11.** Overall classification accuracies of the test samples based on various methods trained/finetuned with 50–250 labeled samples for the Houston 2013 scene.

#### *3.5. Experiments on Salinas and DC Mall Scenes*

Salinas and DC Mall images and their labeled samples are shown in Figures 12 and 13, respectively. It is important to note that surface types were quite different between these two scenes. The Salinas scene mainly consisted of natural materials (i.e., vegetation and three types of fallow), whereas the DC Mall scene included grass, trees, shadows, and three manmade materials. Table 7 provides the number of samples used as training and test datasets. Five hundred samples of each class in the Salinas scene were randomly selected for base network training, whereas 100 samples of each class in the DC Mall scene were used for fine-tuning.

**Figure 12.** The Salinas scene: (**a**) false-color composite image; (**b**) ground truth.

The loss function of SG-CNNs converged during the fine-tuning for the DC Mall scene (see Figure 14). The classification results of both baseline models and SG-CNNs are listed in Table 8 with their corresponding training time. As shown in Table 8, similar conclusions can be reached from the DC Mall experiment. First, SG-CNNs outperformed the baseline models in terms of classification results. Moreover, SG-CNN-8 had an OA nearly 10% higher than that of ResNeXt\_T, indicating the improvement brought by the proposed SG conv unit. Furthermore, although the target data and source data had different surface types, transfer learning on the SG-CNNs led to major improvement in the classification accuracy.

**Figure 13.** The DC Mall scene: (**a**) false-color composite image; (**b**) ground truth.

**Table 7.** The number of training and test samples for Salinas and DC Mall datasets.


**Figure 14.** Convergence curves during the fine-tuning process for the DC Mall scene: (**a**) SG-CNN-7, (**b**) SG-CNN-8, and (**c**) SG-CNN-12.

Analogously, our second test on the DC Mall scene evaluated the classification performance of the proposed method with varying sizes of labeled samples. We used 50–250 samples per class at an interval of 50 to train ShuffleNet2 and ResNeXt and to fine-tune SG-CNNs, ShuffleNet2\_T, and ResNeXt\_T. Figure 15 shows the OAs for the test samples from all methods. In the DC Mall experiment, SG-CNNs outperformed all baseline models, including the ResNeXt\_T, when a large number of training samples (e.g., 250 samples per class) was provided. Specifically, the OA of SG-CNNs was

higher than that of other methods by 5.3–18.2%, which confirmed the superiority of our proposed method. For the DC Mall dataset, SG-CNN-12 achieved better results when samples were relatively limited (i.e., 50–150 samples per class). With 200–250 training samples in each category, SG-CNN-7 and SG-CNN-8 required less time to obtain a comparable accuracy to that of SG-CNN-12.

**Table 8.** Classification accuracy (%) and computation time of the DC Mall scene. A total of 600 labeled samples (100 per class) were used for fine-tuning. The No. column refers to the corresponding class in Table 7. The best results are in **bold**.


**Figure 15.** Overall classification accuracies of the test samples based on various methods trained/finetuned with 50–250 labeled samples for the DC Mall scene.

#### **4. Conclusions**

Typically, only limited labeled samples are available for HSI classification. To improve the HSI classification for such conditions, we proposed a new CNN-based classification method that performed transfer learning between different HSI datasets on a proposed lightweight CNN. This scheme, named SG-CNN, consisted of SG conv units, which combined group convolution, atrous convolution, and channel shuffle operation. In the SG conv unit, group convolution was utilized to reduce the number of parameters, while channel shuffle was employed to connect information in different groups. Also, atrous convolution was introduced in addition to conventional convolution in the groups so that the receptive field was enlarged. To further improve the classification performance with limited samples, transfer learning was applied on SG-CNNs, with a simple dimensionality reduction implemented to keep the dimensions of input data consistent for both the source and target data.

To evaluate the classification performance of the proposed method, transfer learning experiments were performed on SG-CNNs between three pairs of public HSI scenes. Specifically, three SG-CNNs with different levels of complexity were tested. Compared with ShuffleNet-V2, ResNeXt, and their fine-tuned models, the proposed method considerably improved classification results when the training samples were limited, and it also enhanced model efficiency by reducing the computing cost for the training process. It suggests that the combination of atrous convolution with group convolution is effective for training with limited samples, and the band selection method can be helpful for transfer learning.

**Author Contributions:** Conceptualization, Y.L.; Funding acquisition, Y.L. and A.M.; Resources, C.X.; Supervision, L.G.; Writing—original draft, Y.L.; Writing—review & editing, Y.Q., K.Z. and A.M. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the National Natural Science Foundation of China under Grant No. 41901304, No. 41722108, and also funded in part by the Centre for Integrated Remote Sensing and Forecasting for Arctic Operations (CIRFA) and the Research Council of Norway (RCN Grant no. 237906), and by the Fram Center under the Automised Large-scale Sea Ice Mapping (ALSIM) "Polhavet" flagship project.

**Acknowledgments:** The authors would like to thank http://www.ehu.eus/ for providing the original remote sensing images.

**Conflicts of Interest:** The authors declare no conflicts of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Hyperspectral Imagery Classification Based on Multiscale Superpixel-Level Constraint Representation**

#### **Haoyang Yu 1, Xiao Zhang 1, Meiping Song 1,\*, Jiaochan Hu 2, Qiandong Guo <sup>3</sup> and Lianru Gao <sup>4</sup>**


Received: 2 September 2020; Accepted: 10 October 2020; Published: 13 October 2020

**Abstract:** Sparse representation (SR)-based models have been widely applied for hyperspectral image classification. In our previously established constraint representation (CR) model, we exploited the underlying significance of the sparse coefficient and proposed the participation degree (PD) to represent the contribution of the training sample in representing the testing pixel. However, the spatial variants of the original residual error-driven frameworks often suffer the obstacles to optimization due to the strong constraints. In this paper, based on the object-based image classification (OBIC) framework, we firstly propose a spectral–spatial classification method, called superpixel-level constraint representation (SPCR). Firstly, it uses the PD in respect to the sparse coefficient from CR model. Then, transforming the individual PD to a united activity degree (UAD)-driven mechanism via a spatial constraint generated by the superpixel segmentation algorithm. The final classification is determined based on the UAD-driven mechanism. Considering that the SPCR is susceptible to the segmentation scale, an improved multiscale superpixel-level constraint representation (MSPCR) is further proposed through the decision fusion process of SPCR at different scales. The SPCR method is firstly performed at each scale, and the final category of the testing pixel is determined by the maximum number of the predicated labels among the classification results at each scale. Experimental results on four real hyperspectral datasets including a GF-5 satellite data verified the efficiency and practicability of the two proposed methods.

**Keywords:** hyperspectral remote sensing; image classification; constraint representation; superpixel segmentation; multiscale decision fusion

#### **1. Introduction**

Hyperspectral remote sensing is a leading technology developed from remote sensing (RS) in the field of Earth observation, which accesses multidimensional information by combining imaging technology and spectral technology [1,2]. Hyperspectral image (HSI) can be viewed as a data cube with a diagnostic continuous spectrum, providing abundant spectral–spatial information, and different substances usually exhibit diverse spectral curves [3,4]. Because of the ability of characterization and discrimination of ground objects, HSI has become an indispensable technology in a wide range of applications such as civil construction and military fields [5,6]. As one of the popular applications in remote sensing, HSI classification (HSIC) is to use a mapping function to assign each pixel with a class label via its spectral characteristic and spatial information [7–9]. At present, a large number of HSIC methods have been proposed successively, mainly including the following two aspects: one is the classification based on the spectral information, which mainly focuses on the study of spectral features and spectral classifiers, such as support vector machines (SVM) and the maximum likelihood classifier (MLC). The other is realized by extracting spatial features to assist the discrimination, for example, SVM-based Markov Random Field (SVM-MRF) and some segmentation-based classification frameworks [10–13]. However, due to the high dimensionality of HSI, the high correlation and redundancy have been discovered in both the spectral and spatial domains, it can be inferred that HSI is mainly low-rank and can be represented sparsely, though the original HSI is not sparse [14,15].

In this context, sparse representation (SR)-based methods have been widely applied for HSIC and accompanied a state-of-the-art performance [16]. The classic SR-based classifier (SRC) is to use as few samples as possible to better represent the testing pixel [17]. Concretely, SRC firstly constructs a dictionary by labeling samples in different classes, and then represents the testing pixel by a mean of a linear combination of the dictionary and a weight coefficient under a sparse constraint. After obtaining the approximation of the testing pixel, the classification can be realized by analyzing which class yields the least reconstruction error [18]. However, this residual error-driven mechanism ignores the underlying significance and property of the sparse coefficient to a certain extent. The sparse coefficient plays a decisive role in the constraint representation (CR) model, and the category of the testing pixel is determined by the maximum participant degree (PD) in CR, of which PD is the contribution of labeled samples from different classes in representing the testing pixel. The CR model makes full use of a sparse principle to deal with the sparse coefficient, and achieves an equivalent and simplified effect to the classic SRC. As a powerful pattern recognition, both the SRC model and the CR model are the effective representational-based model, and generate a rather accurate result compared with SVM and some other spectral classifiers [19].

However, due to the sparse coefficients are susceptible to suffer spectral variability, some joint representation (JR)-based frameworks have successively appeared with consideration of the local spatial consistency, such as the joint sparse representational-based classifier (JSRC) and the joint collaborate representative-based classifier (JCRC) [20,21]. Similarly, based on the concept of PD and the PD-driven decision mechanism, adjacent CR (ACR) utilizes the PD of adjacent pixels as class-dependent constraints to classify the testing pixel. The adjacent pixels are defined in a fixed window in ACR, lacking consideration of the correlation of ground object, although there is no strong constraint in comparison with JSRC model. Therefore, in order to better characterize the image for classification, it is reasonable to utilize various features from spectral and spatial domains in the image [22,23].

Object-based image classification (OBIC) is a widely adopted classification framework with spatial discriminant characteristics. OBIC usually performs classification after segmentation [24]. Segmentation technology divides an image into several non-overlapping homogeneous regions according to the agreed similarity criteria. Some segmentation algorithms have shown an effective result in HSI, such as partitioned clustering and watershed segmentation [25–28]. In particular, the combination of the vector quantization clustering methods and the representation based has shown a well classification performance in some related literatures [29]. Therefore, the OBIC is a well-established framework, which can be widely applied for the HSIC tasks.

In this paper, a superpixel-level constraint representation (SPCR) model is proposed, combining a spatial constraint, simple linear iterative clustering (SLIC) superpixel segmentation, to the CR model [30]. Differing from the ACR model, the proposed SPCR method extracts the spectral–spatial information of pixels inside the superpixel block, preserves most of the edge information in the image, and estimates the real distribution of ground objects [31]. In general, the SPCR model utilizes the spectral feature of adjacent pixels, and transforms the individual PD to united activity degree (UAD) through a relaxed and adaptive constraint. As shown on the right side of Figure 1, the decision mechanism of the SPCR model is to classify the testing pixel into the category with the maximum UAD. However, like most OBIC-based methods, the constrained representation classification with

a single fixed scale needs to find the optimal scale. To address this issue, it is necessary to propose a multiscale OBIC framework to comprehensively utilize image information [32]. As illustrated in Figure 1, we proposed an improved version based on the above SPCR model, called the multiscale superpixel-level constraint representation (MSPCR) method.

The MSPCR merges the classification maps generated by SPCR at different superpixel segmentation scales, which is implemented in three steps: (1) a segmentation step, in which the processed hyperspectral image is segmented into superpixel images with different scales by the SLIC algorithm; (2) a classification step, in which the PD of pixels inside the superpixel is utilized to shape the class-dependent constraint of the testing pixel; and (3) a decision fusion step, in which the final classification map of MSPCR is obtained through the decision fusion processed, based on the classification result of SPCR at each scale.

As mentioned above, the CR model classifies the testing pixel based on the PD-driven decision mechanism, and obtains a reliable performance with relatively low computational time. Considering the influence of the spectral variability, the ACR model adopts the PD of adjacent pixels to obtain the category of the testing pixel. However, the ACR only regards the pixels within a fixed window as adjacent pixels, lacking consideration to the correlation of ground objects. To address this issue, the SPCR model is firstly established by joining the CR model with the SLIC superpixel segmentation algorithm. Then, the MSPCR approach is successively proposed to alleviate the impact of the segmentation scale on the classification result of the SPCR method, and obtains high accuracies. Experimental results on four real hyperspectral datasets including a GF-5 satellite data are used to evaluate the classification performance of the proposed SPCR and MSPCR methods.

The rest of this paper is organized as follows. Section 2 reviews the related models, including classic representation-based classification methods and superpixel segmentation algorithm, i.e., SLIC that we used in this paper. Section 3 presents our proposed methods, firstly introduces the CR method and the ACR classifier, then presents the SPCR model and the MSPCR method proposed in this paper. Section 4 evaluates the classification performance of our proposed methods and other related methods via the experimental results on three real hyperspectral datasets. Section 5 takes a practical application and analysis to our proposed methods via the experiment on a GF-5 satellite data. Section 6 concludes this paper with some remarks.

#### **2. Related Methods**

In this section, we introduce several related methods of our framework. The classic sparse representation (SR)-based model and the joint representation (JR)-based framework are firstly reviewed in Section 2.1. Then the simple linear iterative clustering (SLIC) is presented in Section 2.2.

#### *2.1. Representation-Based Classification Methods*

Defining a testing pixel **x***i*,*<sup>j</sup>* ∈ **X** in the location (*i*, *j*) of HSI **X** which contains *B* spectral bands and *N* = *r* × *c* pixels (*r* and *c* index the row and column of scene). The dictionary can be denoted as **D** = (**D**1, ... , **D***K*) ∈ **X**, in which each column of **D***<sup>k</sup>* is the samples selected from class *k* ∈ [1,*K*] (*K* is the number of classes).

#### 2.1.1. Sparse Representation-Based Model

Since pixels in HSI can be represented sparsely, representation-based methods have been widely applied to process HSI due to their no assumption of data density distribution [33]. The SRC is a classic SR-based model, implementing classification based on several steps as follows. Firstly, it constructs a dictionary by training the available labeled samples, then represents the testing pixel by a sparse linear combination of the dictionary. Moreover, in order to use as few labeled samples as possible to represent the testing pixel, the weighted coefficients used in representation are sparsely constrained. Finally, the classification is conducted by a residual error-driven decision mechanism, which classifies the testing pixel as the class with minimum class-dependent residual error using the following formula:

$$\begin{cases} \begin{aligned} \dot{\mathbf{x}}\_{i,j} &= \operatorname\*{argmin}\_{\mathbf{x}\_{i,j}} \Big\{ \|\mathbf{x}\_{i,j} - \mathbf{D}\mathbf{x}\_{i,j}\|\_{2}^{2} + \lambda \|\mathbf{x}\_{i,j}\|\_{1} \Big\} \\ \quad \operatorname\*{class}(\mathbf{x}\_{i,j}) &= \operatorname\*{argmin}\_{k} \Big\{ \|\mathbf{x}\_{i,j} - \mathbf{D}\delta\_{k}(\hat{\mathbf{x}}\_{i,j})\|\_{2}^{2} \Big\} \end{aligned} \tag{1}$$

where α*i*,*j* <sup>1</sup> <sup>=</sup> *<sup>n</sup> <sup>m</sup>*=1|α*m*| denotes the *l*1-norm and ·<sup>2</sup> is the *l2*-norm, due to the optimization of *l*0-norm is a combinatorial NP-hard problem, the sparse constraint of weight coefficients α*i*,*<sup>j</sup>* adopts *l*1-norm to substitute *l*0-norm, where *l*1-norm is the closet convex function to the *l*0-norm. Moreover, λ is a scalar regularization parameter. As an indicator function, δ*k*(αˆ *<sup>i</sup>*,*j*) can assign zero to the element that does not belong to the class *k*. The weight vector, αˆ *<sup>i</sup>*,*j*, can be optimized by the basis pursuit (BP) or basis pursuit denoising (BPDN) algorithm.

#### 2.1.2. Joint Representation-Based Framework

HSIC initially focused on the spectral information because of its data characteristic, while the spatial information can be further exploited to reduce classification errors, according to the similar spectral characteristic among neighborhood pixels. As the second generation of SRC, the joint SRC (JSRC) is introduced under the JR-based framework, which has a solid classification performance after integrating spectral information with the local spatial coherence.

Based on the local spatial consistency, the fundamental assumption of JSRC is that the sparse vectors related with the adjacent pixels could share a common sparsity support [34]. In the JSRC, both the testing pixel and its neighboring pixels are stacked into the joint signal matrix, and sparsely represented using the dictionary and a row-sparse coefficient matrix [35]. The final classification result of JSRC is obtained by calculating the minimum total residual error as follows:

$$\begin{cases} \hat{\mathbf{A}}\_{i,j} = \operatorname\*{argmin}\_{\mathbf{A}\_{i,j}} \{ \| \mathbf{X}\_{i,j} - \mathbf{D} \mathbf{A}\_{i,j} \|\_{F}^{2} + \lambda \| \mathbf{A}\_{i,j} \|\_{1,2} \} \\\\ \operatorname\*{class}(\mathbf{x}\_{i,j}) = \operatorname\*{argmin}\_{k} \left\{ \| \mathbf{X}\_{i,j} - \mathbf{D} \delta\_{k}(\mathbf{A}\_{i,j}) \|\_{F} \right\} \end{cases} \tag{2}$$

where **<sup>X</sup>***i*,*<sup>j</sup>* = (**x***i*−*w*,*j*−*w*, ., **<sup>x</sup>***i*,*j*, ., **<sup>x</sup>***i*+*w*,*j*+*w*) is a *<sup>w</sup>* <sup>×</sup> *w* pixel-sized square neighborhood centered on **x***i*,*j*, and **A***i*,*<sup>j</sup>* is the corresponding coefficient matrix. ·*<sup>F</sup>* is the Frobenius norm and **A***i*,*j* 1,2 <sup>=</sup> *<sup>n</sup> <sup>s</sup>*=<sup>1</sup> **a***<sup>s</sup>* <sup>2</sup> is the *l*1,2-norm, in which **a***<sup>s</sup>* is the s-th row of **A***i*,*j*.

#### *2.2. Simple Linear Iterative Clustering*

The OBIC is a widely used spectral–spatial classification framework, and it utilizes the spatial information after the procedure of segmentation [36]. As one of the widely used segmentation methods, the SLIC algorithm identifies superpixels by the over-segmentation approach. The idea of SLIC is to locally apply the K-means algorithm to obtain an effectively cluster segmentation result. Specifically, it measures the distance from each cluster center to pixels within a 2*<sup>S</sup>* <sup>×</sup> <sup>2</sup>*<sup>S</sup>* block, where *<sup>S</sup>* <sup>=</sup> <sup>√</sup> *N*/*P*. Here, *N* is the number of pixels, and *P* is the number of clustering centers which equals to the total number of superpixels [37].

In general, the SLIC algorithm can be implemented in several steps as follows: the first step is to select *P* initial clustering centers from the original image. Then it classifies each pixel to the nearest clustering center, and constructs various clusters respectively. The iterative clustering process is performed until the position of the cluster center became stable. As stated above, the original K-means algorithm calculates the distance from the whole map, while the searching area of SLIC is in the local area of each superpixel, thereby the SLIC algorithm alleviates the computation complexity to a great extent. The distance in SLIC is defined as follows:

$$D\_{\rm SIC} = D\_{\rm spectral} + \frac{m}{\rho} D\_{\rm spatial} \tag{3}$$

where *Dspectral* is a spectral distance, which is used to ensure the homogeneity inside the superpixel, and the spectral distance between pixel *i* and pixel *j* is described as follows:

$$D\_{\text{spectral}} = \sqrt{\sum\_{d=1}^{D} \left(\mathbf{x}\_{i,d} - \mathbf{x}\_{j,d}\right)^2},\tag{4}$$

where *xi*,*<sup>d</sup>* is the value of pixel *i* in band *d*, and *Dspatial* represents the spatial distance, which is used to control the compact and regularity of the superpixels, the spatial distance between pixel *i* and pixel *j* is defined as follows:

$$D\_{\text{spatial}} = \sqrt{(a\_i - a\_j)^2 + (b\_i - b\_j)^2},\tag{5}$$

where (*ai*, *bi*) is the location of pixel *i*, *m*, and ρ in Equation (3) are the scale parameter of superpixels.

#### **3. Proposed Approach**

As introduced in Section 2.1, both the classic SR-based model and the variant JR-based method conduct the classification using the class-dependent minimum residual error between the original observation and the approximate representation value. However, the residual error-based decision mechanism in the SR-based and JR-based frameworks ignore the importance of sparse coefficients. Section 3.1 introduces that the CR method and the ACR classifier can exploit the characteristic of the sparse coefficient. After that, we present the details of SPCR and the MSPCR in Section 3.2. Both methods are generally based on the spatial correlation. Specifically, the SPCR utilizes the spectral consistency feature among adjacent pixels in ACR, and then MSPCR achieves comprehensive utilization of various regional distribution.

#### *3.1. Constraint Representation (CR) and Adjacent CR (ACR)*

#### 3.1.1. CR Model

According to the principle of representation-based model, it can be regarded as representing the testing pixel via a sparse linear combination of the labeled samples. For the sake of understanding, a simple case can be assumed as Equation (6). The testing pixel is represented by a single element with nonzero coefficient (α*p*, α*q*, α*m*, ... , α*n*) from some certain classes (*k*, *k* + 1, ... , *k*<sup>∗</sup> ∈ [1, *K*]) as follows [38]:

$$\mathbf{x}\_{i\cdot j} \approx \alpha\_{\mathbb{P}} \mathbf{x}\_{t}^{k} + \alpha\_{\mathbb{P}} \mathbf{x}\_{t\_{1}}^{k+1} + \alpha\_{m} \mathbf{x}\_{t\_{2}}^{k+2} + \dots + \alpha\_{n} \mathbf{x}\_{t\_{n}}^{k} \tag{6}$$

Since αˆ *<sup>i</sup>*,*<sup>j</sup>* is sparsely constrained, the labeled samples which contributes to representing the testing pixel are the ones whose coefficients are not zero. In the process of representation, the larger measurement of the coefficient value, the higher contribution in representing the testing pixel, such that the testing pixel more likely belongs to the corresponding category. Therefore, CR directly exploits the sparse coefficient to conduct the classification, which is concise and equivalent to the residual error-driven determination mechanism. Specifically, it defines the participant degree (PD) from the perspective of the sparse coefficient, which estimates the contribution of labeled samples from different classes in representing the testing pixel **x***i*,*j*. The PD of each class is calculated by the corresponding weight vector with *ld*-normed measurement (d = 1 or d = 2) as follows:

$$\text{PD}\_k = \left\| \mathfrak{a}\_{i,j}^k \right\|\_d \tag{7}$$

The PD-driven decision mechanism of CR is to determine the category with the maximum PD, which can be expressed in Equation (8):

$$\text{class}(\mathbf{x}\_{i,j}) = \max\_{k} \{ \text{PD}\_{i,j}^{k} \Big| k \in [1, K] \}. \tag{8}$$

#### 3.1.2. ACR Model

Based on the PD-driven mechanism, an improved version, ACR has been proposed to correct spectral variation by imposing spatial constraints during the classification. According to the spectral similarity characteristic among the adjacent pixels, the adjacent pixels more likely belong to the same class [39]. In this context, the ACR brings better classification performance than that of the CR model through innovating the PD-driven mechanism with the spatial consistency of the adjacent pixels. The main principle of ACR is to use the PD of adjacent pixels as a constraint to determine the category of the testing pixel. Specifically, the ACR firstly defines adjacent pixels within a *<sup>w</sup>* <sup>×</sup> *w* pixel-sized window centered on the testing pixel, then constructs a k-dimensional PD image, and each dimensionality of the PD image shows the PD values of pixels in one class. The class-dependent activity degree (CAD) of each element is obtained after successively normalizing the PD image at each dimensionality, which could be expressed as follows:

$$\text{CAD}^k\_{i,j} = \text{PD}^k\_{i,j} / \sum\_{k\*=1}^K \text{PD}^{k\*}\_{i,j \prime} \tag{9}$$

where *k* ∈ [1, *K*] denotes the class index, and (*i*, *j*) are the location of the testing pixel. With consideration of the spatial constraint of the adjacent pixels, the relative activity degree (RAD) is generated by combining the CAD of the testing pixel with the inactivity degree of its adjacent pixels through a scale compensation parameter <sup>τ</sup>, where the index of the adjacent pixels is *<sup>v</sup>* <sup>∈</sup> [1, *w* 2 ]. The ACR uses the RAD as the final contribution degree in representing the testing pixel **x***i*,*j*, and the class of **x***i*,*<sup>j</sup>* can be determined by the maximum RAD as follows:

$$\begin{cases} \text{RAD}^k\_{i,j} = \text{CAD}^k\_{i,j} - \tau \sum\_{v=1}^{\widehat{w}^2} \left( 1 - \text{CAD}^k\_v \right) \\\ class(\mathbf{x}\_{i,j}) = \max\_k \{ \text{RAD}^k\_{i,j} \Big| k \in [1, K] \} \end{cases} \tag{10}$$

#### *3.2. Superpixel-Level CR (SPCR) and Multiscale SPCR (MSPCR)*

#### 3.2.1. SPCR Model

As mentioned above, the ACR model defines the adjacent pixels as pixels within a fixed pixel-sized window centered on the testing pixel. However, it does not consider the real distribution of ground objects. The superpixel block obtained by the superpixel segmentation algorithm is made up of some neighborhood pixels with similar spatial characteristics. Through combing the superpixel segmentation algorithm, we establish the SPCR model to further utilize the spectral consistency feature from the subset of adjacent pixels. In this way, the SPCR model conducts class-dependent constrained represent according to the PD of pixels inside the superpixel block centered on the testing pixel, which preserves most edge information of image in comparison to the sample selection in fixed window in ACR, and has a more objective consideration to the spatial distribution of the testing pixel. As illustrated in Figure 1, the schematic diagram of SPCR model is equal to MSPCR at a single segment scale, which can be implemented in several steps as follows.

Firstly, we obtain superpixel blocks by the SLIC algorithm. Since the SLIC can only process an image in the CIELAB color space, it is necessary to convert an HSI to a three bands image before processed by the SLIC algorithm. Therefore, the principal component analysis (PCA) method is adopted to reduce the spectral dimensionality in the SPCR method, which selects the first three components as the input of SLIC to generate a stable superpixel segmentation result [40]. Then, the category of the testing pixel can be measured by calculating the PD values of pixels inside the superpixel where the testing pixel is located. Specifically, using the PD values of pixels at the corresponding position of the superpixel, we built a SPD image surrounding **x***i*,*<sup>j</sup>* with *K* dimension, and each dimension of SPD image shows the PD values of pixels in one class. Similar to ACR, the normalized value of each pixel in the k*th* SPD image is defined as the class-dependent activity degree (CAD) with regard to the class *k*.

**Figure 1.** The workflow of multiscale superpixel-level constraint representation (MSPCR).

In order to further utilize the correlation of ground objects, SPCR combines the CAD of **x***i*,*<sup>j</sup>* with CAD of other pixels insides superpixel through the scale compensation parameter, such that other pixels can give a properly constraint in classifying the testing pixel **x***i*,*j*. Compared to the constraint with the local spatial information in RAD shown in formula (10), the united activity degree (UAD) utilizes the correlation of ground object via a similar combination, represented as follows:

$$\text{UAD}\_{i,j}^k = \text{CAD}\_{i,j}^k + \gamma \sum\_{\varepsilon=1}^l \text{CAD}\_{\varepsilon}^k \tag{11}$$

where *e* ∈ [1, *l*] indicates the element index in superpixel block, γ represents a scale compensation parameter. Moreover, the SPCR model classifies **x***i*,*<sup>j</sup>* by analyzing which class leads to the maximum UAD*i*,*<sup>j</sup>* as follows:

$$\text{class}(\mathbf{x}\_{i,j}) = \max\_{k} \{ \text{IAD}^k\_{i,j} \Big| k \in [1, K] \}\tag{12}$$

#### 3.2.2. MSPCR Model

As shown in the aforementioned algorithm, the proposed SPCR method based on the OBIC framework generates solid performance through exploiting the spatial contextual information. However, as the classification results of SPCR with different segmentation scales are not the same, the superpixel segmentation-based HSI classification may not generate a comprehensive and stable result under a fixed segmentation scale. Thus, in particular, the performance of SPCR is highly affected by the scale level [41]. In order to solve these problems, it is reasonable to propose multiscale OBIC framework to comprehensively utilize image information. In this paper, MSPCR is firstly proposed by means of decision fusion with the classification result maps obtained by SPCR method at different segmentation scales. Compared with SPCR, the improved MSPCR not only uses multiple scales to balance the different size and distribution of ground objects, but also solves the problem of selecting the optimal segmentation scale.

Specifically, Figure 1 and Algorithm 1 illustrate the general schematic diagram and pseudo procedures of the MSPCR method, respectively. Firstly, similar to the workflow of the SPCR method, we simultaneously obtain the classification results of the testing pixel at different superpixel segmentation scales. In this process, the superpixel block is generated by inputting the result of PCA into the SLIC algorithm, then classify the testing pixel by a relaxed and adaptive constraint inside the superpixel. After performing the SPCR method at each scale, a decision fusion process is applied to obtain the classification result of MSPCR, in which the category of the testing pixel **y***<sup>i</sup>* is determined by the maximum number of labels of the testing pixel **x***i*,*<sup>j</sup>* among the classification results at each scale, and the decision fusion process is expressed as follows:

$$\text{class}(\mathbf{y}\_i) = \arg\max\_{q=q\_1,\dots,Q} \text{class}(\mathbf{y}\_i^q)\_\prime \tag{13}$$

where **y***<sup>i</sup>* is denoted as the final class label of **x***i*,*j*, **y** *q <sup>i</sup>* represents the classification result of **x***i*,*<sup>j</sup>* when the segmentation scale parameter is described as *q*, and mod is a modular function which defines **y***<sup>i</sup>* with the most frequency category in [**y** *q*1 *<sup>i</sup>* , ... , **<sup>y</sup>***<sup>Q</sup> i* ].

#### **Algorithm 1.** The proposed MSPCR method

**Input:** A hyperspectral image (HSI) image **X**, dictionary **D**, the testing pixel **x***i*,*j*, regularization parameter λ, scale compensation parameter γ.

*Step 1*: Reshape **X** into a color image by compositing the first three principal component analysis (PCA) bands. *Step 2*: Obtain multiscale superpixel segmentation images *<sup>S</sup><sup>Q</sup>* = {*Sq*} *Q <sup>q</sup>*=<sup>1</sup> of **X** according to SLIC in Equations (3) to (5).

*Step 3*: Obtain the participation degree (PD) image of **X** according to Equation (7).

*Step 4*: Extract superpixel centered on the testing pixel **x***i*,*<sup>j</sup>* from the PD image of **X** to get multiscale SPD image. *Step 5*: Class-dependent normalization at each scale according to Equation (9).

*Step 6*: Calculate the united activity degree (UAD) according to Equation (11).

*Step 7*: Assign the class of **x***i*,*<sup>j</sup>* at each scale according to Equation (12).

*Step 8*: Determine the final class label by the decision fusion according to Equation (13).

**Output:** The class labels **y**.

#### **4. Experimental Results and Analysis**

In this section, we investigated the effectiveness of the proposed SPCR and MSPCR models with three hyperspectral datasets. The detailed description of the applied datasets is given in Section 4.1. The parameter tuning related to the proposed models and other compared methods is presented in Section 4.2. We evaluate the performance of two proposed methods in comparison with the methods in the spectral domain and the spectral–spatial domain. The classic SR-based method, including SRC as well as its simplified model CR, and the classic SVM are firstly selected in the comparative experiments in the spectral domain. Then, the classic JR-based model JSRC, the typical model with post-processing of spatial information SVM-MRF, and the previously proposed ACR are further tested in the spectral–spatial domain. We randomly selected training samples 20 times in each experiment and calculated the overall accuracy (OA) and class-dependent accuracy (CA). We analyzed the experimental results of the two proposed methods and other related methods in Sections 4.3–4.5.

#### *4.1. Experimental Data Description*

#### 4.1.1. Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) Indian Pines Scene

The first data are of the Indian Pines scene acquired by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensors in the Northwestern Indiana, with a spatial resolution of 20 m. The scene covers 220 spectral bands ranging from 0.4 to 2.5 μm, and the size of the image is 145 × 145. In order to satisfy the sparse thought, eight ground-truth classes with a total of 8624 labeled samples are extracted from the original sixteen categories reference data. Figure 2a,b shows the false-color composite image and the reference map of this scene, respectively.

4.1.2. Reflective Optics Spectrographic Imaging System (ROSIS) University of Pavia Scene

The second data are of the University of Pavia scene collected by the Reflective Optics Spectrographic Imaging System (ROSIS) over a downtown area near the University of Pavia in Italy, with a spatial resolution of 1.3 m. After removing 12 bands with high noise and water absorption, the scene has 103 spectral bands ranging from 0.43 to 0.86 μm, with 610 × 340 pixels. Nine ground-truth classes with a total of 42,776 labeled samples are contained in the reference data. Figure 3a,b shows the false-color composite image and the reference map of this scene, respectively.

4.1.3. Hyperspectral Digital Image Collection Experiment (HYDICE) Washington, DC, National Mall Scene

The third data are of the Washington, DC, National Mall scene captured by the Hyperspectral Digital Image Collection Experiment (HYDICE) sensor over the Washington, DC, in USA, with a spatial resolution of 3 m. The original scene contains 210 spectral bands ranging from 0.4 to 2.5 μm, with 280 × 307 pixels. After removing the atmospheric absorption bands from 0.9 to 1.4 μm, 191 bands were remaining. Six ground-truth classes with a total of 10190 labeled samples were included in the reference data. Figure 4a,b shows the false-color composite image and the reference map of this scene, respectively.

**Figure 2.** The Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) Indian Pines scene: (**a**) false-color composite image and (**b**) reference map.

**Figure 3.** The Reflective Optics Spectrographic Imaging System (ROSIS) University of Pavia scene: (**a**) false-color composite image and (**b**) reference map.

**Figure 4.** The Hyperspectral Digital Image Collection Experiment (HYDICE) Washington, DC, National Mall scene: (**a**) false-color composite image; (**b**) reference map.

#### *4.2. Parameter Tuning*

In the experiment of this paper, the regularization parameter λ for all SR-based models was selected from 10−<sup>3</sup> to 10−1. For the scale compensation parameter τ and γ in ACR and SPCR-based methods, we set them in a properly range according to the value of *w* and the number of superpixels *<sup>P</sup>*. Due to the different value of *<sup>w</sup>*, the distributions of the ground objects in the *<sup>w</sup>* <sup>×</sup> *w* pixel-sized window centered on the testing pixel are different. This fact produces a critical constraint based on the assumption that the adjacent pixels inside the window belong to the same class. Referring to the definition of *<sup>w</sup>* in [22], each scene usually has a proper *w* with a consideration of the spatial consistency, and the exceeded size could influence the result. Therefore, in order to obtain a high classification accuracy, we optimized the size of the window *w* in each experimental scene.

In addition, the number of superpixels *P* in SPCR and MSPCR classifier is decided by the segmentation scale *<sup>S</sup>* and the number of the pixels *<sup>N</sup>* via *<sup>P</sup>* <sup>=</sup> <sup>√</sup> *N*/*S*. The corresponding experimental analysis about *P* and the classification accuracy is illustrated in Figures 5 and 6. We can infer the relationship between the segmentation scale *S* and the classification accuracy, which is equal to the relationship of *P* and the classification results. Firstly, Figure 5 shows the impact of the number of superpixels on the classification accuracy (50 samples per class). We mainly select five and four classes to display from the AVIRIS Indian Pines dataset and HYDICE Washington, DC, National Mall dataset, respectively. As illustrated in Figure 5a, the result indicates that the optimal segmentation scale is various for different classes. For example, the optimal segmentation scale of the class 2 is distinct from the other three classes in Figure 5b. In addition, the relationship of the number of superpixels, overall accuracy and the number of the labeled samples is shown in Figure 6. Generally, the overall accuracy increased with the number of labeled samples at each segmentation scale. It is notable that under different number of labeled samples, the segmentation scale is various in order to achieve the highest classification accuracy. Like the most OBIC frameworks, the proposed SPCR method also needs to set the optimal segmentation scale, while the improved MSPCR method can overcome this drawback through taking fusion the spatial–spectral characteristics of HSI at different segmentation scales.

**Figure 5.** The sensitivity analysis of the number of superpixels on classification accuracy (50 samples per class). (**a**) the AVIRIS Indian Pines dataset. (**b**) the HYDICE Washington, DC, National Mall dataset.

**Figure 6.** The sensitivity analysis of the number of superpixels versus training size. (**a**) the AVIRIS Indian Pines dataset. (**b**) the HYDICE Washington, DC, National Mall dataset.

#### *4.3. Experiments with the AVIRIS Indian Pines Scene*

In the first experiment with the AVIRIS Indian Pines hyperspectral scene, we randomly selected 90 labeled samples per class with a total of 720 samples to construct a dictionary and the training model. The selected training samples constitutes the approximately 8.35% of the labeled samples in the reference map, and the other remained samples are used in validation. As illustrated in Table 1, the OAs and the CAs of different methods are calculated, and the corresponding classification maps are presented in Figure 7. We analyzed the classification results as follows:


improvement on the spectral classifiers. Similarly, since the JSRC conducts the classification by sharing a common sparsity support among all neighborhood pixels, the improvement of overall accuracy also appeared in JSRC compared to SRC. Compared with the CR model, the ACR classifier obtains a significant improvement. It solves the spectral variability problem in CR by setting a spatial constraint, and proves that the innovation of decision mechanism from PD-driven to RAD-driven is effective for the HSIC tasks. As mentioned above, the improvements of SVM-MRF, JSRC, and ACR models relative to their original counterparts SVM, SRC, and CR confirm the effectiveness of introducing spatial information into the spectral domain classifiers.

3 From Figure 7, the JSRC has a better classification performance than the SVM-MRF in the AVIRIS Indian Pines scene. As illustrated in Table 1, the ACR classifier achieves a better classification result in comparison to JSRC and SVM-MRF, of which the overall accuracy is 2.38% higher than that of JSRC and 6.11% higher than that of SVM-MRF. On one hand, the RAD-driven mechanism in ACR is more effective than the hybrid norm constraint in JSRC. On the other hand, the post-processing of spatial information in SVM-MRF takes more emphasis on adjusting the initial classification result generated from spectral features, lacking an effective strategy integrating spatial information with spectral information.

**Figure 7.** Classification maps obtained by the different tested methods with 90 samples per class for the AVIRIS Indian Pines dataset (overall accuracy (OA) is in parentheses). SVM = support vector machine; MRF = Markov Random Field; SRC = sparse-representation-based classifier; CR = constraint representation; ACR = adjacent constraint representation; JSRC = joint sparse representational-based classifier; SPCR = superpixel-level constraint representation.


segmentation scale on the classification results. Then, it also indicates that the decision fusion takes a comprehensive consideration to the different spatial features and distributions of various categories of objects, which elevates the final classification accuracy.

**Table 1.** Overall and classification accuracies (in percent) obtained by the different tested methods for the AVIRIS Indian Pines scene. In all cases, 720 labeled samples in total (90 samples per class) were used for training.


In general, the proposed MSPCR obtains an overall accuracy of 95.30%, which is 2.40% and 4.06% higher than SPCR and ACR, and also 12.11% higher than CR, respectively. For individual class accuracy, it provides great results, especially for the classes 2, 6, and 7. The classification maps in Figure 7 verify the improvement achieved by the MSPCR.

In the second test with the AVIRIS Indian Pines scene, we randomly selected 10 to 90 samples per class as the training samples to measure the proposed SPCR and MSPCR. Figure 8 shows the overall classification accuracies acquired by different methods with different number of labeled samples. The results can be summarized as follows:

**Figure 8.** Overall classification accuracy obtained by different tested methods with different numbers of labeled samples for the AVIRIS Indian Pines scene.


and SVM-MRF, as well as CR toward SVM. Moreover, the proposed MSPCR achieved the best performance among these classifiers.

#### *4.4. Experiments with the ROSIS University of Pavia Scene*

In the first test of the experiment with the ROSIS University of Pavia scene, we select 90 labeled samples per class with a total of 810 samples (which constitutes approximately 1.89% of the available labeled samples in the reference map), and the remaining labeled samples are used for validation. Table 2 and Figure 9 show the OAs and CAs for the classifiers, and the corresponding classification maps. From the experimental results, we have similar results with those obtained under the AVIRIS Indian Pines dataset: First, SRC and CR achieved similar classification results, with comparative result in comparison with the SVM in the spectral domain. In the spatial domain, SVM-MRF, JSRC, and ACR bring significant improvement to the SVM, SRC, and CR model by integrating the spatial information. Moreover, SVM-MRF owns a better classification accuracy than JSRC, different from the performance of these two methods in AVIRIS Indian Pines dataset. In comparison with the ACR, the introduction of the superpixel segmentation algorithm contributes to a higher accuracy in SPCR. Last but not the least, the proposed MSPCR achieves the best classification result with the overall accuracy of 96.90%, which is 3.64% and 4.71% higher than SPCR and ACR, and also 16.7% higher than CR, respectively. Additionally, it brings considerable improvements for individual class accuracy, especially for class 2 and class 4, which can be proved by the classification maps shown in Figure 9.

**Figure 9.** Classification maps obtained by the different tested methods with 90 samples per class for the ROSIS University of Pavia dataset (OAs are in parentheses).


**Table 2.** Overall and classification accuracies (in percent) obtained by the different tested methods for the ROSIS University of Pavia scene. In all cases, 810 labeled samples in total (90 samples per class) were used for training.

Our second test of the ROSIS University of Pavia scene measured the proposed SPCR and MSPCR with various sizes of labeled samples (from 10 to 90 samples per class). Figure 10 shows the overall classification accuracies obtained by different testing methods, under different number of training samples. With the number of the labeled sample increased, most of measured methods have an increase trend in accuracy. In comparison to the overall classification accuracy of SVM, the SRC and CR firstly have better performances, then perform worse as the number of the labeled samples increased. Considering the correlation of ground object, the classification performance of ACR and SVM-MRF, achieved significant improvements with the increase of the number of samples, with a higher classification accuracy than the JSRC in most cases. In addition, the combination of the PD-decision mechanism and the superpixel segmentation algorithm brings reliable and stable improvement, which can be confirmed by the overall classification accuracies obtained by SPCR method in all cases. From Figure 10, MSPCR method achieves the best classification result among these compared methods, as a result of applying the decision fusion which alleviates the challenge of adapting the fixed single segmentation scale to the spatial characteristic of all categories in the image.

**Figure 10.** Overall classification accuracy obtained by the different tested methods with different numbers of labeled samples for the ROSIS University of Pavia scene.

#### *4.5. Experiments with the HYDIC Washington, DC, National Mall Scene*

In our first test with the HYDICE Washington, DC, National Mall scene, we first randomly select 50 labeled samples per class with a total of 300 samples for training and dictionary construction (which constitutes approximately 2.94% of the available labeled samples), the remaining samples are applied for validation. Table 3 shows the OAs and CAs obtained in different tested methods, and Figure 11 shows the corresponding classification maps. In the spectral domain, the traditional SRC provides an approximately equivalent result to CR, and both of them outperform the traditional SVM method, once again proving that the sparse coefficient is powerful to represent the spectral characteristics. In the spectral–spatial domain, the SVM-MRF, ACR, and SPCR perform well toward their original counterparts, i.e., SVM and CR. In addition, it also can be seen from the overall accuracies of the SRC method and the JSRC model that an improperly spatial constraint may have a negative impact on the classification performance. Distinct from the classification results in the above two datasets, the ACR gains a better classification performance than the proposed SPCR method in the HYDICE Washington, DC, National Mall scene, indicating that the SPCR model is susceptible to the superpixel segmentation scale. That is the original intention for us to propose MSPCR method, which eliminates the impact of the number of superpixels on classification by fusing the classification results at different segmentation scales. Furthermore, it can be found that the proposed MSPCR method achieves the highest accuracy 98.32%, which is similar with the results in the AVIRIS Indian Pines hyperspectral scene and the ROSIS University of Pavia scene. In addition, the proposed MSPCR provides reliable individual classification accuracy for each class, especially for class 1 and 2, which can be seen from the classification maps in Figure 11.

**Figure 11.** Classification maps obtained by the different tested methods with 50 samples per class for the HYDICE Washington, DC, National Mall dataset (OAs are in parentheses).

**Table 3.** Overall and classification accuracies (in percent) obtained by the different tested methods for the HYDICE Washington, DC, National Mall scene. In all cases, 300 labeled samples in total (50 samples per class) were used for training.


In our second test with the HYDICE Washington, DC, National Mall scene, we evaluated the classification performance of our proposed methods from the spectral–spatial domain with different numbers of training samples. As shown in Figure 12, the classification result shows a rising tendency with the increase of the number of training samples, and curve tends to be flat when the number of training samples reaches to a certain amount. Firstly, the SRC and CR gain a better classification results toward SVM with the increase of the number of the labeled samples in the spectral domain. Though JSRC obtains relatively poor results than SRC, the SVM-MRF, ACR, and SPCR still provide competitive classification performances toward the SVM and CR with the increase of the number of training samples, which proves the integration of the spectral feature discrimination and spatial coherence is a reliable processing framework for the HSIC in most cases. On the other hand, improvement also appeared by the combination of the PD-driven and spatial constraint, which is indicated by the performance of ACR and SPCR-based method versus SVM-MRF and JSRC. In the spectral–spatial domain for all cases, the proposed MSPCR yields the best overall accuracy in comparison with the other related methods, and makes a significant improvement in comparison to the proposed SPCR.

**Figure 12.** Overall classification accuracy obtained by the different tested methods with different numbers of labeled samples for the HYDICE Washington, DC, National Mall scene.

In addition, we compared the calculation cost of some spectral–spatial-based methods in the above three hyperspectral datasets, and the setting of the labeled samples corresponds to the cases in Tables 1–3. As shown in Figure 13, for the experiments on the above three datasets, the JSRC has the fastest speed but with the lowest classification accuracy. The proposed MSPCR not only achieves the best classification accuracy, which also has an increase in the time-consuming (about five times), as compared to the SPCR, due to the decision fusion process. On the ROSIS University of Pavia dataset and the AVIRIS Indian Pines dataset, the SPCR is the second best with an approximately equivalent time-consuming to ACR. On the HYDIC Washington, DC, National Mall dataset, the ACR achieves the second highest classification accuracy with a similar speed to SPCR.

**Figure 13.** Calculation time-consuming comparison schematic diagram of different tested methods for (**a**) the AVIRIS Indian Pines dataset, (**b**) the ROSIS University of Pavia dataset, and (**c**) the HYDICE Washington, DC, National Mall dataset. The experiments are carried out using MATLAB on Intel(R) Core (TM) i7-6700K CPU machine with 16 GB of RAM.

Synthesizing the above experimental results and analysis, the firstly proposed SPCR method obtains a considerable overall and individual classification accuracy. The improved MSPCR gets better classification performance than the SPCR method. Moreover, the experimental results in different datasets also show that MSPCR outperforms several other related methods. Furthermore, the classification experimental results under different number of training samples also indicate the superiority and practicability of the proposed SPCR and MSPCR methods.

It should also be noted that the computational cost of the proposed MSPCR is relatively high, which is also the part of optimization in the future. Moreover, there are some potential points, for instance, the sample selection mechanism with related to the adaptive capability of method could be the follow-up research line.

#### **5. Practical Application and Analysis**

Different from the above three experimental datasets, we adopt the hyperspectral image data collected by the GF-5 satellite, to measure the practicability of the proposed SPCR and MSPCR method. GF-5 is the first hyperspectral comprehensive observation satellite of China, with a spatial resolution of 30 m. There are six payloads on GF-5, including two land imagers and four atmospheric sounders. In this paper, we select a scene from the hyperspectral image data obtained by visible short wave infrared hyperspectral camera.

First, we select the range of visible light to near infrared spectrum in the original data. After the atmospheric correction and radiation correction processing, the scene covers 150 spectral bands ranging from 0.4 to 2.5 μm, and the size of the image is 200 × 200. Six ground-truth classes with a total of 2216 labeled samples are contained in the reference data. Figure 14 shows the false-color composite image and the reference map of this scene.

**Figure 14.** Classification maps obtained by the different tested methods with 5 samples per class for the GF-5 satellite dataset (OAs are in parentheses).

In the experiment with the GF-5 satellite dataset, we randomly selected five labeled samples per class with a total of 30 samples to construct a dictionary and the training model. The selected training samples constitute the approximately 1.35% of the labeled samples in the reference map, and the other remaining samples are used in validation. Figure 14 displays the classification maps of different methods. We analyzed the classification results as follows:

Compared with the SVM-MRF and JSRC, the ACR has a better classification performance, of which the overall accuracy is 6.20% higher than that of JSRC and 2.00% higher than that of SVM-MRF. It confirms that the PD-driven-based decision mechanism plays an important role in classification. Compared with the ACR, the SPCR method obtains a better classification result, which verifies the effectiveness of integrating the PD-driven mechanism with the superpixel segmentation algorithm. The MSPCR outperforms the SPCR and yields the best accuracy in comparison to other related methods, which not only proves the MSPCR alleviates the impact of superpixel segmentation scale on the classification effect, but also indicates the decision fusion processing plays a decisive role in adapting different spatial characteristics of various categories of objects.

#### **6. Conclusions**

In this paper, a novel classification framework based on sparse representation, called the superpixel-level constraint representation (SPCR), was firstly proposed for hyperspectral imagery classification. SPCR uses the characteristics of spectral consistency of pixels inside the superpixel to determine the category of the testing pixel. Besides this, we proposed an improved multiscale superpixel-level constraint representation (MSPCR) method, obtaining the final classification result through fusing the classification maps of SPCR at different segmentation scales. The proposed SPCR method exploits the latent property of sparse coefficient and improves the contextual constraint, with consideration of spatial characterization. Moreover, the proposed MSPCR achieves comprehensive utilization of various regional distribution, resulting in strong classification performance. The experimental results with four real hyperspectral datasets including a GF-5 satellite data demonstrated that the SPCR outperforms several other classification methods, and the MSPCR yields a better classification accuracy than SPCR.

**Author Contributions:** Conceptualization, H.Y. and J.H.; formal analysis, M.S.; methodology, H.Y. and X.Z.; writing—original draft preparation, H.Y. and X.Z.; writing—review and editing, Q.G. and L.G. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the National Nature Science Foundation of China, grant numbers 61971082 and 61890964; Fundamental Research Funds for the Central Universities, grant numbers 3132020218 and 3132019341.

**Acknowledgments:** The authors would like to thank the Key Laboratory of Digital Earth Science, Aerospace Information Research Institute, Chinese Academy of Sciences for generously providing the GF-5 satellite data.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Underwater Hyperspectral Target Detection with Band Selection**

**Xianping Fu 1,2, Xiaodi Shang 1, Xudong Sun 1,2, Haoyang Yu 1, Meiping Song 1,\* and Chein-I Chang 1,3,4**


Received: 24 January 2020; Accepted: 20 March 2020; Published: 25 March 2020

**Abstract:** Compared to multi-spectral imagery, hyperspectral imagery has very high spectral resolution with abundant spectral information. In underwater target detection, hyperspectral technology can be advantageous in the sense of a poor underwater imaging environment, complex background, or protective mechanism of aquatic organisms. Due to high data redundancy, slow imaging speed, and long processing of hyperspectral imagery, a direct use of hyperspectral images in detecting targets cannot meet the needs of rapid detection of underwater targets. To resolve this issue, a fast, hyperspectral underwater target detection approach using band selection (BS) is proposed. It first develops a constrained-target optimal index factor (OIF) band selection (CTOIFBS) to select a band subset with spectral wavelengths specifically responding to the targets of interest. Then, an underwater spectral imaging system integrated with the best-selected band subset is constructed for underwater target image acquisition. Finally, a constrained energy minimization (CEM) target detection algorithm is used to detect the desired underwater targets. Experimental results demonstrate that the band subset selected by CTOIFBS is more effective in detecting underwater targets compared to the other three existing BS methods, uniform band selection (UBS), minimum variance band priority (MinV-BP), and minimum variance band priority with OIF (MinV-BP-OIF). In addition, the results also show that the acquisition and detection speed of the designed underwater spectral acquisition system using CTOIFBS can be significantly improved over the original underwater hyperspectral image system without BS.

**Keywords:** constrained-target optimal index factor band selection (CTOIFBS); hyperspectral image; underwater spectral imaging system; underwater hyperspectral target detection; band selection (BS); constrained energy minimization (CEM)

#### **1. Introduction**

Underwater target detection using the images acquired by traditional red-green-blue (RGB) cameras has become more and more mature where traditional image processing methods [1,2] and target detection algorithms based on deep learning, such as Faster Region-based Convolutional Neural Networks (Faster R-CNN) [3] and You Only Look Once (YOLO) [4], have been widely applied to underwater target detection. In an ideal underwater imaging environment, the detection speed and accuracy of various algorithms can reach a high level of performance. However, the traditional

RGB image detection technology suffers from a series of problems. When the underwater imaging environment is poor and marine animals have their protective color mechanism, it is difficult to detect and identify targets of interest effectively from the complex background [5,6].

Hyperspectral imaging technology can provide a higher spectral resolution than RGB images, and its band coverage can range from ultraviolet, visible, near-infrared to mid-infrared bands and provides wealthy spectral information. Hyperspectral data is generally acquired by hundreds of contiguous narrow spectral bands, which can resolve the problems encountered in traditional RGB image detection technology and also make it have a good ability to identify targets and distinguishing similar targets. Classical hyperspectral target detection algorithms include an anomaly detector developed by Reed and Xiaoli, called the RXD algorithm [7], kernel RXD (KRXD) algorithm [8], orthogonal subspace projection (OSP) algorithm [9], and constrained energy minimization (CEM) algorithm [10]. Among them, CEM is a subpixel target detection algorithm that has been shown to be an effective and promising technique when only the target spectrum of interest is known and the background spectrum is unknown. So, it is quite suitable for target detection in a complicated underwater background and environment with insufficient prior knowledge.

At the present time, only a few studies on hyperspectral underwater target detection are available in the literature and most of them mainly focused on three aspects. First, in order to ensure that the hyperspectral imager can accurately extract key information in the complex marine environment when collecting underwater images, the designed key technologies are different. Second, since a hyperspectral image has a large number of spectral bands with very high spectral resolution, it has good target recognition ability. However, this is also traded for a slow imaging speed, enormous data volume, long transmission cycle, and slow calculation speed, all of which cannot be suitable for the remote operated vehicle (ROV) platform and real-time underwater target detection [11,12]. Third, hyperspectral underwater target detection technology tends to have strategic significance in both military and economic aspects. So, the degree of technological openness is extremely limited.

Because of the low imaging speed and long processing time of underwater hyperspectral images, the current research on underwater spectral imaging and detection is mainly focused on the detection of underwater pipelines, the distribution and species detection of underwater plants and microorganisms, etc. [13–17], but the capability of real-time detection is low. Some researchers have complied a spectral library for recognition and detection of different underwater targets. Kuniaki Uto et al. [18] classified the objects of interest by measuring their average spectral curves of cauliflower and sand to calculate their resultant correlation coefficient. Tegdan [19] et al. used a spectral library of some known objects of interest to achieve automatic recognition of other objects. An underwater hyperspectral imaging (UHI) system, jointly developed by Norwegian company Ecotone and Norwegian Underwater application robotics, is an optimized underwater hyperspectral imaging system, which can be used for underwater hyperspectral remote sensing. This system is capable of collecting information in the full color spectrum (370–800 nm).

In this paper, sea cucumbers are selected as our primary underwater targets due to its economic value on gross domestic product (GDP) growth in Dalian, China. Marine aquaculture is one of Dalian's pillar industries with an annual output value of more than 3.5 billion US dollars. Sea cucumbers are the major seafood products to account for the most revenue. At present, the main methods for fishing sea cucumbers, abalone, and other sea treasures rely on diver operation and submarine trawl operation. However, such diver operation is inefficient, and the deep-sea environment is extremely harmful to the health of divers. On the other hand, the submarine trawl operation generally causes severe damage to the underwater ecological environment. Therefore, autonomous fishing of seafood using underwater robots has become the most effective solution, and the rapid detection of underwater objects is a key issue that needs to be solved urgently.

Comparing to other targets, sea cucumber detection has more difficulty and greater challenges because sea cucumbers have a strong protective color mechanism. It is difficult to observe using color and texture characteristics when ordinary RGB cameras are used for underwater observation. However, the sea cucumber exhibits relatively obvious reflectance characteristics in some special bands, which is the exact reason why we use hyperspectral technology to solve this problem.

The methods described above can effectively apply hyperspectral imaging technology to underwater biological classification and detection but cannot achieve real-time detection of underwater targets [20]. For the target to be detected, if its sensitive bands can be selected for detection in advance, the image processing speed can be increased to satisfy the real-time requirements. Gleason [21] found that the bands of 546, 568, and 589 nm could more easily separate corals and algae from other background objects. So, a multi-spectral camera could be constructed by six bands for fast acquisition of images for target detection. Experiments show that compared to the traditional RGB cameras, the six-band multi-spectral cameras had better performance in detecting submarine corals. However, the selected bands used for coral detection in the experiments were obtained as a by-product of other experiments, which are not applicable to other underwater targets and are not universal. Therefore, a reliable BS method needs to be designed so that it can select representative band subsets for different targets.

The researchers put forward some effective methods for BS. For example, information divergence (ID) selects bands according to the difference between the probability distributions of a measured band and its corresponding Gaussian probability distribution. The maximum-variance principal component analysis (MVPCA) developed in [22] first performed PCA transformation on the original data and then constructed the loading factor matrix from the obtained eigenvectors and eigenvalues. The priority of a band was determined by the variance of its corresponding loading factor. However, the bands selected according to such band prioritization methods were usually highly correlated. By factoring band correlation into consideration, the optimal index factor (OIF) [23] method was developed to find the largest OIF index. Yang et al. [24] proposed a BS method based on linear prediction, which used linear prediction as a similarity measure to find the next least similar band by sequential forward selection. All of the described methods select band subsets in accordance with the characteristics of the data itself and are not designed to select an optimal band subset for a specific target.

For target detection, Yuan et al. [25] proposed a multigraph determinantal point process (MDPP) model to effectively search for discriminative band sets. Wang [26] proposed the multi-band selection (MBS) method, which did not require prioritizing the bands but relied on a specific application to select desired bands. Based on the concept of CEM, Geng [27] proposed a sparse constrained band selection (SCBS), which is convenient for solving the global optimal solution and avoids the complicated subset search process. Wang et al. [28] proposed a new multi-target detection BS method, MinV-BP, which minimized the variance generated by the target of interest to measure the priority of the band.

This paper proposes a real-time detection method for hyperspectral underwater targets based on BS. First of all, in order to solve the problems suffering from a large amount of redundant data and slow acquisition and processing speed of hyperspectral image data, a BS method is designed in combination with MinV-BP [28] and OIF [23] to select an optimal band subset with strong ability in characterizing specific targets, called constrained-target OIF band selection (CTOIFBS). Then, an underwater multi-spectral sensor composed of the selected bands is particularly designed to collect images to overcome the difficulty of long transmission time of the complete hyperspectral image. Finally, CEM is used to detect underwater targets. The proposed CTOIFBS not only can extract a set of bands more suitable for specific targets to improve detection performance but can also meet the real-time requirements of underwater image acquisition.

#### **2. Materials and Methods**

#### *2.1. MinV-BP*

The idea of the Minimum Variance Band Prioritization (MinV-BP) is based on CEM, which was derived from the linearly constrained minimum variance beamformer in the field of digital signal processing. It detects signals in a specific direction and minimizes signal interference in other directions, thereby achieving target detectability from the image and suppressing the background [10].

Suppose {**r**1,**r**2, ... ,**r***N*} is a hyperspectral image with *N* pixels. *N* is the total number of pixels in the image. Each pixel, **r***<sup>i</sup>* = (**r***i*1,**r***i*2, ... ,**r***iL*) *<sup>T</sup>*, is an *L*-dimensional column vector, where *L* is the number of bands. Define **d** as the target spectral signal to be detected, which is known prior information. The purpose of CEM is to design a linear FIR filter **w** = [**w**1, **w**2, ... , **w***L*] *<sup>T</sup>* so that its output energy is minimized under the constraint term (1):

$$\mathbf{d}^T \mathbf{w} = \sum\_{l=1}^{L} \mathbf{d}\_l \mathbf{w}\_l = 1 \tag{1}$$

where **w** = [**w**1, **w**2, ... , **w***L*] *<sup>T</sup>* is an *L*-dimensional column vector formed by the filter coefficient. Suppose the output of the FIR filter corresponding to the input pixel **r***<sup>i</sup>* is **y***<sup>i</sup>* defined in Equation (2):

$$\mathbf{y}\_{i} = \sum\_{l=1}^{L} \mathbf{w}\_{l} \mathbf{r}\_{il} = \mathbf{w}^{T} \mathbf{r}\_{i} = \mathbf{r}\_{l}^{T} \mathbf{w} \tag{2}$$

Then, for all input {**r**1,**r**2, ... ,**r***N*}, the average energy of the filter output is:

$$E = \frac{1}{N} \sum\_{i=1}^{N} \mathbf{y}\_i^2 = \frac{1}{N} (\mathbf{r}\_i^T \mathbf{w})^T \mathbf{r}\_i^T \mathbf{w} = \frac{1}{N} \sum\_{i=1}^{N} \mathbf{w}^T \mathbf{r}\_i \mathbf{r}\_i^T \mathbf{w} = \mathbf{w}^T \left(\frac{1}{N} \sum\_{i=1}^{N} \mathbf{r}\_i \mathbf{r}\_i^T\right) \mathbf{w} = \mathbf{w}^T \mathbf{R} \mathbf{w} \tag{3}$$

where **<sup>R</sup>** = <sup>1</sup> *N <sup>N</sup> <sup>i</sup>*=<sup>1</sup> **<sup>r</sup>***i***r***<sup>T</sup> <sup>i</sup>* represents the sample autocorrelation matrix of the *L* × *L* dimension. CEM can be expressed as the following linear constrained optimization problem:

$$\min\_{\mathbf{w}} \text{min} \{ E \} = \min\_{\mathbf{w}} \text{\(\mathbf{w}^T \mathbf{R} \mathbf{w} \)\\_{s.t.} \mathbf{d}^T \mathbf{w} = 1 \tag{4}$$

By using the Lagrange multiplier method, the optimal solution and CEM error of Equation (4) are obtained as follows:

$$\mathbf{w}\_{\rm CEM} = \frac{\mathbf{R}^{-1}\mathbf{d}}{\mathbf{d}^T \mathbf{R}^{-1} \mathbf{d}} \tag{5}$$

and:

$$\min\_{\mathbf{w}} \mathbf{w} \mathbf{R}^{-1} \mathbf{w} = \left(\mathbf{w}^{\text{CEM}}\right)^{T} \mathbf{R}^{-1} \mathbf{w}^{\text{CEM}} = \left(\mathbf{d}^{T} \mathbf{R}^{-1} \mathbf{d}\right)^{-1} \tag{6}$$

The CEM filter is obtained from Equation (5):

$$\delta\_{\rm CEM}(\mathbf{r}) = (\mathbf{w}\_{\rm CEM})^{\rm T}\mathbf{r} = \left(\frac{\mathbf{R}^{-1}\mathbf{d}}{\mathbf{d}^{\rm T}\mathbf{R}^{-1}\mathbf{d}}\right)^{\rm T}\mathbf{r} = \frac{\mathbf{d}^{\rm T}\mathbf{R}^{-1}\mathbf{r}}{\mathbf{d}^{\rm T}\mathbf{R}^{-1}\mathbf{d}}\tag{7}$$

The CEM operator is applied to every pixel in the image to minimize the output energy caused by other unknown signals so that the target **d** of interest can be detected to achieve the purpose of detection.

According to the CEM algorithm, single band minimum variance band prioritization (MinV-BP) can further use the variance generated by the target of interest to measure the priority of the band to obtain the band with the best characterization ability for the specific target. Suppose {**b***l*} *L <sup>l</sup>*=<sup>1</sup> is the band set of hyperspectral image, where **b***<sup>l</sup>* is a column vector, **b***<sup>l</sup>* = (*bl*1, *bl*2, ··· , *blN*) *<sup>T</sup>*, representing the image of the *l*-th band. {*bli*} *N <sup>i</sup>*=<sup>1</sup> is the set of all *N* pixels on the *l*-th band image **b***l*. According to the CEM error derived from Equation (6), MinV-BP is defined as:

$$\mathbf{V}(\mathbf{b}\_l) = \left(\mathbf{d}\_{\mathbf{b}\_l}^T \mathbf{R}\_{\mathbf{b}\_l}^{-1} \mathbf{d}\_{\mathbf{b}\_l}\right)^{-1} \tag{8}$$

Using Equation (8), MinV-BP can obtain the band priority sequence for the target of interest. Where, the smaller the variance, the higher the priority. The larger the variance, the lower the priority. In short, the advantage of MinV-BP is that it can give higher priority to the band with strong target characterization ability through the minimum variance criterion. However, when MinV-BP prioritizes the bands, it only considers the ability of the bands to represent the target vector but does not consider the strong correlation and redundancy between the bands. As a result, the bands with high priority in the resulting sequence are largely adjacent bands with a strong correlation. Therefore, how to de-correlate the priority bands and obtain a band set with weak correlation and stronger discrimination ability is a subsequent problem to be solved.

#### *2.2. OIF*

Chavez et al. [23] proposed the optimum index factor (OIF) defined as:

$$\text{OIF} = \sum\_{i=1}^{L} \mathbf{S}\_i / \sum\_{i=1}^{L} \sum\_{j=i+1}^{L} \left| \mathbf{R}\_{ij} \right| \tag{9}$$

to evaluate the amount of information in a dataset where **S***<sup>i</sup>* and **R***ij* represent the standard deviation of the *l*-th band and the correlation coefficient between band *i* and *j*, respectively, and *L* is the total number of bands. The standard deviation is used to represent the amount of image information. Based on the ratio of the amount of information in the band set to the correlation coefficient between the bands defined by:

$$\mathbf{R}\_{ij} = \frac{\mathbf{S}\_{ij}^2}{\mathbf{S}\_i \times \mathbf{S}\_j} \tag{10}$$

A band subset with a large amount of information and a small correlation can be selected as a band subset. In Equation (10), **S***ij* represents the covariance of bands *i* and *j*, and:

$$\mathbf{S}\_{ij}^{2} = \text{Cov}(i, j) = \frac{1}{n} \sum\_{w=1}^{n} \left( \mathbf{x}\_{iw} - \overline{\mathbf{x}}\_{i} \right) (\mathbf{y}\_{jw} - \overline{\mathbf{y}}\_{j}) \tag{11}$$

where **x***<sup>i</sup>* represents the spectral grayscale value for the *i*-th band; **x***iw* represents the gray value of the *w*-th pixel in the *i*-th band; **y***<sup>i</sup>* represents the spectral grayscale value for the *j*-th band; **y***jw* represents the gray value of the *w*-th pixel in the *j*-th band; *N* represents the number of pixels in a single band and *n* is the *n*-th pixel in the band, 1 ≤ *n* ≤ *N*.

In other words, for a hyperspectral image containing *L* bands, the standard deviation of the single-band image and the correlation coefficient matrix of each band are calculated first, and then the OIF index corresponding to all possible band subsets are calculated subsequently, and the optimal band subset is finally selected according to the index value.

#### *2.3. Constrained-Target OIF Band Selection*

Hyperspectral data generally have very high band correlation and data redundancy. In order to mitigate this problem, a BS method with target constraints, called constrained-target optimum index factor BS (CTOIFBS), is developed in this paper. It first prioritizes all bands by MinV-BP to obtain a band priority sequence. The smaller the variance, the higher the priority of the band, and the stronger the ability of the band to represent the target. It is then followed by estimating virtual dimensionality (VD) [10,29–31] to determine the required number of bands, *n*BS, where VD is defined as the number of spectrally distinct signal sources present in the data that can effectively characterize the hyperspectral data from a perspective view of target detection and classification. In this case, the first *n* bands with higher priorities in the sequence are clustered into *n*BS clusters by a K-means method to remove the band correlation. As a result, the band correlation in the same cluster will be high, while the band correlation between different clusters will be low. Finally, a band is selected from each cluster to form a

band subset. The OIF value of the band subset is then calculated. The band subset with the largest OIF value is selected as the best band subset. The CTOIFBS process is as follows.

#### Algorithm CTOIFBS

Input: Hyperspectral image data **Ω** Output: The optimal band set **Ω**∗ *n*BS


A flowchart implementing CTOIFBS is shown in Figure 1.

**Figure 1.** A flowchart of implementing CTOIFBS (constrained-target optimal index factor band selection).

Using the MinV-BP criterion, a band priority sequence for the target of interest can be obtained, and then bands with strong characterization of the target can be selected from all the band sequence. However, there is still a problem, which is high inter-band correlation in this band sequence. OIF takes two factors into account: variance and correlation coefficient. Theoretically, the optimal band subset with large information amount and small inter-band correlation can be obtained by optimizing the priority sequence of the band using OIF. However, it has been found in experiments that the use of OIF alone to process band priority sequences was not effective since a band subset with high

correlation will still be selected. This is because OIF strives to make the standard deviation of the selected bands as large as possible, while the correlation coefficient between the bands is as small as possible. Unfortunately, it is difficult to achieve the best of both measures [15]. Therefore, instead of selecting the first *n* bands of the priority sequence directly by the OIF index as a band subset, CTOIFBS is developed to use clusters to perform band de-correlation prior to using OIF. That is, the selected candidate bands are divided into several subsets to further reduce the band correlation and band redundancy. The advantages of such cluster-based band de-correlation have two advantages. One is the pre-grouping process, which reduces the total number of band subset to be compared so that computational complexity can be greatly reduced. The other is clustering by a K-means method in advance to effectively remove band redundancy so as to improve subsequent detection performance.

#### *2.4. Underwater Spectral Imaging System*

Using an underwater spectrum camera composed of a best-selected band subset to collect the target image can greatly reduce data redundancy and solve the problem of long transmission time of a complete hyperspectral image. However, due to the complicated underwater imaging environment on the one hand and the difficulty in finding the proper loader or vehicle on the other hand, the development of underwater spectral imaging technology is still far from that of atmospheric spectral imaging. Therefore, how to design a suitable underwater spectral imaging (USI) system is the very key to success in realizing the rapid detection of hyperspectral underwater targets.

The core of the spectral imaging system is the optical splitting system. The spectroscopic techniques currently being used are based on dispersion, filtering, and interferometry, and commonly used optical splitting components include gratings, prisms, and various filters. This paper develops a filter wheel spectral camera to collect spectral images. There are several reasons. First of all, it has a wheel with multiple single band-pass filters to collect spectral information of different bands, which is suitable for the case of fewer bands needed. Second, a narrow band filter has a high transmittance, so it is suitable for the special light conditions under water. Third, it adapts to different filter combinations that can be changed according to different objects. Fourth, this type of camera is much cheaper than the commonly used liquid crystal tunable filter (LCTF) spectral camera.

Therefore, this paper builds an underwater spectral imaging system based on a filter wheel spectral camera, as shown in Figure 2. Its main components include a FLIR Blackfly S USB3 CCD camera and its corresponding lens, electric filter wheel, and single band-pass filters with the wavelengths between 400 and 830 nm at intervals of 10 nm. These filters have a bandwidth of 14nm and a cut-off depth of OD3 and a single chip microcomputer for controlling the camera and filter wheel. All the above parts are packed in a watertight enclosure. This system uses electric filter wheels to collect single-band images in different bands and synthesize the target's spectral image. It is also possible to obtain spectral images of different band subsets by replacing the filter combinations on the filter wheel. It is important to note that the spectral filter wheel designed is not limited to the USI system and can be applicable to various beam splitters, such as LCTF, acousto-optic tunable filter (AOTF), or spectral filter array (SFA) according to their application scenarios and costs.

**Figure 2.** Diagram of the underwater spectral imaging system.

#### **3. Results and Discussion**

The experiments conducted in this section are divided into three parts. The first part is to validate the performance of the CTOIFBS on a real hyperspectral image, i.e., hyperspectral digital imagery collection experiment (HYDICE) data. A second part is to apply CTOIFBS to real underwater hyperspectral data and to use the calibrated image to select a band subset to validate the CTOIFBS used for the test image. A third part is to design an underwater spectral imaging system to be used to collect the band images of underwater targets according to bands selected by CTOIFBS for detection to verify the feasibility of the USI system for rapid detection of underwater targets and the superiority of CTOIFBS to other BS methods. To further justify the three BS methods, UBS, MinV-BP, and MinV-BP-OIF along with full bands are compared in the experiments where MinV-BP-OIF uses OIF to directly select the optimal band subset for the first *n* bands selected by MinV-BP. The main difference between CTOIFBS and MinV-BP-OIF is that prior to calculating the OIF value, CTOIFBS uses the K-means method to divide the first *n* bands selected by MinV-BP into *n*BS spectral low-relevance clusters. Then, CTOIFBS combines each band from various clusters to form a band subset and then selects a band subset with the largest OIF value as the desired band subset. Comparing to MinV-BP-OIF, the correlation among the bands selected by CTOIFBS is lower than MinV-BP-OIF. In addition, the required number of bands for HYDICE and real underwater hyperspectral data of sea cucumbers were determined by virtual dimensionality (VD) [10,29], which are six and five, respectively. Finally, visual inspection and quantitative analysis are also used to analyze and compare the performance of various BS methods.

Specifically, a 3D receiver operating characteristic (ROC) analysis-based quantitative analysis developed in [32,33] was conducted by calculating the area under the curve (AUC) for the 2D ROC curves of (PD, PF), (PD, τ), and (PF, τ) widely used in target detection where PD and PF represent the detection probability and the false alarm probability defined in [34], respectively, which were produced by using a different τ range from 0 to 1 to binarize the normalized detection result. The AUC values of (PD, PF), (PD, τ), and (PF, τ) were used to measure the overall detection performance, target detection capability, and background suppression ability of a detector, respectively. It should be noted that the higher the AUC values of (PD, PF) and (PD, τ) are, the better the detection performance of the detector is. Conversely, the smaller the AUC value of (PF, τ), the better the suppression ability of the background.

#### *3.1. Real HYDICE Image*

This real HYDICE scene has been widely used in target detection. It has a spatial resolution of 1.56 m and contains 169 spectral bands with a size of 64 × 64. There are 15 panels divided into five types of targets, **p**1, **p**2, **p**3, **p**4, and **p**5, which are distributed on each row with three different sizes, 3 × 3 m, 2 × 2 m, and 1 × 1 m, respectively shown in Figure 3a. Figure 3b shows their precise spatial

locations with the pixels in yellow (Y pixels), indicating panel pixels mixed with the BKG. In addition, there are a total of 19 panel pixels highlighted by red, which are the target pixels to focus on.

**Figure 3.** (**a**) Hyperspectral digital imagery collection experiment (HYDICE) scene. (**b**) Ground truth map of the 15 panels.

Table 1 shows the band subsets selected by four BS methods along with full bands for target **p**1, **p**2, **p**3, **p**4, and **p**<sup>5</sup> in the HYDICE image. Unlike UBS, which is independent of targets, when the desired targets are different, the bands selected by three BS methods for target detection, MinV-BP, MinV-BP-OIF, and CTOIFBS, are also different. Figure 4 shows the detection results of each target under different sets of bands using CEM. From the intuitive detection results, it can be seen that the detection results are best when using the full bands with the background well suppressed. When using the set of bands selected by MinV-BP and UBS to detect targets, undesired targets respond strongly and are clearly detected. Moreover, the detection results of UBS showed that the band selected by UBS had a weak suppression ability on the background. Finally, compared with the MinV-BP-OIF and CTOIFBS methods, it can be obtained that CTOIFBS has a better ability to detect targets and has a good background suppression effect.


**Table 1.** Optimal band subsets selected by four BS (band selection) methods along with full bands.

UBS: uniform band selection; MinV-BP: minimum variance band priority; MinV-BP-OIF: minimum variance band priority with OIF; CTOIFBS: constrained-target optimal index factor (OIF) band selection.

**Figure 4.** CEM (constrained energy minimization) detection map results using different band subsets selected by four BS (band selection) methods along with full bands: (**a**) Full bands; (**b**) MinV-BP: minimum variance band priority; (**c**) MinV-BP-OIF: minimum variance band priority with OIF; (**d**) CTOIFBS: constrained-target optimal index factor (OIF) band selection; (**e**) UBS: uniform band selection.

In addition to analyzing the performance of various BS methods by visual inspection, the experiment also performed quantitative analysis. Table 2 tabulates the AUC values of the five BS methods where the best and worst results are highlighted by red and green, respectively. The higher the AUC value, the better the detection, that is, the better the selected band subset to represent the target. As expected, the results using full bands were the best. However, among all the four BS methods, CTOIFBS generally outperformed the other three BS methods in terms of (PD, PF). In order to further demonstrate the effectiveness of CTOIFBS, Table 3 ranks the AUC value of (PD, PF) of various methods. The last row of Table 3 ranks the total target detection capability by the BS methods. The smaller the value, the better detection capability of the selected band subset. Among them, the value of full bands is five, ranking first, and the detection capability is the best. CTOIFBS scores 13, which is only worse than full bands. Although CTOIFBS is slightly inferior to using the full bands in detection performance, its transmission time and processing time are much lower than using the full bands due to the reduced data dimensionality. In addition, CTOIFBS performed better than MinV-BP, MinV-BP-OIF, and UBS assuming that the same number of selected bands was used.



MinV-BP: minimum variance band priority; MinV-BP-OIF: minimum variance band priority with OIF; CTOIFBS: constrained-target optimal index factor (OIF) band selection; UBS: uniform band selection.

**Table 3.** Order of the AUC (area under the curve) values of (PD, PF) of four BS (band selection) methods along with full bands.


MinV-BP: minimum variance band priority; MinV-BP-OIF: minimum variance band priority with OIF; CTOIFBS: constrained-target optimal index factor (OIF) band selection; UBS: uniform band selection.

#### *3.2. Underwater Hyperspectral Image*

In this section, real hyperspectral data were collected and conducted for sea cucumber detection to validate the performance of CTOIFBS. To demonstrate the effectiveness of CTOIFBS, several state-of-the-art BS methods, full bands, UBS, MinV-BP, and MinV-BP-OIF are compared by experiments where the required number of bands is five determined by VD. Finally, detection results and quantitative analysis were used to analyze and compare the performance of various BS methods. Specifically, quantitative analysis was conducted by the area under the curve (AUC) widely used in target detection.

The data used in our experiments were underwater sea cucumber images collected by a hyperspectral imager, covering 256 bands with a spectral range of 0.4 to 1.05 nm. Due to the fast attenuation of infrared bands in underwater, the sensor could not collect enough information from infrared bands. So, part of the infrared bands (171–256) were removed, and only 1–170 bands were analyzed for experiments with a spectral coverage of 0.4~0.825 nm. Shown in Figure 5a,b are the RGB images of the calibrated data and their corresponding mask image, respectively.

**Figure 5.** Sea cucumber data for experiments: (**a**) RGB image of calibrated data; (**b**) ground truth map of calibrated data; (**c**) RGB image of validated data; (**d**) ground truth map of validated data.

We have plotted the spectra for five types of ground features, including the sea cucumber, sand, pebble, clam, and scallop from calibrated data, as shown in Figure 5a, where the sea cucumber was selected as the target of interest and the other four features as the background. The obtained spectra were used to mark the spectral bands location (points) selected by the four BS methods in Table 4, which is shown in Figure 6 using red vertical dashed lines for visual inspection and comparison among correlation of the selected band sets.



MinV-BP: minimum variance band priority; MinV-BP-OIF: minimum variance band priority with OIF; CTOIFBS: constrained-target optimal index factor (OIF) band selection; UBS: uniform band selection.

**Figure 6.** Bands selected by four BS (band selection) methods: (**a**) MinV-BP: minimum variance band priority; (**b**) MinV-BP-OIF: minimum variance band priority with OIF; (**c**) CTOIFBS: constrained-target optimal index factor (OIF) band selection; (**d**) UBS: uniform band selection.

On the one hand, comparing to MinV-BP and MinV-BP-OIF, CTOIFBS took the correlation among bands into consideration. As a result, the bands selected by CTOIFBS were more dispersed and contained more spectral information. On the other hand, although the distribution of band selected by UBS was more dispersed than the other three methods, the detection results were not satisfactory. This is because UBS did not consider the special relationship between the target and its selected bands. Consequently, it was unable to select bands pertaining to target information compared to the band set selected by CTOIFBS, which can effectively avoid high correlation between bands and can be further used to characterize targets of interest.

Table 5 shows the correlation coefficient among bands in each band subset selected by a different BS method where the greater the value between two bands in a band subset, the higher the correlation between these two bands. So, a better band subset should have less correlation among its bands. Furthermore, Table 6 shows the mean correlation coefficients among bands selected by different BS methods.


**Table 5.** Correlation coefficient matrices of the band subset selected by four BS (band selection) methods: (**a**) correlation coefficient matrix of the band subset selected by MinV-BP; (**b**) correlation coefficient matrix of the band subset selected by MinV-BP-OIF; (**c**) correlation coefficient matrix of the band subset selected by CTOIFBS; (**d**) correlation coefficient matrix of the band subset selected by UBS.

**Table 6.** Mean correlation coefficients of four BS (band selection) methods.


MinV-BP: minimum variance band priority; MinV-BP-OIF: minimum variance band priority with OIF; CTOIFBS: constrained-target optimal index factor (OIF) band selection; UBS: uniform band selection.

From Table 6, it can be seen that compared to the other two target-constrained BS methods, the mean correlation coefficient among the bands selected by CTOIFBS is the smallest, which validates the advantage of CTOIFBS in reducing correlation between bands during the BS. It is worth noting that although the mean correlation coefficient among the bands selected by UBS is the smallest, its detection results were poor due to its inability to select effective bands to characterize the target.

According to the band subsets selected by different BS methods in Table 4, their corresponding band images of the calibrated data shown in Figure 5a were synthesized. CEM was then used to detect sea cucumbers, and the detection results of using full bands and band subsets selected by four BS methods were shown in Figure 7. The brighter a pixel in the image is, the higher the probability that the pixel is considered to be more likely a target by the detector. It is also observed that the target pixels detected with a band set selected by UBS were not obvious and have been buried in the background. Furthermore, the AUC values calculated in Table 7 were also used to quantitatively analyze the effect

of different BS methods on detection performance where the best and worst results are highlighted by red and green, respectively. Comparing to the AUC values of (PD, PF), the full band was the best followed by CTOIFBS, MinV-BP-OIF, and MinV-BP, and finally, UBS.

**Figure 7.** Detection results of the calibrated data of the RGB image and ground truth map shown in Figure 5a,b by full bands and four BS methods: (**a**) Full bands; (**b**) MinV-BP: minimum variance band priority; (**c**) MinV-BP-OIF: minimum variance band priority with OIF; (**d**) CTOIFBS: constrained-target optimal index factor (OIF) band selection; (**e**) UBS: uniform band selection.


**Table 7.** AUC (area under the curve) values of five BS (band selection) methods.

In order to further validate the effectiveness of CTOIFBS in detecting underwater targets, an additional experimental image was also selected for testing the performance of various BS methods. Figure 8 shows the detection results of sea cucumbers on the test image using a set of bands selected in Table 4. Table 8 tabulates their AUC values where the best and worst results are highlighted by red and green, respectively. According to the AUC values of (PD, PF) in Table 8, the detection result of CTOIFBS was higher than that of other BS methods, MinV-BP, MinV-BP-OIF, and UBS using the same number of bands. As expected, the CTOIFBS result was only worse than that of using full bands. This proves that it is feasible to use the band subset selected by CTOIFBS for underwater target detection.

**Figure 8.** Detection results of the validated data of the RGB image and ground truth map shown in Figure 5c,d by full bands and four BS methods: (**a**) Full bands; (**b**) MinV-BP: minimum variance band priority; (**c**) MinV-BP-OIF: minimum variance band priority with OIF; (**d**) CTOIFBS: constrained-target optimal index factor (OIF) band selection; (**e**) UBS: uniform band selection.

MinV-BP: minimum variance band priority; MinV-BP-OIF: minimum variance band priority with OIF; CTOIFBS: constrained-target optimal index factor (OIF) band selection; UBS: uniform band selection.


**Table 8.** AUC (area under the curve) values of four BS (band selection) methods along with full bands.

MinV-BP: minimum variance band priority; MinV-BP-OIF: minimum variance band priority with OIF; CTOIFBS: constrained-target optimal index factor (OIF) band selection; UBS: uniform band selection.

The above real image sea cucumber image experiments also proved that it was feasible to use the band subset selected by CTOIFBS for underwater target detection. Although the detection result of CTOIFBS is slightly worse than that of using full bands, the acquisition and transmission speeds are considerably faster than using full bands because a smaller number of bands were used, and the smaller amount of image data is being processed. Table 9 shows the detection speeds of using full bands and CTOIFBS under the same experimental environment.

**Table 9.** Comparison of the average speed of two methods for detecting a single image.


CTOIFBS: constrained-target optimal index factor (OIF) band selection.

From Table 9, the process of using full bands consumed a great deal of time, which was reflected in imaging, transmission, and processing. Under the effect of water flow, target movement, and other factors, a USI system needs to detect the target quickly. Obviously, a USI system using full bands cannot meet the requirement for rapid detection of an underwater target. In addition, studies have found that using full bands may incur an issue of the Hughes phenomenon [35], that is, high dimensionality may decrease the detection accuracy. Furthermore, the experiments further demonstrated that the detection results of CTOIFBS could be very close to that obtained using the full bands. With all things considered above, a USI system with full bands is not suitable for underwater rapid target detection.

#### *3.3. Underwater Spectral Imaging System*

In order to verify that the collected target spectral data by the constructed underwater spectral imaging (USI) system can accurately detect underwater targets, two experiments were set up in this section. The first experiment was conducted by comparing the hyperspectral data using the selected band subset to the multi-spectral data collected by the USI system using the same band subset under similar scenes to prove that the multi-spectral data collected by the USI system has consistent feature expression capability with the hyperspectral images. A second experiment was also conducted under the same scenes to compare the detection performance of data collected by the USI system using different BS methods to verify the detection capability of CTOIFBS.

#### 3.3.1. First Experiment: Compatibility of USI to HSI

In order to show that the multi-spectral data collected by the USI system have the same feature expression ability as the hyperspectral images, the experiment collected the hyperspectral data and the filter bands corresponding to the band subset selected by CTOIFBS in similar scenes. Because the bands selected by CTOIFBS are 470, 480, 500, 540, and 830 nm, the band images corresponding to the hyperspectral data were extracted to form a band subset for subsequent target detection. Figure 9 shows the images collected by two methods and their corresponding detection results of sea cucumbers in similar scenes.

**Figure 9.** Images collected by two methods and corresponding detection map results in similar scenes. HSI-01 (**a**), HSI-02 (**c**) are hyperspectral images, USI-01 (**b**), USI-02 (**d**) are images collected by the USI system; (**e**), (**f**), (**g**), (**h**) are the detection results corresponding to (**a**), (**b**), (**c**), (**d**).

According to the detection results, both methods are capable of detecting sea cucumbers. From the performance of suppressing non-target pixels, although the image extracted from the HSI data can suppress the main background, which is sand, it has a high response to interference targets, such as stones and clams. By contrast, the data collected by the USI system can suppress non-target pixels more effectively. From the AUC values of (PD, PF) in Table 10, the AUC value detected using the data collected by the USI system is higher than that using HSI data, indicating that its ability to detect targets is higher. Of course, due to the difference in the performance of the sensors used by the two methods, this experiment may not have sufficient evidence to conclude that the detection results based on the data collected by the USI system must be better than the data using the corresponding band of HSI. Nevertheless, it can prove that the data collected using the USI system has the same feature expression ability as the hyperspectral images and can be used for underwater spectral data collection and target detection.


**Table 10.** AUC (area under the curve) values for CEM (constrained energy minimization) detection map results using four images shown in Figure 9.

#### 3.3.2. Second Experiment: USI System using CTOIFBS

This section uses the data collected by the USI system to compare the performance of the CTOIFBS with four BS methods. MinV-BP, MinV-BP-OIF, and UBS with their corresponding band subsets tabulated in Table 11. Then, the single-band images are collected by the USI system, as shown in Figure 10. Finally, the collected single-band images are integrated into multi-spectral image cubes for target detection. It should be noted that the single band image-constructed multi-spectral image data has indeed a spectral resolution of approximate 10 nm, and thus, the filters actually used are rounded to 10nm.


**Table 11.** Band subsets selected by four BS (band selection) methods.

MinV-BP: minimum variance band priority; MinV-BP-OIF: minimum variance band priority with OIF; CTOIFBS: constrained-target optimal index factor (OIF) band selection; UBS: uniform band selection.

**Figure 10.** Image acquisition with different bands: (**a**) 470; (**b**) 490; (**c**) 510; (**d**) 540; (**e**) 570; (**f**) 650; (**g**) 740; (**h**) 820 nm.

CEM was used to detect the sea cucumbers in the composite image of each band subset. The detection results corresponding to each method are shown in Figure 11.

**Figure 11.** Detection map results using four BS (band selection) methods: (**a**) original image; (**b**)MinV-BP: minimum variance band priority; (**c**) MinV-BP-OIF: minimum variance band priority with OIF; (**d**) CTOIFBS: constrained-target optimal index factor (OIF) band selection; (**e**) UBS: uniform band selection.

The detection results shown in Figure 11 illustrated that when the set of bands selected by CTOIFBS was used to detect sea cucumbers, non-target pixels could be removed more effectively compared to other BS methods. On the contrary, MinV-BP and MinV-BP-OIF had poor ability in distinguishing the targets from the background, and the response to non-target pixels was also high when the target was detected. Table 12 shows the AUC values of the detection, and we also highlight the best and worst results by red and green. According to the AUC values of (PD, PF) in Table 12, UBS has the worst performance on all four test images. This shows that BS methods based on a constrained-target are more conducive to target detection. Furthermore, except for image USI-06, the AUC value of CTOIFBS is the highest. This proves that compared to other BS methods based on a constrained-target, MinV-BP, and MinV-BP-OIF, CTOIFBS has a better ability to characterize targets.


**Table 12.** AUC (area under the curve) values of detection using four BS (band selection) methods.

MinV-BP: minimum variance band priority; MinV-BP-OIF: minimum variance band priority with OIF; CTOIFBS: constrained-target optimal index factor (OIF) band selection; UBS: uniform band selection.

#### **4. Conclusions**

Hyperspectral imaging technology has advantages of high spectral resolution and abundant spectral information. Its applications to underwater object detection can help overcome the problems of a poor underwater imaging environment and complex background. The fast processing of detecting underwater hyperspectral targets can be achieved by CTOIFBS, while retaining crucial spectral information. In the meantime, CTOIFBS also overcomes the imaging and processing speed problems. Experiments show that the detection performance of the band subset selected by CTOIFBS is better than that by using other BS methods.

**Author Contributions:** Conceptualization, X.F. and M.S.; formal analysis, H.Y.; methodology, X.S. (Xiaodi Shang) and X.S. (Xudong Sun); writing—original draft, X.S. (Xiaodi Shang) and X.S. (Xudong Sun); writing—review and editing, C.-IC. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the National Nature Science Foundation of China, grant number 61601077, 61971082, 61890964; Fundamental Research Funds for the Central Universities, grant number 3132019341; State Administration of Foreign Experts Affairs, grant number ZD20180073.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Attention-Based Spatial and Spectral Network with PCA-Guided Self-Supervised Feature Extraction for Change Detection in Hyperspectral Images**

**Zhao Wang 1, Fenlong Jiang 1, Tongfei Liu 1, Fei Xie 2,\* and Peng Li <sup>1</sup>**


**Abstract:** Joint analysis of spatial and spectral features has always been an important method for change detection in hyperspectral images. However, many existing methods cannot extract effective spatial features from the data itself. Moreover, when combining spatial and spectral features, a rough uniform global combination ratio is usually required. To address these problems, in this paper, we propose a novel attention-based spatial and spectral network with PCA-guided self-supervised feature extraction mechanism to detect changes in hyperspectral images. The whole framework is divided into two steps. First, a self-supervised mapping from each patch of the difference map to the principal components of the central pixel of each patch is established. By using the multilayer convolutional neural network, the main spatial features of differences can be extracted. In the second step, the attention mechanism is introduced. Specifically, the weighting factor between the spatial and spectral features of each pixel is adaptively calculated from the concatenated spatial and spectral features. Then, the calculated factor is applied proportionally to the corresponding features. Finally, by the joint analysis of the weighted spatial and spectral features, the change status of pixels in different positions can be obtained. Experimental results on several real hyperspectral change detection data sets show the effectiveness and advancement of the proposed method.

**Keywords:** hyperspectral images; change detection; self-supervised learning; attention mechanism

#### **1. Introduction**

Change detection (CD) has been a popular research and application in the field of remote sensing in recent years, which aims to acquire the change information from multitemporal images in the same geographical area. The change information is vital in many applications, such as disaster detection and assessment [1], environmental governance [2], ecosystem monitoring [3], urban sustainable development [4,5], etc.

With the advances in sensing and imaging technology, hyperspectral images (HSIs) have attracted increasing attention and been widely utilized in earth observation applications [4,6]. Some characteristics of HSIs should be noticed: unlike multispectral images and SAR images, HSIs typically have hundreds of spectral bands, and this rich spectral information helps detect finer changes for CD. Although HSIs bring some key advantages, redundant spectral bands may introduce interference information as adjacent bands have similar spectral values, which are continuously measured by the hyperspectral sensor [4]. Moreover, the high-dimensional spectral band also leads to a significant increase in the storage and computational complexity of HSIs processing and analysis [7]. In addition, for HSIs, spatial feature extraction is more challenging than multispectral image as the serious

**Citation:** Wang, Z.; Jiang, F.; Liu, T.; Xie, F.; Li, P. Attention-Based Spatial and Spectral Network with PCA-Guided Self-Supervised Feature Extraction for Change Detection in Hyperspectral Images. *Remote Sens.* **2021**, *13*, 4927. https://doi.org/ 10.3390/rs13234927

Academic Editors: Chein-I Chang, Meiping Song, Chunyan Yu, Yulei Wang, Haoyang Yu, Jiaojiao Li, Lin Wang, Hsiao-Chi Li and Xiaorun Li

Received: 13 October 2021 Accepted: 30 November 2021 Published: 4 December 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

mixed pixels phenomenon caused by low spatial resolution [8]. Furthermore, it is very difficult to obtain enough labeled training samples in HSIs analysis.

In view of the characteristics of HSIs, many approaches have been proposed for CD in HSIs. These methods can be mainly summarized into two categories:

(1) One is to directly use spectral features to obtain change information for multitemporal HSIs. For example, Liu et al. promoted a sequential spectral change vector analysis to detect multiple changes for HSIs [9], which employs an adaptive spectral change vector representation to identify changes. Liu et al. employed spectral change information to detect change classes for achieving unsupervised HSIs change detection [10]. Different from the common method by reducing or selecting the band to reduce the band redundancy for CD in HSI, in [11], change information of each band is utilized to construct the hyperspectral change vectors for detecting multiple types of change. Recently, a general end-to-end convolutional neural network (CNN) has been proposed for HSI CD in [6], named GETNET, which introduces a unmixing-based subpixel representation to fuse multisource information. The performance of these methods is often hindered as they usually utilize change vector analysis of spectral feature to generate directly change magnitude between multi-temporal HSIs.

(2) However, only using spectral features is bound to ignore spatial contextual information [12]. Therefore, joint spatial-spectral analysis is a common technical means in HSI-based tasks [13–17]. Therefore, the other is to obtain changes and improve detection accuracy through joint analysis of spectral and spatial features of HSI. For instance, Wu et al. stacked first multi-temporal HSIs, and then the local spatial information around the pixel is presented through joint sparse representation for hyperspectral anomalous CD [18]. Recently, a CD approach based on multiple morphological profiles has been proposed in HSIs [19]. This approach employed multiple morphological profiles to extract spatial information, and then a spectral angle weighted-based local absolute distance and an absolute distance are used to obtain changes. In addition, some deep learning-based techniques can help improve the performance of CD due to its ability to effectively capture and fuse spectral and spatial features. A recurrent 3D fully convolutional networks is designed to capture spectral-spatial features of HSIs simultaneously for CD in [12]. Zhan et al. promoted a three-directions spectral-spatial convolution neural network (TDSSC) in [20], which can capture representative spectra-spatial features by concatenating the feature of spectral direction and two spatial directions, and thus improving detection performance. Such methods are usually weighted to equalize spatial and spectral features to conduct joint analysis and classification, and have achieved good performance, but they usually have the following common problems:


To address these two problems mentioned above, in this paper, we propose an attention-based spatial and spectral network with PCA-guided self-supervised feature extraction for CD in HSIs. The whole framework consists of two parts. In the first part, a PCA-guided self-supervised spatial feature extraction network is devised to extract spatial differential features. Concretely, two HSIs are compared to generate a difference map (DM) first. Then, the principal component analysis is utilized to obtain the transferred image that only contains several principal components. Afterwards, a mapping from the image

patch, i.e., a neighborhood with a certain size for each pixel in the DM, to the corresponding principal component vector in the transferred image is established, where the spatial targeted differential features can be extracted. Finally, the extracted spatial features can be used in the subsequently joint analysis combined with the spectral features. In the whole process, no additional supervisory information is involved, and the training data used in the training only comes from the processing of the data itself, which is categorized into the self-supervised learning task recently [21–23]. These methods mine useful supervisory information from the data itself and can obtain performance not weaker than external supervised learning. Besides, the designed mapping relationship can make the extracted spatial features more distinctive. In the second part, we propose an attention-based spatial and spectral CD network. Different from the above-mentioned methods, the attention mechanism [24–26] is introduced to balance the spatial and spectral features adaptively. Specifically, the spatial and spectral features are first combined directly to calculate a weight factor for the corresponding pixel via several fully-connected layers. After that, the calculated factor is applied to weight the two features. Finally, by combining the weighted spatial and spectral features, the final change status for each pixel can be inferred. The introduction of attention mechanism enables the network to calculate its own weight factor for the spatial and spectral features of each pixel, which avoids multiple trials to select the optimal factor and allows for more detailed detection of changes. In order to improve the network performance and the detection effect, a few ground truth labels are used for semi-supervised training detection network. Experiments on several real data sets show the effectiveness and advance of our algorithm. The main contributions of our work are summarized below:


The rest of this paper is organized as follows. Related works are presented in Section 2. Section 3 describes the proposed ASSCDN in detail. In Section 4, experiments and analysis based on three pairs of HSI dataset are presented and discussed. Finally, the conclusion is provided in Section 6.

#### **2. Related Works**

#### *2.1. Traditional CD Methods*

During past few decades, many CD methods have been proposed and applied in practical applications [27,28]. In the early development of CD, two main steps are usually required to realize CD: measuring the difference image (DI) and obtaining the change detection map (CDM). Many techniques are commonly used to measure DI, such as image difference [29], image log-ratio [30], change vector analysis (CVA) [29,31], etc. Generally, these approaches calculate the change magnitude of bi-temporal images by the distance between two pixels. Afterwards, the methods widely used to generate CDM are threshold segmentation techniques (OTSU [32], expectation maximum [33]) or clustering algorithms (k-means [34], fuzzy c-means [35], k-nearest neighbors (KNN) [36], and support vector machines (SVM) [37]). With the development of CD technology, some methods are further promoted to improve the detection performance. For example, Zhuang et al. combined spectral angle mapper and change vector analysis for CD of multispectral images [38]. Thonfeld et al. proposed a robust change vector analysis (RCVA) [39] approach

for multi-sensor satellite images CD. In addition to the above methods, some techniques are also helpful to improve the performance of CD, such as principal component analysis (PCA) [34,40], level set [41,42], Markov field [43,44], etc. However, these approaches rely significantly on the quality of hand-crafted features in order to measure the similarity between bi-temporal images.

#### *2.2. Deep Learning-Based CD Methods*

In recent years, with the booming development and wide application of deep learning technology in the field of computer vision, many scholars have extended this technology to remote sensing image CD. According to different manners of supervision, we place these deep learning-based CD approaches into three groups [28,45]: supervised CD, unsupervised CD, and semi-supervised CD.

(1) Supervised CD. This kind of method is commonly used in CD, which refers to the method of using artificially labeled samples in model training to realize supervised learning. For instance, in the early stage, Gong et al. designed a deep neural network for synthetic aperture radar (SAR) images CD, which can perform feature learning and generate CDM by supervised learning [46]. Zhang et al. recently promoted a deeply supervised image fusion network for CD, which devises a difference discrimination network to obtain CDM of bi-temporal images through deeply supervised learning [47]. Other methods are available in [48,49]. Although these supervised CD approaches can achieve acceptable performance for CD, manually labeled data is expensive and time consuming, and the quality of the manually labeled data has a significant impact on the performance of the model.

(2) Unsupervised CD. In addition to supervised learning-based CD approaches, unsupervised CD approaches have received much attention, which can acquire CDM directly without the need for manually labeled data. In recent years, many studies have been proposed for unsupervised CD, for example, Saha et al. designed an unsupervised deep change vector analysis (DCVA) method based on pretrained CNN for multiple CD [50]; an unsupervised deep slow feature analysis (DSFA) was proposed based on two symmetric deep networks for multitemporal remote sensing images in [51], which can effectively enhance the separability of changed and unchanged pixels by slow feature analysis. Moreover, other unsupervised change detection methods are available in [52–55]. However, at present, the unsupervised CD method is difficult to promote for practical application, this is because unsupervised CD approaches rely heavily on migrating features from data sources with different distribution, resulting in poor robustness and unreliable results.

(3) Semi-supervised CD. To overcome the limitation of supervised and unsupervised CD methods to a certain extent, semi-supervised learning approaches have been further developed for CD. In semi-supervised CD, in addition to a small amount of labeled data, unlabeled data are also effectively used to achieve the semi-supervised learning, and thus obtaining CDM. For example, Jiang et al. proposed a semi-supervised CD method, which extracts discriminative features by using unlabeled data and limited labeled samples [56]. In [57], a semi-supervised CNN based on a generative adversarial network was proposed, which can employ two discriminators to enhance the feature distribution consistency between the labeled and unlabeled data for CD. These semi-supervised CD methods significantly reduce the dependence on a large number of labeled data, and meanwhile maintain the performance of the model to a certain extent. However, unlabeled data may cause some interference to network training due to its unreliability, so developing reliable methods to apply unlabeled data is a crucial procedure in semi-supervised learning.

#### **3. Proposed Method**

In order to effectively detect changes based on the joint spatial and spectral features of HSIs, in this paper, we propose a novel self-supervised feature extraction and attention based CD framework, as shown in Figure 1. From the figure, it can be seen that the entire framework is divided into two steps. In the first step, the PCA-guided self-supervised spatial feature extraction network is designed, which can extract the most important change feature representation in each difference patch. In the second step, in order to effectively combine the extracted spatial and spectral features, the attention mechanism is introduced into the spatial and spectral CD network, which can adaptively learn a matching ratio for the spatial and spectral features of each patch, highlighting where is the most conducive for detecting changes. Below, we will introduce the proposed framework in detail.

**Figure 1.** Framework of the proposed ASSCDN. The first step is PCA-guided self-supervised spatial feature extraction network. The second step is to combine the spectral and spatial features by introducing a attention mechanism and obtain the final class.

#### *3.1. Data Preparation*

#### 3.1.1. Data Preprocessing

Before comparing and analyzing the target HSIs, as the original HSIs usually contain noise and interference channels caused by atmospheric and water vapor scattering, it is often necessary to perform preprocessing such as dead pixel repair, strip removal, atmospheric correction, etc., on the original images. In addition, as change detection requires joint analysis of these two images, unaligned pixels will cause higher false detection, so joint registration of these two images is also essential.

#### 3.1.2. Training Data Generation

It is a common method to directly analyze the difference image and obtain the final change map, since it can analyze the difference more directly and specifically. In addition, considering the lack of labeled data for HSIs, analysis based on a certain size of neighborhood of each pixel, i.e., a small patch, can often improve the reliability of change detection. After comprehensive consideration, we select the small patch centered on each pixel in the difference map of the two HSIs as the processing unit. Formally, let *I*<sup>1</sup> and *I*<sup>2</sup> represent the two HSIs of size *H* × *W* × *C* to be detected, where *H*, *W*, and *C* represent the height, width, and the number of spectral bands of the images, respectively. First, by comparing the two images, a difference map DM can be generated, i.e.,

$$\text{DM} = |I\_1 - I\_2|. \tag{1}$$

Then, by cutting the pixel-by-pixel neighborhood of DM, a total of *H* × *W* patches of size *P* × *P* × *C* can be obtained for the input of CD, where *P* is the patch size.

#### 3.1.3. Principal Component Analysis (PCA) for DM

Principal component analysis (PCA) is a popular dimensional reduction machine learning technique, which has been widely used in change detection due to its simplicity, robustness, and effectiveness. For DM, PCA technique can transform the image into an orthogonal space with larger data variance, where the data can be represented by fewer dimensional features with almost little information loss, consequently finding the most expressive difference representation. Formally, for the DM data matrix **D** which has *H* × *W* × *C* samples of *M*-dimensional features, the transformed data can be calculated by

$$\mathbf{D}' = \mathbf{P} \mathbf{D},\tag{2}$$

where **P** is the transposed eigenvector matrix sorted according to the eigenvalue of the eigencovariance matrix **C** of **D**. That is, **P** satisfies the following equation:

$$\mathbf{P}^{\top}\mathbf{C}\mathbf{P} = \begin{bmatrix} \lambda\_1 \\ & \lambda\_2 \\ & & \ddots \\ & & & \lambda\_M \end{bmatrix},\tag{3}$$

where {*λ*1, *λ*2, ··· , *λM*} are *M* eigenvalues of **C**, which satisfies *λ*<sup>1</sup> ≥ *λ*<sup>2</sup> ≥···≥ *λM*.

In this way, the original data can be transformed into a new feature space, and the former *K*-dimension features can contain most of the information. The data after dimensionality reduction can be expressed as

$$\mathbf{D} = \mathbf{T} \mathbf{D},\tag{4}$$

where **T** is the matrix of the eigenbasis vectors for the first *K* rows of **P**. Then, the obtained **D˜** can be reshaped as the dimension reduced difference map DM*PCA*.

#### *3.2. PCA-Guided Self-Supervised Spatial Feature Extraction*

When the data are ready, it can be fed into the designed framework for change detection. We first extract spatial features based on these patches. As DM*PCA* contains several major differential features, we expect to establish a mapping relationship from patch to several principal components of its central pixel. In this way, we propose a PCA-guided spatial feature extraction network (PCASFEN) which is supposed learn the spatial features that can express the most dominating features of the central pixel from the neighborhood information. There is no artificially labeled labels involved in the whole learning process; the supervised information can be obtained completely by the transformation of data itself, which is actually a self-supervised task. Specifically, given a patch with of size *P* × *P* × *C*, several convolutional layers are used to extract deep spatial features. In this process, a pooling layer is not used, mainly considering that the patch size is usually small and pooling may lose more spatial details. In addition, batch normalization is adopted to prevent distributed drift and thus ensure the stability of training. After the feature extraction, in order to ensure the same spatial and spectral dimensions in joint spatial and spectral analysis, the processed features are flattened and processed into a *C*-dimensional vector with the same feature dimensions as the input via a fully-connected layer. Finally, after several fully connected layers of processing, the output is a vector of *K* dimensions, which is utilized to regression-fitted with the principal component features of the central pixel of the patch.

#### *3.3. Attention-Based Spatial and Spectral Network*

At present, we have obtained spatial and spectral features representing each pixel in the DM. Joint analysis of spatial and spectral features is a common method in change detection tasks, because it can comprehensively analyze data from spatial and spectral perspectives, thus reduce isolated noise points and improve detection robustness. Generally speaking, to better balance these two features, a weighting factor *γ* ∈ [0, 1] is often used. The fusion feature *F* of a pixel can be represented as

$$F = \left[ \gamma F\_{sp\mathbf{a}\prime} (1 - \gamma) F\_{sp\mathbf{e}} \right]. \tag{5}$$

It can be seen that *γ* is a very important parameter, which is used to determine which of the spatial and spectral features contributes more to the final CD result. In most methods, a suitable *γ* usually requires multiple experiments to obtain, which undoubtedly greatly increases the actual use cost. In addition, for all pixels in the image, *γ* will eventually be set globally, but in fact, the spatial and spectral features of different pixels contribute differently to their change status. Inspired by the attention mechanism, we propose an attention-based spatial and spectral change detection network (ASSCDN). Concretely, given the spatial feature *Fspa* <sup>∈</sup> <sup>R</sup>*<sup>C</sup>* and a spatial feature *Fspe* <sup>∈</sup> <sup>R</sup>*<sup>C</sup>* of the *<sup>n</sup>*-th pixel in DM, first, they are concatenated as *Fn* <sup>∈</sup> <sup>R</sup>2*C*, where *<sup>n</sup>* <sup>=</sup> 1, 2, ··· , *<sup>H</sup>* <sup>×</sup> *<sup>W</sup>*. Then, *Fn* is fed into a fully-connected layer to calculate the *γ<sup>n</sup>* only for the corresponding pixel, which can be expressed as

$$\gamma\_n = \sigma(wF\_n + b) = \frac{1}{1 + \varepsilon^{-(wF\_n + b)}},\tag{6}$$

where *σ* is the *Sigmoid* activation function which can ensure that *γ<sup>n</sup>* is between 0 and 1, and *w* and *b* represent the weight and bias of the fully-connected layer, respectively. Then, *Fspa* and *Fspe* are weighted by multiplying *γ<sup>n</sup>* and 1 − *γn*, respectively. At this time, the weighted *Fspa* and *Fspe* can be concatenated into a new feature, represented as

$$F\_n{}^{\prime} = \left[ \gamma\_n F\_{\text{spa}\prime} \left( 1 - \gamma\_n \right) F\_{\text{spc}\prime} \right]. \tag{7}$$

Finally, the obtained features can be input into several fully-connected layers for classification to obtain the final change status.

#### *3.4. Training and Testing Process*

#### 3.4.1. Training and Testing PCASFEN

As PCASFEN establishes a regression mapping from the patch to the principal component features of the central pixel, the mean square error (*MSE*) function is adopted as the loss of training PCASFEN. Given the input patch and feature pairs, training the PCASFEN can be seen as minimizing the *MSE* loss *LMSE* between the output *K*-dimensional vectors *v*ˆ and the target principal component features *v*. *LMSE* can be represented as

$$L\_{MSE} = \frac{1}{N} \sum\_{v=1}^{N} \left( v - \hat{v} \right)^2,\tag{8}$$

where *N* is the mini-batch size. Here, the Stochastic Gradient Descent (SGD) optimizer is adopted to reduce the loss and update the network parameters. After the training of several epochs, *LMSE* will converge, and then the *C*-dimensional spatial features of each pixel neighborhood extracted from the network can be used for subsequent spatial and spectral joint analysis.

#### 3.4.2. Training and Testing ASSCDN

For ASSCDN, it establishes the mapping from the spatial features combined with the spectral features of pixels to the final change status, which is a classification task. Therefore, the cross-entropy loss *LCE* function is employed to guide parameter updating. *LCE* can be represented as

$$L\_{\square \to} = -\sum \hat{y} \log(\hat{y}),\tag{9}$$

where *y* and *y*ˆ are the ground truth label to be fitted and the output of the network, respectively. Similarly, the SGD optimizer is used to optimize the ASSCDN. Due to the effectiveness of the extracted features, only a very small number of labeled samples are enough to satisfy the training. Here, we use random selection from the reference CD map to simulate this process. The number of samples selected will be discussed in detail in the next section. After several rounds of training, the spectral features and the spatial features extracted from PCASFEN of each pixel can be directly input to the well-trained ASSCDN to obtain the change category of this pixel, and thus generate the final change map.

#### **4. Experiments and Analysis**

In this section, the experimental datasets are firstly described. Then, the experimental settings, including comparative methods and evaluation metrics are illustrated. Subsequently, the effects of different components in the proposed ASSCDN method on the detection performance are studied and analyzed. Finally, experimental results are presented and discussed in detail.

#### *4.1. Dataset Descriptions*

To evaluate the effectiveness of the proposed ASSCDN approach, three groups of HSIs are conducted in the experiments. These datasets are presented as follows.

The first and second datasets are Santa Barbara dataset and Bay Area dataset, which were released in [58]. As shown in Figures 2 and 3, these datasets were captured by AVIRIS sensor, which both have 224 spectral bands. In the Santa Barbara dataset, Figure 2a,b was acquired over the Santa Barbara region, California, in 2013 and 2015, respectively. The images have 30 m/pixel spatial resolution and a size of 984 × 740 pixels. As presented in Figure 3a,b, in the Bay Area dataset, two HSIs were collected over the city of Patterson, California, in 2007 and 2015, respectively. These images are with the size of 600 × 500 pixels and the spatial resolution of 30 m/pixel. Besides, the reference images of two datasets are shown in Figures 2c and 3c, which are obtained by manual interpretation, separately.

**Figure 2.** Barbara dataset: (**a**) *T*1-time image, (**b**) *T*2-time image, and (**c**) reference image. (Notation: gray color, white color, and black color denote unchanged pixels, changed pixels, and uninteresting pixels, respectively).

**Figure 3.** Bay dataset: (**a**) *T*1-time image, (**b**) *T*2-time image, and (**c**) reference image. (Notation: gray color, white color, and black color denote unchanged pixels, changed pixels, and uninteresting pixels, respectively).

The third dataset is River dataset, which was published in [6], as shown in Figure 4. Figure 4a,b was acquired by Earth Observing-1 (EO-1) Hyperion in 3 May 2013, and 31 December 2013, respectively, which contain total 242 spectral bands, and depict a river area in Jiangsu Province, China. In the River dataset, 198 bands are employed, and these images have a size of 463 × 241 pixels and a spatial resolution of 30 m/pixel. In addition, Figure 4c provides a reference image, which is obtained by manual interpretation.

**Figure 4.** River dataset: (**a**) *T*1-time image, (**b**) *T*2-time image, and (**c**) reference image. (Notation: white color and black color denote changed pixels and unchanged pixels, respectively).

#### *4.2. Experimental Settings*

#### 4.2.1. Evaluation Metrics

To evaluate quantitatively the accuracy of the proposed ASSCDN approach, three commonly used comprehensive evaluation metrics are selected [56,59,60], including overall accuracy (OA), F1-score (F1), and kappa coefficient (KC). Here, true positive (TP), true negative (TN), false positive (FP), and false negative (FN) are first counted by confusion matrix of the detection results, where TP indicates the number of pixels correctly detected as changed class; TN indicates the number of pixels correctly detected as unchanged class; FP and FN indicate the number of pixels falsely detected as changed and unchanged classes, respectively. On this basis, these evaluation metrics can be computed as follows:

$$\text{OA} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}} \tag{10}$$

$$\text{KC} = \frac{\text{OA} - \text{p\_c}}{1 - \text{p\_c}} \tag{11}$$

$$\mathbf{p\_e} = \frac{\left(\text{TP} + \text{FP}\right) \times \text{RC} + \left(\text{TN} + \text{FN}\right) \times \text{RU}}{\left(\text{TP} + \text{TN} + \text{FP} + \text{FN}\right)^2} \tag{12}$$

$$\text{PRE} = \frac{\text{TP}}{\text{TP} + \text{FP}} \tag{13}$$

$$\text{REC} = \frac{\text{TP}}{\text{TP} + \text{FN}} \tag{14}$$

$$\text{F}\_1 = \frac{2 \times \text{PRE} \times \text{REC}}{\text{PRE} + \text{REC}} \tag{15}$$

where RC and RU represent the number of pixels that are changed and unchanged classes in the reference image, respectively. The larger values of these evaluation metrics indicate better detection performance.

#### 4.2.2. Comparative Methods

In the experiments, eight widely used or state-of-the-art methods are selected to validate the superiority of the proposed ASSCDN approach. These methods are summarized as follows:


#### 4.2.3. Implementation Details

In the experiments, the proposed ASSCDN approach and other comparative methods were deployed on Pycharm platform with Pytorch or TensorFlow framework by using a single NVIDIA RTX 3090 or NVIDIA Tesla P40. During the training stage, the parameters of the model were optimized by a SGD optimizer with the momentum of 0.5 and the weight decay of 0.001. In all the experiments, the batch size is set as 32.

#### *4.3. Ablation Study and Parameter Analysis on River Dataset*

In this section, to investigate the effectiveness of the proposed ASSCDN, we conduct a series of ablation studies on the River dataset. These ablation studies mainly contain three aspects as follows: (1) In the proposed ASSCDN, we devise a novel PCA-guided self-supervised feature extraction network (PCASFEN) and attention-based CD framework to combine effectively the spatial and spectral features. Therefore, we first test the influence of different components on the performance of CD in the proposed ASSCDN. (2) As the patch size is an inevitable parameter in the proposed self-supervised spatial feature extraction framework, the sensitivity of patch size for network performance is investigated subsequently. (3) In addition, the relationship between the number of training samples and performance is also analyzed to validate the effectiveness of the proposed ASSCDN when only a small number of training samples are available.

#### 4.3.1. Ablation Study for Different Components

In the ablation study, to investigate the contribution of different components in the proposed ASSCDN, three comprehensive evaluation metrics, including OA, KC, and F1, are selected to evaluate quantitatively the results of these ablation studies. Besides, to ensure the fairness of the experiment, we set the same parameter for each experiment, that is, the patch size was set as 15, the number of training samples of each class was 250, and other hyperparameter settings were the same.

In this ablation study, four major components are adopted in the our ASSCDN, i.e., "*spe*", "*spa*", "*spe* + *spa*", and "*spe* + *spa* + Attention", where "*spe*" denotes that only spectral features are used, "*spa*" denotes that only spatial features are exploited, "*spe* + *spa*" indicates that spectral features and spatial features are combined in equal proportions, and "*spe* + *spa* + Attention" indicates that spectral features and spatial features are combined through the application of the proposed attention mechanism. According to the aforementioned settings, the results were obtained on River dataset, as shown in Table 1 and Figure 5. From the quantitative results, compared with "*spe*", "*spa*" can improve the detection performance to a certain extent, which indicates that the most important change feature representation is extracted by our proposed self-supervised spatial feature extraction framework. In addition, "*spe* + *spa*" can achieve better accuracy due to the improved discriminable feature expression by fusing spectral and spatial features, thus ameliorating the detection performance. Note that "*spe* + *spa* + Attention" reached the best accuracy (95.82%, 0.7609, and 78.37%) in terms of OA, KC, and F1. Compared with "*spe* + *spa*", "*spe* + *spa* + Attention" was significantly improved in all three evaluation criteria (1.21%, 0.0575, and 5.10%). From the visual results, the same conclusion can be obtained. Besides, as shown in Figure 6, we also tested the performance of different components with different patch sizes, and the results further verified the contribution of the components of our proposed ASSCDN.

In summary, two aspects can be obtained by the comparison results of the above ablation study: (1) The most useful change feature representation can be captured by our proposed PCASFEN, which can help to enhance the separability between changed and unchanged classes. (2) As it is unreasonable to combine spectral and spatial features by equal proportions for different patches, a novel attention mechanism is designed to adaptively adjust the proportion of spectral and spatial features for different patches to achieve effective and reasonable fusion of spectral and spatial features, thus significantly improving the accuracy of CD. Therefore, the effectiveness of each component of the proposed ASSCDN can be validated, it can join effectively spectral and spatial features by our proposed self-supervised spatial feature extraction network and attention mechanism, thereby elevating the performance of CD for HSI.


**Table 1.** Quantitative comparison for ablation study of the combination of different features on the River dataset.

**Figure 5.** Visual results for ablation study of the combination of different features on the River dataset: (**a**) *spe*, (**b**) *spa*, (**c**) *spe* + *spa*, (**d**) *spe* + *spa* + Attention.

#### 4.3.2. Sensitivity Analysis of Patch Size

In the proposed ASSCDN framework, patch size is an inevitable parameter in our PCASFEN step, which provides the spatial neighborhood information of a central pixel. Therefore, to comprehensively investigate the relationship between the patch size and accuracy, each component of our proposed ASSCDN, including "*spe*", "*spa*", "*spe* + *spa*", and "*spe* + *spa* + Attention", is employed in this experiment. Here, KC is selected to evaluate the results for each component of our proposed ASSCDN. In addition, to ensure the fairness of the comparison, in all experiments, the number of the training samples of each class was fixed to 250, and the other hyperparameter settings were the same.

Based on the above settings, the results of patch sizes ranging from 7 to 17 for each element were acquired, as presented in Figure 6. Notably, "*spe*" does not actually involve patch size as "*spe*" denotes that only spectral features are used to obtain detection results. Therefore, to facilitate comparison with the results of other components, the results of each patch size for the "*spe*" are the same, as the red line shown in Figure 6. By observing Figure 6, we can find that the results of "*spa*" present unstable fluctuation at different patch sizes. That is because different patch sizes may contain different information with various scales. Small patch sizes are more suitable for the different information of the small scale, but the extraction of the difference information of large scale is insufficient, which limits the accuracy. Similarly, larger patch size is more suitable for large-scale difference information, but for small-scale difference information, the noise may be introduced and the performance may is damaged in turn. Moreover, the relationship between the results of "*spe* + *spa*" and "*spe* + *spa* + Attention" and the patch size is similar to that of "*spa*". Overall, compared with "*spa*" and "*spe* + *spa*", the performance of "*spe* + *spa* + Attention" is relatively stable, and can achieve good performance in each patch size.

#### 4.3.3. Analysis of the Relationship between the Number of Training Samples and Accuracy

In this subsection, to further promote the proposed ASSCDN (i.e., "*spe* + *spa* + Attention") in practical application, we conducted an experiment to explore the relationship between the number of training samples and the accuracy. Here, when testing the performance of different numbers of training samples, we set the same hyperparameter, and the patch size

was fixed at 11. Additionally, KC is employed to evaluate the accuracy of the all the results. On this basis, the results were acquired with the number of training samples ranging from 10 to 1000 (see Figure 7). As can be seen in Figure 7, with the number of training samples increasing, the value of KC increases gradually, and when the number reaches around 200, the value of KC tends to be stable. Figure 7 also reveals that the proposed ASSCDN can acquire convincing performance even with a small number of training samples.

**Figure 6.** Sensitivity analysis of patch size for each component of the proposed ASSCDN on the River dataset.

**Figure 7.** Relationship between the number of training samples and accuracy for the proposed ASSCDN on the River dataset.

#### *4.4. Comparison Results and Analysis*

In this section, we tested the performance of the proposed ASSCDN on three real public available HSI datasets. Moreover, to verify the superiority of the proposed ASS-CDN, eight approaches are selected for comparison, including four widely used methods: CVA [61], KNN, SVM, and RCVA [39], and four deep learning-based methods: DCVA [50], DSFA [51], GETNET [6], and TDSSC [20]. Furthermore, five metrics (OA, KC, F1, PRE, and REC) are exploited to evaluate the accuracy of the proposed ASSCDN and the compared methods. Moreover, we employed a patch size of 15, and the number of the training samples of 250 to perform the proposed ASSCDN on these three datasets. In addition, to ensure the fairness of comparison, GETNET [6], and TDSSC [20] are deployed under the same semi-supervised learning framework as the proposed ASSCDN.

#### 4.4.1. Results and Comparison on Barbara and Bay Datasets

The CD results were acquired by different approaches on Barbara and Bay datasets, as shown in Figures 8 and 9, and the results of the quantitative evaluation are listed in Tables 2 and 3. From Figures 8a and 9a, the traditional CVA method shows more pixels of false positive due to its lack of effective use of spatial features. Different from CVA, as shown in Figures 8d and 9d, although RCVA introduces neighborhood information, it is unreliable as changed targets of various scales are inevitable. Besides, KNN and SVM present fewer pixels of false positive and false negative for both Barbara and Bay datasets, especially, SVM achieved the highest PRE (93.01%), as listed in Table 2. Notably, unsupervised-based deep learning methods, i.e., DCVA and DSFA, did not reach satisfactory performance on Barbara and Bay datasets, respectively. Among them, DCVA aims to acquire CD results by comparing differences between transferred deep features, but the generalization ability of the transfer model is unreliable, while DSFA may be limited by the results of the pre-detection. GETNET [6] can obtain the second best performance on Barbara dataset, but it cannot get satisfactory accuracy on Bay data. By contrast, TDSSC [20] can achieve relatively stable accuracy on these two datasets as it captures more robust feature representation by fusing the features of spectral direction and two spatial directions. For the proposed ASSCDN, spectral and spatial features are fused adaptively for different patches, which is helpful to obtain more reliable detection results. As listed in Tables 2 and 3, compared with the above methods, our proposed ASSCDN can achieve the best accuracy for both Barbara and Bay datasets in terms of OA, KC, and F1. From the visual results of Barbara and Bay datasets (Figures 8i and 9i see), the proposed ASSCDN acquires very few pixels of false positive and false negative, and it obtains the results closest to the reference image.


**Table 2.** Quantitative comparison results of various methods applied on the Barbara dataset.

**Figure 8.** The visual results of different methods on the Barbara dataset: (**a**) CVA [61], (**b**) KNN, (**c**) SVM, (**d**) RCVA [39], (**e**) DCVA [50], (**f**) DSFA [51], (**g**) GETNET [6], (**h**) TDSSC [20], (**i**) our ASSCDN, and (**j**) Reference image.

**Figure 9.** The visual results of different methods on the Bay dataset: (**a**) CVA [61], (**b**) KNN, (**c**) SVM, (**d**) RCVA [39], (**e**) DCVA [50], (**f**) DSFA [51], (**g**) GETNET [6], (**h**) TDSSC [20], (**i**) our ASSCDN, and (**j**) Reference image.


**Table 3.** Quantitative comparison results of various methods applied on the Bay dataset.

4.4.2. Results and Comparison on River Dataset

For the River dataset, as presented in Figure 4, more fine changed ground targets exist in this dataset, which increases the difficulty of obtaining fine CD results. As shown in Figure 10, the CD results were obtained by various approaches on the River dataset. From the Figure 4a–c, although typical CVA, KNN, and SVM display a few pixels of false negative, many unchanged pixels are misclassified as changed pixels as spatial information is not considered. Compared with CVA, KNN, and SVM, the result of the RCVA (see Figure 10d) shows fewer noises by introducing spatial contextual information for each pixel. By contrast, DCVA performs poorly performance, as presented in Figure 10e; this is because DCVA depends heavily on transferred deep features. For the DSFA, it generated CD result with relatively few false positive pixels but many missed detection. Both GETNET [6] and TDSSC [20] exhibit fewer false negative pixels, and compared to TDSSC [20], GETNET [6] reaches fewer false positive pixels. From the visual observations, compared with the other methods, our proposed ASSCDN presents the fewest false positive pixels, thus realizing the best visual performance. Although the proposed ASSCDN shows relatively more false negative pixels for GETNET [6] and TDSSC [20], our ASSCDN can obtain a good trade-off between false positive pixels and false negative pixels. In addition to visual comparison, quantitative comparison results have further demonstrated that the proposed ASSCDN can reach the improvements of 0.4%, 0.0113, 0.92%, and 3.47% of OA, KC, F1, and PRE, respectively, as listed in Table 4.

In summary, in this section, the aforementioned comparative experiments based on three real HSIs have been demonstrated that the proposed ASSCDN outperforms some traditional methods and state-of-the-art methods. The comparison results have further verified that effective spatial features can be captured for CD by introducing a novel PCASFEN, which can present the most significant difference representation. Furthermore, spectral and spatial features are fused in an adaptive proportion manner by exploiting an attention mechanism, which is able to enhance feature representation, and thus improves the separability of difference features.


**Table 4.** Quantitative comparison results of various methods applied on the River dataset.

**Figure 10.** The visual results of different methods on the River dataset: (**a**) CVA [61], (**b**) KNN, (**c**) SVM, (**d**) RCVA [39], (**e**) DCVA [50], (**f**) DSFA [51], (**g**) GETNET [6], (**h**) TDSSC [20], (**i**) our ASSCDN, and (**j**) Reference image.

#### **5. Discussion**

In this paper, effective ablation studies and comparison experiments are conducted on three groups of popular benchmark HSI CD datasets. In the ablation studies, three aspects can be observed. First, the effect of different components in our proposed ASSCDN has been proved that the proposed PCA-guided self-supervised feature extraction network and an attention-based CD framework can capture and fuse spatial and spectral features to further improve the performance of HSI CD. Second, although the sensitivity analysis of the patch size reveals that the patch size is more likely to affect the network accuracy (see Figure 6), the proposed ASSCDN significantly improves the accuracy of each patch size. Third, the relationship between the number of training samples and the accuracy has been explored, that is, the results show that the accuracy increases gradually with the increase of the number of training samples. In particular, the proposed ASSCDN can obtain relatively satisfactory performance when fewer training samples are employed. In addition, in the comparison experiments, eight cognate approaches, including four traditional methods (CVA [61], KNN, SVM, and RCVA [39]) and four state-of-the-art methods (DCVA [50], DSFA [51], GETNET [6], and TDSSC [20]), were selected to investigate the performance of the proposed ASSCDN. By observing the quantitative comparison, the proposed ASSCDN is superior to the other eight methods in OA, KC, and F1 for three datasets. Meanwhile, through visual comparison, it can be found that the change detection maps acquired by our ASSCDN can obtain a good trade-off between false detection and missed detection. Despite the proposed ASSCDN can provide a better result for HSI CD, the complexity of performing this method is relatively high, because the training process of our ASSCDN needs to be divided into two stages (i.e., first train the proposed self-supervised spatial feature extraction network, and then train our semi-supervised attention-based spatial and

spectral network). Besides, the computational cost of our ASSCDN framework is evaluated by multiply-accumulate operations(MACs), i.e., in the PCA-guided self-supervised spatial feature extraction network step, 0.81 G MACs are needed; in the semi-supervised attentionbased spatial and spectral network step, 0.0051 G MACs are needed.

#### **6. Conclusions**

In this paper, we propose an attention-based spectral and spatial change detection network (ASSCDN) for hyperspectral images, which mainly contains the following steps as follows. First, the main spatial features of differences can be extracted by our proposed PCASFEN. Second, the attention mechanism is introduced to allocate adaptively the ratio of spectral features and spatial features for fused features. Finally, by the joint analysis of the weighted spatial and spectral features, the change status of each pixel can be obtained. We conducted ablation study and parameter analysis experiment to validate the effectiveness of each component in the proposed ASSCDN. In addition, the experimental comparisons based on three groups of publicly available hyperspectral images have demonstrated that our promoted ASSCDN outperforms the other eight compared methods. In our future work, other HSIs will be collected to further investigate the robustness of this method. Furthermore, there will be a focus on weakly supervised and unsupervised HSI CD.

**Author Contributions:** Conceptualization, Z.W. and F.J.; methodology, Z.W.; validation, Z.W., F.J. and T.L.; investigation, F.J. and T.L.; writing—original draft preparation, Z.W., F.J. and F.X.; writing review and editing, F.X. and P.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the National Natural Science Foundation of Shaanxi Province under Grant 2021JQ-210, the Fundamental Research Funds for the Central Universities under Grant XJS200216, and the Fundamental Research Funds for the Central Universities and the Innovation Fund of Xidian University.

**Acknowledgments:** We are grateful to Wang Qi and Javier López-Fandiño who provided the data for this research.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **References**


### *Article* **A Constrained Sparse-Representation-Based Spatio-Temporal Anomaly Detector for Moving Targets in Hyperspectral Imagery Sequences**

#### **Zhaoxu Li †, Qiang Ling †, Jing Wu, Zhengyan Wang and Zaiping Lin \***

College of Electronic Science and Technology, National University of Defense Technology, Changsha 410073, China; lizhaoxu@nudt.edu.cn (Z.L.); lingqiang16@nudt.edu.cn (Q.L.); jingwu@nudt.edu.cn (J.W.); wangzhengyan@nudt.edu.cn (Z.W.)

**\*** Correspondence: linzaiping@nudt.edu.cn

† These authors contributed equally to this work.

Received: 5 August 2020; Accepted: 25 August 2020; Published: 27 August 2020

**Abstract:** At present, small dim moving target detection in hyperspectral imagery sequences is mainly based on anomaly detection (AD). However, most conventional detection algorithms only utilize the spatial spectral information and rarely employ the temporal spectral information. Besides, multiple targets in complex motion situations, such as multiple targets at different velocities and dense targets on the same trajectory, are still challenges for moving target detection. To address these problems, we propose a novel constrained sparse representation-based spatio-temporal anomaly detection algorithm that extends AD from the spatial domain to the spatio-temporal domain. Our algorithm includes a spatial detector and a temporal detector, which play different roles in moving target detection. The former can suppress moving background regions, and the latter can suppress non-homogeneous background and stationary objects. Two temporal background purification procedures maintain the effectiveness of the temporal detector for multiple targets in complex motion situations. Moreover, the smoothing and fusion of the spatial and temporal detection maps can adequately suppress background clutter and false alarms on the maps. Experiments conducted on a real dataset and a synthetic dataset show that the proposed algorithm can accurately detect multiple targets with different velocities and dense targets with the same trajectory and outperforms other state-of-the-art algorithms in high-noise scenarios.

**Keywords:** anomaly detection; constrained sparse representation; hyperspectral imagery; moving target detection; spatio-temporal processing

#### **1. Introduction**

With the development of optical sensor technology, hyperspectral imagery (HSI) has been dramatically improved in recent years, and HSI sequences are more available in the real world. Because of adequate spectral information with dozens or hundreds of spectrum bands, the HSI detection technique can find and distinguish dim targets, which are unobservable in the visible or infrared images, and has promising prospects in military, security, satellite surveillance, disaster monitoring, and other applications [1]. According to whether prior target spectral information is utilized, the HSI detection technique can be mainly divided into target detection [2–4] and anomaly detection. Due to factors such as camera angle, illumination, atmosphere, and sensor spatial resolution, it is common in HSI that the same object has different spectra. Besides, no prior target spectrum is available for most of the moving target detection scenes. Therefore, current hyperspectral moving target detection technologies [5–12] are mainly based on anomaly detection.

Traditional single-frame anomaly detection is usually accomplished by detecting irregular deviations between the test pixel and background pixels in a hyperspectral image. Designed to detect the presence of a dim target in a multi-band image, the Reed–Xiaoli (RX) algorithm [13] assumes that the global background spectra obey a multivariate Gaussian distribution and applies the Mahalanobis distance to identify anomaly spectra. To solve the problem that the Gaussian distribution is not applicable to the non-stationary global background, the local version of RX [14] divides the local neighborhood of the test pixel into potential regions and background regions by dual-windows and replaces global statistics with local statistics. The Quasi-local-RX (QLRX) algorithm [15] improves point-target detection by utilizing local and global statistics simultaneously. The kernel RX (KRX) algorithm [16], a nonlinear version of RX, maps spectra into a more high-dimensional characteristic space through a kernel function and outperforms the original RX detector in military target and mine detection. The cluster KRX (CKRX) algorithm [17] improves the performance of KRX by replacing background pixels with cluster centers. Support vector data description (SVDD) algorithms [18,19] also determine anomalies in a high-dimensional characteristic space by building a minimal enclosing hypersphere around local background pixels. Sparse representation (SR)-based algorithms [20–27] have made significant progress in anomaly detection in recent years. These algorithms usually assume that background pixels can be presented as linear combinations of the surrounding background, and anomaly pixels cannot. The collaborative representation (CR)-based algorithm [22] adopts *l*2-norm minimization to reinforce the collaboration of background representation and is superior to RX and its improved algorithms. To realize the detection of dense small targets, the constrained sparse representation (CSR)-based algorithm [23] imposes two constraints on abundance vectors and can remove anomalous atoms from the local background dictionary. Because background pixels and target pixels are considered low rank and sparse, respectively, low-rank and sparse matrix decomposition-based algorithms [28–30] have also received widespread attention in anomaly detection.

When a hyperspectral staring camera is continuously imaging at short intervals, anomaly detectors can output detection maps in succession. Usually, anomaly detection maps of a hyperspectral imagery sequence can be regarded as an infrared image sequence. Therefore, multi-frame infrared detection or tracking algorithms can be used to detect or track dim moving targets on these maps. Rotman et al. combined hyperspectral target detection and infrared target tracking for the first time [5–7]. They transformed each HSI into a two-dimensional anomaly detection map and then utilized a variance filter (VF) [31] to detect targets moving at subpixel velocity. Besides, Duran et al. focused on tracking small dense objects, such as pedestrians or vehicles, from airborne platforms [8–10]. They adopted endmember techniques to detect subpixel targets and estimated the motion parameters of targets under the framework of the Bayesian filter. Wang et al. proposed a novel temporal anomaly detector in dim moving target detection, which extracts the local spatial background in the previous frame to mine the singularity of the test pixel [11]. Combining the traditional single-frame detection with their proposed temporal detection can effectively reduce temporal noise clutter. Then, Wang et al. introduced a simplified VF to calculate a trajectory history map in the literature [12]. The fusion of the spatial detection map, the temporal detection map, and the trajectory history map (STH) is superior to previous moving target detection algorithms in hyperspectral imagery sequences.

In summary, current anomaly detection algorithms for moving targets still only utilize the spatial neighborhood background of the current frame or the previous frame. However, static or non-moving objects for which the spectra are different from neighborhoods can be regarded as anomaly targets by these detection algorithms. Temporal profile filtering algorithms can detect moving targets, but ask for prior information about speed. Besides, detecting targets in complex motion, such as multiple targets at different velocities and dense targets on the same trajectory, is still a challenge for temporal profile filtering-based algorithms [5–7,11]. To solve these problems, we propose a CSR-based spatio-temporal anomaly detector (CSR-ST), sufficiently employing temporal spectral information in HSI sequences. Unlike hyperspectral change detection (CD) [32,33], which detects anomaly regions under diurnal and seasonal changes, moving target detection asks for a very short interval between frames. This means that camera angle, illumination, weather, and other imaging conditions are almost unchanged in adjacent frames. After frame registration, the spectrum of the same pixel can be regarded as a mixture

of spectra in a small local region, only affected by the temporal clutter in different frames. Based on this assumption, we propose a novel temporal anomaly detection framework that calculates the anomaly score of the test pixel employing its former spectra. In our previous work [23], the CSR detector was based on the assumption that a background pixel can be linearly represented by the endmembers present in its spatial neighborhood while an anomaly pixel cannot. Compared to background spectra in the spatial neighborhood, the former spectra of the test pixel in previous frames can provide more pure background endmembers to represent the current spectrum. Therefore, the CSR-based temporal detector has a better ability to recover the test background pixel than the CSR-based spatial detector. Besides, the temporal detector has two insurances to construct a pure temporal background dictionary for the test pixel. The first insurance is to remove potential target spectra from the candidate set of the temporal background dictionary based on spatial detection results. The other insurance is to automatically remove anomaly atoms from the background dictionary when the corresponding abundances are higher than a given upper bound and then solve the model with the new background dictionary. Non-homogeneous background pixels or stationary objects can turn into false alarms in the single-frame detection, while the temporal detector is mainly sensitive to moving targets. However, when some background regions move in the imaging scene, the temporal detector can regard them as targets and be inferior to the spatial detector. The fusion of the spatial detection map and the temporal detection map combines the advantages of the two detectors and can suppress the background and stationary objects. The main contributions of this article are summarized as follows.


The rest of this article is organized as follows. The CSR detector and its kernel version are introduced in Section 2. The proposed CSR-ST algorithm is described in Section 3. The experiments conducted on a real dataset and a synthetic dataset are presented in Section 4, followed by the conclusions in Section 5.

#### **2. Related Work**

SR-based anomaly detection algorithms usually assume that a background pixel can exist in a low-dimensional subspace spanned by surrounding background pixels. Meanwhile, anomaly pixels cannot be represented as a sparse linear mixture of background spectra. Suppose *y* is the test pixel, which has N spectral bands, and *A* is the background dictionary, which has M atoms; the competing hypotheses for the SR-based algorithms are:

$$\begin{aligned} H\_0: y &= A\mathfrak{a} + \mathfrak{n}, \text{ background pixel} \\ H\_1: y &\neq A\mathfrak{a} + \mathfrak{n}, \text{ anomaly pixel} \end{aligned} \tag{1}$$

where *<sup>A</sup>* <sup>∈</sup> <sup>R</sup>N×M, *<sup>α</sup>* is defined as a sparse vector for which each item is the abundance of the correlated atom in *A* and *n* is defined as a random noise item.

Usually, the sparse vector *α* has a sparsity constraint *α*<sup>0</sup> ≤ *K* imposed in SR-based detection, where *K* is a sparsity parameter. However, if there is no constraint on each abundance item in *α*, anomaly pixels can also be linear mixtures of the background dictionary on account of abundance items less than zero. The linear spectral mixture model (LMM) [34] supposes that the abundance vector *α* of a mixed pixel should satisfy a sum-to-one constraint:

$$\sum\_{l=1}^{M} \alpha\_l = 1 \tag{2}$$

and a non-negativity constraint:

$$\mathfrak{a}\_l \ge 0, \ l = 1, \dots, \ll M. \tag{3}$$

The CSR algorithm introduces Equations (2) and (3) into the SR model, and the minimizing problem of CSR can be expressed as:

$$\min\_{\mathbf{a}} \quad \|\mathbf{y} - \mathbf{A}\mathbf{a}\|\_2 \qquad \text{s.t.} \quad \|\mathbf{a}\|\_0 \le K$$

$$\mathbf{e}^T \mathbf{a} = 1 \tag{4}$$

$$\mathbf{a}\_l \ge 0, \ l = 1, \dots, M$$

where *e* represents an *M* × 1 vector for which each item is one. The objective function can be converted to:

$$\|\|y - \mathbf{A}\mathbf{a}\|\|\_{2} = \sqrt{\mathbf{a}^T \mathbf{A}^T \mathbf{A} \mathbf{a} - 2y^T \mathbf{A} \mathbf{a} + y^T y} \tag{5}$$

Note that *y<sup>T</sup> y* is a constant and can be removed. If the test pixel is anomalous and the background dictionary contains a few anomaly pixels, the corresponding entries of *α* can be enormous, resulting in a small reconstitution residual. To avoid missing alarms, an adequately tiny constant *C* is introduced as an upper limit of *α*, and Equation (4) can be transformed as:

$$\begin{aligned} \min\_{\mathbf{a}} \mathbf{a}^T A^T A \mathbf{a} - 2y^T A \mathbf{a} \qquad \text{s.t.} \quad \mathbf{e}^T \mathbf{a} &= 1\\ 0 \le \mathbf{a}\_l \le \mathbf{C}, \ l = 1, \dots, M \end{aligned} \tag{6}$$

where *C* ∈ [1/*M*, 1]. According to the Karush–Kuhn–Tucker conditions [35], the constraint *α*<sup>0</sup> ≤ *K* in Equation (4) can be removed in Equation (6).

When abnormal pixels are tested, the abundances correlated with similar anomalous atoms can reach the maximum. Accordingly, the atoms for which the abundances are *C* have a significant possibility of being anomalies and should be eliminated from the background dictionary. A pure dictionary *<sup>A</sup>***˜** can be built by the remaining atom. With the constraint 0 <sup>≤</sup> *<sup>α</sup>*˜*<sup>i</sup>* <sup>≤</sup> 1 and *<sup>A</sup>***˜**, reconstruction residuals of anomalous test pixels will be significantly higher than those in the first reconstruction and can be regraded as anomaly scores.

$$\tau = \sqrt{\tilde{\mathbf{a}}^{\*T} \tilde{\mathbf{A}}^{T} \tilde{\mathbf{A}} \tilde{\mathbf{a}}^{\*} - 2 \mathbf{y}^{T} \tilde{\mathbf{A}} \tilde{\mathbf{a}}^{\*} + \mathbf{y}^{T} \mathbf{y}} \tag{7}$$

where *α***˜** ∗ is the approximately calculated sparse vector without anomalous atoms in the background dictionary *A***˜**.

Given secondary or multiple scattering in the atmosphere, spectrum mixing usually is a nonlinear process [36]. The kernel methods map the original data into a more high-dimensional characteristic space via nonlinear functions and then achieve linear partition of the linearly inseparable data [37]. Skillfully, the inner product in the characteristic space can be replaced by:

$$
\langle \phi \left( \mathbf{x}\_{i} \right), \phi \left( \mathbf{x}\_{j} \right) \rangle = k \left( \mathbf{x}\_{i}, \mathbf{x}\_{j} \right) \tag{8}
$$

where *φ* is a nonlinear function, *x<sup>i</sup>* and *x<sup>j</sup>* are the original data, and *k* is the kernel function. The kernel CSR (KCSR) algorithm introduces the kernel method and adopts the Gaussian radial basis function kernel:

$$k\left(\mathbf{x}\_{i},\mathbf{x}\_{j}\right) = e^{-\gamma \|\mathbf{x}\_{i} - \mathbf{x}\_{j}\|\_{2}^{2}}.\tag{9}$$

The optimal problem is replaced by:

$$\begin{aligned} \min\_{\boldsymbol{\alpha}} & \boldsymbol{\alpha}^T \mathbf{K} \boldsymbol{\alpha} - 2 \mathbf{K}\_{\mathcal{Y}} \boldsymbol{\alpha} & \quad \text{s.t.} \quad \mathbf{e}^T \boldsymbol{\alpha} = 1\\ & 0 \le \boldsymbol{a}\_l \le \mathbb{C}, \ l = 1, \dots, M \end{aligned} \tag{10}$$

where *K* is an *M* × *M* Gram matrix for which the *i*-th row and *j*-th column item *Ki*,*<sup>j</sup>* = *k ai*, *a<sup>j</sup>* . *Ky* = *φ* (*y*) *<sup>T</sup> φ* (*A*) and can also be replaced by:

$$\begin{aligned} \mathbb{K}\_{\mathcal{Y}} &= k\left(A, \mathcal{Y}\right) \\ &= \left[k\left(a\_1, \mathcal{Y}\right) \quad k\left(a\_2, \mathcal{Y}\right) \quad \cdots \quad k\left(a\_{M'}, \mathcal{Y}\right)\right]. \end{aligned} \tag{11}$$

Likewise, the atoms for which abundances are *C* are removed, and then, a pure background dictionary *A***˜** is used to solve Equation (10). Therefore, the anomaly score can be replaced by:

$$r = \sqrt{\tilde{\mathfrak{a}}^{\*T} \tilde{\mathbf{K}} \tilde{\mathfrak{a}}^{\*} - 2 \tilde{\mathbf{K}}\_{y}^{T} \tilde{\mathfrak{a}}^{\*} + k \,(y, y)} \tag{12}$$

where *r* is the approximate error and *K***˜** and *K***˜** *<sup>y</sup>* are both solved by *A***˜**.

#### **3. Spatio-Temporal Anomaly Detection for Moving Targets**

In this section, a novel CSR-based spatio-temporal anomaly algorithm is proposed to detect dim moving targets accurately in HSI sequences. Our algorithm is divided into four steps, namely spatial anomaly detection, iterative smoothing filter, temporal anomaly detection, and spatial-temporal fusion. The spatial anomaly detection finds abnormal targets by utilizing the spectral information of the current frame. An iterative smoothing filter can reduce noise and false alarms in the time and space domains. Different from AD, CD, and the temporal detection [12] using the information between two adjacent frames, our proposed temporal anomaly detection constructs background dictionaries with the historical spectral curves of the test pixels. The proposed temporal anomaly detection explores anomaly characteristics in the time dimension and provides anomaly information different from that in the spatial detection. The fusion of spatial and temporal anomaly detection can explore the target information more comprehensively. The framework of the proposed CSR-ST algorithm is displayed in Figure 1.

*Remote Sens.* **2020**, *12*, 2783

**Figure 1.** The framework of the proposed CSR-ST algorithm. (**a**) The schematic diagram of the CSR-based spatial and temporal detectors. (**b**) The program flowchart of the smoothing filter and fusion on the spatial and temporal detection maps.

#### *3.1. Spatial Anomaly Detection*

Let *X<sup>i</sup>* = *x*1 *<sup>i</sup>* , *<sup>x</sup>*<sup>2</sup> *<sup>i</sup>* , ··· , *<sup>x</sup>d*1×*d*<sup>2</sup> *i* <sup>∈</sup> <sup>R</sup>*<sup>N</sup>* denote a hyperspectral cube collected in the current frame, where *i* is the current sequence number, *d*<sup>1</sup> and *d*<sup>2</sup> are defined as the space sizes of the cube, and *N* is the quantity of spectral bands. Dual concentric windows [38] are used to extract a spatial background dictionary for each pixel *x j i* . The dual-windows are centered at each test pixel and divide the neighborhood into a potential target region and a background region. Pixels in the background region are selected as atoms to form a background dictionary *A<sup>j</sup> i* . Then, the spatial anomaly score *s j i* of the test pixel *x j <sup>i</sup>* is solved by the CSR detector with the corresponding background dictionary *<sup>A</sup><sup>j</sup> i* . After all pixels on *X<sup>i</sup>* are detected in sequence, a two-dimensional spatial detection map *S<sup>i</sup>* is obtained:

$$\mathbf{S}\_{\bar{i}} = \begin{bmatrix} \mathbf{s}\_{\bar{i}}^1 & \cdots & \mathbf{s}\_{\bar{i}}^{d\_2} \\ \vdots & \ddots & \vdots \\ \mathbf{s}\_{\bar{i}}^{d\_1 \times (d\_2 - 1) + 1} & \cdots & \mathbf{s}\_{\bar{i}}^{d\_1 \times d\_2} \end{bmatrix} \tag{13}$$

#### *3.2. Iterative Smoothing Filter*

The spectra change with time due to the measurement noise, resulting in temporal fluctuation of anomaly scores. Meanwhile, spatial background clutter is also generated in the detection maps due to the fluctuation. The literature [17] has used a simple smoothing filter as a post-processing procedure to decrease false alarms and noise in detection maps. Inspired by [17], an iterative smoothing filter is adopted to reduce noise both in the spatial and temporal domains simultaneously.

To avoid the overall drift of anomaly scores on the spatial detection map *S<sup>i</sup>* caused by sudden changes in imaging conditions, Z-score normalization should be first performed:

$$
\vec{S}\_{\bar{i}} = \frac{\mathbf{S}\_{\bar{i}} - \mu}{\sigma}. \tag{14}
$$

In typical image preprocessing, *μ* and *σ* are the mean value and standard deviation of pixels in the whole image, respectively. However, because anomaly scores of anomalous pixels are much higher than those of background pixels on *Si*, it is more accurate to describe the distribution of *s j <sup>i</sup>* by a truncated normal distribution or a half-normal distribution [39] rather than a normal distribution. Therefore, it is more reasonable to set *μ* and *σ* to the mean value and standard deviation of the collection of *S<sup>i</sup>* and its symmetric set about zero.

Then, an iterative smoothing operation is performed on *S***¯** *<sup>i</sup>* to reduce spatial and temporal clutter:

$$\mathfrak{s}\_{i}^{j} = (1 - \rho)\,\mathfrak{s}\_{i-1}^{j} + \rho \sum\_{l \in L(j)} \mathfrak{s}\_{l}\mathfrak{s}\_{i}^{l} \tag{15}$$

where *s*¯ *l <sup>i</sup>* is the normalized spatial anomaly scores of *<sup>x</sup><sup>l</sup> i* , *s*˜ *j <sup>i</sup>* and *s*˜ *j <sup>i</sup>*−<sup>1</sup> are the smoothed spatial anomaly scores of *x j <sup>i</sup>* and *x j <sup>i</sup>*−1, respectively, *<sup>L</sup>* denotes the spatial neighborhood used for smoothing, and *<sup>ρ</sup>* and *ε<sup>l</sup>* denote filter weights. When the first spatial detection map is smoothed, let *ρ* = 1. The latter part of Equation (15) is actually a spatial smoothing filter such as the mean filter or the Gaussian filter. Furthermore, one-dimensional denoising algorithms can also replace the temporal iterative smoothing part of Equation (15) to reduce temporal clutter. Compared to the original spatial detection map *Si*, background clutter and noise on *S***˜** *<sup>i</sup>* are suppressed, and detection performance can be improved.

#### *3.3. Temporal Anomaly Detection*

Note that, using the dual-window strategy to select a background dictionary has several disadvantages. Firstly, the selection of an inappropriate dual-window size can cause the local background to be contaminated by target pixels in spatial anomaly detection. If the inner window of dual-windows is too small, the chosen local background of the test target pixel can contain some target pixels. Moreover, the contamination problem can also occur when multiple targets are densely distributed. Secondly, the spatial distributions of moving targets are usually unknown and change in the real world. Therefore, it is difficult to determine the optimal dual-window size to detect moving targets in advance. Thirdly, the performance of these algorithms still varies with the dual-window size, and the best performance of the dual-window-based AD algorithms is a local optimum. For instance, detection results can be further improved after combining with a weight matrix obtained by segmentation or clustering in the literature [40,41], where background pixels are assigned lower weight values. An interesting phenomenon is that the best local background of some detection algorithms for subpixel targets are eight neighborhoods [42], and large dual-windows are harmful to these algorithms. Fourthly, the dual-window-based spatial detection cannot eliminate motionless objects, the spectra of which are also different from the background spectra.

To accurately detect moving targets in HSI sequences, we propose a new approach for constructing background dictionaries of test pixels. Compared to hyperspectral CD, the interval between two contiguous frames in moving target detection is short; thus, the camera angle, illumination, weather, and other imaging conditions are almost unchanged. In this case, the spectrum of the same

object in short HSI sequences can only be affected by the measured noise. Moreover, due to camera shake and the error of frame registration, the imaging space corresponding to the same pixel in the HSI moves back and forth in a local background region. Therefore, it can be assumed that the spectra of the same pixel in adjacent frames, *x j i* , *x j <sup>i</sup>*−1, *<sup>x</sup> j <sup>i</sup>*−2, ..., *<sup>x</sup> j <sup>i</sup>*−*P*, are a linear combination of the same set of endmembers. According to the LMM, the current pixel *x j <sup>i</sup>* can be expressed as a linear combination of its former spectra *x j <sup>i</sup>*−1, *<sup>x</sup> j <sup>i</sup>*−2, ..., *<sup>x</sup> j <sup>i</sup>*−*P*:

$$\mathbf{x}\_{i}^{j} = \sum\_{l=1}^{P} \mathbf{x}\_{i-l}^{j} \boldsymbol{\beta}\_{l} + \mathbf{n} = \mathbf{B}\_{i}^{j} \boldsymbol{\beta} + \mathbf{n} \qquad \text{s.t.} \quad \sum\_{l=1}^{P} \boldsymbol{\beta}\_{l} = 1$$

$$\boldsymbol{\beta}\_{l} \ge 0, \ l = 1, \dots, P$$

where *B<sup>j</sup> <sup>i</sup>* is defined as the former spectra matrix, *β* is defined as the abundance vector, *P* is defined as the number of former spectra, and *n* is defined as the noise item.

Equation (16) means that the test pixel *x j <sup>i</sup>* and its former spectra *x j <sup>i</sup>*−1, *<sup>x</sup> j <sup>i</sup>*−2, ..., *<sup>x</sup> j <sup>i</sup>*−*<sup>P</sup>* can be also applied to the CSR detector. *B<sup>j</sup> <sup>i</sup>* and *x j <sup>i</sup>* can be considered to consist of the same set of background endmembers. In the spatial anomaly detection, the background dictionary *A<sup>j</sup> <sup>i</sup>* constructed by the dual-window strategy contains some endmembers independent of *x j i* . Compared to *A<sup>j</sup> i* , *B<sup>j</sup> <sup>i</sup>* is more suitable as a background dictionary for the CSR and KCSR detectors. In this subsection, temporal anomaly detection is defined as a method to calculate the anomaly scores of the test pixel *x j <sup>i</sup>* in the current frame by using its former spectra *B<sup>j</sup> i* . Because the positions of non-homogeneous background pixels or motionless objects are almost unchanged in the HSI after inter-frame registration, the temporal anomaly detection can avoid false alarms caused by these pixels.

However, *B<sup>j</sup> <sup>i</sup>* is not a pure background dictionary sometimes. When the target is moving slowly, it takes more than one frame to pass through a pixel. In this case, if *x j <sup>i</sup>* is a target pixel, it is possible that its former spectra are also target spectra. Besides, if the trajectories of moving targets intersect, the former spectra of pixels at the intersection can also be contaminated by targets. Therefore, we delete the abnormal atoms in *B<sup>j</sup> <sup>i</sup>* based on the spatial anomaly detection results. *ND* and *NC* are defined as the number of atoms in the background dictionary and its candidate set, respectively. Specifically, for the test pixel *x j <sup>i</sup>* in the current frame, smoothed spatial anomaly scores *s*˜ *j <sup>i</sup>*−1,*s*˜ *j <sup>i</sup>*−2, ...,*s*˜ *j <sup>i</sup>*−*NC* of its former spectra are sorted at first. In order from smallest to largest, the sort result is *s*˜ *j <sup>m</sup>*<sup>1</sup> ,*s*˜ *j <sup>m</sup>*<sup>2</sup> , ...,*s*˜ *j mNC* , where the subscripts *m*1, *m*2, ..., *mNC* are the sequence numbers. The smaller the spatial anomaly score is, the higher the probability that the corresponding former spectrum belongs to the background. Therefore, *ND* former spectra *x j <sup>m</sup>*<sup>1</sup> , *x j <sup>m</sup>*<sup>2</sup> , ..., *x j mND* are selected to construct a pure background dictionary *B***¯***j <sup>i</sup>* for the test pixel *x j i* . Then, the minimizing problem in the CSR algorithm can be transformed as:

$$\begin{aligned} \min\_{\mathbf{a}} \mathbf{a}^T \mathbf{B}\_i^T \mathbf{B}\_i^j \mathbf{a} - 2\mathbf{x}\_i^{jT} \mathbf{A} \mathbf{a} &\qquad \text{s.t.} \quad \mathbf{e}^T \mathbf{a} = 1\\ 0 \le \mathbf{a}\_I \le \mathbf{C}, \ l = 1, \dots, N\_D \end{aligned} \tag{17}$$

where *<sup>C</sup>* <sup>∈</sup> [1/*ND*, 1]. The background dictionary *<sup>B</sup>***¯***<sup>j</sup> <sup>i</sup>* can be further purified by removing the atoms with *α* = *C*. The temporal anomaly detection result *t j <sup>i</sup>* of *x j <sup>i</sup>* is transformed as:

$$\mathbf{t}\_{i}^{j} = \sqrt{\mathbf{\tilde{a}}^{\*T} \mathbf{B}\_{i}^{j} \,^{T} \mathbf{B}\_{i}^{j} \mathbf{\tilde{a}}^{\*} - 2 \mathbf{x}\_{i}^{j} \,^{T} \mathbf{B}\_{i}^{j} \mathbf{\tilde{a}}^{\*} + \mathbf{x}\_{i}^{j} \,^{T} \mathbf{x}\_{i}^{j}} \tag{18}$$

where *α***˜** ∗ is the approximately calculated sparse vector without anomalous atoms in the background dictionary *B***˜***<sup>j</sup> <sup>i</sup>* and *t j <sup>i</sup>* is the *l*2-norm of the approximate error. Similarly, the KCSR algorithm can also be applied to the temporal anomaly detection. After all pixels on *X<sup>i</sup>* are detected in sequence, a two-dimensional temporal detection map *T<sup>i</sup>* is obtained.

The lower limit of the constraint parameter *C* is connected with the number of anomalous atoms in the background dictionary. To obtain a convenient setting of *C* in the spatial and temporal anomaly detection, *C* can be represented as:

$$\mathcal{C} = \frac{1}{\nu N\_D} \tag{19}$$

where *ν* ∈ [1/*ND*, 1]. If *ν* < 1/*ND* and *C* > 1, then the inequality constraint *α<sup>l</sup>* ≤ *C* is invalid. To further explore the meaning of *ν*, two definitions are given as follows:

$$
\eta\_1 = \frac{N\_a}{N\_D} \tag{20}
$$

$$
\eta\_2 = \frac{\sum\_{l=1}^{N\_d} a\_a^l}{N\_D} \tag{21}
$$

where *Na* is defined as the number of anomalous atoms and *α<sup>l</sup> <sup>a</sup>* is defined as the abundance relevant to the anomaly endmember in the LMM of the *l*-th anomalous atom. In the hyperspectral AD, 0 ≤ *η*<sup>2</sup> ≤ *η*<sup>1</sup> 1. We proofed a proposition of the parameter *ν* in the article [23]:

**Proposition 1.** *To delete all anomalous atoms from the background dictionary, ν must satisfy:*

$$w \ge \max\left(\eta\_1, \eta\_2/a\_a\right) \tag{22}$$

*where α<sup>a</sup> is defined as the abundance relevant to the anomaly endmember in the LMM of the test pixel.*

The proposition gives an intuitive interpretation of *ν*. When *ν* is larger than max (*η*1, *η*2/*αa*), all anomalous atoms can be deleted. Regardless of spatial detection or temporal detection, *α<sup>a</sup>* of the same test pixel is constant. Therefore, it is practicable to set *ν* to the same value in both detections. *η*<sup>1</sup> and *η*2/*α<sup>a</sup>* in temporal detection can be set to values smaller than those in spatial detection by reducing the proportion of anomalous atoms in *B***¯***<sup>j</sup> i* . One method is to enlarge *ND*, the size of *<sup>B</sup>***¯***<sup>j</sup> i* . Another method is to decrease *Na*, the number of anomalous atoms, by enlarging the size of the candidate set *<sup>B</sup><sup>j</sup> <sup>i</sup>* or sample the former spectra at intervals before constructing *B<sup>j</sup> i* . Through the above operations, the lower limit of *ν* in temporal detection is less than that in spatial detection. When *ν* is set to an excessively large value, numerous background atoms are exorbitantly deleted, resulting in slight degeneration in the ability of the CSR and KCSR algorithms to represent test background pixels. Therefore, *ν* should be a trade-off value between the inadequate deletion of anomalous atoms and unnecessary deletion in spatial detection. The same *ν* can cause the excessive deletion of atoms in temporal detection, but a large *ND* can avoid this situation.

#### *3.4. Spatio-Temporal Fusion*

Compared to the spatial anomaly detection, the temporal anomaly detection can suppress spatially non-homogeneous background pixels and stationary objects. Furthermore, compared to the temporal profile filtering algorithms, the proposed temporal anomaly detection can identify moving targets with different speeds simultaneously and is robust to the situation where multiple targets pass through the same trajectory one after the other. However, the temporal detection is inferior to the spatial detection in some situations. If there are some moving background pixels in the scene, such as clouds, temporal anomaly detection can judge them as targets. Besides, if the frame registration error is too large, the temporal background dictionary cannot describe the background accurately. To improve the stability and robustness of the detection algorithm, it is necessary to combine spatial and temporal detection results.

Before fusion, the filtering operation in Section 3.2 can also be performed on the temporal detection map *Ti*. First, perform Z-score normalization on *Ti*:

$$
\vec{T}\_i = \frac{T\_i - \mu}{\sigma}.\tag{23}
$$

where *μ* and *σ* are set to the mean value and standard deviation of the collection of *S<sup>i</sup>* and its symmetric set about zero. Then, the same iterative smoothing operation as Equation (15) is performed on *T***¯***<sup>i</sup>* to reduce temporal clutter:

$$\vec{F}\_i^j = (1 - \rho)\,\vec{F}\_{i-1}^j + \rho \sum\_{l \in L(j)} \varepsilon\_l \vec{r}\_i^l \tag{24}$$

where ¯*t l <sup>i</sup>* is the normalized temporal anomaly scores of *<sup>x</sup><sup>l</sup> <sup>i</sup>* and ˜*<sup>t</sup> j <sup>i</sup>* and ˜*<sup>t</sup> j <sup>i</sup>*−<sup>1</sup> are the temporal spatial anomaly scores of *x j <sup>i</sup>* and *x j <sup>i</sup>*−1, respectively. The smoothed detection maps can be combined by the multiplication fusion strategy:

$$ST\_i = \frac{\mathfrak{S}\_i - \min\left(\mathfrak{S}\_i\right)}{\max\left(\mathfrak{S}\_i\right) - \min\left(\mathfrak{S}\_i\right)} \circ \frac{\mathfrak{T}\_i - \min\left(\mathfrak{T}\_i\right)}{\max\left(\mathfrak{T}\_i\right) - \min\left(\mathfrak{T}\_i\right)}\tag{25}$$

where max *S***˜** *i* and max *T***˜***i* are the maximum values in *S***˜** *<sup>i</sup>* and *T***˜***i*, min *S***˜** *i* and min *T***˜***i* are the minimum values in *S***˜** *<sup>i</sup>* and *<sup>T</sup>***˜***i*, the symbol ◦ denotes the Hadamard product, and *ST<sup>i</sup>* is the fusion spatio-temporal detection map. The overall description of the proposed spatio-temporal anomaly detection is presented in Algorithm 1.

**Algorithm 1** CSR-based spatio-temporal anomaly detection for moving targets

**Input:** Hyperspectral sequences, dual-window size (*win*, *wout*), temporal background dictionary size *ND*, candidate set size *NC*, parameter *ν*, and kernel parameter *γ* for KCSR.

**for** each frame *X<sup>i</sup>* in the hyperspectral sequences **do**

	- (a) Collect the spatial background dictionary based on the hollow window;
	- (b) Calculate the spatial anomaly score *s j <sup>i</sup>* by the CSR or KCSR detector;
	- (a) **for** each pixel *x j <sup>i</sup>* in *X<sup>i</sup>* **do**
		- i. According to the sorting of smoothed spatial detection results *s*ˆ *j <sup>i</sup>*−1,*s*<sup>ˆ</sup> *j <sup>i</sup>*−2, ...,*s*<sup>ˆ</sup> *j i*−*NC* , select *ND* dictionary atoms from former spectra *x j <sup>i</sup>*−1, *<sup>x</sup> j <sup>i</sup>*−2, ..., *<sup>x</sup> j <sup>i</sup>*−*NC* to construct the temporal background dictionary *B***¯***<sup>j</sup> i* ;
		- ii. Calculate the temporal anomaly score *t j <sup>i</sup>* by the CSR or KCSR detector;
	- (b) Smooth the spatial detection map *T<sup>i</sup>* by Equations (23) and (24);
	- (c) Calculate the spatio-temporal fusion map *ST<sup>i</sup>* by Equation (25);

#### **end if**

**Output:** Spatio-temporal anomaly detection map *ST<sup>i</sup>* when *i* > *NC*.

#### **4. Experimental Results and Discussion**

In the beginning of this section, a real HSI sequence dataset and a synthetic dataset are introduced. Subsequently, the capability of the proposed temporal anomaly detection with different background dictionary sizes and different spatial detection results is demonstrated in detail. Additionally, the proposed spatio-temporal anomaly detection is compared to several existing algorithms in the detection performance.

#### *4.1. Datasets and Evaluation Metrics*

The Cloud dataset is an HSI sequence under a complex cloudy background and was collected by the Interuniversity Microelectronics Centre of Beihang University with the xiSpec snapshot mosaic hyperspectral cameras [12]. The dataset has a spatial size of 409 × 216 pixels and 25 spectral bands including the 682–957 nm spectral region. The HSI sequence consists of 500 frames, where an aircraft (Target A) rises from the bottom of the imagery. Since the distance between the camera and the aircraft increases with the frames, the size of the aircraft decreases over time, 53 pixels in the 1st frame and 21 pixels in the 500th frame, resulting in a descending spectral difference from the background. However, because of the aircraft's speed on HSIs also decreases, the number of frames that the aircraft needs to pass through a pixel increases. Three small flying targets (Target B, Target C, and Target D) with no more than 10 pixels exist in the 250th–393rd, 256th–363rd, and 417th–466th frames, respectively, and their velocities are all greater than 5 pixels per frame. As shown in Figure 2, there is a noise clutter in the cloudy background.

**Figure 2.** False color local image around targets in the Cloud dataset. (**a**) Target A in the 50th frame. (**b**) Target A in the 500th frame. (**c**) Target B and Target C. (**d**) Target D.

The synthetic dataset is based on the Terrain dataset acquired by the Hyperspectral Digital Image Collection Experiment sensor. The dataset has a spatial size of 180 × 180 pixels and 210 spectral bands including the 400–2500 nm spectral region, as shown in Figure 3a. The spatial resolution is 1 m, and the spectral resolution is 10 nm. The water absorption and high noise bands are deleted, and one-hundred sixty-two spectral bands are usable in the experiments. According to the LMM, synthetic targets can be added to the Terrain dataset by:

$$
\overrightarrow{a} = \left(1 - \lambda\right)\overrightarrow{b} + \lambda\overrightarrow{a} + \pi\tag{26}
$$

where *a* is a pure target spectrum, *b* is an original background spectrum, *a***˜** is a mixed target spectrum, *n* is the added zero mean Gaussian noise vector, and *λ* is the target abundance to be set. Considering that the radiation response interval of the background varies with bands, Gaussian noise with different variance is added to each band of a hyperspectral cube. Noise intensity is adjusted by the signal-to-noise ratio (SNR), expressed in this dataset by:

$$\text{SNR}\_{\text{dB}} = 10 \log\_{10} \left( \frac{\sigma\_{b,l}^2}{\sigma\_{n,l}^2} \right) \tag{27}$$

where *σ*<sup>2</sup> *<sup>b</sup>*,*<sup>l</sup>* and *<sup>σ</sup>*<sup>2</sup> *<sup>n</sup>*,*<sup>l</sup>* are the variances of the background and noise in the *l*-th band. Three targets with a size of 5 × 5 pixels and a speed of 2 pixels per frame are added to the Terrain dataset and move 100 frames. The plane trajectories of targets are the same, and the distance between the two targets is ten frames. Considering that the boundaries between neighboring objects are often accompanied by severe spectral mixing in the real data, *λ* of 16 pixels on the periphery of targets is set to 10%, while that of 9 pixels in the center of targets is set to 40%. To explore the noise immunity of CSR-ST, the SNR is set to 20 dB, 10 dB, 5 dB, and 0 dB in turn. Figure 3b–f shows background spectra and mixed target spectra in different noise environments. With the decrease of SNR, the discriminability between background and mixed targets also decreases. When the SNR is 0 dB, background spectra and mixed target spectra are almost indistinguishable.

**Figure 3.** The synthetic Terrain dataset. (**a**) Original false color image. (**b**–**f**) Spectral curves of background and target pixels with different noise. The blue curve is a pure target pixel; the four red curves are background pixels; orange curves are mixed target pixels with a target abundance of 40%; and green curves are mixed target pixels with a target abundance of 10%. (**b**) No noise. (**c**) SNR = 20 dB. (**d**) SNR = 10 dB. (**e**) SNR = 5 dB. (**f**) SNR = 0 dB.

To evaluate anomaly detection performance, this article adopts the receiver operating characteristic (ROC) curve and the area under the ROC curve (AUC). The detection probability (*Pd*) and false alarm rate (*Pf*) are computed on a segmentation map, which is obtained on the detection map by a given threshold. After the threshold is iterated over, a set of *Pd* and *Pf* can be used to plot the ROC curve. An excellent detector has an upper left ROC curve [43]. However, the ROC curve can only qualitatively analyze detection performance. AUC [44] can give an intuitive and quantitative description and is calculated by several trapezoids:

$$\text{AUC} = \frac{1}{2} \sum\_{l=1}^{n-1} (P\_f^{l+1} - P\_f^l)(P\_d^{l+1} + P\_d^l) \tag{28}$$

where (*P<sup>l</sup> <sup>f</sup>* , *<sup>P</sup><sup>l</sup> <sup>d</sup>*) is defined as the *l*-th coordinate point and *n* is defined as the number of coordinate points constituting the curve. The closer to 1 AUC values are, the better the detection algorithms are. For the anomaly detector in an HSI sequence, the mean ROC of all frames can describe the performance.

Considering that kernel space can represent hyperspectral data better, the proposed spatio-temporal anomaly detection algorithm is based on the KCSR model in the following experiments. KCSR-S, KCSR-SF, KCSR-T, and KCSR-ST denote spatial detection, smoothed spatial detection, temporal detection, and spatio-temporal fusion detection, respectively. All the experiments were implemented on a machine that was equipped with an Intel Core i9-9980XE CPU and 128-GB RAM, and the programs were written in Python.

#### *4.2. Temporal Detection Performance under Different Settings of the Temporal Background Dictionary*

For the KCSR-based temporal detection, the parameter *ν* can be set to the same value in spatial detection, which was analyzed in Section 3.3. Moreover, because of the same background spectra for the spatial and temporal detection, the kernel space in spatial detection is also suitable for temporal detection. Therefore, after the parameters of spatial detection are adjusted, the settings to be adjusted in the temporal detection are *ND* and *NC*, denoting the sizes of the temporal background dictionary and its candidate set, respectively. To further explain *ND* and *NC*, we define the number of removed atoms as:

$$N\_{\mathbb{R}} = N\_{\mathbb{C}} - N\_{\mathbb{D}}.\tag{29}$$

The meaning of the candidate set is to prevent the background dictionary from the target contamination, and *NR* should ensure than most of the abnormal spectra can be removed from the candidate set.

#### 4.2.1. Experiments on the Cloud Dataset

Traditional temporal profile filtering algorithms ask for strong prior information about the target velocity. We count the number of frames that targets take to pass through a single pixel in the Cloud dataset and draw the histogram. As shown in Figure 4, three-thousand two-hundred forty-one pixels are passed through by targets in 20 frames, while only 130 pixels are passed through by targets more than 20 frames. The latter occurs mainly in the latter half of the sequence because the airport is far away from the camera and becomes slower in the imagery.

**Figure 4.** Histogram of the number of frames that targets take to pass through a single pixel in the Cloud dataset.

To explore the impact of the temporal background dictionary on the temporal detection performance, we set *NC* to 20, 30, 50, 80, and 100, respectively. *NR* was set to 10, 20, 30, and 40, respectively. Because the first 100 frames in the Cloud dataset are selected as the temporal background candidate set, the temporal anomaly detection starts at the 101st frame. The parameters *ν*, *γ* and the dual-window size of KCSR-S are empirically tuned to acquire the best detection capability in the first frame. The mean AUCs of KCSR-T in the Cloud dataset are shown in Table 1. When *NC* is set to 20, the mean AUC of KCSR-T becomes the worst value of 0.970966 in the table. That is because if the dictionary candidate set size is too small, temporal background dictionaries of some target pixels can consist mainly of target spectra. When *NC* is set to 50, 80, and 100, the mean AUCs of KCSR-T when *NR* is 20 are better than those when *NR* is 10. That is because the former can remove more target spectra in the dictionary candidate set than the latter. When *NC* is set to 30 and 50, the mean AUCs of KCSR-T when *ND* is 10 are worse than those when *ND* is 20. It is indicated that a small temporal background dictionary size is not conducive to the representation of spectral features. Moreover, the best mean AUC in Table 1 is 0.980302 and achieved when *NC* is 50 and *NR* is 20.


**Table 1.** The mean AUC performance achieved by KCSR-T on the Cloud dataset with different settings of the temporal background dictionary.

#### 4.2.2. Experiments on the Synthetic Terrain Dataset

To explore how to set the temporal background dictionary on the synthetic Terrain dataset, we set *NC* to 20, 30, 40, and 50 in turn. *NR* was set to 10, 20, 30, and 40, respectively. The first 50 frames in the Cloud dataset are selected as the temporal background candidate set, and the KCSR-T is performed on the last 50 frames. The parameters *ν*, *γ* and the dual-window size of KCSR-S are empirically tuned to acquire the best detection capability in the first frame. As shown in Table 2, when the background dictionary size *ND* is fixed to 10, the worst mean AUC is achieved by *NR* = 20. That is because the former spectra of target pixels contain at most 8 target spectra, and a small *NR* is not conducive to removing them from the background dictionary candidate set. With SNR decreasing, the distinction between background and target spectra decreases, and the gaps of mean AUCs between *NR* = 10 and other settings become larger. In addition, the best mean AUCs in the four noise conditions are achieved by *NC* = 50 and *NR* = 20, which are 0.999258, 0.932968, 0.819948, and 0.685078, respectively.

**Table 2.** The mean AUC performance achieved by KCSR-T on the synthetic Terrain dataset in four noise conditions with different settings of the temporal background dictionary. (**a**) SNR = 20 dB. (**b**) SNR = 10 dB. (**c**) SNR = 5 dB. (**d**) SNR = 0 dB.


#### *4.3. Detection Performance under Different Settings of the Dual-Window*

As mentioned in Section 3.3, it is different for moving target detection to set the optimal dual-window size in advance in the spatial anomaly detection. One important reason for this is that the sizes of moving targets can change. For the Cloud dataset, as the airplane moves away from the camera, the aircraft size in the HSI becomes smaller. In Section 4.2.1, the dual-window size (*win*, *wout*) of KCSR-S was set to (29, 31), which is the optimal size in the first frame. However, (29, 31) is too large for the aircraft in the 500th frame, which only has 21 pixels. To explore the impact of different settings of the dual-window on KCSR-S and KCSR-T, we set the dual-window size to (3, 5), (9, 11), (13, 15), (19, 21), (23, 25), and (29, 31), respectively. Considering that the iterative smoothing filter can improve the spatial detection map of KCSR-S, KCSR-T uses the original spatial detection results instead of smoothed results to select the temporal background dictionary in this subsection.

As shown in Table 3, the dual-window size has a significant influence on the detection capability of KCSR-S. The best mean AUC of KCSR-S in the 101st–500th frames is 0.970251, while the worst mean AUC is 0.961412. However, the mean AUC of KCSR-T is better than that of KCSR-S under different dual-window sizes and fluctuates in a small range from 0.979962 to 0.980175. To give a more intuitive representation, we fit the variation curves of AUC with time in the 101st–500th frames by a power function with the highest power of 15. As shown in Figure 5a, with the change of the aircraft size, the optimal dual-window size also changes at different times. The optimal size around the 200th frame is (23, 25) and then becomes (19, 21) in the 300th frame. When it reaches the 200th frame, the 300th frame, the 450th frame, and the 480th frame, respectively, the optimal size is (23, 25), (19, 21), (13, 15), and (9, 11), respectively. Although the fitted curve with a dual-window size of (29, 31) performs well at the beginning of the sequence, the gap between the curve and the best AUC increases over time. However, the AUC of KCSR-T is almost impervious to the dual-window size of KCSR-S. Compared to Figure 5a, the curves with different dual-window size in Figure 5b are almost the same. There are two reasons why KCSR-T is robust to the dual-window size of KCST-S. On the one hand, different dual-window sizes can result in different anomaly scores of target pixels in the spatial detection, and an unsuitable size can lead to lower anomaly scores. However, for the same pixel, even though under unsuitable dual-window sizes, the gap between anomaly scores within and without targets is still large enough to remove anomalous spectra in the candidate set of the temporal background dictionary. On the other hand, KCSR-T can also automatically remove anomalous atoms from the background dictionary during the temporal detection process. In conclusion, the proposed temporal anomaly detection is remarkably robust to the dual-window size in the spatial detection, and the combination of the spatial and temporal detection can overcome the disadvantages of the dual-window strategy.

**Figure 5.** The fitted variation curves of AUC with time in the 101st–500th frames under different settings of the dual-window. (**a**) The curves of KCSR-S. (**b**) The curves of KCSR-T.

**Table 3.** The mean AUC performance achieved by KCSR-S and KCSR-T on the Cloud dataset with different settings of the temporal background dictionary.


#### *4.4. Comparison to the State-of-the-Art*

In the subsection, the KCSR-ST algorithm is contrasted with several single-frame HSI anomaly detection algorithms, including RX [13], QLRX [15], KSVDD [19], KRX [16], CR [22], KCR [22], and CSR. Meanwhile, the proposed algorithm is also contrasted with two detection algorithms for moving targets,

including VF [5] and STH [12]. In fairness, both VF and STH are based on KCSR in the following experiments, denoted by KCSR-VF and KCSR-STH, respectively. All parameters of these algorithms are empirically tuned to acquire the best detection capability at the beginning of the sequences. The dual-window sizes are set to (29, 31) and (9, 15) for the Could and Terrain dataset, respectively. The *NC* and *ND* on the two datasets are set to the optimal values obtained in Section 4.2. The temporal filter weight *ρ* is set to 0.5, and the spatial smooth filter adopts a simple 3 × 3 mean smoothing filter. The AUC performances and detection maps of KCSR-S, KCSR-SF, and KCSR-T are also shown to explore the role of each step in the proposed KCSR-ST algorithm.

#### 4.4.1. Experiments on the Cloud Dataset

The ROC curves obtained on the Cloud dataset are shown in Figure 6; the AUC values are shown in Table 4; and the color detection maps are shown in Figure 7. These all illustrate that KCSR-ST is superior to all single-frame and multiple-frame anomaly detection algorithms.

**Table 4.** The mean AUC performance obtained on the Cloud dataset.

**Figure 6.** ROC curves obtained on the Cloud dataset. (**a**) Logarithmic abscissa; (**b**) linear abscissa.

As shown in Table 4, the best AUC value among single-frame anomaly detection algorithms is 0.9649 and achieved by KCSR. Taking advantage of temporal information, the AUC values of KCSR-VF, KCSR-STH, and KCSR-T are all higher than single-frame algorithms. The reason for this phenomenon can be explained in Figure 7. As shown in Figure 7a, obvious vignetting exists at the edges of false color images. Vignetting is a common phenomenon in photography, but turns edges of HSIs into heterogeneous background pixels. Therefore, there always exists a relatively large number of false alarms at the edges of detection maps obtained by single-frame algorithms, which is shown in Figure 7c–j. Because KCSR-VF and KCSR-T make use of the historical spatial detection results and the former spectra of test pixels, respectively, the heterogeneous background pixels rarely lead to false alarms in the corresponding detection maps, which is shown in Figure 7k,n.

**Figure 7.** Color detection maps obtained in the 400th frame of the Cloud dataset. (**a**) False color image; (**b**) ground-truth map; (**c**) RX; (**d**) QLRX; (**e**) KRX; (**f**) KSVDD; (**g**) CR; (**h**) KCR; (**i**) CSR; (**j**) KCSR-S; (**k**) KCSR-VF; (**l**) KCSR-STH; (**m**) KCSR-SF; (**n**) KCSR-T; (**o**) KCSR-ST.

However, the historical trajectory of Target B turns into false alarms in the detection map of KCSR-VF in the 400th frame. That is because the VF algorithm is mainly designed to detect slow targets, and the parameter setting of VF depends on the speed of targets. Because velocities of Targets B, C, and D are all greater than 5 pixels per frame and go through a pixel in a frame, the temporal variance-calculation window suitable for Target A is too long for them. As long as the temporal variance-calculation window contains the trajectory of Targets B, C, and D, the detection results can have high values and become false alarms in the VF detection map. Moreover, KCSR-STH combines KCSR-VF with other spatial detection maps and is slightly affected by these false alarms, shown in Figure 7l.

As shown in Figure 7j,m, there is much background clutter on the detection maps of KCSR-S and KCSR-T. Compared to KCSR-S, the spatial detection map after the iterative smoothing filter, KCSR-SF, suppresses the background clutter and enhances the target. However, false alarms resulting from the heterogeneous background are also enhanced in Figure 7m. KCSR-ST combines the smoothed spatial detection map (KCSR-SF) with the smoothed temporal detection map, and the heterogeneous background and the background clutter are entirely suppressed in Figure 7o. As shown in Figure 6a, the ROC curve of KCSR-ST is on the upper left of those of other algorithms, which indicates that KCSR-ST is superior to the single-frame and multi-frame anomaly detection algorithms. However, when *Pf* is limited to an extremely low value range, the detection performance of KCSR-ST is inferior to KCSR-T. As shown in Figure 6b, when *Pf* is 10−5, the *Pd* of KCSR-T is about 0.75, while that of KCSR-ST is only about 0.35. Furthermore, when *Pf* is smaller than 10−5, the ROC curve of KCSR-S outperforms KCSR-SF. This is because the iterative smoothing filter enhances target pixels and pixels around targets. Compared to KCSR-S and KCSR-T, KCSR-SF and KCSR-ST blurred the boundary between target and background in the detection maps. However, the iterative smoothing filter can still be regarded as a useful strategy. Although reducing *Pd* when *Pf* is low, the enhancement improves the ability to detect slow targets and the robustness to the different moving speeds of the targets. For most hyperspectral anomaly detection scenarios, the focus is on whether the target exists rather than the shape of the target. The false alarms that result from the enhancement of pixels around the target have little influence on the judgment of whether the target exists. Besides, the enhancement from the iterative smoothing filter can be optimized by adjusting the filter weights or changing the smoothing strategy.

#### 4.4.2. Experiments on the Terrain Dataset

The ROC curves achieved on the synthetic Terrain dataset under different noise environments are shown in Figure 8; the color detection maps are shown in Figures 9–11; and the AUC results are shown in Table 5. Our proposed KCSR-ST algorithm is considerably robust to noise and superior to all single-frame anomaly detection algorithms. When the SNR is set to 20 dB, 10 dB, 5 dB, and 0 dB, respectively, the corresponding mean AUC of KCSR-ST is 0.9996, 0.9959, 0.9461, and 0.7516, respectively; whereas, the best mean AUC among single-frame algorithms is 0.8402, 0.7438, 0.7057, and 0.6205, respectively. As shown in Figure 9c–j, Figure 10c–j, and Figure 11c–j, there are a large number of false alarms on the detection maps of single-frame algorithms because some trees are sparsely distributed in the scene.


**Table 5.** The mean AUC performance obtained on the synthetic Terrain dataset.

**Figure 8.** ROC curves obtained on the synthetic Terrain dataset. (**a**,**b**) adopt logarithmic abscissas and (**c**,**d**) adopt linear abscissas. (**a**) SNR = 20 dB; (**b**) SNR = 10 dB; (**c**) SNR = 5 dB; (**d**) SNR = 0 dB.

Although KCSR-VF and KCSR-STH are also superior to single-frame algorithms, their detection performance is far inferior to that of KCSR-ST on the Terrain dataset. As shown in Figure 9k, the trajectory of targets results in false alarms on the detection map of KCSR-VF. That is because the targets share the same trajectory, and the baseline background of VF cannot be estimated accurately. KCSR-STH combines KCSR-VF, KCSR-S, and its temporal detection and suppresses the background and false alarms. However, because the temporal detection of STH extracts the background dictionary of the test pixel in the forward frame by the same dual-window as KCSR-S, the false alarms resulting from sparse trees are still on the temporal detection map of KCSR-STH and then appear in the final fusion map, i.e., Figure 9l.

As shown in Figure 9k, when the SNR is 20 dB, KCSR-T has an excellent ability to detect moving targets. Although the mean AUC of KCSR-T is slightly lower than KCSR-ST, the ROC performance of KCSR-T outperforms KCSR-ST when *Pf* is smaller than 10−3. As shown in Figure 8a, when *Pf* is 10−5, the *Pd* of KCSR-T is about 0.98, while the *Pd*s of all the other algorithm are higher than 0.25. When the SNR is 10 dB, the mean AUC of KCSR-T is much lower that of KCSR-ST, and the ROC performance of KCSR-T is also inferior to that of KCSR-ST. That is because the target abundances of target peripheral pixels are lower, and then, these pixels cannot be detected by KCSR-T, which is shown in Figure 10n. Employing the iterative smoothing filter, KCSR-ST enhances the anomaly scores of pixels around targets and then performs prominently under the ROC and AUC evaluation metrics. When SNR comes to 5 dB, there is much noise clutter on the detection map of KCSR-T, i.e., Figure 11n, and the mean AUC of KCSR-T descends to 0.8199. As shown in Figure 8c, the ROC curves of single-frame algorithms are close to the diagonal, which means that the detection abilities of single-frame algorithms of moving targets are incredibly inferior. KCSR-ST can effectively suppress the noise clutter and false alarms on the detection map, which is shown in Figure 11n, and its ROC performances are much better than the other curves in Figure 8c. Even though the SNR descends to 0 dB, KCSR-ST still can detect targets. As shown in Figure 8d, the ROC curves of other algorithms are around the diagonal, while the *Pd* of KCSR-ST can reach 0.6 when the *Pf* is 0.1.

**Figure 9.** Color detection maps obtained in the 60th frame of the synthetic Terrain dataset when SNR = 20 dB. (**a**) False color image; (**b**) ground-truth map; (**c**) RX; (**d**) QLRX; (**e**) KRX; (**f**) KSVDD; (**g**) CR; (**h**) KCR; (**i**) CSR; (**j**) KCSR-S; (**k**) KCSR-VF; (**l**) KCSR-STH; (**m**) KCSR-SF; (**n**) KCSR-T; (**o**) KCSR-ST.

**Figure 10.** Color detection maps obtained in the 70th frame of the synthetic Terrain dataset when SNR = 10 dB. (**a**) False color image; (**b**) ground-truth map; (**c**) RX; (**d**) QLRX; (**e**) KRX; (**f**) KSVDD; (**g**) CR; (**h**) KCR; (**i**) CSR; (**j**) KCSR-S; (**k**) KCSR-VF; (**l**) KCSR-STH; (**m**) KCSR-SF; (**n**) KCSR-T; (**o**) KCSR-ST.

**Figure 11.** Color detection maps obtained in the 80th frame of the synthetic Terrain dataset when SNR = 5 dB. (**a**) False color image; (**b**) ground-truth map; (**c**) RX; (**d**) QLRX; (**e**) KRX; (**f**) KSVDD; (**g**) CR; (**h**) KCR; (**i**) CSR; (**j**) KCSR-S; (**k**) KCSR-VF; (**l**) KCSR-STH; (**m**) KCSR-SF; (**n**) KCSR-T; (**o**) KCSR-ST.

#### **5. Conclusions**

In the traditional single-frame anomaly detection, false alarms on stationary targets and non-homogeneous backgrounds are unavoidable. Besides, detecting targets in complex motion is still a challenge for multi-frame algorithms. In this article, a constrained sparse representation- based spatio-temporal AD algorithm is proposed to identify small and dim moving targets in hyperspectral sequences and overcomes the aforementioned drawbacks. Our algorithm includes a spatial detector and a temporal detector. The former can suppress moving background regions, and the latter can suppress non-homogeneous background and stationary objects. Moreover, two temporal background purification procedures ensure the effectiveness of the temporal detector for targets in complex motion. Experiments accomplished on the Cloud dataset and the synthetic Terrain dataset indicate that our algorithm is superior to other classic detection algorithms. Even though the noise clutter is extreme, our algorithm can also suppress the clutter and effectively detect small and dim moving targets.

Our algorithm provides a novel spatio-temporal anomaly detection framework for hyperspectral remote sensing. In addition, adaptive anomaly elimination in the temporal background is a good idea for detecting targets in complex motion. However, the proposed algorithm needs accurate frame registration and has enormous demand for data storage equipment. Besides, the iterative smoothing filter can effectively suppress background clutter, but blurs the boundary between the target and the background. In future work, we will focus on reducing the algorithm's need for inter-frame matching and data storage and improve the iterative smoothing filter by introducing edge-preserving filters. Furthermore, the proposed algorithm can be combined with target tracking, state estimation, and trajectory prediction and then provide motion information about targets.

**Author Contributions:** Conceptualization, methodology, and software, Q.L. and Z.L. (Zhaoxu Li); writing, original draft preparation, Z.L. (Zhaoxu Li) and Z.W.; writing, review and editing, Q.L., Z.L. (Zaiping Lin), and J.W.; visualization, Z.W.; project administration, Z.L. (Zaiping Lin). All authors read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the National Natural Science Foundation of China under Grant 61605242, Grant 61602499, and Grant 61471371.

**Acknowledgments:** Thanks to Wang of Beihang University for providing the Cloud dataset.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Toward Super-Resolution Image Construction Based on Joint Tensor Decomposition**

#### **Xiaoxu Ren 1, Liangfu Lu 2,\* and Jocelyn Chanussot <sup>3</sup>**


Received: 30 June 2020; Accepted: 3 August 2020; Published: 6 August 2020

**Abstract:** In recent years, fusing hyperspectral images (HSIs) and multispectral images (MSIs) to acquire super-resolution images (SRIs) has been in the spotlight and gained tremendous attention. However, some current methods, such as those based on low rank matrix decomposition, also have a fair share of challenges. These algorithms carry out the matrixing process for the original image tensor, which will lose the structure information of the original image. In addition, there is no corresponding theory to prove whether the algorithm can guarantee the accurate restoration of the fused image due to the non-uniqueness of matrix decomposition. Moreover, degenerate operators are usually unknown or difficult to estimate in some practical applications. In this paper, an image fusion method based on joint tensor decomposition (JTF) is proposed, which is more effective and more applicable to the circumstance that degenerate operators are unknown or tough to gauge. Specifically, in the proposed JTF method, we consider SRI as a three-dimensional tensor and redefine the fusion problem with the decomposition issue of joint tensors. We then formulate the JTF algorithm, and the experimental results certify the superior performance of the proposed method in comparison to the current popular schemes.

**Keywords:** hyperspectral image; multispectral image; image fusion; joint tensor decomposition

#### **1. Introduction**

With the flourishing of both artificial intelligence (AI) and mathematical theory, image fusion has always been the focus and hotspot in neuroscience, metabonomics, remote sensing, and many other fields [1–3]. Generally, image fusion refers to synthesizing images' data obtained from the diverse image acquisition equipment. It is aimed at achieving complementary information from different information sources to further acquire clearer, more informative, and higher quality reconstructed images. In 1994, Genderen and Phol proposed a simple and intuitive definition of image fusion [4]: image fusion is merging two or more images into a new image using some algorithms. Therein, hyperspectral images (HSIs), playing a catalytic role in image fusion, have been widely leveraged in geophysical exploration [5], agricultural remote sensing [6], marine remote sensing [7], environmental monitoring [8], and other fields [9–13] because of their rich spectral information. However, the spatial resolution of HSIs is still relatively low, subjected to the imaging equipment of HSIs and the complex imaging environment, which cannot meet the application requirements of mixing, classification, detection, etc., while this further limits the prospect of HSIs. Therefore, how to improve the resolution of hyperspectral images has become the hot issue in the field of image processing.

Specially, panchromatic fusion and hyperspectral-multispectral fusion are two forms of hyperspectral super-resolution image reconstruction. Panchromatic fusion refers to fusing multispectral images (or hyperspectral images) and panchromatic images to obtain images with more spatial information. In [14], the authors classified the pansharpening techniques into component substitution (CS) [15] and multi-resolution analysis (MRA) [16]. Meanwhile, they proposed a hybrid method, combining the better spatial information of CS and the more accurate spectral information of MRA techniques, to improve the spatial resolution while preserving the original spectral information as much as possible.

Concretely, a multispectral image (MSI) generally consists of dozens of bands, and most of them are in the range of the visible region. Given a low-spatial resolution HSI, the operation of spatial resolution enhancement using MSI under the same scene is termed hyperspectral image fusion or super-resolution image (SRI) reconstruction. Generally, the MSI has a higher spatial resolution than the HSI, which is complementary to the HSI. In this paper, we mainly study the method of SRI reconstruction by combining the spectral information of HSI with the spatial information of MSI under the same scene.

Nevertheless, the existing technologies can neither avoid the distortion of image spectral characteristics, nor the complex and time-consuming frequency decomposition and reconstruction. Therefore, Yokoya proposed a simple spectral preservation fusion technique: the smoothing filter based intensity modulation (**SFIM**) [17], which is also called the generalized Laplace pyramid (**GLP**) [18–21] based on the modulation transfer function (**MTF**) used to fuse HSI and MSI by hypersharpening [22]. Then, utilizing the ratio between the high-resolution image and its low-pass filter (with a smoothing filter) image, spatial details could be modulated into co-registered low-resolution MSIs without changing their spectral characteristics and contrast. Compared with Brovey transform [23], **SFIM** is an advanced fusion technology to improve the spatial details of MSIs, and its spectral characteristics are reliably preserved.

Beyond that, Eismann introduced a maximum a posteriori estimation method [24]. It combined the stochastic mixing model of the content under the spectral scene and developed a cost function that could optimize the estimation of the hyperspectral scene related to the observed hyperspectral and auxiliary images. Moreover, this method can generally reconstruct sub-pixel information of several main components in SRI estimation. Furthermore, sparse representation has often been employed to deal with various types of image processing problems, especially in the inverse problem. In 2006, Elad denoised the image with the sparse representation method [25]. This not only achieved the state-of-the-art effect, but introduced the K-SVD dictionary training method [26]. In 2010, the authors of [27] proposed a single-frame super-resolution image reconstruction method based on sparse representation.

In contrast, traditional methods, such as principal component substitution and enhancement of least squares estimation, are primarily limited to the first principal component. For remote sensing images, the spectral characteristics of pixels are denoted as endmembers, including mixed endmembers and pure endmembers. Since each pixel has mixed endmembers, unmixing is a technique for estimating the number of pure endmembers in each pixel, the spectral characteristics, and the abundance of the endmembers [28]. The SRI reconstruction method based on unmixing usually decomposes the HSI and the MSI of the same scene. The endmember matrix of the decomposed HSI and the abundance matrix of the MSI are combined to obtain the reconstructed HSI with high spatial resolution.

In the effort of [29], the authors proposed the method of enhancing the spatial resolution of the HSI in terms of unmixing technology: coupled nonnegative matrix factorization (**CNMF**). They exploited nonnegative matrix factorization to unmix the HSI and MSI sequentially and iteratively obtained the endmember matrix and abundance matrix. Ultimately, the SRI was obtained by combining the two matrices. The nonnegative matrix decomposition usually cannot guarantee the unique solution, although **CNMF** could achieve good reconstruction results. To address this matter, Eliot Wycoff presented a nonnegative sparse enhancement model for SRI reconstruction [30], which testified that the solution was not unique in the **CNMF** method, and it had high computational complexity and a high requirement for CPU operation ability. To further boost the effect of super-resolution reconstruction, the authors of [31] came up with a method to resolve the problem of super-resolution

and hyperspectral unmixing simultaneously. Unlike the measures in [29], they took advantage of the nearest alternating linear minimization (PALM) [32] to update them simultaneously, while the initialization of the endmember matrix applied SISAL [33] for endmember extraction.

Simultaneously, some researchers studied the fusion of the MSI and HSI based on the tensor [34–40], mainly considering the natural tensor structure of spectral images, so as to reduce the information lost in matricization and increase the performance. Generally, multi-channel images and other data have their own natural tensor structure. In addition, since the tensor has good expressive ability and computational characteristics, it is very meaningful to study the tensor analysis of images. Moreover, tensor decomposition can preserve the structural characteristics of the original image data. For HSIs, tensor decomposition makes full use of spatial and spectral redundancy between images and compresses and extracts relevant feature information with high quality. Based on HSIs, a nonnegative tensor canonical polyadic decomposition (CP) algorithm was raised that was applied to dispose of the blind source separation [41]. Shashua utilized CP decomposition for image compression and classification [42], while Bauckhage introduced discriminant analysis to high-order data such as color images for classification [43]. Xiao Fu proposed a coupled tensor decomposition framework [44], which could guarantee the identifiability of SRIs under mild and realistic conditions. Meanwhile, Shutao Li and Renwei Dian put forward a coupled sparse tensor factorization (**CSTF**) [45]; they regarded the SRI as a three-dimensional tensor and redefined the fusion problem as the core tensor and dictionary estimation of three modes. The high-spatial spectral correlation in the SRI was modeled by a regularizer, which could promote the generation of sparse core tensors. However, **CSTF** is an optimization model based on tensor Tucker decomposition, which is not unique. Moreover, most existing methods assume that known (or easily estimated) degenerate operators are applied to SRI to form the corresponding HSI and MSI, which is practically absent. In this paper, we deal with the super-resolution problem under the condition that the degenerate operators are seldom known and contain noise. A joint tensor decomposition model is proposed by taking advantage of the multi-dimensional tensor structure of the HSI and MSI.

The main content of this paper is to utilize the joint tensor decomposition (**JTF**) algorithm for the fusion of HSI and MSI, so as to explore the problem of SRI reconstruction. The contributions are listed as follows:


The outline of this paper is organized as follows. In Section 1, we mainly introduce some basic notations and definitions for tensors. In Section 2, we give a basic overview of tensors. The proposed coupled image fusion algorithms are introduced in Section 3. In Section 4, experimental results on the algorithms are presented. Conclusions and future research directions are given in Section 6.

#### **2. Preliminaries on Tensors**

#### *2.1. Definition and Notations*

In this section, we first briefly introduce some necessary notions and preliminaries. The general tensor is denoted as X , while the element (*i*, *j*, *k*) of a third-order tensor X is signed by *xijk*. The matrix is denoted as **X**, and the scalar (or the vector) is represented by *x*. A fiber is defined by fixing every index but one. Third-order tensors have column, row, and tube fibers, denoted by *x*:*jk*, *xi*:*k*, and *xij*:, respectively; see Figure 1. Fibers are always assumed to be column vectors. The mode-*n* matricization of a tensor X ∈ <sup>R</sup>*I*1×*I*2×···×*IN* is signed by **<sup>X</sup>**(*n*), which arranges the mode-n fibers to be the columns of the matrix and can reduce the dimension of the tensor.

**Figure 1.** Fibers of a third-order tensor .

**Definition 1.** *The Kronecker product of matrices* **<sup>A</sup>** <sup>∈</sup> <sup>R</sup>*I*×*<sup>J</sup> and* **<sup>B</sup>** <sup>∈</sup> <sup>R</sup>*K*×*<sup>L</sup> is defined by Equation (1), which is denoted as* **A** ⊗ **B***, and the calculation result is a matrix of size IK* × *JL, i.e.,*

$$\mathbf{A}\otimes\mathbf{B}=\begin{bmatrix}a\_{11}\mathbf{B} & a\_{12}\mathbf{B} & \cdots & a\_{1f}\mathbf{B} \\ a\_{21}\mathbf{B} & a\_{22}\mathbf{B} & \cdots & a\_{2f}\mathbf{B} \\ \vdots & \vdots & \ddots & \vdots \\ a\_{I1}\mathbf{B} & a\_{I2}\mathbf{B} & \cdots & a\_{II}\mathbf{B} \end{bmatrix}\tag{1}$$
 
$$=\begin{bmatrix}a\_{1}\otimes b\_{1} & a\_{1}\otimes b\_{2} & a\_{1}\otimes b\_{3}\cdots a\_{I}\otimes b\_{L-1} & a\_{I}\otimes b\_{L}\end{bmatrix}.$$

Then, we have a new matrix-matrix production termed as the Khatri–Rao product.

**Definition 2.** *Let* **<sup>A</sup>** <sup>∈</sup> <sup>R</sup>*I*×*<sup>K</sup> and* **<sup>B</sup>** <sup>∈</sup> <sup>R</sup>*J*×*K. Then, the Khatri–Rao product is a matrix of size I J* <sup>×</sup> *<sup>K</sup> defined as:*

**A B** = [*a*<sup>1</sup> ⊗ *b*<sup>1</sup> *a*<sup>2</sup> ⊗ *b*<sup>2</sup> ··· *aK* ⊗ *bK*], (2)

*where* ⊗ *is the Kronecker product.*

Next, we discuss some properties of the Khatri–Rao product, which will be useful in our later discussion.

$$\begin{aligned} \mathbf{A} \odot \mathbf{B} \odot \mathbf{C} &= (\mathbf{A} \odot \mathbf{B}) \odot \mathbf{C} = \mathbf{A} \odot (\mathbf{B} \odot \mathbf{C}), \\ (\mathbf{A} \odot \mathbf{B})^{\mathrm{T}} (\mathbf{A} \odot \mathbf{B}) &= \mathbf{A}^{\mathrm{T}} \mathbf{A} \* \mathbf{B}^{\mathrm{T}} \mathbf{B}, \\ (\mathbf{A} \odot \mathbf{B})^{\dagger} &= ((\mathbf{A}^{\mathrm{T}} \mathbf{A}) \* (\mathbf{B}^{\mathrm{T}} \mathbf{B}))^{\dagger} (\mathbf{A} \odot \mathbf{B})^{\mathrm{T}}. \end{aligned} \tag{3}$$

**Definition 3.** *The n-mode (matrix) product of a tensor* X ∈ <sup>R</sup>*I*1×*I*2×···×*IN with a matrix* **<sup>M</sup>** <sup>∈</sup> <sup>R</sup>*J*×*In is represented as:*

$$(\mathcal{X} \times\_n \mathbf{M})\_{\dot{i}\_1 \cdots \dot{i}\_{n-1} \ddot{j}\_{n+1} \cdots \dot{i}\_N} = \sum\_{\dot{i}\_n = 1}^{\dot{i}\_N} \mathbf{x}\_{\dot{i}\_1 \dot{i}\_2 \cdots \dot{i}\_N} m\_{\dot{j} \dot{i}\_n \prime} \tag{4}$$

*which can be denoted by* X ×*<sup>n</sup>* **M** *and is a tensor with a size of I*<sup>1</sup> ×···× *In*−<sup>1</sup> × *J* × *In*+<sup>1</sup> ×···× *IN.*

**Definition 4.** *The Frobenius norm of a tensor* X ∈ <sup>R</sup>*I*1×*I*2×···×*IN is represented as:*

$$||\mathcal{X}\ ||=\sqrt{\sum\_{i\_1=1}^{I\_1}\sum\_{i\_2=1}^{I\_2}\cdots\sum\_{i\_N=1}^{I\_N}\mathbf{x}\_{i\_1,i\_2,\cdots,i\_N}^2}.\tag{5}$$

#### *2.2. Tensor Decomposition*

The general tensor decomposition models involve CP decomposition and Tucker decomposition. Specifically, the CP decomposition is a special case of the Tucker decomposition. Due to the special structure of tensors, these tensor decomposition methods are leveraged in hyperspectral image processing. For a tensor X ∈ <sup>R</sup>*I*1×*I*2×···×*IN* , the CP decomposition could be expressed as:

$$\mathcal{X} \approx \sum\_{r=1}^{R} \lambda\_r a\_r^{(1)} \diamond a\_r^{(2)} \diamond \cdots \diamond a\_r^{(N)} = \left[ \lambda; \mathbf{A}^{(1)}, \mathbf{A}^{(2)}, \cdots, \mathbf{A}^{(N)} \right],\tag{6}$$

where "◦" is the outer product of the vectors, *<sup>R</sup>* is a positive integer, and **<sup>A</sup>**(*n*) is the factor matrix. For *<sup>n</sup>* <sup>=</sup> 1, 2, ··· , *<sup>N</sup>*, *<sup>λ</sup>* <sup>∈</sup> <sup>R</sup>*R*, *<sup>a</sup>* (*n*) *<sup>r</sup>* <sup>∈</sup> <sup>R</sup>*In* , **<sup>A</sup>**(*n*) <sup>∈</sup> <sup>R</sup>*In*×*R*, the factor matrix is a combination of the rank one vector *a* (*n*) *<sup>r</sup>* and denoted as:

$$\mathbf{A}^{(n)} = [a\_1^{(n)}, a\_2^{(n)}, \dots, a\_n^{(n)}]\_\prime \tag{7}$$

Let the three-order tensor X ∈ <sup>R</sup>*I*×*J*×*<sup>K</sup>* be a hyperspectral image, the CP decomposition could be formulated as:

$$\mathcal{X} \approx \sum\_{r=1}^{R} \lambda\_r a\_r \diamond b\_r \diamond c\_r = [\lambda; \mathbf{A}\_r \mathbf{B}, \mathbf{C}],\tag{8}$$

where *I*, *J*, and *K* are the numbers of the row, column, and spectral dimensions, respectively, while *r* = 1, 2, ··· , *<sup>R</sup>*, *<sup>λ</sup>* <sup>∈</sup> <sup>R</sup>*R*, *ar* <sup>∈</sup> *<sup>R</sup><sup>I</sup>* , *br* <sup>∈</sup> *<sup>R</sup><sup>J</sup>* ,*cr* <sup>∈</sup> *<sup>R</sup>K*.

Each column of the above factor matrices **A**, **B**, and **C** is normalized, and *λ<sup>r</sup>* is the weight. If there is no requirement to standardize the factor matrix, the CP decomposition can also be reformulated as:

$$\mathcal{X} \approx (\mathbf{A}', \mathbf{B}', \mathbf{C}'). \tag{9}$$

where **A** , **B** , **C** mean the general factor matrices, which are constructed by assigning the weight to the factor matrices **A**, **B**, **C**.

The schematic diagram of CP decomposition is shown in Figure 2. If *R* denotes the minimum number of outer products needed to express X , then the tensor rank is *R*, i.e., *rank*(X ) = *R*, and the decomposition is known as rank decomposition, which is a particular case of CP decomposition. At present, there is no specific method to directly solve the rank of any given tensor, which has been proven to be an NP-hard problem. Through the factor matrix, the CP decomposition of a third-order tensor can be written in expansion form.

$$\begin{aligned} \mathbf{X}\_{(1)} &= \mathbf{A}^{'} (\mathbf{C}^{'} \odot \mathbf{B}^{'})^{\mathrm{T}}, \\ \mathbf{X}\_{(2)} &= \mathbf{B}^{'} (\mathbf{C}^{'} \odot \mathbf{A}^{'})^{\mathrm{T}}, \\ \mathbf{X}\_{(3)} &= \mathbf{C}^{'} (\mathbf{B}^{'} \odot \mathbf{A}^{'})^{\mathrm{T}}. \end{aligned} \tag{10}$$

A third-order tensor can be denoted as follows by applying mode-n products:

$$\boldsymbol{\mathcal{X}}' = \boldsymbol{\mathcal{X}} \times\_1 \mathbf{D}\_1 \times\_2 \mathbf{D}\_2 \times\_3 \mathbf{D}\_3. \tag{11}$$

The above formula can be expressed in the form of factor matrices:

$$
\boldsymbol{\mathcal{X}}^{\boldsymbol{\mathcal{X}}} \approx (\mathbf{D}\_1 \mathbf{A}^{\boldsymbol{\prime}}, \mathbf{D}\_2 \mathbf{B}^{\boldsymbol{\prime}}, \mathbf{D}\_3 \mathbf{C}^{\boldsymbol{\prime}}).\tag{12}
$$

**Figure 2.** Canonical polyadic (CP) decomposition of third-order tensors .

**Theorem 1.** *Correspondingly, we consider how many rank-one tensors (components) of the decomposition of the CP model are added to minimize the error. The usual practice is to start with R* = 1 *until you encounter a "good" result. Of course, if you have a strong application background and prior information, you can also specify it in advance. For a given number of components, there is still no universal solution for CP decomposition. Specifically, the alternating least squares (ALS) algorithm is a more popular method in the case that the number of components is pre-given [46]. For the CP decomposition of tensors, even if R is much larger than max*{*i*, *j*, *k*}*, the CP decomposition model is essentially unique. The lower order decomposition of matrices and Tucker decomposition of tensors are generally not unique, which is a significant difference between them. The most famous result about the uniqueness of tensor decomposition is due to Kruskal [47]. One result of the Kruskal criteria is the following statement, which applies to general tensors, which provides the uniqueness proof of the CP decomposition model.*

*Suppose* <sup>X</sup> = [[**A**, **<sup>B</sup>**, **<sup>C</sup>**]] *and tensor* X ∈ <sup>R</sup>*I*×*J*×*<sup>K</sup> of rank R has a unique decomposition if:*

$$R \le \frac{1}{2} \left[ \min(I, \mathbb{R}) + \min(I, \mathbb{R}) + \min(K, \mathbb{R}) - 2 \right].$$

*where* **<sup>A</sup>** <sup>∈</sup> <sup>R</sup>*I*×*R,* **<sup>B</sup>** <sup>∈</sup> <sup>R</sup>*J*×*R, and* **<sup>C</sup>** <sup>∈</sup> <sup>R</sup>*K*×*R.*

The uniqueness condition of the tensor CP decomposition model is relatively relaxed compared with that of the matrix decomposition model. Since the rank of matrix decomposition must be lower than the dimension of the matrix and needs nonnegative, sparse, and geometric conditions, certainly, we can also judge whether the CP decomposition model is unique according to the rank of the given tensor.

#### **3. Problem Formulation**

The purpose of HSI and MSI fusion is to estimate the unobservable SRI (S ∈ <sup>R</sup>*I*×*J*×*K*) from the observable low-spatial resolution HSI (H ∈ <sup>R</sup>*i*×*j*×*K*) and the high-spatial resolution MSI (M ∈ R*I*×*J*×*k*), where *I*(*i*) and *J*(*j*) denote the spatial dimensions and *K*(*k*) denotes the number of spectral bands. The tensor H is spatially downsampled with respect to (w.r.t.) S, that is *I* > *i* and *J* > *j*, while the tensor M is spectrally downsampled w.r.t. S, that is *K* > *k*. We assume that the two observed data are obtained under the same atmospheric and illumination conditions and are geometrically combined with radiation correction.

#### *3.1. Image Fusion Based on Matrix Decomposition*

The fusion method based on matrix factorization assumes that each spectral vector of the target SRI can be written as a small number of linear combinations of different spectral characteristics [48], which can be represented as:

$$\mathbf{S}\_{(3)} = \mathbf{W} \mathbf{H}\_{\prime} \tag{13}$$

where **<sup>S</sup>**(3) <sup>∈</sup> <sup>R</sup>*I J*×*<sup>K</sup>* is the three-mode unfolding matrix of the tensor <sup>S</sup>. Matrices **<sup>W</sup>** <sup>∈</sup> <sup>R</sup>*I J*×*<sup>R</sup>* and **<sup>H</sup>** <sup>∈</sup> <sup>R</sup>*R*×*<sup>K</sup>* represent the spectral basis and the corresponding coefficient matrix, respectively, where *R min*{*I J*, *K*}.

The spatial domain of low-spatial resolution hyperspectral data is degraded from the spatial

domain of multispectral data. On the other hand, multispectral data are a form of spectral degradation of high-spatial resolution hyperspectral data. Therefore, H and M are modeled as:

$$\mathbf{H}\_{(3)} = \mathbf{W} \mathbf{H}\_{h'} \mathbf{M}\_{(3)} = \mathbf{W}\_m \mathbf{H}\_{\prime} \tag{14}$$

where **<sup>H</sup>***<sup>h</sup>* <sup>=</sup> **HDH**, **<sup>W</sup>***<sup>m</sup>* <sup>=</sup> **DMW**, **<sup>H</sup>**(3) <sup>∈</sup> <sup>R</sup>*ij*×*K*, and **<sup>M</sup>**(3) <sup>∈</sup> <sup>R</sup>*I J*×*<sup>k</sup>* are the three-mode unfolding matrices of the HSI (tensor <sup>H</sup>) and MSI (tensor <sup>M</sup>), respectively. **DH** <sup>∈</sup> <sup>R</sup>*I J*×*ij* is a matrix modeling the point spread function (PSF) and the spatial subsampling process in the hyperspectral sensor. **DM** <sup>∈</sup> <sup>R</sup>*K*×*<sup>k</sup>* is a matrix modeling spectral downsampling in the multispectral sensor, whose rows contain the spectral response of the multispectral sensor. Therefore, the matricized HSI and MSI are modeled as:

$$\mathbf{H}\_{(3)} = \mathbf{W} \mathbf{H} \mathbf{D}\_{\mathbf{H}} \mathbf{M}\_{(3)} = \mathbf{D}\_{\mathbf{M}} \mathbf{W} \mathbf{H},\tag{15}$$

In the matrix decomposition based fusion approaches, if the spectral basis **DH** and coefficient matrix **DM** can be estimated by jointly factoring from **H**(3) and **M**(3), the SRI can be restored according to Equation (13), which is the main idea based on matrix decomposition.

#### *3.2. Image Fusion Based on Tensor Decomposition*

Matrix based methods usually assume that degradation operators are known or easily estimated, but in practice, it is difficult to determine. By comparing the spectral properties of hyperspectral and multispectral sensors, the degradation operator **DM** can be modeled and estimated relatively easily. However, the spatial operator becomes a bit difficult. A common model assumption from SRI to HSI conversion is a combination of the blurring by a Gaussian kernel and a downsampling process. Of course, this is a rough approximation and may be far from accurate. Even if this assumption is approximately correct, there are still many uncertainties.

In order to solve the non-uniqueness of matrix decomposition and under the condition of little knowledge of degenerate operators and noise, we propose a method based on joint tensor decomposition to fuse the HSI and MSI in this section. Tensor based models have many advantages. For example, it is a very efficient strategy to abstract image data into tensor representation and then input them in the image fusion model. For the output data, we can choose the desired format to save them conveniently. Formally, we represent the SRI as the following equation via CP decomposition:

$$\mathcal{S} = \left[ \mathbf{A}, \mathbf{B}, \mathbf{C} \right] \tag{16}$$

where S ∈ <sup>R</sup>*I*×*J*×*K*, **<sup>A</sup>** <sup>∈</sup> <sup>R</sup>*I*×*R*, **<sup>B</sup>** <sup>∈</sup> <sup>R</sup>*J*×*R*, **<sup>C</sup>** <sup>∈</sup> <sup>R</sup>*K*×*R*, and *<sup>R</sup>* is the the number of components.

The HSI is the spatial downsampling version of the SRI. Assuming that the point spread function (PSF) of the hyperspectral sensor is separable from the downsampling matrix of the wide mode and the high mode, we can have:

$$\mathcal{H} = \mathcal{S} \times\_1 \mathbf{D}\_1 \times\_2 \mathbf{D}\_2 \tag{17}$$

where **<sup>D</sup>**<sup>1</sup> <sup>∈</sup> <sup>R</sup>*I*×*<sup>i</sup>* , **<sup>D</sup>**<sup>2</sup> <sup>∈</sup> <sup>R</sup>*J*×*<sup>j</sup>* are the spatial degradation along the width and height modes, respectively. For subsampling, the separability hypothesis implies that the function of spatial subsampling matrix **DH** is decoupled from the two spatial patterns of S, and thus, the degenerate operator **DH** = **D**<sup>2</sup> ⊗ **D**<sup>1</sup> in the matricized form. Under the separability assumption, the HSI (H) can be represented as:

$$\mathcal{H} = \left[ \mathbf{A}', \mathbf{B}', \mathbf{C} \right] \tag{18}$$

where **A** <sup>=</sup> **D1A** <sup>∈</sup> <sup>R</sup>*I*×*R*, **<sup>B</sup>** <sup>=</sup> **D2B** <sup>∈</sup> <sup>R</sup>*J*×*R*, and **<sup>C</sup>** <sup>∈</sup> <sup>R</sup>*k*×*R*. In this paper, we assume that spectral response **DM** has noise, i.e., rough sampling in the process of conversion from the SRI to the MSI. Formally, we represent it as:

$$\mathbf{D}'\_{\mathbf{M}} = \mathbf{D}\mathbf{M} + \Gamma \tag{19}$$

where Γ is Gaussian random noise. Analogously, the MSI (M) can be represented as:

$$\mathcal{M} = \mathcal{S} \times\_{\mathfrak{B}} \mathbf{D}\_{\mathbf{M}}^{'} \tag{20}$$

where **D <sup>M</sup>** <sup>∈</sup> <sup>R</sup>*K*×*<sup>k</sup>* is the downsampling matrix of the spectral mode. We substitute Formula (16) into (20) to obtain:

$$\mathcal{M} = \left[ \mathbf{A}, \mathbf{B}, \mathbf{C}^{'} \right] \tag{21}$$

where **<sup>A</sup>** <sup>∈</sup> <sup>R</sup>*i*×*R*, **<sup>B</sup>** <sup>∈</sup> <sup>R</sup>*j*×*R*, and **<sup>C</sup>** = **D <sup>M</sup><sup>C</sup>** <sup>∈</sup> <sup>R</sup>*K*×*R*. In order to reconstruct the SRI, we need to estimate the factor matrices **A**, **B**, **C**.

#### **4. The Joint Tensor Decomposition Method**

#### *The Joint Tensor Decomposition Method*

In this section, we consider that when **DM** contains noise and the spatial degradation operator **DH** = **D2** ⊗ **D1** is completely unknown, even though this type of operation is called a combination of blurring and downsampling, in practice, hyperparameters such as the blurring kernel type, kernel size, and downsampling offset are barely known. Therefore, the joint tensor decomposition model can be generalized to the following model:

$$\min\_{\mathbf{A},\mathbf{B},\mathbf{C}} \parallel \mathcal{H} - (\mathbf{A}^{'}, \mathbf{B}^{'}, \mathbf{C}) \parallel\_{\mathbf{F}}^{2} + \parallel \mathcal{M} - (\mathbf{A}, \mathbf{B}, \mathbf{C}^{'}) \parallel\_{\mathbf{F}}^{2} + \beta \parallel \mathbf{C}^{'} - \mathbf{D}\_{\mathbf{M}}^{'} \mathbf{C} \parallel\_{\mathbf{F}}^{2} \tag{22}$$

We use the following optimization models to obtain factor matrices **A**, **B** and **C**, where *β* is the regularization parameter. The above optimization problem is non-convex, and the solutions of the factor matrices **A**, **B**, and **C** are not unique. However, the objective function in (22) is convex for each variable block, remaining unchanged with other variables. Therefore, we choose the proximal alternate optimization (PAO) scheme to solve the above optimization problem, which guarantees that the optimization problem converges to the critical point under certain conditions. Then, each step of the iterative update of the factor matrix is reduced to solving an easy-to-handle Sylvester equation by matricization of tensor HSI and MSI. Specifically, the **A**, **B**, and **C** iterations are updated as follows:

• Optimization with respect to **C**:

When **A**, **B**, **A** , **B** , and **C** are fixed, the optimization w.r.t. **C** in (22) can be written as:

$$\min\_{\mathbf{C}} \parallel \mathcal{H} - (\mathbf{A}^{'}, \mathbf{B}^{'}, \mathbf{C}) \parallel\_{\mathbf{F}}^{2} + \parallel \mathcal{M} - (\mathbf{A}, \mathbf{B}, \mathbf{C}^{'}) \parallel\_{\mathbf{F}}^{2} + \beta \parallel \mathbf{C}^{'} - \mathbf{D}\_{\mathbf{M}}^{'} \mathbf{C} \parallel\_{\mathbf{F}'}^{2}$$

The above optimization problem can be transformed into the following one by using the properties of n-mode matrix unfolding.

$$\min\_{\mathbf{C}} \quad \|\|\left(\mathbf{H}\_{\langle 3 \rangle} - \mathbf{C} (\mathbf{B}^{'} \odot \mathbf{A}^{'})^{T}\right)\|\|\_{\mathbf{F}}^{2} + \beta \parallel \mathbf{C}^{'} - \mathbf{D}\_{\mathbf{M}}^{'} \mathbf{C} \parallel\_{\mathbf{F}}^{2} \tag{23}$$

where **H**(3) is the three-mode unfolding matrix of tensors H. The optimization problem (23) is quadratic, and its unique solution is equal to the calculation of the general Sylvester matrix equation.

$$
\beta \mathbf{D}\_{\mathbf{M}}^{\top \mathbf{T}} \mathbf{D}\_{\mathbf{M}}^{\prime} \mathbf{C} + \mathbf{C} \mathbf{E} - \beta \mathbf{D}\_{\mathbf{M}}^{\top \mathbf{T}} \mathbf{C}^{\prime} = \mathbf{H}\_{(3)} \mathbf{E} \tag{24}
$$

where **<sup>E</sup>** = (**B**<sup>T</sup> **B** ) <sup>∗</sup> (**A**<sup>T</sup> **A** ).

We use the Sylvester function in the MATLAB toolbox to solve the above equation.

• Optimization with respect to **<sup>A</sup>** : When **A**, **B**, **C**, **B** , and **C** are fixed, the optimization w.r.t. **A** in (22) can be written as:

$$\min\_{\mathbf{A}'} \parallel \mathcal{H} - (\mathbf{A}', \mathbf{B}', \mathbf{C}) \parallel\_{\mathbf{F}'}^2 \tag{25}$$

The above optimization problem can be transformed into the following one by using the properties of n-mode matrix unfolding.

$$\min\_{\mathbf{A}'} \parallel \mathbf{H}\_{(1)} - \mathbf{A}'(\mathbf{C} \odot \mathbf{B}')^{\mathrm{T}} \parallel \|\_{\mathrm{F}'}^2 \tag{26}$$

where **H**(1) is the one-mode unfolding matrix of tensors H. The optimization problem (26) is convex, and the optimal solution is then given by:

$$\mathbf{A}^{'} = \mathbf{H}\_{(1)}[(\mathbf{C} \odot \mathbf{B}^{'})^{T}]^{\dagger},\tag{27}$$

According to the property of the Khatri–Rao product pseudo-inverse, we can rewrite the solution as:

$$\mathbf{A}^{'} = \mathbf{H}\_{(1)}(\mathbf{C}\odot\mathbf{B}^{'})(\mathbf{C}^{T}\mathbf{C}\*\mathbf{B}^{'T}\mathbf{B}^{'})^{\dagger},\tag{28}$$

The advantage of solving the above equation is that we only need to compute the pseudo-inverse matrix of the *<sup>R</sup>* <sup>×</sup> *<sup>R</sup>* matrix, but not the *jK* <sup>×</sup> *<sup>R</sup>* matrix. The solving process of factor matrix **<sup>B</sup>** is similar to that of **A** , and we can rewrite the solution as:

$$\mathbf{B}^{'} = \mathbf{H}\_{(2)}(\mathbf{C}\odot\mathbf{A}^{'})(\mathbf{C}^{T}\mathbf{C}\*\mathbf{A}^{'T}\mathbf{A}^{'})^{\dagger}.\tag{29}$$

• Optimization with respect to **<sup>C</sup>** :

When **A**, **B**, **C**, **A** , and **B** are fixed, the optimization w.r.t. **C** in (22) can be written as the following one by using the properties of n-mode matrix unfolding.

$$\min\_{\mathbf{C}'} \parallel \mathbf{M}\_{(3)} - \mathbf{C}'(\mathbf{B} \odot \mathbf{A})^T) \parallel\_{\mathbf{F}}^2 + \beta \parallel \mathbf{C}' - \mathbf{D}'\_{\mathbf{M}} \mathbf{C} \parallel\_{\mathbf{F}}^2 \tag{30}$$

where **M**(3) is the three-mode unfolding matrix of tensors M. The optimization problem (30) is quadratic, and its unique solution is equal to the calculation of the general Sylvester matrix equation.

$$
\beta \mathbf{I}^T \mathbf{I} \mathbf{C}^\prime + \mathbf{C}^\prime \mathbf{F} - \beta \mathbf{D}\_\mathbf{M}^\prime \mathbf{C} = \mathbf{H}\_{(3)} \mathbf{F} \tag{31}
$$

where **<sup>F</sup>** = (**B**T**B**) <sup>∗</sup> (**A**T**A**), and **<sup>I</sup>** is the unit matrix of 4 <sup>×</sup> 4.

We use the Sylvester function in the MATLAB toolbox to solve the above equation.

• Optimization with respect to **A**:

When **B**, **C**, **A** , **B** , and **C** are fixed, the optimization w.r.t. **A** in (22) can be written as:

$$\min\_{\mathbf{A}} \parallel \mathcal{M} - (\mathbf{A}, \mathbf{B}, \mathbf{C}^{'}) \parallel\_{\mathbf{F}'}^{2} \tag{32}$$

The above optimization problem can be transformed into the following one by using the properties of n-mode matrix unfolding.

$$\min\_{\mathbf{A}} \parallel \mathbf{M}\_{(1)} - \mathbf{A} (\mathbf{C}' \odot \mathbf{B})^{\mathrm{T}} \parallel\_{\mathbf{F}'}^{2} \tag{33}$$

where **M**(1) is the one-mode unfolding matrix of tensors M. The optimization problem (33) is convex, and the optimal solution is then given by:

$$\mathbf{A} = \mathbf{M}\_{\text{(1)}} [(\mathbf{C}^{'} \odot \mathbf{B})^{T}]^{\dagger},\tag{34}$$

According to the property of the Khatri–Rao product pseudo-inverse, we can rewrite the solution as:

$$\mathbf{A} = \mathbf{M}\_{(1)} (\mathbf{C}^{'} \ominus \mathbf{B}) (\mathbf{C}^{'T} \mathbf{C} \ast \mathbf{B}^{'T} \mathbf{B})^{\dagger},\tag{35}$$

Similarly, we only need to compute the pseudo-inverse matrix of the *R* × *R* matrix, but not the *Jk* × *R* matrix. The solving process of factor matrix **B** is similar to that of **A**, and we can rewrite the solution as:

$$\mathbf{B} = \mathbf{M}\_{\text{(2)}} (\mathbf{C}^{\prime} \odot \mathbf{A}) (\mathbf{C}^{\prime \top} \mathbf{C}^{\prime} \ast \mathbf{A}^{\text{T}} \mathbf{A})^{\dagger},\tag{36}$$

For each iteration update, we discuss it in detail. The specific algorithm is shown in Algorithm 1. After obtaining the estimated values of **A**, **B**, and **C**, the super-resolution tensor reconstruction is obtained from the following formula:

$$\mathcal{S} \approx (\mathbf{A}, \mathbf{B}, \mathbf{C}). \tag{37}$$

The detailed steps of the proposed method are given in Algorithm 1.

#### **5. Experiments And Results**

#### *5.1. Experimental Data*

To obtain an MSI from an SRI, we used the spectral specifications of the multispectral sensor, which were taken from the QuickBird sensor in our experiments [49]. The QuickBird sensor produces four-band MSI in the following spectral bands: blue (430–545 nm), green (466–620 nm), red (590–710 nm), and near-infrared (715–918 nm). Then, the spectral response matrix **DM** is formed by comparing the SRI obtained in the experiment from 400 to 2500 nm with the multi-spectral sensor band, and the MSI image is obtained by assuming that there is random Gaussian noise in **DM**. More precisely, **DM** is a selective averaging matrix that acts on the common wavelength of the SRI and MSI, and the experimental data came from [44]. The data selected in this paper were taken from Pavia University in Italy and were captured by the ROSIS sensor. The SRI, HSI, and MSI have sizes of 608 × 336 × 103, 152 × 84 × 103, and 608 × 336 × 4, respectively. Specifically, the MSI is generated by QuickBird simulation, while the HSI is generated by SRI by 9 × 9 Gaussian blur and downsampling, and the MSI is generated for the Pavia University image according to the QuickBird specification. The degradation process from the SRI to the HSI is a combination of spatial blurring of the 9 × 9 Gaussian kernel and the *D* = 4 factor along two spatial directions to model the blurred image.

On the 3.6 GHz kernel and 8 GB RAM Windows server, the simulation is carried out by MATLAB. According to the algorithm **JTF**, factors **A**, **B**, and **C** are obtained through the joint tensor decomposition of the MSI and HSI, which mainly solves the least squares problem and preliminarily estimates

the potential factors, where the CP decomposition is computed by TensorLab [50]. The maximum number of iterations for tensor decomposition was set to 25 in the initialization, while the number of iteration updates for factor matrix requires continuing numerical simulation. In this paper, we fixed *β* = 1. We mainly refer to [44]; this paper proved that the performance of super-resolution tensor reconstruction is best when beta is equal to one. The proposed model adds the noise term in the objective function based on [44]. Therefore, we selected a similar parameter.

To further demonstrate the performance of our proposed algorithm, this method is compared with the following five HSI-MSI fusion methods: **Blind Stereo** [44], **CNMF** (coupled nonnegative matrix factorization) [29], **SFIM** (smoothing filter based intensity modulation) [51], **MTF-GLP** (modulation transfer function based generalized Laplacian pyramid) [19], and **MAPSMM** (maximum a posterior estimation with a stochastic mixing model) [24].

#### *5.2. Evaluation Criterion*

In order to evaluate the quality of reconstructed high-spatial resolution HSIs, we introduce several intuitive evaluation indicators. The first index is the reconstruction signal-to-noise ratio (R-SNR) criterion defined as:

$$\mathbf{R} - \text{SNR} = 10 \log\_{10} (\frac{\sum\_{k=1}^{K} \|\mathbf{S\_k}\|\_F^2}{\sum\_{k=1}^{K} \|\mathbf{S\_k'} - \mathbf{S\_k}\|\_F^2}). \tag{38}$$

where **S <sup>k</sup>** and **Sk** are the frontal slices of reconstructed SRI and ground truth SRI. The higher the R-SNR is, the better the reconstruction quality.

The second index is the root mean squared error (RMSE), i.e.,

$$\text{RMSE} = \sqrt{\frac{\|\left|\mathcal{S}' - \mathcal{S}\right\|\|\_F^2}{\mathbf{WHS}}}.\tag{39}$$

where <sup>S</sup> and S are the reconstructed SRI and ground truth SRI, **B** is the number of bands of hyperspectral images, and **W** and **H** are the spatial dimensions of total spectral images. Low RMSE values indicate good reconstruction performance.

The third index is the spectral angle mapper (SAM), which is defined as:

$$\text{SAM} = \frac{1}{I} \sum\_{n=1}^{I} \arccos(\frac{\mathbf{S}\_{(3)}(n, :) \mathbf{S}\_{(3)}'(n, :)^{\Gamma}}{\|\mathbf{S}\_{(3)}(n, :)\|\_{2} \|\mathbf{S}\_{(3)}'(n, :)\|\_{2}}).\tag{40}$$

where **S** (3) (*n*, :) and **S**(3)(*n*, :) express respectively the fibers of the reconstructed and the ground-truth SRI. SAM measures the angles between the reconstructed and the ground-truth fibers of the SRI, and a small SAM is equivalent to good performance.

The fourth index is the relative dimensionless global error in synthesis (ERGAS), which is represented as:

$$\text{ERGAS} = 100c \sqrt{\frac{1}{IJK} \sum\_{k=1}^{K} \frac{||\mathbf{S}\_{\mathbf{k}}^{\prime} - \mathbf{S}\_{\mathbf{k}}||\_F^2}{\mu\_k^2}}. \tag{41}$$

where *c* = *<sup>I</sup> <sup>i</sup>* <sup>=</sup> *<sup>J</sup> <sup>j</sup>* is the spatial downsampling factor and *μ<sup>k</sup>* is the mean values of the elements in **Sk**. After image reconstruction, we hope to get a smaller ERGAS.

The fifth index is the universal image quality index (UIQI), which is defined as:

$$\text{UIQI} = \frac{1}{S} \sum\_{i=1}^{S} \text{UIQI}(\mathbf{S}^{'i}, \mathbf{S}^{i}). \tag{42}$$

where:

$$\text{UIQI}(\mathbf{S}^{'i}, \mathbf{S}^{i}) = \frac{1}{P} \sum\_{j=1}^{P} \frac{\sigma\_{\mathbf{s}\_{j}^{'} \mathbf{s}\_{j}^{'}}}{\sigma\_{\mathbf{s}\_{j}^{'}} \sigma\_{\mathbf{s}\_{j}^{'}}} \frac{2\mu\_{\mathbf{s}\_{j}^{'}} \mu\_{\mathbf{s}\_{j}^{'}}}{\mu\_{\mathbf{s}\_{j}^{'}} + \mu\_{\mathbf{s}\_{j}^{'}}} \frac{2\sigma\_{\mathbf{s}\_{j}^{'} \mathbf{s}\_{\mathbf{s}\_{j}^{'}}}}{\sigma\_{\mathbf{s}\_{j}^{'}} + \sigma\_{\mathbf{s}\_{j}^{'}}} \tag{43}$$

**S***i <sup>j</sup>* and **S** *i <sup>j</sup>* show the *j*th window of the *i*th band ground truth image and reconstructed image, respectively. *<sup>P</sup>* represents the number of window positions. *<sup>σ</sup>***s***<sup>i</sup> j* **s** *i j* means the sample covariance between **S***i <sup>j</sup>* and **S** *i <sup>j</sup>* , and *μ***s***<sup>i</sup> j* and *σ***s***<sup>i</sup> j* denote the mean value and standard deviation of **S***<sup>i</sup> j* . The range of the index is [−1, 1]. The larger the value of UIQI, the better the fusion effect.

The sixth index is the normalized mean squared error (NMSE), which is represented as:

$$\text{NMSE} = \frac{||\mathbf{S\_3}^{'} - \mathbf{S\_3}||\_F}{||\mathbf{S\_3}||\_F}. \tag{44}$$

where **S <sup>3</sup>** and **S3** are the three-mode unfolding matrix of the reconstructed and ground-truth SRI. The smaller the NMSE, the closer the effect of image fusion is to the ground truth image.

In addition to the above evaluation indicators, the other simplest performance indicator is the running time of the algorithm. In this paper, the efficiency of several reconstruction algorithms is compared with the computational time.

#### *5.3. Selection Of Parameters*

In this section, we select different iterations and the number of components of CP decomposition and experimented under these conditions to obtain the best parameter values, so as to evaluate the sensitivity of the **JTF** algorithm to important parameters in the model. Because the algorithm in this paper is based on the unknown **DH**, we first consider experimenting under the condition that the algorithm incorrectly assumes a 7 × 7 Gaussian blur kernel instead of using the correct 9 × 9 Gaussian kernel. Certainly, we have also made experimental comparisons under the correct Gaussian kernels.

Considering the **JTF** algorithm under different signal-to-noise ratios and assuming that the signal-to-noise ratio (SNR) of the HSI and MSI is the same, where the SNR here is set 20 db, to evaluate the effect of the number of iterations **Iter** on the image fusion in the algorithm, we run the **JTF** algorithm based on the number of iterations **Iter**. Because the **JTF** algorithm is based on the modification of the **Blind Stereo** algorithm [44], we compare the performance of the algorithms with or without noise of the degenerate operator. Figure 3 shows the evaluation metrics of SRI after the reconstruction of Pavia University with the change of iteration number **Iter**. In order to reduce the running time of the algorithm, without loss of generality, here *R* = 100, *β* = 1, where the black line represents **Blind Stereo** algorithm performance in matrix **DM** with noise, the red is the reconstruction results of the **JTF** algorithm under the same condition, and the blue trend line indicates the fusion performance by the **Blind Stereo** algorithm when **DM** does not contain noise.

As can be seen from Figure 3, when **Iter** changes from one to 20, the R-SNR of Pavia University decreases, while the values of NMSE, ERGAS, and SAM increase. Among them, R-SNR declines sharply and then rises, while the other evaluation metrics show the opposite trend. When the number of iterations is less than five, the reconstruction effect is better. Therefore, the maximum number of iterations of the **JTF** algorithm is set between one and five. In addition, the reconstruction effect of this algorithm is always superior to the Blind stereofusion algorithm with noise.

Then, we change the number of components from *R* = 50 to *R* = 600 to observe the effect of the number of tensor decomposition components on the image fusion, which depicts different evaluation metrics of the recovered HSIs for Pavia University; see Figure 4.

**Figure 3.** The results of the evaluation criterion as functions of the number of iterations **Iter** for the proposed joint tensor decomposition (JTF) method. SAM, spectral angle mapper; ERGAS, relative dimensionless global error in synthesis; R-SNR, reconstruction signal-to-noise ratio.

Figure 4 shows the effect of image fusion in three cases with a different number of components. When **DH** is unknown and **DM** contains noise, we compare the above three cases. From the six evaluation metrics, it can be seen that when the number of components is less than 100, the performance of the proposed algorithm is almost the same as the case that **DM** is clean. It completely achieves the denoising effect and is always better than the **Blind Stereo** algorithm in the same situation. With the increase of the number of components, all three cases show good performance. However, when the number of components increases to more than 300, the reconstruction effect has a downward trend. According to the above Theorem [47], this is because when the number of components does not satisfy Theorem 1, the algorithm cannot guarantee the uniqueness of CP decomposition, which affects the initialization of the initial factor matrix in the algorithm, resulting in the poor performance of the algorithm.

Then, consider the selection range of the number of components under the condition of the uniqueness of tensor decomposition. As the **JTF** algorithm only decomposes MSI by CP, we set *I* = 608, *J* = 336, *K* = 4 in Theorem 1. According to the dimension of MSI, we can divide the selection of parameter *R* into the following four cases:

(1) When *R* < *K* = 4, bring *R* into Proposition 1, i.e.,

$$R \le \frac{1}{2}(R + R + R - 2) \Rightarrow R \ge 1. \tag{45}$$

By synthesizing the formulas and conditions, we can get the range of R in the first case, which is 1 ≤ *R* < 4.

(2) When 4 = *K* ≤ *R* < *J* = 336, bring *R* into Proposition 1, i.e.,

$$R \le \frac{1}{2}(R+R+4-2) \Rightarrow R \le R+1. \tag{46}$$

The above derivation is obviously valid, so we only consider the conditions, and we can get the range of R in the second case, which is 4 ≤ *R* < 336.

(3) When *J* = 336 ≤ *R* < *I* = 608, bring *R* into Proposition 1, i.e.,

$$R \le \frac{1}{2}(R + 336 + 4R - 2) \Rightarrow R \ge 338. \tag{47}$$

By synthesizing the formulas and conditions, we can get the range of R in the third case, which is 336 ≤ *R* < 338.

(4) When *R* ≥ 608 = *I*, bring *R* into Proposition 1, i.e.,

$$R \le \frac{1}{2}(608 + 336 + 4R - 2) \Rightarrow R \ge 475. \tag{48}$$

As such, we can get the range of *R* in the fourth case, which is *R* < 475 and *R* ≥ 608 by synthesizing the formulas and conditions. Therefore, we can conclude that there is a contradiction between the deduced range of *R* and the range of conditions, so this situation does not exist.

To sum up, combining the above four cases, in order to guarantee the uniqueness of CP decomposition, the range of the number of components is 1 ≤ *R* ≤ 338. In this paper, according to the fusion effect of the algorithm while ensuring the uniqueness of CP decomposition, we fixed *R* = 275.

**Figure 4.** The results of evaluation metrics as functions of the number of components *R* for the proposed JTF method. UIQI, universal image quality index.

#### *5.4. Experimental Results*

To further investigate the performance of the method, we conduct experiments under the incorrect Gaussian kernel (3 × 3, 5 × 5, 7 × 7) and correct Gaussian kernel (9 × 9) and show the fusion effect of the six test methods on Pavia University. Table 1 shows the R-SNR, NMSE, RMSE, ERGAS, SAM, and UIQI of the HSI recovered from Pavia University, and we present the best of the six algorithms in bold. As can be seen from the table, in the case of incorrect estimation of the Gaussian kernel, the fusion effect of other algorithms excluding **JTF** is worse than that of the correct Gaussian kernel. The closer these five algorithms are to the correct Gaussian kernel, the better the results will be. Nevertheless, the **JTF** method performs best in the comparison of the methods in terms of reconstruction accuracy whether the Gaussian kernel is correctly estimated or not. Overall, the **JTF** and **CNMF** methods are very effective in the reconstruction of Pavia University. On the contrary, the **JTF** algorithm proposed does not degrade the image reconstruction effect due to the incorrect estimation of the Gaussian kernel. More specifically, the property of the proposed algorithm is greater under the hypothetical Gaussian kernel, which also proves that the **JTF** algorithm has more generalization significance and application prospects.


**Table 1.** Quantitative results of the test methods on Pavia University under the different Gaussian kernels. CNMF, coupled nonnegative matrix factorization; SFIM, smoothing filter based intensity modulation; MTF-GLP, modulation transfer function based generalized Laplacian pyramid; MAPSMM, maximum a posterior estimation with a stochastic mixing model.

Figure 5 reveals the fusion experimental results for Pavia University under the incorrect Gaussian kernel (7 × 7), which contains the 50th and 100th bands' fused images and the corresponding error images reconstructed by the six algorithms, were Line 1 and Line 2 in Figure 5 denote the fused HSIs of the 50th band and the corresponding error HSIs of each method, respectively. Moreover, Figure 5g shows the reference HSIs, while the third and forth rows show the reconstructed images for the 100th band and corresponding error images, respectively. Except for the last column, each column in Figure 5 shows the experimental results corresponding to each method. The error image reflects the difference between the fusion result and the ground truth. As depicted in Figure 5, this paper uses the red box to show the more obvious areas in order to compare the difference of error images of different algorithms clearly. By visualized comparison of the fused HSIs with the reference HSIs, the fusion result of the **Blind STEREO** method shows slight spectral distortion on the top of the building, while the **MAPSMM** method generates fuzzy spatial details in some areas, and the spatial information of the fused image is well enhanced by the **CNMF** method. A closer inspection reveals that the spectral and spatial differences of fused HSIs obtained by the six methods are not obvious. Therefore, in order to further compare the performance of each fusion method, the second and fourth lines of Figure 5 show the error images of the six methods under two spectral bands. The error image is the difference (absolute value) between the fused HSI and the reference HSI pixel value. We magnify the data element value in the error image by 10 times, so that we can inspect it more carefully. It can be seen that the **Blind STEREO**, **SFIM**, **MTF-GLP**, and **MAPSMM** methods have large differences, while the **CNMF** method generates relatively smaller differences, and the **JTF** method has the smallest differences in most regions, indicating that this method has good fusion ability and provides clearer spatial details than the other five algorithms.

**Figure 5.** Reconstructed images and corresponding error images of Pavia University for the 50th and 100th bands with unknown **DH** and noisy **DM**: (**a**) JTF; (**b**) Blind STEREO; (**c**) CNMF; (**d**) SFIM; (**e**) MTF-GLP; (**f**) MAPSMM; (**g**) ground truth.

Similar to the previous experiments, Figure 6 shows the fusion experimental results for Pavia University under the correct Gaussian kernel (9 × 9), which contain the 50th and 100th bands' fused images and the corresponding error images reconstructed by the six algorithms. Figure 6 shows the fused HSIs of the 50th band and the corresponding error HSIs of each method, which are displayed in Lines 1–2. Moreover, Figure 6g shows the reference HSIs, while the third and forth rows show the reconstructed images for the 100th band and corresponding error images, respectively. Similarly, in order to compare the difference of the error images of different algorithms clearly, the data element values in the error image are magnified 10 times, and the red box is applied to display the region with obvious errors. The spectral distortion caused by the **Blind STEREO** method is very obvious and is affected by the Gaussian kernel changes, as shown in Figure 6b. Compared with the **Blind STEREO** method, other methods can effectively improve the spatial performance while maintaining the spectral information, and the difference between the fused images is not significant. Therefore, in order to further verify the fusion performance of the proposed method, the second and fourth lines of Figure 6 show the error images corresponding to different methods, respectively. It can be seen that the error image obtained by the **JTF** method is the lowest in most regions, and the fusion effect is not affected by the Gaussian kernel, which indicates that the **JTF** method has a superior image reconstruction effect and is more robust. Overall, the **JTF** method has better reconstruction performance and clearer fusion effects than the other five algorithms.

**Figure 6.** Reconstructed images and corresponding error images of Pavia University for the 20th and 60th bands with unknown **DH** and noisy **DM**: (**a**) JTF; (**b**) Blind STEREO; (**c**) CNMF; (**d**) SFIM; (**e**) MTF-GLP; (**f**) MAPSMM; (**g**) ground truth.

#### *5.5. Experimental Results of the Noisy Case*

In practice, there exits additive noise in the hyperspectral and multispectral imaging processes. Therefore, to test the robustness of the proposed **JTF** method to the noise, we firstly simulate the tensor images M and H in the same way as the previous experiments for Pavia University and then add Gaussian noise to the HSI and MSI. Because the noise level in the HSI is often higher than that of the MSI, we fix the SNR added to the HSI to be 20db and compare the evaluation indicators with the traditional five classical models with the change of noise added to MSI.

Figure 7 presents the quality metric values of the noisy cases on Pavia University. It can be seen that from the reconstruction performance of the six fusion algorithms emerges a trend of enhancement with the increase of MSI image noise. Although the fusion effect of **CNMF** is closer to that of the **JTF** algorithm when the noise is high, the **JTF** method is still better than other test methods in the case of noise as a whole.

**Figure 7.** The results of evaluation metrics under different noises. MSI, multispectral image.

#### *5.6. Analysis of Computational Costs*

In this section, experiments are carried out on six classical methods to demonstrate the computational efficiency of the proposed method, which are accomplished with MATLAB R2016b on a PC with Intel Core i7-7500 CPU and 8 GB RAM. The mean time (in terms of seconds) of all comparison methods is shown as Table 2.

As can be seen from Table 2, the method based on the filter fusion (**SFIM**) does not need to calculate the optimal factor matrix of each mode, and its running time is shorter than the method based on tensor samples. For tensor based methods, since the iterative strategy is used to obtain the optimal solution of each unknown factor, the time of the two methods (**JTF**, **Blind STEREO**) is almost the same. **MTF-GLP** runs between the first two classes of methods, while **CNMF** and **MAPSMM** have long running times. Compared with the excellent performance, the running time of **JTF** is acceptable. Analyses were conducted based on various noises under different Gaussian kernels to further observe the performance of different algorithms. The results indicate that the running time of most algorithms is shorter under the premise of correct estimation of Gaussian kernels. However, there is little difference with the run time of the **JTF** algorithm under unknown Gaussian kernels, and the running time has little relation with the magnitude of additive noise, which indirectly proves that our algorithm is more robust.



#### **6. Conclusions**

In this paper, a joint tensor decomposition method was proposed to fuse hyperspectral and multispectral images to address the hyperspectral super-resolution issue. The JTF algorithm regards the fusion problem as the joint tensor decomposition, which not only ensures the non-uniqueness of decomposition, but is applicable to the circumstance that degenerate operators are unknown or tough to gauge. In order to observe the reconstruction effect of this method, we compare the performance of the proposed algorithm with that of the five algorithms. Experiments show that the proposed algorithm has great performance advantages and certain simulation prospects. For our future work, we would concentrate on the novel scenario that adds the non-negative constraints for the joint tensor decomposition of super-resolution images.

**Author Contributions:** Methodology, X.R. and L.L.; resources, J.C.; validation, X.R.; writing, original draft, X.R. and L.L.; writing, review and editing, L.L. All authors read and agreed to the published version of the manuscript.

**Funding:** The authors would like to thank the Editors and anonymous Reviewers. This work was partially supported by the National Natural Science Foundation of China under No.51877144.

**Acknowledgments:** The authors would like to thank the Editors and Reviewers of the Remote Sensing journal for their constructive comments and suggestions.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Hyperspectral Image Super-Resolution with Self-Supervised Spectral-Spatial Residual Network**

**Wenjing Chen 1,2, Xiangtao Zheng 1,\* and Xiaoqiang Lu <sup>1</sup>**


**Abstract:** Recently, many convolutional networks have been built to fuse a low spatial resolution (LR) hyperspectral image (HSI) and a high spatial resolution (HR) multispectral image (MSI) to obtain HR HSIs. However, most deep learning-based methods are supervised methods, which require sufficient HR HSIs for supervised training. Collecting plenty of HR HSIs is laborious and time-consuming. In this paper, a self-supervised spectral-spatial residual network (SSRN) is proposed to alleviate dependence on a mass of HR HSIs. In SSRN, the fusion of HR MSIs and LR HSIs is considered a pixel-wise spectral mapping problem. Firstly, this paper assumes that the spectral mapping between HR MSIs and HR HSIs can be approximated by the spectral mapping between LR MSIs (derived from HR MSIs) and LR HSIs. Secondly, the spectral mapping between LR MSIs and LR HSIs is explored by SSRN. Finally, a self-supervised fine-tuning strategy is proposed to transfer the learned spectral mapping to generate HR HSIs. SSRN does not require HR HSIs as the supervised information in training. Simulated and real hyperspectral databases are utilized to verify the performance of SSRN.

**Keywords:** hyperspectral image super-resolution; data fusion; spectral-spatial residual network; multispectral image; self-supervised training

#### **1. Introduction**

Hyperspectral imaging sensors collect hyperspectral images (HSIs) across many narrow spectral wavelengths, which contain rich physical properties of observed scenes [1]. HSIs with high spectral resolution are beneficial for various tasks, e.g., classification [2] and detection [3]. However, as the amount of incident energy is limited, observed HSIs usually have low spatial resolution (LR) [4]. Contrary to HSIs, observed multispectral images (MSIs) have high spatial resolution (HR) but low spectral resolution [5,6]. Exploring both MSIs and HSIs captured in the same scene is a feasible and effective way for improving the spatial resolution of HSIs [7].

Over decades, many methods [8,9] have been proposed to reconstruct the desired HR HSI by fusing HR MSIs and LR HSIs, including sparse representation-based methods [10,11], Bayesian-based methods [12,13], spectral unmixing-based methods [1,14], and tensor factorization-based methods [15,16]. Sparse representation-based, Bayesianbased, and spectral unmixing-based methods usually first learn spectral bases (or endmembers) from the LR HSI [9,10]. Then, the learned spectral bases are transformed to extract the sparse codes (or abundances) from the HR MSI. Finally, the desired HR HSI is reconstructed using the learned spectral bases and sparse codes. These methods usually treat the HR MSI and LR HSI as 2-D matrices, which result in the spatial structure information of HR MSIs and LR HSIs not being effectively exploited [15]. Tensor factorization-based methods [15,16] consider HR MSIs and LR HSIs as 3-D tensors to fully explore the spatial structure information of HR MSIs and LR HSIs. In general, previous methods mainly

**Citation:** Chen, W.; Zheng, X.; Lu, X. Hyperspectral Image Super-Resolution with Self-Supervised Spectral-Spatial Residual Network. *Remote Sens.* **2021**, *13*, 1260. https://doi.org/10.3390/ rs13071260

Academic Editor: Chein-I Chang

Received: 16 February 2021 Accepted: 23 March 2021 Published: 26 March 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

focus on exploiting various handcrafted priors (e.g., sparsity and low-rankness) to improve the quality of the reconstructed HR HSI [9]. However, sparsity and low-rankness priors may not hold in real complicated scenarios [17], which can result in unsatisfactory super-resolved results [18].

Recent works [19–22] usually build various deep learning (DL) architectures to learn deep priors for fusing HR MSIs and LR HSIs. Due to powerful feature learning capabilities, DL-based methods have shown superior performance. In most DL-based methods, deep networks are usually utilized to learn the deep priors between LR HSIs and HR HSIs [22–24]. For example, Li et al. [25] employed a Laplacian pyramid network instead of the bicubic interpolation to upsample HSIs for the guided filtering-based MSI and HSI fusion. Dian et al. [21] proposed to utilize a residual network to learn deep priors of HR HSIs. However, these methods are supervised methods, which require plentiful HR HSIs as the supervised information to optimize weight parameters of deep networks. It is an intractable problem to collect a mass of HR HSIs for supervised training [26].

To mitigate the dependence on HR HSIs as the supervised information, several works [26,27] have designed unsupervised deep networks. Yuan et al. [26] transferred the deep priors between LR and HR nature images to HSIs. Sidorov et al. [27] utilized a fully convolutional encoder–decoder network to explore deep hyperspectral priors. However, these methods [26,27] cannot exploit HR MSIs for reconstructing HR HSIs. To leverage both LR HSIs and the corresponding HR MSI, several works [17,28] attempted to build two-branch deep networks. Qu et al. [28] designed two sparse Dirichlet autoencoder networks: one for extracting spectral bases from LR HSI and the other for extracting spatial representations from HR MSIs. Ma et al. [17] proposed a generative adversarial network with two discriminators to reconstruct HR HSIs. One discriminator is utilized to preserve the spectral information of HR HSIs consistent with that of LR HSIs, and the other discriminator is designed to preserve the spatial structures of HR HSIs consistent with that of HR MSIs. However, these methods [17,28] ignore the potential spectral mapping relationship between the observed MSI and HSI.

In this paper, the fusion problem of HR MSIs and LR HSIs is considered a problem of learning the pixel-wise spectral mapping from MSIs to HSIs. The pixel-wise spectral mapping can be utilized to reconstruct hyperspectral pixels directly from multispectral pixels. Since the LR HSI and the reconstructed HR HSI contain the same observed scene, the spectral mapping between the HR MSI and HR HSI is assumed to be approximately equal to that between the corresponding LR MSI and LR HSI. In this paper, as shown in Figure 1, a self-supervised spectral-spatial residual network (SSRN) is proposed to learn the pixel-wise spectral mapping between LR MSIs and LR HSIs. Then, the learned spectral mapping is transferred to reconstruct the desired HR HSI from HR MSIs. In the proposed SSRN, the LR MSI utilized for training is the spatial degradation of HR MSIs. Additionally, SSRN takes the observed LR HSI instead of the HR HSI as supervised information in training. There are two advantages to consider the fusion problem of HR MSIs and LR HSIs as the problem of learning the pixel-wise spectral mapping. The first advantage is that reconstructing HR HSIs directly from HR MSIs, which contains the desired HR spatial structure information, can mitigate the distortion of spatial structures in HR HSIs. The second advantage is that there are plentiful multispectral and hyperspectral pixel pairs naturally between MSIs and HSIs, which are sufficient for training deep networks without the need to introduce other supervised information.

**Figure 1.** Illustration of the proposed spectral-spatial residual network (SSRN) framework. Firstly, a low spatial resolution (LR) multispectral image (MSI) and an LR hyperspectral image (HSI) are utilized to train the proposed SSRN to learn the pixel-wise spectral mapping. Then, the learned pixel-wise spectral mapping is exploited to estimate high spatial resolution (HR) HSIs from HR MSIs.

The proposed SSRN includes two modules: the spectral module and the spatial module. First, the spectral module is proposed to extract spectral features from MSIs. In the spectral module, the concatenation operation is employed to explore the complementarity among multi-layer features. Second, the spatial module is added following the spectral module to capture spectral-spatial features for facilitating learning of the spectral mapping. Especially, an attention mechanism is employed in the spatial module to make SSRN extract spectral-spatial features from homogeneous adjacent pixels, since homogeneous adjacent pixels in HSIs usually share similar spectral signatures. Finally, a self-supervised fine-tuning strategy is employed to further improve the performance of SSRN. In fact, the spatial degradation from the HR image to the LR image usually interferes with the spectral signatures of the LR image, which makes the spectral mapping between LR MSIs and LR HSIs slightly different from the spectral mapping between HR MSIs and HR HSIs. The selfsupervised fine-tuning strategy is utilized to obtain the spectral mapping between HR MSIs and HR HSIs from the spectral mapping between LR MSIs and LR HSIs. The experimental results demonstrate that SSRN performs better than the state-of-the-art methods.

The major contributions of this paper are as follows:


The remaining sections are as follows. In Section 2, recent HSI super-resolution methods are reviewed. In Section 3, the proposed SSRN is introduced. The experimental results of SSRN and the compared methods are reported in Section 4. The performance of SSRN is discussed in Section 5. Finally, Section 6 concludes this paper.

#### **2. Related Work**

Many methods have been proposed to reconstruct HR HSIs by fusing the observed LR HSI and HR MSI. In light of whether deep networks are utilized, the existing methods are roughly categorized into traditional methods and DL-based methods.

#### *2.1. Traditional Methods*

According to different technique frameworks, traditional methods can be further divided into sparse representation-based methods, Bayesian-based methods, spectral unmixing-based methods, and tensor factorization-based methods.

Sparse representation-based methods [29] learn a dictionary from the observed LR HSI. The dictionary represents the reflectance spectrum of the scene and is then employed to learn the sparse code of HR MSIs. Akhtar et al. [30] proposed a generalization of the simultaneous orthogonal matching pursuit (GSOMP) method. Wei et al. [31] proposed a variational-based fusion method and designed a sparse regularization term.

Bayesian-based methods [32] intuitively interpret the process of fusion through the posterior distribution. Eismann et al. [33] proposed a maximum a posteriori probability (MAP) estimation method. Wei et al. [34] proposed a hierarchical Bayesian fusion method to fuse spectral images. Irmak et al. [35] proposed a MAP-based energy function to enhance the spatial resolution of HSI.

Spectral unmixing-based methods usually employ nonnegative matrix factorization to decompose HR MSIs and LR HSIs [36,37]. A classic method is coupled nonnegative matrix factorization (CNMF) [36]. In CNMF, HR MSIs and LR HSIs are alternately decomposed. Then, the estimated endmember matrix of the LR HSI and the estimated abundance matrix of the HR MSI are multiplied to reconstruct the HR HSI. Borsoi et al. [1] embedded an explicit parameter into a spectral unmixing-based method to model the spectral variability between the HR MSI and LR HSI.

Tensor factorization-based methods treat HSIs as a 3-D tensor to estimate a core tensor and the dictionaries of the width, height, and spectral modes [15,16]. Dian et al. [16] introduced the sparsity prior into tensor factorization to extract non-local spatial information from HR MSIs and spectral information from LR HSIs, respectively. Li et al. [38] proposed a coupled sparse tensor factorization to estimate the core tensor.

Traditional methods have achieved favorable performances by exploiting the priors (e.g., sparsity and low-rankness), but such priors may not hold in some complicated scenarios [9,17,18].

#### *2.2. Deep Learning-Based Methods*

Recently, many works have designed various deep networks for fusing HR MSIs and LR HSIs, which can be divided into supervised DL-based methods [39] and unsupervised DL-based methods [40].

Supervised DL-based methods usually exploit massive HR HSIs as training images to learn potential HSI priors or the mapping relationship between LR and HR HSIs [20,25]. Xie et al. [20] exploited the low-rankness prior of HSIs to construct an MSI and HSI fusion model, which can be optimized iteratively with the proximal gradient. Subsequently, the iterative optimization is unfolded into a convolutional network structure for end-to-end training. Wei et al. [23] proposed a residual convolutional network to learn the mapping relationship between LR MSIs and HR MSIs. To mitigate dependence on the point spread function and spectral response function, Wang et al. [24] proposed a blind iterative fusion network to iteratively optimize the observation model. Li et al. [39] proposed a twostream network to reconstruct HR HSIs, where one is a 1-D convolutional stream to extract spectral features and the other is a 2-D convolutional stream to extract spatial features. However, in practice, collecting plenty of HR HSIs as supervised information for training is time-consuming and laborious [26,27].

Unsupervised DL-based methods are dedicated to leveraging spectral and spatial ingredients from the given HR MSI and LR HSI to reconstruct the desired HR HSI [17,28,41]. Huang et al. [42] utilized a sparse denoising autoencoder to learn the spatial mapping relationship between LR and HR panchromatic images, where LR panchromatic images are obtained from the spectral degradation of LR MSIs. Then, the learned spatial mapping relationship was exploited to improve the spatial resolution of each spectral band of LR MSIs. Fu et al. [40] proposed a plain network simply composed of five convolution layers

to fuse HR MSIs and LR HSIs. The HR MSI was concatenated with the feature maps of every convolution layer to guide the spatial structure reconstruction of HR HSIs. Although recent methods have achieved superior performance [17,28], designing deep networks suitable for HSI super-resolution that do not require additional supervision information for training is still an open problem.

#### **3. Materials and Methods**

#### *3.1. Proposed Method*

#### 3.1.1. Problem Formulation

The goal of the proposed SSRN is to estimate the HR HSI by fusing the observed HR MSI and LR HSI of the same scene. Let the HR HSI be **<sup>X</sup>***<sup>H</sup>* <sup>∈</sup> <sup>R</sup>*B*×*W*×*H*, the observed LR HSI be **<sup>X</sup>***<sup>L</sup>* <sup>∈</sup> <sup>R</sup>*B*×*w*×*h*, and the observed HR MSI be **<sup>Y</sup>***<sup>H</sup>* <sup>∈</sup> <sup>R</sup>*b*×*W*×*H*, where *<sup>B</sup>* and *<sup>b</sup>* represent spectral band numbers, *W* and *w* represent the width, and *H* and *h* represent the height. The observed LR HSI has a higher spectral resolution and a lower spatial resolution than the observed HR MSI, *i.e.*, *W* = *D* × *w*, *H* = *D* × *h*, and *B b* (*D* is the scaling factor). In fact, one pixel **<sup>Y</sup>***H*(*i*, *<sup>j</sup>*) <sup>∈</sup> <sup>R</sup>*<sup>b</sup>* of **<sup>Y</sup>***<sup>H</sup>* uniquely corresponds to one pixel **<sup>X</sup>***H*(*i*, *<sup>j</sup>*) <sup>∈</sup> <sup>R</sup>*<sup>B</sup>* of **<sup>X</sup>***H*, where (*i*, *<sup>j</sup>*) represents the spatial location in the *<sup>i</sup>*th row and the *j*th column. This paper exploits a convolutional network to learn a nonlinear pixel-wise spectral mapping *<sup>F</sup>* : <sup>R</sup>*<sup>b</sup>* <sup>→</sup> <sup>R</sup>*<sup>B</sup>* that maps **<sup>Y</sup>***H*(*i*, *<sup>j</sup>*) to **<sup>X</sup>***H*(*i*, *<sup>j</sup>*). The pixel-wise spectral mapping can be formulated as

$$\mathbf{X}\_{H} = F(\mathbf{Y}\_{H}).\tag{1}$$

Since HR HSIs are difficult to obtain in practice [26,28], the proposed SSRN does not use HR HSIs as supervised information. In this paper, the spectral signatures of LR HSI **<sup>X</sup>***<sup>L</sup>* are first used as the supervised information to learn the spectral mapping *<sup>F</sup>*<sup>ˆ</sup> : <sup>R</sup>*<sup>b</sup>* <sup>→</sup> <sup>R</sup>*<sup>B</sup>* between LR MSIs and LR HSIs. During the training phase, the input of SSRN is the LR MSI **<sup>Y</sup>***<sup>L</sup>* <sup>∈</sup> <sup>R</sup>*b*×*w*×*<sup>h</sup>* and the output is the LR HSI **<sup>X</sup>***L*. **<sup>Y</sup>***<sup>L</sup>* is obtained by spatially blurring and then downsampling **Y***H*.

$$\mathbf{Y}\_{L} = E(\mathbf{Y}\_{H}),\tag{2}$$

where *E*(·) represents the spatially blurring and downsampling operations [12]. Then, the learned spectral mapping *F*ˆ between the LR MSI **Y***<sup>L</sup>* and LR HSI **X***<sup>L</sup>* is transformed to the spectral mapping *F* with a self-supervised fine-tuning strategy, which can reconstruct the HR HSI **X***<sup>H</sup>* from the HR MSI **Y***H*.

Previous methods [10,28,30,36] usually focus on extracting spectral ingredients (spectral bases or endmembers) from the LR HSI **X***<sup>L</sup>* and extracting spatial ingredients (sparse codes or abundances) from the HR MSI **Y***H*. Then, the spectral ingredients of **X***<sup>L</sup>* and the spatial ingredients of **Y***<sup>H</sup>* are utilized to reconstruct the HR HSI **X***H*. However, the observed scene in the HR MSI **Y***<sup>H</sup>* usually contains complex spatial distributions of land-covers; hence, there are still many challenges in accurately extracting spatial ingredients from HR MSI **Y***<sup>H</sup>* [8,9]. In previous methods, inaccurate spatial ingredients extracted from the HR MSI **Y***<sup>H</sup>* can cause spatial distortion of the reconstructed HR HSI. Different from previous methods [10,28,30,36], the proposed SSRN avoids the process of spatial ingredient extraction from HR MSI **Y***H*. The proposed SSRN considers the fusion problem of HR MSI and LR HSI as a problem of spectral mapping learning. Based on the learned spectral mapping *F*, HR HSI **X***<sup>H</sup>* is directly reconstructed from HR MSI **Y***H*. All the spatial ingredients of HR MSI **Y***<sup>H</sup>* can be used to reconstruct the HR HSI **X***H*. Therefore, compared with previous methods, the proposed SSRN can better preserve the spatial structures of the reconstructed HR HSI.

The proposed method is similar to recent spectral resolution enhancement methods [43,44] that focus on learning the spectral mapping between MSIs and HSIs. However, the methods for spectral resolution enhancement are usually supervised training methods [45,46], which learn the spectral mapping from plentiful MSI and HSI pairs that are collected in other observed scenes. In contrast, in the HR MSI and LR HSI fusion task, the HR MSI and LR HSI are captured in the same observed scene. Our proposed method is

a self-supervised training method specially designed for the HR MSI and LR HSI fusion task. The details of SSRN are introduced in the following subsections.

#### 3.1.2. Architecture of SSRN

A detailed architecture of SSRN is shown in Figure 2. SSRN consists of two modules: the spectral module and the spatial module. In SSRN, the input is an MSI patch **<sup>Y</sup>**<sup>ˆ</sup> <sup>∈</sup> <sup>R</sup>*b*×*K*×*<sup>K</sup>* and the output is an HSI patch **<sup>X</sup>**<sup>ˆ</sup> <sup>∈</sup> <sup>R</sup>*B*×*K*×*K*, where *<sup>b</sup>* and *<sup>B</sup>* represent spectral band numbers and *K* × *K* represents the spatial size. First, a 1 × 1 convolution layer is used to generate initial shallow spectral features from the MSI patch **Y**ˆ . Then, the spectral module is utilized to extract spectral features from the initial shallow spectral features, and the spatial module is added following the spectral module to extract spectral-spatial features to facilitate learning of spectral mapping.

**Figure 2.** Architecture of the proposed SSRN. A spectral module and a spatial module are utilized to learn the pixel-wise spectral mapping between MSIs and HSIs. ⊕ represents the residual connection.

In this paper, spectral features refer to the features of multispectral pixels in the spectral dimension, which do not involve any information of the spatially adjacent pixels. The spectral module mainly consists of several residual blocks and a multi-layer feature aggregation (MLFA) component. As shown in Figure 3, the setting of the residual blocks is similar to that in the literature [47], where the residual connection can facilitate the convergence of SSRN. In residual blocks, the kernel size of all convolution layers is set to 1 × 1 to ensure that spectral feature extraction is only performed in the spectral dimension of MSI. The different residual blocks can extract different spectral features, which are beneficial for learning spectral mapping [48,49]. To explore the complementarity among the features of different residual blocks, an MLFA component is employed to integrate these features into the final spectral feature. The MLFA component is composed of a concatenation layer and 1 × 1 convolution, which do not introduce any information from the spatially adjacent multispectral pixels.

**Figure 3.** Structures of the residual block and the self-attention module. ⊕ represents the residual connection. ReLU represents the rectified linear unit.

The spatial module aims to extract complementary spatial information from adjacent pixels to learn spectral mapping. In this paper, the spatial information from adjacent pixels refers to the spatial structure information and spectrums contained in adjacent pixels. In practice, due to that adjacent pixels in real MSIs or HSIs potentially corresponding to the same object, adjacent pixels may have similar spectral signatures [50–52], which can be used as a prior to refine the reconstruct HR HSI. The adjacent pixels with similar spectral signatures are called homogeneous adjacent pixels. The spatial information of homogeneous adjacent pixels in HR MSI is beneficial in the learning of the pixel-wise

spectral mapping between MSIs and HSIs [53]. Previous methods [43,44] usually use 3 × 3 convolution to extract spatial information. However, the 3 × 3 convolution can introduce the interference information from inhomogeneous adjacent pixels [2]. In this paper, the spatial module employs a self-attention module [54] to extract spectral-spatial features from the homogeneous adjacent pixels. The self-attention module can capture homogeneous adjacent pixels based on the correlation between different pixels [54] and then aggregate the information from these homogeneous adjacent pixels to generate spectral-spatial features. The details of the self-attention module are shown in Figure 3. The self-attention module takes the final spectral feature from the spectral module as the input and outputs the spectral-spatial features. The final spectral feature is denoted as **<sup>S</sup>** <sup>∈</sup> <sup>R</sup>*C*×*K*×*K*, where *<sup>C</sup>* is the channel number and *K* × *K* is the spatial size. First, **S** is fed into three 1 × 1 convolution layers to generate abstract features *<sup>f</sup>*(**S**) <sup>∈</sup> <sup>R</sup>*C*×*K*×*K*, *<sup>g</sup>*(**S**) <sup>∈</sup> <sup>R</sup>*C*×*K*×*K*, and *<sup>n</sup>*(**S**) <sup>∈</sup> <sup>R</sup>*C*×*K*×*K*, respectively. Second, *f*(**S**), *g*(**S**), and *n*(**S**) are reshaped to ¯ *<sup>f</sup>*(**S**), *<sup>g</sup>*¯(**S**), and *<sup>n</sup>*¯(**S**) <sup>∈</sup> <sup>R</sup>*C*×*M*, where *<sup>M</sup>* <sup>=</sup> *<sup>K</sup>* <sup>×</sup> *<sup>K</sup>*. Each column of the reshaped ¯ *f*(**S**), *g*¯(**S**), and *n*¯(**S**) represents the spectral feature of a certain pixel. The correlation among pixels in spectral features is calculated as follows

$$\mathbf{N} = \vec{f}(\mathbf{S})^T \vec{g}(\mathbf{S}),\tag{3}$$

where (·)*<sup>T</sup>* denotes the transpose and **<sup>N</sup>** <sup>∈</sup> <sup>R</sup>*M*×*M*. A softmax function is employed to normalize the value of all elements in **N** to the range [0, 1]. Then, the spectral-spatial features are generated by multiplying *n*¯(**S**) with the normalized correlation **N**. With the normalized correlation **N**, the homogeneous pixels from adjacent regions can be aggregated to facilitate the learning of the spectral mapping. Finally, the shape of spectral-spatial features is reshaped to R*C*×*K*×*<sup>K</sup>* for the following operations. After the self-attention module, a 1 × 1 convolution layer with *B* kernels is utilized to reconstruct the desired HR HSI from the spectral-spatial features.

In the proposed SSRN, the kernel size of all convolution layers is set to 1 × 1, which can mitigate the difficulty of training SSRN caused by too many weight parameters.

#### 3.1.3. Loss Function

In the proposed SSRN, a reconstruction loss *Lrec* and a cosine similarity loss *Lcos* are employed as the loss functions. Let **U** represent the generated HSI and **V** represent the ground truth. For convenience, **U** and **V** are reshaped to R*P*×*Q*, where *P* is the number of spectral bands and *Q* is the number of pixels. Each column of **U** and **V** represents the spectral vector of a hyperspectral pixel. The reconstruction loss *Lrec* is a classic metric function that measures the numerical differences between two HSIs. *Lrec* is defined as

$$L\_{\rm rec}(\mathbf{U}, \mathbf{V}) = \left\| \mathbf{U} - \mathbf{V} \right\|\_{F'}^2 \tag{4}$$

where ·*<sup>F</sup>* represents the Frobenius norm. The cosine similarity loss *Lcos* measures the spectral distortion based on the angle between two spectral signatures. *Lcos* is defined as

$$L\_{\cos}(\mathbf{U}, \mathbf{V}) = 1 - \frac{1}{\mathcal{Q}} \sum\_{i=1}^{\mathcal{Q}} \frac{\mathbf{U}^{(i)} \cdot \mathbf{V}^{(i)}}{||\mathbf{U}^{(i)}||\_2 ||\mathbf{V}^{(i)}||\_2},\tag{5}$$

where **U**(*i*) is the *i*th column of **U** that denotes the spectral vector of the *i*th pixel of **U** and **V**(*i*) is the *i*th column of **V** that denotes the spectral vector of the *i*th pixel of **V** (1 *i* -*Q*).

In the training phase, the LR MSI **Y***<sup>L</sup>* and the LR HSI **X***<sup>L</sup>* are cropped into small patches for training. Let **<sup>Y</sup>**<sup>ˆ</sup> *<sup>L</sup>* <sup>∈</sup> <sup>R</sup>*b*×*K*×*<sup>K</sup>* be the LR MSI patch cropped from **<sup>Y</sup>***L*, **<sup>X</sup>**<sup>ˆ</sup> *<sup>L</sup>* <sup>∈</sup> <sup>R</sup>*B*×*K*×*<sup>K</sup>* be the corresponding LR HSI patch cropped from **<sup>X</sup>***L*, and **<sup>X</sup>**¯ *<sup>L</sup>* <sup>=</sup> *<sup>F</sup>*ˆ(**Y**<sup>ˆ</sup> *<sup>L</sup>*) <sup>∈</sup> <sup>R</sup>*B*×*K*×*<sup>K</sup>* be the reconstructed LR HSI patch. To facilitate calculation of the loss, **<sup>Y</sup>**<sup>ˆ</sup> *<sup>L</sup>* is reshaped to R*b*×*M*, and **<sup>X</sup>**<sup>ˆ</sup> *<sup>L</sup>* and **<sup>X</sup>**¯ *<sup>L</sup>* are reshaped to <sup>R</sup>*B*×*M*, where *<sup>M</sup>* <sup>=</sup> *<sup>K</sup>* <sup>×</sup> *<sup>K</sup>*. The details of the loss function in SSRN are as follows.

First, the loss *LossHSI* between the reconstructed **X**¯ *<sup>L</sup>* and the ground truth **X**ˆ *<sup>L</sup>* are measured by the reconstruction loss *Lrec* and the cosine similarity loss *Lcos*.

$$L\text{loss}\_{HSI}(\mathbf{\hat{X}}\_{L}, \mathbf{\hat{X}}\_{L}) = L\_{\text{rec}}(\mathbf{\hat{X}}\_{L}, \mathbf{\hat{X}}\_{L}) + \lambda L\_{\text{cos}}(\mathbf{\hat{X}}\_{L}, \mathbf{\hat{X}}\_{L}),\tag{6}$$

where *λ* is the balancing parameter to control the tradeoff between *Lrec* and *Lcos*.

Second, according to the observation model, the LR MSI patch **Y**ˆ *<sup>L</sup>* is the spectral degradation of the LR HSI patch **X**ˆ *<sup>L</sup>* [14], which can be formulated as

$$
\hat{\mathbf{Y}}\_L = \mathcal{R}(\hat{\mathbf{X}}\_L),
\tag{7}
$$

where *R*(·) represents the spectral degradation. This means that the spectral degradation of the reconstructed HSI patch **X**¯ *<sup>L</sup>* should also be consistent with the input MSI patch **Y**ˆ *<sup>L</sup>*. To maintain the consistency between *R*(**X**¯ *<sup>L</sup>*) and **Y**ˆ *<sup>L</sup>*, another loss function *LossMSI* is established in this paper. Similar to *LossHSI*, *LossMSI* is formulated as

$$L\text{loss}\_{MSI}(R(\mathbf{\hat{X}}\_{L}), \mathbf{\hat{Y}}\_{L}) = L\_{\text{rec}}(R(\mathbf{\hat{X}}\_{L}), \mathbf{\hat{Y}}\_{L}) + \beta L\_{\text{cos}}(R(\mathbf{\hat{X}}\_{L}), \mathbf{\hat{Y}}\_{L}), \tag{8}$$

where *β* is simply set to the same value as *λ* of Equation (6), since the second terms in Equations (6) and (8) are all the cosine similarity loss. Overall, the loss function of SSRN is set as

$$Loss\_{train} = Loss\_{HSI}(\mathbf{\tilde{X}}\_L, \mathbf{\hat{X}}\_L) + \phi Loss\_{MSI}(\mathcal{R}(\mathbf{\tilde{X}}\_L), \mathbf{\hat{Y}}\_L). \tag{9}$$

In the proposed SSRN, *LossHSI* and *LossMSI* are equally important for reconstructing HR HSI. Therefore, *φ* is simply set to 1 in the following experiments.

#### 3.1.4. Self-Supervised Fine-Tuning

This paper assumes that the pixel-wise spectral mapping *F* between HR MSI and HR HSI can be estimated on the basis of the pixel-wise spectral mapping *F*ˆ between LR MSI and LR HSI. The training process of the proposed SSRN includes two stages: the pretraining stage and the fine-tuning stage. In the pretraining stage, the pixel-wise spectral mapping *F*ˆ can be easily learned from the paired LR MSI patches and LR HSI patches using the proposed SSRN. In this stage, the proposed SSRN is supervised by LR MSIs and LR HSIs simultaneously. In fact, the spectral signatures of LR MSIs and LR HSIs are usually influenced by spatial degradation. The spectral mapping *F*ˆ is not exactly equal to the spectral mapping *F*. Hence, in this paper, a fine-tuning strategy is proposed to further estimate the spectral mapping *F* from the spectral mapping *F*ˆ. The SSRN trained with LR MSIs and LR HSIs serves as a pretrained network. Then, in the fine-tuning stage, the pretrained SSRN is further fine-tuned with the HR MSI. Since the HR HSI is hard to be obtained in practice, SSRN does not utilize the HR HSI as supervised information in training. Therefore, Equation (6) cannot be employed as the loss function in the fine-tuning stage. Equation (8) is employed as the fine-tuning loss *LossFT* to maintain the consistency between *<sup>R</sup>*(**X**¯ *<sup>H</sup>*) and **<sup>Y</sup>**<sup>ˆ</sup> *<sup>H</sup>*, where *<sup>R</sup>*(·) has the same definition as that in Equation (7), **<sup>X</sup>**¯ *<sup>H</sup>* is the reconstructed HR HSI patch, and **Y**ˆ *<sup>H</sup>* is the input HR MSI patch. *LossFT* can be expressed as

$$Loss\_{FT} = Loss\_{MSI}(R(\bar{\mathbf{X}}\_H), \hat{\mathbf{Y}}\_H). \tag{10}$$

In the fine-tuning stage, the proposed SSRN is only supervised by HR MSI. Therefore, the fine-tuning stage is a self-supervised training style. After fine-tuning, the spectral mapping *F* between HR MSI and HR HSI is obtained. The desired HR HSI can be reconstructed with Equation (1).

#### *3.2. Software and Package*

The proposed SSRN is implemented in a computer workstation that is configured with the Ubuntu 14.04 system, 64G RAM, Intel Core i7-5930K, and NVIDIA TITAN X. The software used in the experiments is PyCharm. The packages used in the experiments include Python, TensorFlow, NumPy, and SciPy.

#### *3.3. Databases*

To evaluate the performance of SSRN, the experiments are conducted on simulated databases and real databases, respectively. First, the Pavia University (PU) database (http:// www.ehu.eus/ccwintco/index.php?title=Hyperspectral\_Remote\_Sensing\_Scenes#Pavia\_ University\_scene, accessed on 16 December 2020) and the Washington DC Mall (WDCM) database (https://engineering.purdue.edu/~biehl/MultiSpec/hyperspectral.html, accessed on 16 December 2020) are utilized to simulate MSIs for experiments. Then, the Paris database (https://github.com/alfaiate/HySure/tree/master/data, accessed on 17 December 2020) [12] and the Ivanpah Playa database (https://github.com/ricardoborsoi/FuVar Release/tree/master/DATA, accessed on 17 December 2020) [1], which contain both real MSIs and HSIs, are employed to conduct experiments. Finally, the CAVE database (https:// www.cs.columbia.edu/CAVE/databases/multispectral/, accessed on 19 December 2020) [55] is employed to further explore the performance of SSRN.

The PU database is captured by the ROSIS sensor over Pavia University. The PU database contains an HSI with 103 spectral bands and 610 × 340 pixels. The WDCM database is collected by the HYDICE sensor over the National Mall. The WDCM database consists of an HSI with 191 spectral bands and 1280 × 307 pixels. Similar to the literature [37], a 200 × 200 subimage of the PU database and a 240 × 240 subimage of the WDCM database are utilized for experiments. The original HSIs in the PU and WDCM databases are regarded as the ground truth. The ground truth is blurred and then spatially downsampled with the scaling factor of 4 to simulate the observed LR HSI. The observed HR MSI is obtained by spectrally downsampling the ground truth. The setting of the spectral response function is the same as that in the literature [37].

The Paris database contains an HSI captured by the hyperion instrument and an MSI collected by the ALI instrument [12]. The HSI contains 128 spectral bands. The MSI contains 9 spectral bands. Both the HSI and the MSI have 72 × 72 pixels. The Ivanpah Playa database consists of an HSI with 173 spectral bands and an MSI with 10 spectral bands. The HSI and the MSI on the Ivanpah Playa database contain 80 × 128 pixels. According to the literature [1], the HSIs on the Paris and Ivanpah Playa databases are treated as the ground truth, which are blurred and spatially downsampled with the scaling factor of 4 to generate the observed LR HSI. The MSIs on the Paris and Ivanpah Playa databases are treated as the observed HR MSI.

The CAVE database contains 32 HSIs, which are captured by the cooled chargecoupled device (CCD) camera on the ground [55]. On the CAVE database, each HSI consists of 512 × 512 pixels, where each pixel is composed of 31 spectral bands ranging from 400 nm to 700 nm. Following the literature [56], the original HSIs on the CAVE database are treated as the ground truth. Then, the ground truth is blurred and spatially downsampled with the scaling factor of 4 to obtain the observed LR HSI. The ground truth is spectrally downsampled by the spectral response function of Nikon D700 (https: //maxmax.com/spectral\_response.htm, accessed on 19 December 2020) to obtain the observed HR MSI.

#### *3.4. Evaluation Metrics*

Five quantitative quality metrics are employed for performance evaluation, including peak signal-to-noise ratio (PSNR), spectral angle mapper (SAM), universal image quality index (UIQI), erreur relative globale adimensionnelle de synthèse (ERGAS), and root mean squared error (RMSE). PSNR measures the spatial reconstruction quality of each spectral band in the reconstructed HR HSI. SAM measures the spectral distortions of each hyperspectral pixel in the reconstructed HR HSI. UIQI measures the spatial structural similarity between the reconstructed HR HSI and the ground truth based on the combination of luminance, contrast, and correlation comparisons. ERGAS takes into account the ratio of ground sample distances between HR MSI and LR HSI to measure the global statistical quality of the reconstructed HR HSI. RMSE measures the global statistical error between the reconstructed HR HSI and the ground truth. The larger values of PSNR and UIQI indicate the better quality of the reconstructed HR HSI. When the values of SAM, ERGAS, and RMSE are smaller, the quality of the reconstructed HR HSI is better. The best value of PSNR is +∞. The best value of SAM is 0. The best value of UIQI is 1. The best values of ERGAS and RMSE are 0.

In this paper, the ground truth **<sup>X</sup>**˜ *<sup>H</sup>* <sup>∈</sup> <sup>R</sup>*B*×*W*×*<sup>H</sup>* and the reconstructed HR HSI **<sup>X</sup>***<sup>H</sup>* <sup>∈</sup> <sup>R</sup>*B*×*W*×*<sup>H</sup>* are converted into 8-bit images to calculate quantitative performance, where *B*, *W*, and *H* are the numbers of the band, width, and height, respectively. The formulations of the above quality metrics for the ground truth **X**˜ *<sup>H</sup>* and the reconstructed HR HSI **X***<sup>H</sup>* are given below.

PSNR is formulated as

$$\text{PSNR} = \frac{1}{B} \sum\_{i=1}^{B} 10 \log\_{10} \left( \frac{\max \left( \tilde{\mathbf{X}}\_{H\_i} \right)^2}{\frac{1}{W \times H} \sum\_{j=1}^{W \times H} \left( \tilde{\mathbf{X}}\_{H\_{\tilde{\mathbf{U}}}} - \mathbf{X}\_{H\_{\tilde{\mathbf{U}}}} \right)^2} \right), \tag{11}$$

where *max*(**X**˜ *Hi* ) represents the maximum pixel value in the *<sup>i</sup>*th band of **<sup>X</sup>**˜ *<sup>H</sup>*. **<sup>X</sup>**˜ *Hij* and **X***Hij* (1 *i* - *B*, 1 *j* - *<sup>W</sup>* <sup>×</sup> *<sup>H</sup>*) represent the *<sup>j</sup>*th pixel in the *<sup>i</sup>*th band of **<sup>X</sup>**˜ *<sup>H</sup>* and **X***H*, respectively.

SAM is formulated as

$$\text{SAM} = \frac{1}{W \times H} \sum\_{j=1}^{W \times H} \arccos \left( \frac{\left( \mathbf{X}\_H[j] \right)^T \mathbf{X}\_H[j]}{||\mathbf{X}\_H[j]||\_2 ||\mathbf{X}\_H[j]||\_2} \right), \tag{12}$$

where **<sup>X</sup>**˜ *<sup>H</sup>*[*j*] <sup>∈</sup> <sup>R</sup>*B*×<sup>1</sup> and **<sup>X</sup>***H*[*j*] <sup>∈</sup> <sup>R</sup>*B*×<sup>1</sup> (1 *j* - *W* × *H*) denote the spectra of the *j*th pixel of **<sup>X</sup>**˜ *<sup>H</sup>* and **<sup>X</sup>***H*, respectively. (·)*<sup>T</sup>* denotes the transpose, and ·<sup>2</sup> denotes the -2 vector norm.

UIQI is formulated as

$$\text{UUIQI} = \frac{1}{B} \sum\_{i=1}^{B} \left( \frac{1}{Z} \sum\_{q=1}^{Z} \frac{4\mu\_{\sharp\_{i\eta}}\mu\_{\sharp\_{i\eta}}\sigma\_{\sharp\_{i\eta}}z\_{i\eta}}{\left(\mu\_{\sharp\_{i\eta}}\,^2 + \mu\_{z\_{i\eta}}\,^2\right)\left(\sigma\_{\sharp\_{i\eta}}\,^2 + \sigma\_{z\_{i\eta}}\,^2\right)} \right), \tag{13}$$

where a sliding window moving pixel by pixel is used to divide the *i*th band of **X**˜ *<sup>H</sup>* and **X***<sup>H</sup>* at the same position into *Z* image patch pairs *z*˜*iq* and *ziq* (1 *i* - *B*, 1 *q* - *Z*), respectively. *Z* is the image patch pair number. *μz*˜*iq* and *μziq* are mean pixel values of image patches *z*˜*iq* and *ziq*, respectively. *σz*˜*iq* and *σziq* are the corresponding variance. *σz*˜*iqziq* is the covariance.

ERGAS is formulated as

$$\text{ERGAS} = 100d \sqrt{\frac{1}{B} \sum\_{i=1}^{B} \frac{\frac{1}{W \times H} \sum\_{j=1}^{W \times H} \left(\hat{\mathbf{x}}\_{H\_{ij}} - \mathbf{x}\_{H\_{ij}}\right)^2}{\left(\mu\_{\hat{\mathbf{X}}\_{H\_i}}\right)^2}},\tag{14}$$

where *<sup>d</sup>* is the ratio of ground sample distances between HR MSI and LR HSI. *<sup>μ</sup>***X**˜ *Hi* (1 - *i* -*B*) denotes the mean pixel value in the *i*th band of the ground truth HSI **X**˜ *<sup>H</sup>*.

RMSE is formulated as

$$\text{RMSE} = \sqrt{\frac{1}{B} \sum\_{i=1}^{B} \left( \frac{1}{W \times H} \sum\_{j=1}^{W \times H} \left( \mathbf{X}\_{H\_{ij}} - \mathbf{X}\_{H\_{ij}} \right)^2 \right)} \tag{15}$$

where **<sup>X</sup>**˜ *Hij* and **<sup>X</sup>***Hij* (1 *i* - *B*, 1 *j* - *W* × *H*) represent the *j*th pixel in the *i*th band of **X**˜ *<sup>H</sup>* and **X***H*, respectively.

#### **4. Results**

#### *4.1. Parameter Settings of SSRN*

This subsection explores the parameter settings of SSRN. The WDCM database has plenty of spectral bands and contains complicated land-cover distributions, making the fusion task challenging [16]. Therefore, the WDCM database is utilized for parameter setting experiments. For convenience, this subsection directly uses PSNR and SAM to measure the quality of the reconstructed HR HSI. Moreover, the fine-tuning strategy is not employed in the parameter setting experiments.

#### 4.1.1. Number of Convolutional Kernels

In the experiments, the spatial size of input image patches is set as 4 × 4. The number of training epochs is set as 200. The learning rate is initially set as 0.01, which then drops by a factor of 10 after 100 epochs. The balancing parameter *λ* in the loss function is initially set as 0.1. The number of residual blocks is set as 3. For convenience, all convolution layers of SSRN (except the last convolution layer) are configured with the same number of convolutional kernels, which is set as 16, 32, 64, 128, 256m and 512 for the experiments. The PSNR and SAM of SSRN with different numbers of convolutional kernels on the WDCM database are shown in Table 1. As the kernel number increases from 16 to 256, the performance of SSRN increases. As the kernel number increases from 256 to 512, the performance of SSRN decreases due to too many weight parameters, making SSRN training difficult. As shown in Table 1, the number of convolutional kernels in SSRN except the last convolution layer is set as 256 in the following experiments.

**Table 1.** Peak signal-to-noise ratio (PSNR) and spectral angle mapper (SAM) of SSRN with different numbers of convolutional kernels on the Washington DC Mall (WDCM) database.


4.1.2. Number of Residual Blocks

SSRN utilizes several residual blocks to extract spectral features from the MSI. To explore the effects of different numbers of residual blocks on the performance of SSRN, the number of residual blocks is set as 1, 2, 3, 4, 5, and 6 for the experiments. The PSNR and SAM of SSRN on the WDCM database are shown in Table 2. SSRN with 4 residual blocks achieves the best performance, where the PSNR and SAM are 33.167 and 1.213, respectively. In following experiments, the residual block number of SSRN is set as 4.

**Table 2.** PSNR and SAM of SSRN with different numbers of residual blocks on the WDCM database.


#### 4.1.3. Balancing Parameter *λ*

The balancing parameter *λ* is a key parameter that controls the tradeoff between the reconstruction loss and the cosine similarity loss in SSRN. If the value of the balancing parameter *λ* is too small, the cosine similarity loss in SSRN will be invalidated, resulting in a large SAM value of the reconstructed HR HSI. If the value of the balancing parameter *λ* is too large, the reconstruction loss will be invalidated, resulting in a decrease in the quality of the reconstructed HR HSI. To explore the impacts of the balancing parameter *λ* on the performance of SSRN, *λ* is set as 0.001, 0.01, 0.1, 1, 5, and 10 for the experiments. The PSNR and SAM of SSRN with different balancing parameter *λ* are shown in Table 3. As *λ* increases from 0.001 to 0.1, the performance of SSRN increases. However, as *λ* increases from 0.1 to 10, the performance of SSRN decreases. The balancing parameter *λ* of SSRN is set as 0.1 in the following experiments.

**Table 3.** PSNR and SAM of SSRN with different *λ* on the WDCM database.


#### *4.2. Ablation Study*

The proposed SSRN can be specifically decomposed into five components, including the basic network, the MLFA component, the spatial module, the cosine similarity loss, and the fine-tuning. The basic network refers to the proposed spectral module without the MLFA, which can be utilized to coarsely learn the pixel-wise spectral mapping. The loss function of the basic network is a reconstruction loss. The other four components are utilized to improve the performance of this basic network. In this subsection, the ablation experiments for these four components are conducted on the WDCM database. The experimental results are shown in Table 4. The basic network achieves the worst performance. It is indicated that spatial features are not adequately exploited by the basic network. The MLFA component is added to the basic network to demonstrate that aggregating features of different convolution layers can improve the performance of the basic network. After further introducing the spatial module in the basic network and MLFA, the PSNR of the estimated HSI improved. Although the spatial module can improve the spatial quality of the estimated HSI, it cannot significantly reduce the spectral distortion. Then, the cosine similarity loss is further added into the basic network combined with the MLFA and the spatial module. As shown in Table 4, the cosine similarity loss can effectively alleviate the problem of spectral distortion in the estimated HSI. Finally, the fine-tuning strategy is added into the basic network combined with other three components. The proposed SSRN shows superior performance, which demonstrates the effectiveness of the fine-tuning strategy. Therefore, the MLFA, the spatial module, the cosine similarity loss, and the fine-tuning are all crucial components for the proposed SSRN.

**Table 4.** Ablation experiments of SSRN on the WDCM databases.


1. <sup>×</sup> represents that the basic network is configured with the component. 2. <sup>√</sup> represents that the basic network is not configured with the component.

#### *4.3. Comparisons with Other Methods on Simulated Databases*

In this subsection, the PU and WDCM databases are employed to simulate HR MSIs to evaluate the proposed SSRN. The proposed SSRN is compared with several stateof-the-art HSI super-resolution methods, including coupled nonnegative matrix factorization (CNMF) (http://naotoyokoya.com/assets/zip/CNMF\_MATLAB.zip, accessed on 12 October 2020) [36], generalization of simultaneous orthogonal matching pursuit (GSOMP) (http://www.csse.uwa.edu.au/~ajmal/code/HSISuperRes.zip, accessed on 12 October 2020) [30], hyperspectral image super-resolution via subspace-based regularization (HySure) (https://github.com/alfaiate/HySure, accessed on 12 October 2020) [12], transfer learning-based super-resolution (TLSR) [26], unsupervised sparse Dirichlet-net (USDN) (https://github.com/aicip/uSDN, accessed on 20 October 2020) [28], and deep hyperspectral prior (DHSP) (https://github.com/acecreamu/deep-hs-prior, accessed on 20 October 2020) [27]. CNMF, GSOMP, and HySure are traditional methods. TLSR, USDN, and DHSP are recent unsupervised DL-based methods. On the PU and WDCM databases, the number of training epochs is set as 200 for SSRN. The learning rate of SSRN is initially set as 0.01, which then drops by a factor of 10 after every 100 epochs. Compared methods use the parameter settings from the original literature. All experiments are implemented 5 times, and then, the average results are reported.

#### 4.3.1. PU Database

The quantitative results of SSRN and the compared methods on the PU database are reported in Table 5. TLSR and DHSP perform worse than traditional methods, since TLSR and DHSP only employ a single hyperspectral image to reconstruct HR HSIs. TLSR and DHSP cannot utilize the spatial information of the MSI to estimate the HR HSI. USDN utilizes two autoencoder networks to extract spatial information from HR MSIs and spectral information from LR HSIs, respectively. USDN shows superior performance to the traditional methods. Different from CNMF, GSOMP, HySure, and USDN, the proposed SSRN learns a pixel-wise spectral mapping between MSIs and HSIs. In SSRN, the desired HSI is directly estimated from MSIs with the desired high spatial resolution, which can preserve the spatial structures. In addition, the proposed SSRN employs cosine similarity loss for training, which can reduce the distortion of spectral signatures. As shown in Table 5, the proposed SSRN outperforms other methods on the PU database.


**Table 5.** Quantitative experimental results on the Pavia University (PU) database.

To visualize the experimental results, the visual images and error maps of SSRN and the compared methods are displayed in Figure 4. The HSIs estimated by TLSR and DHSP are blurry, since TLSR and DHSP cannot utilize the spatial information of the MSI. In the estimated HSIs of TLSR and DHSP, the small targets that only cover one or two pixels are missing. As shown in the error maps, the proposed SSRN effectively preserves the spatial structures of the estimated HSI.

**Figure 4.** Visual images (R: 60, G: 30, and B: 10) and error maps of SSRN and the compared methods on the PU database. The error maps are the sum of absolute differences in all spectral bands between the estimated HSI and the ground truth.

#### 4.3.2. WDCM Database

The quantitative results of SSRN and the compared methods on the WDCM database are reported in Table 6. CNMF, GSOMP, and HySure show better performance than TLSR and DHSP, since the spatial information of the MSI is utilized. The performance of USDN can compete with CNMF, GSOMP, and HySure. As shown in Table 6, the performance of the proposed SSRN is better than the compared methods in terms of PSNR, RMSE, and SAM. In terms of UIQI and ERGAS, the proposed SSRN shows favorable performance, which is close to the results of GSOMP.

Visual images and error maps of SSRN and the compared methods on the WDCM database are shown in Figure 5. The visual images of CNMF, GSOMP, HySure, USDN, and the proposed SSRN have good visualization results, owing to the reliable spatial information provided by the HR MSI. As shown in the error maps, the errors of TLSR and DHSP are mainly concentrated on the edges of complicated land-covers. The proposed SSRN shows superior performance in complicated land-covers.


**Table 6.** Quantitative experimental results on the WDCM database.

**Figure 5.** Visual images (R: 50, G: 30, and B: 20) and error maps of SSRN and the compared methods on the WDCM database.

#### *4.4. Comparisons with Other Methods on Real Databases*

In this subsection, SSRN and the compared methods are evaluated on two real databases. On the Paris and Ivanpah Playa databases, the number of training epochs is set as 400 for SSRN. The learning rate of SSRN is initially set as 0.01, which then drops by a factor of 10 after 200 epochs.

#### 4.4.1. Paris Database

On the Paris database, HR MSIs and LR HSIs are captured at the same time instant. On this database, the LR HSI is generated from the original HSI for training. After spatially downsampling with the scaling factor of 4, the LR HSI contains only 18 × 18 pixels, which

is far less than the number of pixels on simulated databases. Insufficient pixels in the LR HSI can make the proposed SSRN difficult to train. To alleviate the problem of insufficient pixels, the training samples are flipped left and right. Furthermore, the training samples are rotated 90, 180, and 270 degrees. On the Paris database, the spectral response function is estimated with the method proposed in the literature [12]. The performance of SSRN and the compared methods on the Paris database is shown in Table 7. In comparison with the PU and WDCM databases, the performance of SSRN and the compared methods decreased due to the too complicated land-cover distributions on the Paris database. As shown in Table 7, the proposed SSRN still shows better performance than the compared methods.


**Table 7.** Quantitative experimental results on the Paris database.

Visual images and error maps of SSRN and the compared methods are shown in Figure 6. Since the proposed SSRN estimates the HSI directly from the MSI, the spatial information of the MSI can be fully utilized. As shown in Figure 6, compared to the error maps of other methods, the proposed SSRN effectively mitigates the spatial distortion.

#### 4.4.2. Ivanpah Playa Database

The Ivanpah Playa database is a real database that consists of a LR HSI collected on 26 October 2015 and a HR MSI captured on 17 December 2017. On the Ivanpah Playa database, the HR MSI and LR HSI are collected during different seasons. In practice, seasonal changes may result in that the same land-cover material having different intrinsic spectral signatures [1]. Therefore, the intrinsic spectral signatures of the same land-cover may be different in LR MSIs and HR HSIs on the Ivanpah Playa database. It is challenging to perform HR MSI and LR HSI fusion on the Ivanpah Playa database. On this database, similar to the literature [1], the spectral response function from calibration measurements (https://earth.esa.int/web/sentinel/user-guides/sentinel-2-msi/documentlibrary/-/asset\_publisher/Wk0TKajiISaR/content/sentinel-2aspectral-responses, accessed on 17 December 2020) is employed for the compared methods.

Experimental results of SSRN and the compared methods on the Ivanpah Playa database are reported in Table 8. Different from that on the PU, WDCM, and Paris databases, TLSR and DHSP perform better than traditional methods and USDN on the Ivanpah Playa database. CNMF, GSOMP, HySure, and USDN usually rely on the assumption that the intrinsic spectral signatures of the same land-cover in HR MSIs and LR HSIs are the same [28,36]. In these methods, the spectral response function is usually directly used to obtain the spectral ingredients of HR MSIs from the spectral ingredients of LR HSIs. However, this assumption is not satisfied on the Ivanpah Playa database, which results in the performance degradations of CNMF, GSOMP, HySure, and USDN.

**Figure 6.** Visual images (R: 24, G: 14, and B: 3) and error maps of the proposed SSRN and the compared methods on the Paris database.

Different from CNMF, GSOMP, HySure, and USDN, the proposed SSRN does not rely on the assumption that the intrinsic spectral signatures of the same land-cover in HR MSIs and LR HSIs are the same. In the proposed SSRN, the fusion problem of the HR MSI and the LR HSI is considered a problem of spectral mapping learning. The proposed SSRN is utilized to directly learn spectral mapping from the multispectral pixels to the hyperspectral pixels. Owing to the powerful nonlinear representation ability of deep convolutional networks, SSRN can model the spectral variability increased by the seasonal changes between multispectral and hyperspectral pixels. In addition, due to HR MSI and LR HSI on the Ivanpah Playa database being collected at different time instants, the imaging environments (e.g., illumination, atmospheric, and weather) of HR MSIs and LR HSIs are different. Different imaging environments may result in it being difficult to accurately obtain real spectral response function [12]. In the proposed SSRN, only the loss function requires a spectral response function. To reduce the errors caused by the estimated spectral response function, the second term in the loss function Equation (9) and the fine-tuning strategy of SSRN are removed in the experiments on the Ivanpah Playa database. Data augmentation that is same as that on the Paris database is also employed on the Ivanpah Playa database to increase the number of training samples. As shown in Table 8, in terms of PSNR, UIQI, RMSE, and ERGAS, the proposed SSRN shows superior performance.


**Table 8.** Quantitative experimental results on the Ivanpah Playa database.

Visual images and error maps are shown in Figure 7. The spatial structures on the Ivanpah Playa database are relatively smooth. Although TLSR and DHSP cause the highfrequency information of the reconstructed image to be blurred, the experimental results of TLSR and DHSP in the smooth land-cover regions are favorable. According to the visual image and the error map of SSRN, the proposed SSRN effectively preserves the spatial structures of the HR MSI in the estimated HSI.

**Figure 7.** Visual images (R: 32, G: 20, and B: 8) and error maps of the proposed SSRN and the compared methods on the Ivanpah Playa database.

#### *4.5. Time Cost*

The total time cost of SSRN and the compared methods on the PU, WDCM, Paris, and Playa databases is shown in Table 9. In this paper, all experiments are conducted on the Ubuntu 14.04 system, 64G RAM, Intel Core i7-5930K, and NVIDIA TITAN X. CNMF, GSOMP, HySure, and TLSR are implemented with MATLAB. DHSP is implemented with PyTorch. USDN and the proposed SSRN are implemented with TensorFlow. The codes ofthe traditional methods (CNMF, GSOMP, and HySure) are implemented with the CPU. In the compared methods, the DL-based methods include TLSR, USDN, and DHSP. However, the code of TLSR provided by the original literature [26] is implemented with the CPU rather than the GPU. The codes of other deep learning-based methods (USDN, DHSP, and the proposed SSRN) are implemented with the GPU. As shown in Table 9, CNMF has superior computational efficiency. In general, DL-based methods usually take more time than traditional methods due to plenty of weight parameters. In the training process, the inputs of the proposed SSRN are image patches and the inputs of TLSR, USDN, and DHSP are entire images. Therefore, the proposed SSRN has less time cost than TLSR, USDN, and DHSP.


**Table 9.** Time cost of different methods on different databases (seconds).

#### **5. Discussion**

The performance of the proposed SSRN heavily depends on the learning of the spectral mapping. When the spectral information contained in MSI is too little, it becomea difficult to learn effective spectral mapping, which may weaken the performance of the proposed SSRN. For instance, RGB images (special MSIs), containing only three spectral bands, have little spectral information. Similar colors in RGB images may represent different objects. In other words, similar RGB image pixels may correspond to different HSI pixels, which makes it challenging to learn the spectral mapping between MSIs and HSIs. In this subsection, to explore the performance of SSRN when the MSI contains little spectral information, the CAVE database [55] is employed to conducted experiments. The average quantitative results are reported in Table 10. On the CAVE database, the MSI only contains three spectral bands, making it challenging to learn the spectral mapping between MSIs and HSIs. In terms of PSNR, UIQI, RMSE, and ERGAS, the performance of SSRN is weaker than that of CNMF and HySure. In terms of SAM, the proposed SSRN outperforms the compared methods.

**Table 10.** Quantitative experimental results on the CAVE database.


#### **6. Conclusions**

In this paper, a spectral-spatial residual network is proposed to estimate HR HSI based on the observed HR MSI and LR HSI. Different from previous methods that focus on extracting spectral ingredients from LR HSI and extracting spatial ingredients from HR MSI, the proposed SSRN directly learns pixel-wise spectral mapping between MSIs and HSIs. In SSRN, a spectral module is proposed to extract spectral features from MSIs and a spatial module is proposed to explore the complementarity of homogeneous adjacent pixels to facilitate learning of spectral mapping. Finally, a self-supervised fine-tuning strategy is proposed to estimate the spectral mapping between HR MSIs and HR HSIs on the basis of the learned pixel-wise spectral mapping between LR MSIs and LR HSIs. Experiments on simulated and real databases show that SSRN can effectively reduce spatial and spectral distortions and can achieve superior performance. In the future, we will study more efficient deep networks for learning spectral mapping between MSIs and HSIs.

**Author Contributions:** Conceptualization, W.C. and X.Z.; methodology, W.C.; software, W.C.; validation, W.C., X.Z., and X.L.; formal analysis, X.Z.; investigation, W.C.; resources, X.Z.; data curation, W.C.; writing—original draft preparation, W.C.; writing—review and editing, X.Z.; visualization, X.Z.; supervision, X.L.; project administration, X.L.; funding acquisition, X.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was funded in part by the National Science Fund for Distinguished Young Scholars under grant 61925112, in part by the National Natural Science Foundation of China under grant 61806193 and grant 61772510, in part by the Innovation Capability Support Program of Shaanxi under grant 2020KJXX-091 and grant 2020TD-015, and in part by the Natural Science Basic Research Program of Shaanxi under grants 2019JQ-340 and 2019JC-23.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** The authors would like to thank the editors and reviewers for their insightful suggestions.

**Conflicts of Interest:** The authors declare no conflicts of interest.

#### **References**


### *Article* **A Spatial-Enhanced LSE-SFIM Algorithm for Hyperspectral and Multispectral Images Fusion**

**Yulei Wang 1,2, Qingyu Zhu 1, Yao Shi 1, Meiping Song 1,\* and Chunyan Yu <sup>1</sup>**


**Abstract:** The fusion of a hyperspectral image (HSI) and multispectral image (MSI) can significantly improve the ability of ground target recognition and identification. The quality of spatial information and the fidelity of spectral information are normally contradictory. However, these two properties are non-negligible indicators for multi-source remote-sensing images fusion. The smoothing filter-based intensity modulation (SFIM) method is a simple yet effective model for image fusion, which can improve the spatial texture details of the image well, and maintain the spectral characteristics of the image significantly. However, traditional SFIM has a poor effect for edge information sharpening, leading to a bad overall fusion result. In order to obtain better spatial information, a spatial filterbased improved LSE-SFIM algorithm is proposed in this paper. Firstly, the least square estimation (LSE) algorithm is combined with SFIM, which can effectively improve the spatial information quality of the fused image. At the same time, in order to better maintain the spatial information, four spatial filters (mean, median, nearest and bilinear) are used for the simulated MSI image to extract fine spatial information. Six quality indexes are used to compare the performance of different algorithms, and the experimental results demonstrate that the LSE-SFIM based on bilinear (LES-SFIM-B) performs significantly better than the traditional SFIM algorithm and other spatially enhanced LSE-SFIM algorithms proposed in this paper. Furthermore, LSE-SFIM-B could also obtain similar performance compared with three state-of-the-art HSI-MSI fusion algorithms (CNMF, HySure, and FUSE), while the computing time is much shorter.

**Keywords:** hyperspectral image; multi-source image fusion; SFIM; least square estimation; spatial filter

#### **1. Introduction**

In recent years, a large number of remote-sensing satellites have been launched continuously with the development of Earth observation technology [1,2]. Modern remotesensing technology has reached a new developmental stage of multi-platform, multi-sensor, and multi-angle observation [3–5]. The continuous development of remote-sensing applications such as geological exploration [6], resource and environmental investigation [7–9], agricultural monitoring [10–12], urban planning [13–16], etc., has greatly promoted the demand for remote-sensing data and the improvement of the performance of satellite sensors. However, due to the limitations of optical diffraction, modulation transfer function, signal-to-noise ratio, and the sensor hardware conditions, a single sensor normally cannot obtain data with both high-spatial and high-spectral resolutions at the same time. Multi-sensor data fusion has arisen at an historic moment, which can effectively explore the complementary information from multi-platform observations, making land surface monitoring more accurate and comprehensive. Multi-source remote-sensing data fusion refers to the processing of multi-source data with complementary information in time or space according to certain rules, so as to obtain a more accurate and informative composite images than any single data source.

**Citation:** Wang, Y.; Zhu, Q.; Shi, Y.; Song, M.; Yu, C. A Spatial-Enhanced LSE-SFIM Algorithm for Hyperspectral and Multispectral Images Fusion. *Remote Sens.* **2021**, *13*, 4967. https://doi.org/ 10.3390/rs13244967

Academic Editors: Chein-I Chang, Haoyang Yu, Jiaojiao Yu, Lin Wang, Hsiao-Ch Li and Xiaorun Li

Received: 14 October 2021 Accepted: 3 December 2021 Published: 7 December 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

A variety of multi-source remote-sensing fusion techniques have been developed in the last decade to enhance the spatial resolution of hyperspectral images and obtain information-rich HSI data with both high-spectral and high-spatial resolutions. The HSI-MSI fusion algorithms can be divided into the following four categories: component substitution (CS), multi-resolution analysis (MRA), spectral unmixing (SU), and Bayesian representation (BR). The idea of CS-based fusion algorithms is straightforward, it transforms the original HSI data, replaces the spatial information in low-spatial HSI data set with the spatial information in the high-spatial MSI data set, and finally inverts the reconstructed data to obtain the fused hyperspectral image. The typical CS-based methods are proposed toward generalizing existing pansharpening methods for HSI-MSI fusion, including the IHS [17] transform method proposed by W. J. Carper in 1990, PCA [18] transform proposed by P. S. Chavez in 1991, Gram–Schmidt (GS) [19] transform proposed by B. Aiazzi in 2007, and their variants [20–22], etc. These methods are simple and easy to implement, but have serious spectral distortion and cannot be used well for the fusion of hyperspectral images. The MRA-based methods obtain the fusion result by filtering the high-resolution image and adding the high-frequency detailed information to the hyperspectral image. The earliest MRA-based methods realized multi-scale image decomposition through pyramid transform, while the most representative and most widely used multi-resolution analysis methods include the fusion method based on various wavelet transforms [23], the smoothing filter-based intensity modulation (SFIM) [24] proposed by Liu and the generalized Laplacian pyramid (GLP) method proposed by Aiazzi [25]. The advantages of these MRA-based methods are less spectral distortion and anti-aliasing, but the algorithm is complex and the spatial feature loss often occurs in the fusion results. The SU-based methods utilize the hyperspectral linear unmixing model and apply it to the fusion optimization model. The advantages of these methods are less spatial and spectral information loss, but they always have higher computational complexity. Typical methods include the coupled non-negative matrix factorization (CNMF) [26] method proposed by Yokoya in 2012, the subspace-based regularization (HySure) [27] method proposed by Simoes in 2015, and the coupled sparse tensor factorization (CSTF) [28] method proposed by Li in 2018. The BR-based methods transform the problem of high-resolution image and hyperspectral image fusion into the problem of solving the Bayesian optimization model, and obtain the fusion result through solving the optimization. Typical BR-based methods include a maximum posteriori-stochastic mixing model (MAP-SMM) [29] proposed by Eismann in 2004, Bayesian sparse method [30] proposed by Wei in 2015, and fast fusion based on the Sylvester equation (FUSE) [31] method proposed by Wei in 2015. The BR-based methods have the advantage of less spatial and spectral information loss, but also result in a disadvantage of high computational complexity.

Recently, an increasing number of HSI-MSI fusion algorithms have been proposed [32–34]. These algorithms have been proved to be effective with good fusion performance. However, most of the researchers focus too much on performance improvements using modern technologies such as sparse representation, deep learning processing, etc., ignoring the computing time. In other words, these algorithms improve the fusion performance at the cost of increased computational complexity. As one of the effective MRA-based fusion methods, SFIM is proposed by Liu [24] for image fusion as mentioned above. Compared with traditional methods, SFIM is simple to calculate, easy to implement, and the spectral information is normally retained well, but there are problems such as fuzzy edge information of the image and insufficient improvement of detailed spatial information. In recent years, many improved SFIM algorithms have been studied, most of which focused on how to obtain simulated multispectral images with spatial information characteristics consistent with hyperspectral images and spectral features consistent with multispectral images. This paper combines the least square estimation LSE algorithm with SFIM, which can effectively improve the spatial information quality of the fused image. This paper also compares several spatial filters to extract spatial information to enhance the simulated MSI s boundary spatial information, and proposes an improved LSE-SFIM fusion algorithm based on

spatial information promotion to obtain an optimal fusion result. Experimental results on three HSI-MSI data sets show the effectiveness of the proposed algorithm using six image quality indexes.

The remainder of this article is organized as follows. Section 2 gives a detailed description of the proposed method. In Section 3, experimental results and analysis of different data sets are presented. Finally, conclusions are drawn in Section 4.

#### **2. Proposed Method**

#### *2.1. Basic Smoothing Filter-Based Intensity Modulation (SFIM) Algorithm*

The SFIM algorithm was proposed by Liu for image fusion in 2000, which is based on the simplified solar radiation and surface reflection model. Even though it was proposed some time ago, this algorithm is still in use due to its simplicity with good spectral preservation. The basic principle of the traditional SFIM is expressed as follows [24]:

$$\text{DNN}\_{\text{SFIM}} = \frac{\text{DNN}\_{\text{low}} \text{DN}\_{\text{high}}}{\text{MeanDN}\_{\text{high}}} \tag{1}$$

where DNlow, DNhigh represent the gray values of low-resolution and high-resolution images respectively, MeanDNhigh represents the simulated low-resolution image obtained by the local mean value of DNhigh.

#### *2.2. The Proposed Spatial Filter-Based Least Square Estimation (LSE)-SFIM*

For HSI-MSI image fusion, the formula (1) can be expressed as:

$$\text{Fusion} = \frac{\text{HSI}^{\prime} \times \text{MSI}}{\text{MSI}^{\prime\prime}} \tag{2}$$

where HSI is the up-sampling of original low-resolution hyperspectral data HSI, MSI represents the original high-resolution multispectral data MSI, and MSI is the up-sampling of MSI where MSI represents the simulated low-resolution image obtained by MSI. The algorithm performance is influenced by two factors: (1) how to obtain the simulated lowresolution image MSI ; and (2) how to get the up-sampled HIS and MS. The traditional SFIM uses mean filter to obtain the simulated low-resolution MSI (down-sampling) and uses the same filter to obtain up-sampled HSI and MSI. The edge information is lost by the mean filters, and this paper takes two steps to solve the problem: (1) least squares estimation (LSE) is used to adjust the coefficient and obtain MSI with as similar spatial information as the original HSI image, with the details discussed in Section 2.2.1; (2) filtering and interpolation methods are compared in the up-sampling stage to obtain the best enhanced spatial information, with the details discussed in Section 2.2.2. A bilinear approach proved to be the best in the experiments for this paper. The flow chart of the proposed algorithm is shown in Figure 1.

In order to make it clear how we obtain MSI, MSI , MSI", HSI, and HSI , Figure 2 gives a graphic abstract with detailed steps of the proposed algorithm. It can easily be seen from Figure 2 that, MSI is the original high-spatial multispectral data set, MSI is the down-sampling of MSI where the spatial size can be shrunk into the same as the original HIS (LSE is used here to adjust MSI for preserving better spatial information), and MSI is the up-sampling of MSI with the same size as the original high-spatial resolution MSI. HSI is the original low-spatial resolution hyperspectral data set, and HSI is the up-sampling of HSI to the same size as the high-spatial resolution.

**Figure 1.** Flowchart of the proposed fusion algorithm.

**Figure 2.** Graphic abstract with detailed steps of the proposed algorithm.

#### 2.2.1. Least Square Estimation Based SFIM Algorithm (LSE-SFIM)

Assuming that there is an ideal simulated multispectral image MSI , it should have two characteristics: (1) the spatial information characteristics are consistent with the original hyperspectral image, which can ensure that the spatial information of the hyperspectral image is counteracted; (2) the spectral information characteristics should be consistent with the original multispectral image, which ensures that the spectral characteristics of the multispectral image can be counteracted. The least square estimation algorithm solves these two problems well.

The LSE algorithm finds the best matching function by minimizing the sum of squares of errors. It is often used to solve linear regression coefficients in the processing of remotesensing images. The LSE-SFIM algorithm uses LSE to solve the linear regression coefficient that can minimize the spatial information error between the hyperspectral image and the simulated multispectral image MSI , so that the latter can have as much spatial information as possible as with the hyperspectral image.

The LSE-SFIM fusion algorithm first down-samples and extracts the spatial information of the high-resolution multispectral image MSI to obtain the simulated MSI , and then uses the least square estimation algorithm to solve the problem that can minimize the linear

regression coefficient of the spatial information error between the multispectral image MSI and the hyperspectral image, and use this linear regression coefficient to update the MSI and MSI , so as to ensure that the spectral information of the MSI is close to the MSI , and it can ensure that both MSI and MSI can have the same spatial information as HSI as possible. Finally, the HSI and MSI are obtained by up-sampling, which will be introduced in the next section, and fused separately according to the bands to obtain the fused image.

#### 2.2.2. Spatial Information Enhanced LSE-SFIM

When using the LSE-SFIM fusion image, the most critical step concerns how to obtain the simulated multispectral image MSI whose spatial information features are consistent with the hyperspectral image and spectral features are consistent with the multispectral image, so as to effectively improve the spatial resolution of the hyperspectral image and achieve the purpose of fusion. This step is the up-sampling process to obtain MSI and HSI in Figure 2. This paper compares several methods of extracting boundary spatial information from low-spatial resolution multispectral images, and obtains the best fusion result.

#### Filtering Method

Mean filtering and median filtering are two commonly used filtering methods. The main idea of mean filtering is to replace the gray value of the central pixel with the mean value of the gray value of the pixel to be found and the surrounding pixels in the middle, so as to achieve the purpose of filtering. Mean filtering can be simplified as Equation (3):

$$g(\mathbf{x}, y) = \frac{1}{M} \sum\_{f \in \mathcal{W}} f(\mathbf{x}, y) \tag{3}$$

where *M* is the filtering window size (pixels within the current window), and W is the current window.

Median filtering, as the name implies, is to replace the value of the pixel with the median value of the gray-scale values in the neighborhood window of a pixel. Median filtering can be simplified as Equation (4):

$$\log(\mathbf{x}, \mathbf{y}) = \text{med}\{f(\mathbf{x} - k, \mathbf{y} - l), (k, l \in \mathcal{W})\}\tag{4}$$

Taking 3 × 3 window size as the example, mean and median filtering are shown in Figure 3, where (a) represents gray values before filtering, (b) represents gray values after mean filtering, and (c) represents gray values after median filtering.

**Figure 3.** 3 × 3 mean and median filtering, (**a**) gray values before filtering in green color, (**b**) gray values after mean filtering in red color, and (**c**) gray values after median filtering in purple color.

#### Interpolation Method

Image interpolation algorithm is a basic technology in image processing. Nearest neighbor interpolation and bilinear interpolation are two commonly used image interpolation algorithms. The nearest neighbor interpolation algorithm has the least amount of calculation and the simplest principle. The gray value of the nearest point among the neighboring pixels around the point to be sampled is used as the gray value of the point. There

is a linear relationship between the pixel values of different points in the image. According to this idea, the bilinear interpolation algorithm considers the pixel values in the horizontal and vertical directions at the same time, so that the problem of grayscale discontinuity in the image is improved, and the overall effect of the image can also be improved.

#### **3. Experimental Results and Analysis**

In order to verify the effectiveness of the algorithm proposed in this paper, three sets of simulation experiment data sources are selected, namely Pavia University, Chikusei and HyMap Rodalquilar. This paper uses MATLAB R2018b software platform to program on Windows 10 64-bit system, and the processor is Intel (R) Core (TM) i5-8250U, 8G memory.

#### *3.1. Hyperspectral Datasets*

In order to evaluate the performance of the fusion method objectively and quantitatively, we use low-spatial resolution hyperspectral images obtained from real data resampling in the spatial domains, and high-spatial resolution multispectral images obtained in the spectral domains to carry out simulation data experiments. Table 1 shows the parameters of the three datasets used in this paper for verification experiments.


**Table 1.** Parameters of three hyperspectral datasets.

#### 3.1.1. Pavia University

Pavia University acquired a ROSIS sensor in 2001. The image size is 610 × 340 with a spatial resolution of 1.3 m per pixel, and the experimental data selected in this section contains 560 × 320 pixels. Its spectral range is 0.43–0.86 μm with a total of 115 bands, and 103 bands are remaining used after removing 12 noise bands. The low-spatial resolution HSI image was obtained from the original HSI data through the isotropic Gaussian point spread function down-sampling eight times with a total of 103 bands and 70 × 40 pixels, the pseudo-color image is shown in Figure 4a. The high-spatial resolution MSI image data was synthesized from the original HSI data according to the SRF down-sampling of the ROSIS sensor. There were four bands in total with an image size of 560 × 320, as shown in Figure 4b. The reference image of the original HIS is shown in Figure 4c.

**Figure 4.** Pavia University datasets with (**a**) low spatial resolution hyperspectral image (HSI), (**b**) high spatial resolution multispectral image (MSI) and (**c**) high spatial resolution HSI as the reference.

#### 3.1.2. Chikusei

Chikusei dataset was collected by a Headwall Hyperspec-VNIR-C sensor on 29 July 2014 in Chikusei City, Japan. It was then produced and published by Naoto Yokoya and Akira Iwasaki of the University of Tokyo [26]. The spatial resolution is 2.5 m and the scene is 2517 × 2335. It consists of several pixels, mainly including agricultural and urban areas. In the experiment, a size of 540 × 420 pixels image is selected for experiments. The data spectrum range is 0.36–1.02 μm, including 128 bands in total. Among them, the lowspatial resolution HSI image was obtained from the original HSI data through the isotropic Gaussian point spread function down-sampling six times, with a total of 128 bands and 90 × 70 pixels, the pseudo-color image is shown in Figure 5a. The high-spatial resolution MSI data were synthesized from the original HSI data according to the SRF of the WV-2 sensor, with eight bands and 540 × 420 pixels, and the pseudo-color image of MSI was shown in Figure 5b. The reference image of the original HSI is shown in Figure 5c.

**Figure 5.** Chikusei datasets with (**a**) low-spatial resolution HSI, (**b**) high-spatial resolution MSI and (**c**) high-spatial resolution HSI as the reference.

#### 3.1.3. HyMap Rodalquilar

The HyMap image was taken in Rodalquilar, Spain in June 2003 [35], covering a gold mining area in the Cabo de Gata Mountains. The spatial resolution of the data was 10 m. The experimental data selected in this paper contains 867 × 261 pixels. After removing the water absorption band, 167 bands are selected for experimentation, and the spectral range is 0.4 μm–2.5 μm. Among them, the low-spatial resolution HSI image was obtained from the original HSI data through the isotropic Gaussian point spread function downsampling three times, with a total of 167 bands and 289 × 87 pixels, and the resulting pseudo-color image is shown in Figure 6a. The high-spatial resolution MSI image data were synthesized from the original HSI data according to the SRF down-sampling of the HyMap sensor. There were four bands in total with the image size of 867 × 261 pixels, as shown in Figure 6b. The reference image is shown in Figure 6c.

#### *3.2. Comparative Analysis of the Proposed Spatial Enhanced LSE-SFIM Using Different Spatial Filters*

In this section, different spatial enhanced methods in Section 2.2.2 are used to extract boundary information and obtain better fusion results, and the performance discussed to find the best method. Six methods are discussed in this section, the traditional SFIM (named as SFIM), LSE-based SFIM (named LSE-SFIM), mean filtering LSE-SFIM (named LSE-SFIM-M), median filtering LSE-SFIM (named LSE-SFIM-Med), neighboring interpolation LSE-SFIM (named LSE-SFIM-N), and bilinear interpolation LSE-SFIM (named LSE-SFIM-B). Both subjective and objective evaluations are discussed, and spectral distortion are compared among all six algorithms.

**Figure 6.** HyMap Rodalquilar datasets with (**a**) low spatial resolution HSI, (**b**) high spatial resolution MSI and (**c**) high spatial resolution HSI as the reference.

#### 3.2.1. Subjective Evaluation

The subjective evaluation mainly uses human eyes to observe the fusion results. The comparison chart of the fusion results of the three groups of experiments is shown in Figures 7–9. Observing the fusion result from a subjective point of view, it can be known that the method of using bilinear interpolation to obtain simulated MSI has better visibility, and the fusion result obtained has a clearer texture and better spectrum retention performance.

**Figure 8.** Fusion results of Chikusei using six SFIM-based algorithms.

**Figure 9.** Fusion results of Chikusei using six SFIM-based algorithms.

Figure 7a–f shows that the Pavia University dataset has been subjected to SFIM, LSE-SFIM, LSE-SFIM-M, LSE-SFIM-Med, LSE-SFIM-N and LSE-SFIM-B. It can be seen from the figures that the spectral characteristics of the fusion results by all methods are maintained well. In terms of spatial geometric features, the fusion result of LSE-SFIM-N is visually blurred, and the edge details are not highlighted. In addition, the fusion results obtained by other algorithms have higher clarity, and the LSE-SFIM-B algorithm, whether in terms of spectral characteristics or spatial characteristics, has the closest visual effect to the reference image.

Figure 8a–f are the fusion results of Chikusei dataset through 6 algorithms, namely SFIM, LSE-SFIM, LSE-SFIM-M, LSE-SFIM-Med, LSE-SFIM-N and LSE-SFIM-B. It is obvious that, in terms of spectral characteristics, the fusion results of each method have no more spectral distortions of ground objects, and the color information performs well. In terms of spatial features, the fusion result of LSE-SFIM-N has unclear ground textures and nonobvious edge details. In addition, the fusion results obtained by other algorithms maintain both texture features and edge details of the ground features, especially the LSE-SFIM-B algorithm, which retains more spatial features and the edges of the ground features are more obvious.

Figure 9a–f are the fusion results of the HyMap Rodalquilar data set by 6 algorithms, namely SFIM, LSE-SFIM, LSE-SFIM-M, LSE-SFIM-Med, LSE-SFIM-N and LSE-SFIM-B. In terms of spectral characteristics, the fusion results of each method are not too distorted in the maintenance of the ground object spectrum, and the color information is maintained well. In terms of spatial features, the LSE-SFIM-N fusion result has a poor spatial information enhancement effect. In addition, the fusion results obtained by other algorithms maintain the texture features of hyperspectral images and multispectral images well, especially the LSE-SFIM-B algorithm, which maintains the spatial characteristics better, and the edges of the features are also the most obvious.

In general, through the subjective evaluation of the fusion results by human eyes, it can be found that the proposed LSE-SFIM-B fusion algorithm has the best performance, and the fusion image with clearer boundaries can be obtained from the visual results, especially in Figures 7 and 8. It is the LSE-SFIM-B algorithm makes full use of the complementary characteristics of HIS and MSI images, which realizes the fusion of spectral and spatial features of multiple source images, improves the geometric features of ground objects, and verifies the effectiveness of this algorithm. As for Figure 9, due to image abbreviation, the spatial information enhancement effect of some images is not easy to see, and it is difficult to subjectively judge which method is better. Therefore, objective evaluation is particularly important.

#### 3.2.2. Objective Evaluation

By observing the fusion results in Figures 7–9, it can be seen that the method of obtaining simulated multispectral images by using the LSE-SFIM-B method has better visibility, and the fusion results obtained have clearer texture and better spectrum retention capabilities. To further objectively evaluate the quality of the fusion images by different algorithms, this paper also calculates and analyzes the fusion results from a quantitative perspective by comparing the six objective evaluation indicators of PSNR (peak signal-tonoise ratio), SAM (spectral angle mapping), CC (cross correlation), Q2*<sup>n</sup>* (quality 2*n*), RMSE (root mean square error) and ERGAS (error relative global adimensionnelle synthesizer). The quantitative comparisons are shown in Tables 2–4, with the histogram comparison of evaluation indicators shown in Figures 10–12.

**Table 2.** Quantitative comparisons of fusion performance by six algorithms for Pavia University.


**Table 3.** Quantitative comparison of fusion results by six algorithms for Chikusei.


**Table 4.** Quantitative comparison of fusion results by six algorithms for HyMap Rodalquilar.


According to Table 2 and Figure 10, it can be seen that for Pavia University data, the method based on the LSE-SFIM algorithm has better fusion effect than traditional SFIM, and the algorithms LSE-SFIM-M and LSE-SFIM-B are better than the original LSE-SFIM algorithm, and the effect of LSE-SFIM-B is even better than that of LSE-SFIM-M, which is the best among several comparison methods. In terms of PSNR, CC and Q2*n*, the LSE-SFIM-B fusion algorithm is significantly higher than the results of the other algorithms, indicating that the spatial quality information of the fusion image is better, the fusion result has more detailed spatial information, and it is correlated with the reference image. In terms of SAM, RMSE and ERGAS, the LSE-SFIM-B fusion algorithm is still superior to other algorithms, indicating that the fusion result can better maintain the spectrum, and the error with the reference image is the smallest, and the fusion result is the closest to the reference image.

**Figure 10.** Histograms comparison of evaluation indicators by six algorithms for Pavia University.

**Figure 11.** Histogram comparison of evaluation indicators by six algorithms for Chikusei.

**Figure 12.** Histogram comparison of evaluation indicators by six algorithms for HyMap Rodalquilar.

In order to further illustrate that the algorithms proposed in this paper have a good performance using the data of different images, Table 3 and Figure 11 give the objective evaluation of Chikusei data, and Table 4 and Figure 12 give the objective evaluation of HyMap Rodalquilar data. It can be seen that for all the data sets, the fusion algorithm using LSE-SFIM-B is also the most outstanding in terms of spatial resolution enhancement and spectral characteristic maintenance. Specifically, the LSE-SFIM-B algorithm is significantly higher than other algorithms in terms of PSNR, CC and Q2*n*, indicating that the fusion image has good ground quality information, the fusion result is more detailed, and the correlation with the reference image is relatively high. In terms of SAM, RMSE and ERGAS, the LSE-SFIM-B fusion algorithm has the best performance, indicating that the error between the fusion result and the reference image is the smallest, and the spectrum can be better maintained.

#### 3.2.3. Spectral Distortion Comparison

A good fusion method should minimize spectral distortion as much as possible while improving the spatial resolution. In this section, to further analyze the spectral distortion for different SFIM-based algorithms, Figures 13–15 show SAM plots of the experimental results of three hyperspectral data sets. The SAM plot is conducted for every pixel to compute the SAM value between the fusion result and the reference image. In the figures, each pixel uses the change from cold to warm to indicate the level of spectral similarity at that pixel. The closer the color of the pixel point is to the warm color, that is, the closer to dark red, the lower the spectral similarity and the worse the spectral quality relative to other pixels; the closer the color of the pixel point is to the cool color, that is, the closer to dark blue, the higher the spectral similarity and the higher the spectral quality relative to other pixels. The larger the area occupied by the blue part in the figure, the better the overall spectral quality. Compared with other algorithms, it can be seen from the SAM graph of the algorithm experiment results that the spectral performance of LSE-SFIM-B on the three data sets is relatively better.

**Figure 15.** SAM map of six algorithms for HyMap Rodalquilar data experiment results.

3.2.4. Influence of Spatial Scale Factor between MSI and HSI

The above experiments have proved the LSE-SFIM-B algorithm to be effective among all SFIM-based algorithms. It would be interesting to know the performance of the proposed LSE-SFIM-B algorithm according to the spatial scale factor between the highresolution MSI and low-resolution HSI images. In order to see how the algorithm performs on different scale factors, the Pavia University data are used in this section, where the spatial scale factors (SF) are set to SF = 2, 4, 8, respectively. The performance comparison is given in Table 5, where the spatial scale value is the down-sampling rate of the HSI data.


**Table 5.** Performance of different spatial scale factors of MSI and HSI data using LSE-SFIM-B algorithm for Pavia University.

The results in Table 5 are very interesting and show that the PSNR, SAM, CC, Q2*<sup>n</sup>* and RMSE values tend to worsen as the spatial scale factor increases, while the ERGAS value becomes smaller (better) as the spatial scale factor increases.

#### *3.3. Performance Analysis of the Proposed SFIM-Based Algorithm and Other Commonly Used Algorithms*

In order compare the fusion performance of the proposed SFIM-based algorithm with the existing representative algorithms, this section chooses some state-of-the-art algorithms for comparison. Three state-of-the-art algorithms have been used in this section, which are CNMF proposed in 2012, HySure proposed in 2015, and FUSE proposed in 2018. Since the LSE-SFIM-B method has been proved to be most effective in the previous section, we will use this one for comparison. Since the robustness of the algorithm for different data sets has been proved in the previous section, in this section only the Chikusei data set is use to reduce repetition.

The experimental settings are as follows: (1) for the Chikusei data set, the number of endmembers (D) is set to *D =* 30 for any algorithm needed. (2) In CNMF algorithm, the maximum number of iterations for inner loops (*Iin*) and the maximum number of iterations for outer loops (*Iout*) are *Iin =* 200, *Iout =* 2. (3) In the HySure algorithm, the parameters are set to *λϕ* = 10−3, *λ<sup>B</sup>* = *λ<sup>R</sup>* = 10.

Figure 16a–e are the fusion results of the Chikusei dataset through five algorithms, which are conventional SFIM, the proposed LSE-SFIM-B, CNMF, HySure, and FUSE. The visual effects seem to be similar to the last four algorithms, in which the SFIM seems to have the worst performance. In order to further evaluate the performance of the proposed LSE-SFIM-B algorithm and the other three state-of-the-art algorithms, Table 6 gives the comparison for objective evaluation indicators PSNR, SAM, CC, Q2*n*, RMSE and ERGAS.

**Figure 16.** Fusion results of Chikusei dataset by five different algorithms.


**Table 6.** Quantitative comparison of fusion results by five different algorithms for Chikusei.

It can be seen that the conventional SFIM algorithm has the worst performance of all six indicators. The other four algorithms, including the proposed LSE-SFIM-B, CNMF, HySure, and FUSE, have similar performance. In the HySure algorithm four of the six indicators are optimal, and the other two optimal indicators are obtained by the proposed LSE-SFIM-B algorithm. However, as mentioned in Section 1, the SFIM-based algorithm is simple to calculate and easy to implement. In order to verify the computational complexity of these different algorithms, Table 7 shows the computing time comparisons for different algorithms, and the proposed LSE-SFIM-B algorithm has the least computing time. To further show the time-efficient superiority of LSE-SFIM-B, the speed-up ratio is also provided in Table 7 which is calculated using the computing time of the other three algorithms (CNMF, HySure, and FUSE) divided by the computing time of the proposed LSE-SFIM-B algorithm. As a result, even though the HySure algorithm has four performance indicators that are better than LSE-SFIM-B algorithm, it is time consuming, especially when the data set is very large, the LSE-SFIM-B could demonstrate its excellent time efficiency while maintaining the performance.

**Table 7.** Computational complexity analysis by five different algorithms for Chikusei.


#### **4. Discussion and Conclusions**

This paper proposes a spatial-enhanced LSE-SFIM algorithm for HSI-MSI images fusion. The contributions of the proposed algorithm can be summarized as follows:


The proposed algorithm can achieve a good performance in most cases, performs better than traditional SFIM algorithm with better spatial preserving and less spectral distortion, and also has less computational complexity than the state-of-the-art fusion algorithms. However, the spectral fidelity is not good enough, since the SFIM-based model performs the fusion band by band, without considering the spectral correlations. Adding spectral constraints to the model can be considered in a future study.

**Author Contributions:** Conceptualization, Y.W.; Methodology, Y.W., Q.Z. and Y.S.; Experiments: Y.S.; Data Curation, Y.W. and M.S.; Formal Analysis, M.S. and C.Y.; Writing—Original Draft, Y.W. and Y.S.; Writing—Review & Editing: Y.W. All authors have read and agreed to the published version of the manuscript.

**Funding:** The work of Y. Wang was supported in part by the National Nature Science Foundation of China (61801075), China Postdoctoral Science Foundation (2020M670723), the Fundamental Research Funds for the Central Universities (3132019341), and Open Research Funds of State Key Laboratory of Integrated Services Networks, Xidian University (ISN20-15). The work of Meiping Song was supported by the National Nature Science Foundation of China (61971082).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


*Article*
