1. Introduction
Hyperspectral images, also known as hyperspectral remote sensing images, are stereoscopic images captured by aerospace vehicles equipped with hyperspectral imagers. They consist of two spatial dimensions and one spectral dimension. The spectral dimension contains 10 s or even 100 s of spectral bands, which provide it with broad prospects for applications such as military target detection [
1]; atmospheric and environmental research [
2]; forest vegetation cover detection [
3]; and change area detection [
4]. Hyperspectral image classification is a commonly used technique in the applications listed above. However, the excessive redundancy of spectral information and the limited number of training samples pose a great challenge for hyperspectral image classification.
In the early research on hyperspectral image classification, methods such as support vector machine (SVM) [
5], multinomial logistic regression (MLR) [
6] and sparse representation classification (SRC) [
7] were proposed, which directly take the original input as the training sample and use it to train the classifier through the spectral information of the hyperspectral image. However, such methods ignore two problems: (1) the large amount of redundant information in spectral bands makes it difficult to train the model; (2) hyperspectral images have high spatial correlation and contain abundant spectral information. To solve problem (1), dimensionality reduction strategies [
8,
9] (feature selection [
10] and feature extraction [
11]) are applied to hyperspectral image classification tasks. To solve problem (2), morphological contours [
12] and Gabor features [
13] are used to ex-tract spatial information, and the morphological kernel [
14] and composite kernel [
15] methods are used to extract spectral–spatial information. Although the aforementioned methods improve the accuracy of the classifier, it is difficult to achieve better classification results in complex scenes because these methods use shallow models and rely heavily on labeled samples, which cannot extract the deep features of the samples.
Deep learning (DL) has shown strong capabilities in automatically extracting nonlinear and hierarchical features, and thus has been widely used in information extraction [
16], image classification [
17], semantic segmentation [
18] and object detection [
19]. Therefore, some hyperspectral image classification methods based on deep learning are proposed. In [
20], Zhou et al. used a stacked auto-encoder (SAE) to extract spectral and spatial features and used logistic regression to obtain classification results. In [
21], Szegedy C, et al. used a restricted Boltzmann machine (RBM) and deep belief network (DBN) for classification. In [
22], Ma et al. used a spatially updated deep auto-encoder (DAE) to extract spectral–spatial features and designed a different co-representation mechanism to handle narrow-scale training sets. In [
23], Zhang et al. utilized a recursive auto-encoder to learn the spatial and spectral information and adopted a weighting scheme to fuse the spatial information. Although these methods can extract the spectral–spatial features of hyperspectral images to a certain extent, they destroy the spatial structure. Since convolutional neural networks (CNNs) can exploit spatial features while preserving the original spatial structure, some methods based on CNNs have been proposed. In [
24], Zhao et al. employed a CNN as the feature extractor. In [
25], Zhang et al. proposed a method based on differentiated region convolutional neural network (DRCNN), which uses different image patches within the neighborhood of the target pixel as the input of the CNN, and the input data is effective reinforcement. In [
26], Lee et al. proposed a contextual deep CNN (CDCNN) with deeper and wider network layers.
In general, deep-level features in the image can be extracted by increasing the depth of the network, but this also causes problems such as difficulty in model training and gradient vanishing. The residual network (ResNet) [
27] and dense convolutional network (DenseNet) [
28] solved this problem quite efficiently, and such networks can also extract deep-level features without increasing the depth of the network structure. Inspired by ResNet, the literature [
29] proposed a spectral-spatial residual network (SSRN) that contains a spectral residual block and a spatial residual block for sequentially extracting spectral features and spatial features. Inspired by DenseNet, some literature [
30] has proposed a fast dense spectral-spatial convolutional network (FDSSC), which achieves better performance while reducing the training time. Although the aforementioned methods solve the feature extraction problem using CNN, but in the process of model training, the attention of the convolutional layer to features is not the same. In order to optimize the extracted features, the attention mechanism is used to process different features differently. It is also a research hotspot in recent years. One study [
31] proposed a feedback attention-based dense CNN network, while another [
32] proposed a dual-branch multi-attention mechanism network (DBMA) based on the convolutional block attention module (CBAM) [
33]. Moreover, [
34] proposed a dual-branch dual-attention mechanism network (DBDA) based on the dual-attention network (DANet) [
35]. Although these methods are very effective, the extraction and utilization of spatial information and spectral information of hyperspectral images are not sufficient, resulting in the inability to obtain better classification results in the case of limited training samples.
Inspired by pyramidal convolution (PyConv) [
36] and DenseNet, for the two problems of missing features and insufficient feature extraction, this paper proposes a hyperspectral image classification method based on dense pyramidal convolution and multi-feature fusion (DPCMF). The proposed method consists of two branches: the spatial branch and the spectral branch, both of which are designed to capture spectral and spatial features, respectively. In the spatial branch, principal component analysis (PCA) is performed to achieve dimensionality reduction for image samples, whereas noise and redundant spectral information are removed while retaining significant spectral information. Then, the dense pyramid convolution module and non-local block [
37] are used to extract multi-scale local spatial information and global spatial information from image samples. Finally, the multi-scale local spatial information and global spatial information are fused to obtain a spatial feature map. In the spectral branch, convolutional neural networks are first used to perform convolutions on image samples, and then the spectral information in the images is extracted through dense pyramid convolution to obtain spectral feature maps. Lastly, the spatial and spectral feature maps are fused and fed into the classification module to obtain the classification results. The three main contributions of this paper are described below.
To address the problem of missing features, a hyperspectral image classification method (DPCMF) based on dense pyramidal convolution and multi-feature fusion is proposed. This method uses spatial and spectral branches to extract local spatial features and spectral features, respectively. Multi-scale local spatial information is extracted using non-local block segmentation and fused to obtain hyperspectral feature maps.
To address the problem of insufficient feature extraction, in the feature extraction part of the spatial and spectral branches, a combination of pyramidal convolution and DenseNet is used. Without increasing the depth of the network, the convolutional kernels are arranged in descending order to extract deep-level features at different scales.
DPCMF achieves state-of-the-art classification accuracies on four datasets with limited training samples.
The rest of this paper is arranged as follows:
Section 2 presents the specific implementation of the proposed method;
Section 3 and
Section 4 present and analyze the experimental results; and
Section 5 summarizes the conclusions of the article and proposes future directions for research.
3. Experimental Results
In order to verify the effectiveness of the method proposed in this paper, four public hyperspectral datasets, namely, Indian Pines (IP), Pavia University (UP), Salinas Valley (SV) and Botswana (BS), are used to conduct experiments. The accuracy of each method is measured by three evaluation indicators: overall accuracy (OA), average accuracy (AA) and Kappa coefficient. OA represents the proportion of correctly classified samples to the total test samples, and AA represents the average accuracy across all categories. The Kappa coefficient represents the level of consistency between the true value and the classification result. The greater the values of these three evaluation metrics, the better the classification results.
3.1. Introduction and Division of the Dataset
Indian Pines (IP): Indian Pines was imaged by the Airborne Visible Infrared Imaging Spectrometer (AVIRIS) in 1992 on an Indian pine tree in Indiana, USA and marked with a size of 145 × 145. The imaging wavelength range of AVIRIS is 0.4–2.5 μm, and it continuously images ground objects in 220 continuous bands. However, because the 104th–108th, 150th–163rd and 220th bands cannot be reflected by water, 200 bands covering 16 types of ground objects are truly used for training.
Pavia University (UP): Pavia University is part of the hyperspectral data of the city of Pavia in Italy in 2003 collected by the German Airborne Reflective Optics Spectrographic Imaging System (ROSIS-03). The size of this dataset is 610 × 340. The spectral imager continuously captures images with 115 bands within the wavelength range of 0.43–0.86 μm, and the spatial resolution of the resulting images is 1.3 m. Among these bands, 12 bands were eliminated due to noise effects, and 103 bands covering 9 types of ground objects were used for real-world training.
Salinas Valley (SV): The Salinas Valley dataset was acquired by the AVIRIS sensor in Salinas Valley, California. The size of this dataset is 512 × 217, the spatial resolution is 3.7 m, and it contains 224 continuous bands. In total, 20 water-absorbing bands (108–112, 154–167, 224) were removed, and 204 bands covering 16 types of ground objects were actually used for training.
Botswana (BS): The Botswana dataset was acquired by the NASA EO-1 satellite in the Okavango Delta of Botswana in May 2001. The size of this dataset is 1476 × 256. The sensor on EO-1 has a wavelength range of 400–2500 nm and a spatial resolution of about 20 m. Among the 242 bands, the noise bands (1–9, 56–81, 98–101, 120–133, 165–186) were removed, and 145 bands covering 14 types of ground objects were actually used for training.
Before conducting the experiments, we split each dataset into three parts, namely, training set, validation set, and test set. The training set is used to update model parameters, the validation set is used to monitor the temporary models generated during the training phase, and the test set is used to evaluate the optimal model. For different datasets, the proportions of the three parts are different. The division of Indian Pines (IP) is shown in
Table 4, the division of Pavia University (UP) is shown in
Table 5, the division of Salinas Valley (SV) is shown in
Table 6 and the division of Botswana (BS) is shown in
Table 7.
3.2. Experimental Setting
To validate the classification performance of the, we conducted experiments to compare the DPCMF network with the SVM, SSRN, FDSSC, DBMA and DBDA classification networks. All experiments were performed on Intel (R) Xeon (R) 4208 CPU @ 2.10 GHz processor with Nvidia GeForce RTX Running on the 2060Ti graphics card system. The programming language used is Python. All classification networks were implemented using PyTorch, PyCharm was used as the compiler, the batch size was set to 16, RMSprop was used as the optimizer, the initial learning rate was set to 0.00005, and the cross-entropy loss function was used for experiments.
3.3. Classification Maps and Categorized Results
3.3.1. Classification Maps and Categorized Results for the IP Dataset
In this experiment, 3% of the samples were used as training samples, 3% as validation samples, and 94% as test samples. The categorized results of different methods on the IP dataset are listed in
Table 8, and the classification maps are shown in
Figure 6.
3.3.2. Classification Maps and Categorized Results for the UP Dataset
In this experiment, 0.5% of the samples were used as training samples, 0.5% as validation samples, and 99% as test samples. The categorized results of different methods on the UP dataset are listed in
Table 9, and the classification maps are shown in
Figure 7.
3.3.3. Classification Maps and Categorized Results for the SV Dataset
In this experiment, 0.5% of the samples were used as training samples, 0.5% as validation samples, and 99% as test samples. The categorized results of different methods on the SV dataset are listed in
Table 10, and the classification maps are shown in
Figure 8.
3.3.4. Classification Maps and Categorized Results for the BS Dataset
In this experiment, 1.2% of the samples were used as training samples, 1.2% as validation samples, and 97.6% as test samples. The categorized results of the different methods on the BS dataset are listed in
Table 11, and the classification maps are shown in
Figure 9.
3.4. Impact of Convolution Kernel Size
In the process of feature extraction, the size of the convolution kernel affects the degree to which information is extracted. In this chapter experiment, convolution kernels of different sizes were used to extract features from hyperspectral images. To evaluate the impact of the convolution kernel size on the experimental results, experiments were conducted using convolution kernels of the same size instead of Pyconv. The experimental results are shown in
Table 12. In the table, DPCMF_3 represents the use of a 3 × 3 × 1 convolution kernel in the spatial block and a 1 × 1 × 3 convolution kernel in the spectral block; DPCMF_5 represents the use of a 5 × 5 × 1 convolution kernel in the spatial block and a 1 × 1 × 5 convolution kernel in the spectral block; DPCMF_7 represents the use of a 7 × 7 × 1 convolution kernel in the spatial block and a 1 × 1 × 7 convolution kernel in the spectral block; and DPCMF_9 represents the use of a 9 × 9 × 1 convolution kernel in the spatial block and a 1 × 1 × 9 convolution kernel in the spectral block.
3.5. Impact of Training Sample Size
To verify the impact of the number of training samples on the experimental results, experiments were conducted using a varying number of samples as training samples. For the IP dataset, 0.5%, 1%, 3%, 5% and 10% of the samples were used as training sets, and the experimental results are shown in
Table 13. For the UP and SV datasets, 0.1%, 0.5%, 1%, 3% and 5% of the samples were used as training sets, and the experimental results are shown in
Table 14 and
Table 15, respectively. For the BS dataset, 0.5%, 1.2%, 3%, 5% and 10% of the samples were used as training sets, and the experimental results are shown in
Table 16.
3.6. Ablation Experiment
To verify the impact of the spatial block, spectral block, and non-local block on OA, experiments were conducted on these three modules using four datasets, as shown in
Table 17.
Table 17 displays the classification results of DPCMF, DPCMF-AE, DPCMF-AN, DPCM-EN and DPCMF-D on the four datasets. DPCMF-AE represents the absence of the non-local block, DPCMF-AN represents the absence of the spectral block, DPCMF-EN represents the absence of the spatial block, and DPCMF-D represents the absence of the Dense structure. The dataset partitioning process is consistent with that described in the previous section.
4. Discussion
The experimental results are shown in
Figure 10a, DPCMF method achieves significant improvements in all of the following three metrics: OA, AA and Kappa. In terms of time, due to the large input volume of convolutional neural networks and the need for more training parameters, the time cost of the DPCMF method is higher than that of the SVM method, but in terms of classification accuracy, the accuracy level of the SVM method is much lower than those of other deep learning methods. In most cases, the DPCMF method takes less time than other deep learning-based methods.
For the IP dataset, the OA of the DPCMF method is 96.74%, which is 27.39%, 4.79%, 1.29%, 3.89% and 0.61% higher than the OA levels of the other five methods, its AA is 96.12%, which is 30.26%, 7.18%, 1.44%, 9.94% and 0.43% higher than the AA levels of the other five methods, and its Kappa coefficient is 96.32%, which is 31.67%, 5.51%, 1.5%, 4.47% and 0.89% higher than the Kappa of the other five methods. The classification accuracy for each category has reached more than 90%. Compared to the other ground objects, the classification accuracy is lower for the Grass-pasture-mowed because there are fewer training samples for the ground objects, and it is difficult to extract a large amount of feature-related information from a small number of samples during model training.
For the UP dataset, the OA of the DPCMF method is 98.10%, which is 15.03%, 9.78%, 1.18%, 2.82% and 0.98% higher than the OA levels of the other five methods, its AA is 98.33%, which is 16.09%, 5.19%, 1.55%, 2.64% and 0.99% higher than the AA levels of the other five methods, and its Kappa coefficient is 97.77%. The classification accuracy for all categories has reached more than 94%. Compared to the alternative ground-truth features, Self-blocking bricks have lower classification accuracy because the category features of the ground-truth features are not obvious, and it is difficult to extract features with elevated discriminative degree.
For the SV dataset, the OA of the DPCMF method is 98.92%, which is 10.83%, 4.5%, 2.4%, 1.97.05% and 1.61% higher than the OA levels of the other five methods, its AA is 98.76%, which is 7.31%, 2.14%, 0.95%, 1.79% and 1.42% higher than the AA levels of the other five methods, and its Kappa coefficient is 98.73%, which is 12.03%, 4.96%, 3.35%, 2.13% and 0.92% higher than the Kappa coefficient of the other five methods. The classification accuracy for each category has reached more than 91%.
For the BS dataset, the OA of the DCFE method is 96.67%, which is 18.04%, 6.41%, 3.50%, 1.48% and 0.28% higher than the OA levels of the other five methods, its AA is 97.08%, which is 17.51%, 6.26%, 2.63%, 1.36% and 0.58% higher than the AA levels of the other five methods, and its Kappa coefficient is 96.57%, which is 19.70%, 7.10%, 3.98%, 1.78% and 0.48% than the Kappa coefficient of the other five methods. Compared to the other ground features, Floodplain Meadows 2 has lower classification accuracy because there are fewer training samples for this ground feature, the features of this ground feature are more complex, and feature extraction is more difficult during the training of the model.
From the results of the
Figure 10b and
Table 12, as the size of the convolutional kernel increased, the OA gradually increased, indicating that larger kernels can better capture a wider range of features. However, when the size of the kernel increased to a certain extent, the OA started to decrease because excessively large kernels may capture noise or irrelevant information, thereby reducing the accuracy of the model. In this group of experiments, the best OA can be achieved through pyramidal convolutions. The reason is that the pyramidal convolution module uses multi-scale kernels to capture features of different scales, thereby improving the model generalization ability. Choosing an appropriate combination of convolutional kernels can also improve the accuracy of the model. For example, for certain tasks, the use of smaller kernels may be more suitable because they can better capture local features. Therefore, when choosing the size of the convolutional kernel, it is necessary to ensure a good balance according to the specific task requirements and data characteristics in order to obtain the best experimental results.
In the experiments on the impact of training sample size, the classification accuracy of the SVM, SSRN, FDSSC, DBMA, DBDA and DPCMF methods all improved as the number of training samples increased. Moreover, the performance gap between different models also narrowed with the increase in the number of training samples. The results of these experiments show that, in the case of limited number of training samples, the DPCMF method can better extract multi-class features in the samples by using densely connected pyramidal convolution layers to capture spectral features and multi-scale spatial features and using non-local modules to capture global spatial information. Therefore, it achieved good classification results.
From the results of
Figure 10c and
Table 17, it can be seen that the absence of any module will reduce the model accuracy. DPCMF-AE performed poorly because of inadequate perception of global spatial features. When processing images, it is not only necessary to understand the characteristics of each pixel in the image, but it is also necessary to understand the global information such as the structure, background, layout and composition of the image. Such global information can help the model better understand the image and improve its performance in image processing. DPCMF-AN performed poorly because spectral images contain multiple continuous spectral bands, and each spectral band corresponds to different spectral information of different wavelengths. Therefore, they have high dimensionality and rich information that can be used to accurately describe the spectral characteristics of objects. The inability to extract spectral features results in the lack of important spectral information, making the model unable to distinguish and classify different objects. DPCMF-EN performed poorly because considering only the spectral information of pixels is often insufficient to provide sufficient information. For example, when classifying vegetation and non-vegetation, the spectral information of vegetation may vary at different positions, making it difficult to distinguish vegetation from non-vegetation using only spectral information. In this case, it is necessary to consider the spatial information of pixels, the positional relationship between pixels in the image, to improve classification accuracy. DPCMF-D performed poorly because the Dense structure plays an important role in feature extraction. The Dense structure can share features between different layers through feature reuse, which can effectively improve the network expressive ability and enable the network to better learn the complex features of input data. This mechanism can avoid the problem of vanishing gradients that is often encountered in traditional deep networks, thus improving the network feature extraction ability. The parameter sharing between different layers in the Dense structure can greatly reduce the number of parameters that need to be trained in the network, thereby reducing the network complexity. This makes the network more compact and lightweight, helping to avoid overfitting and improve the network generalization ability. Since each layer in the Dense structure can accept inputs from all previous layers, the network can learn the features of input data more quickly. In addition, the Dense structure can use a shallower network structure to achieve the same performance as traditional networks, which can reduce training time and computational cost. Therefore, the Dense structure is essential in the DPCMF network.
5. Conclusions
In this paper, we propose a hyperspectral image classification method based on dense pyramidal convolution and multi-feature fusion to address the difficulty in adequately extracting and exploiting the spatial and spectral information of hyperspectral images when the sample size is limited. In this approach, two branches—i.e., spatial and spectral branches—are designed, and in each branch, dense pyramidal convolution layers are used as feature extractors. In the spatial branch, multiple local and global spatial features in image samples are extracted using dense pyramidal convolution and non-local blocks. In the spectral branch, the spectral features in the image samples are extracted by the dense pyramidal convolution module. Finally, the spatial and spectral features are fused through fully connected layers to obtain classification results.
The results of experiments conducted to compare the proposed method with the SVM, SSRN, FDSSC, DBMA and DBDA methods on four public hyperspectral datasets (Indian Pines, Pavia University, Salinas Valley, and Botswana) show that the DPCMF method achieves the best experimental results in terms of OA, AA and Kappa coefficients. In the follow-up study, we will continue to build more efficient classification models to resolve the problem of limited sample size and further improve the current model classification accuracy for hyperspectral images.