1. Introduction
Hyperspectral Remote Sensing Image (HRSI) contains abundant spectral and spatial information of ground objects, so it is widely used in land use and land cover [
1,
2], forestry [
3,
4], precision agriculture [
5,
6], environmental monitoring [
7,
8], and military surveillance [
9,
10], etc. In these applications, the task of classification of HRSI is a universal and significant process, whose purpose is to identify the class of ground object for every pixel, because the classification accuracy determines the effect of applications. Unfortunately, HRSI not only provides rich spectral and spatial features of ground objects, but also contains a large amount of redundant information, which increases the difficulty of feature extraction and reduces the classification accuracy. This is the so-called curse of dimensionality. In addition, it is expensive to specify the class of ground object for each pixel manually, so insufficient labeled pixels accelerate the difficulty of feature extraction. The difficulty in improving HRSI classification accuracy with insufficient labeled samples restricts these applications. In order to extract effective features and improve classification accuracy, a large number of algorithms have been proposed.
Early methods mainly reduced the spectral dimensionality by hand-designed features. Band selection aims to select partial bands to replace original data for features extraction through certain criteria, such as semantic information of bands [
11], kernel similarity of discriminative information [
12], and affinity propagation algorithm by unsupervised learning [
13]. Although band selection methods can effectively reduce dimensionality of bands, the directly removed bands may contain more or less important information for classification, resulting in reduced classification accuracy.
More dimensionality reduction methods are based on Feature Extraction (FE), which maps the original spectral bands to a new feature domain by some algorithms. Based on Principal Component Analysis (PCA), which is a linear unsupervised statistical transformation, the Tensor PCA (TPCA) [
14], Joint Group Sparse PCA (JGSPCA) [
15], and Superpixelwise Kernel PCA (SuperKPCA) [
16] are proposed for HRSI classification in the spectral domain. Morphological Attribute Profiles (MAP) [
17] is another feature extraction method widely used in HRSI classification. Ye et al. [
18] employed PCA and extended multiple attribute profiles (EMAP) to extract features. Liu et al. [
19] combined MAP and deep random forest for small sample HRSI classification. Yan et al. [
20] improved the 2D singular spectral analysis (2DSSA) for extracting global and local spectral features by fusing PCA and folded PCA (FPCA). Traditional FE methods can only extract shallow features. This makes it difficult to further improve the classification accuracy.
In recent years, Deep Learning (DL) has achieved significant success in image processing due to its powerful capabilities of deep feature extraction. Researchers are inspired to introduce DL methods into HRSI classification tasks and have achieved better classification results than traditional methods. Chen et al. [
21] proposed a 1D autoencoder network to extract spatial features and spectral features, respectively. Mario et al. [
22] employed 1D Stacked Autoencoders (SAE) with three layers of encoder and decoder for pixel-based classification. Bai et al. [
23] proposed a two-stage multi-dimensional convolutional SAE for HRSI classification, which was composed of the SAE-1 sub-model based on 1D Convolutional Neural Network (CNN) and the SAE-2 sub-model based on 2D and 3D convolution operations. Zhao et al. [
24] proposed a deep learning architecture by combining an SAE and 3D Deep Residual Network (3DDRN), where the SAE is designed for dimensionality reduction and the 3DDRN is used for extracting spatial–spectral joint features. Cheng et al. [
25] proposed a Deep Two-stage Convolutional Sparse Coding Network (DTCSCNet) for HRSI classification without back propagation and a fine-tuning process. Although the models based on SAE can be trained without any labeled samples, the existence of the decoder restricts the increase in the number of layers in the encoder, as overfitting easily occurs when the depth of the SAE is too large. This constrains the improvement of feature extraction ability.
CNN is the most widely used model in HRSI classification [
26]. Jacopo et al. [
27] made a shallow 1D-CNN with only one hidden layer to achieve state-of-the-art performance by label-based data augmentation. Li et al. [
28] proposed a DCNR composed of a deep cube 1D-CNN and a random forest classifier for extracting spectral and spatial information. From the perspective of kernel structure, 1D-CNNs are only suitable for extracting spectral features. If they are used to extract spatial features, the 2D spatial tensors must be converted into 1D vectors and this transformation will result in the loss of spatial information. In order to fully utilize the spatial information, 2D-CNNs are introduced for HRSI classification. Haque et al. [
29] proposed multi-scale CNNs with three different sizes of 2D convolutional kernels to extract spectral and spatial features. Jia et al. [
30] proposed an end-to-end deep 2D-CNN based on U-net which took the entire HSI as input instead of pixel patches. In 2D-CNNs, the spatial and spectral information are extracted separately. This ignores the fact that there is related information between spatial and spectral features. Gao et al. [
31] proposed a lightweight spatial–spectral network which was composed of a 3D Multi-group Feature Extraction Module (MGFM) based on 3D-CNN and a 2D MGFM based on Depthwise Separable Convolution. Yu et al. [
32] proposed a lightweight 2D-3D-CNN by combining 2D and 3D convolutional layers, which can extract the fused spatial and spectral features. Hueseyin et al. [
33] proposed a 3D-CNN based on LeNet-5 model to extract spatial and spectral features from data processed by the PCA method. Due to the structural consistency between 3D convolutional kernels and 3D-cube HRSIs, 3D-CNNs can effectively extract spatial–spectral joint features. Unfortunately, the excellent feature extraction ability of 3D-CNNs requires sufficient labeled samples for training, but there is a high cost of labeling sample results in that there are insufficient labeled samples for training.
Recently, how to improve the feature extraction ability of 3D-CNNs with a small training sample size has become a research hotspot. Li et al. [
34] proposed MMFN based on 3D-CNN, in which multi-scale architecture and residual blocks are introduced to fuse related information among different scale features and extract more discriminative features. Zhou et al. [
35] proposed a Shallow-to-Deep Feature Enhancement (SDFE) model, which was composed of PCA, a shallow 3D-CNN (SSSFE), a channel attention residual 2D-CNN, and a Vision-Transformer network. Ma et al. [
36] proposed a Multi-level Feature extraction Block (MFB) for spatial–spectral feature extraction and a spatial multi-scale interactive attention (SMIA) module for spatial feature enhancement. Paoletti et al. [
37] proposed a pyramidal residual module architecture, which is used to build deep pyramidal residual networks for HRSI classification. In addition to residual structure, attention mechanism is also widely used to improve the feature extraction ability of CNNs. Zhu et al. [
38] proposed a Residual Spectral–Spatial Attention Network (RSSAN) for HRSI classification. In this model, the raw data are sequentially processed through a spectral attention module, a spatial attention module, and a residual block. In the Cooperative Spectral–Spatial Attention Network (CS2ADN) [
39], the spectral and spatial features are extracted by the independent spectral and spatial attention branches, respectively. Then, the fused features are further extracted by dense connection structure. Generally, there are two residual branches in a model, which are named the spatial residual branch and the spectral residual branch. Similarly, there are two attention branches named the spatial attention branch and the spectral attention branch, respectively, in a model based on attention mechanism. This structure composed of two independent branches cannot extract spatial–spectral joint features and increases the number of trainable parameters in the model. When the number of labeled samples is small, the trainable parameters of models cannot be fully trained and the classification accuracy will decrease.
To improve the HRSI classification accuracy-based deep learning models under a small number of labeled samples, a model with a small number of trainable parameters and robust deep feature extraction capability is necessary. Inspired by these studies, a Lightweight 3D Dense Autoencoder Network (L3DDAN) is proposed in this paper for HRSI classification. From the top-level architecture, the network is an SAE composed of an encoder for extracting features and a decoder for reconstructing data. First, the SAE is trained without any labeled samples through unsupervised learning. Then, all labeled samples are randomly divided into a training group, validating group and testing group. The fine-tuned encoder and trained classifier are completed with a small number of training groups and validating groups by supervised learning. Finally, the classification ability of the trained classifier is evaluated with the testing group. The experimental results indicate that the L3DDAN can extract deep spatial–spectral features with only a small number of labeled samples. Thanks to this, the high classification accuracy can still be achieved when there are insufficient labeled samples. The major contributions of this paper include the following:
- (1)
A Lightweight 3D Dense Autoencoder Network (L3DDAN) is proposed for HRSI classification. The architecture of L3DDAN is an SAE based on 3D convolution operations and the Spectral–Spatial Joint Dense Block (S2DB) is introduced into the encoder to enhance the deep feature extraction ability. The high classification accuracy can be maintained when the number of training samples is small.
- (2)
An SAE architecture is proposed to train the encoder by unsupervised learning. The encoder of the SAE composes 3D convolution operations and S2DB to extract deep features from HRSI. The decoder is implemented by 3D convolution operations. The SAE enables the L3DDAN to extract deep features from origin data without labeled samples.
- (3)
The Spectral–Spatial Joint Dense Block (S2DB) is proposed to replace the traditional separated spatial residual branch and spectral residual branch. The S2DB not only avoids the loss of spectral–spatial joint features, but also reduces the number of trainable parameters in L3DDAN.
The rest of this paper is organized as follows. In
Section 2, the detailed framework and related principle of L3DDAN are illustrated. In
Section 3, the details and results of extensive experiments are provided. Finally, the conclusion is presented in
Section 4.