1. Introduction
Hyperspectral images (HSI) generally consist of tens to hundreds of continuous spectral bands [
1], and can provide rich spatial and spectral information simultaneously, which offers great potential for the subsequent information extraction and practical applications in people’s lives [
2]. Therefore, HSI is becoming a valuable tool for monitoring the Earth’s surface, and is used in a wide range of applications, such as environmental monitoring [
3], precision agriculture [
4], military investigation [
5], and so on.
Hyperspectral image classification (HSIC) is one of the hot issues in hyperspectral research. Taking advantage of rich spectral information, numerous classification methods have been developed. Support vector machine (SVM) [
6] has good robustness to high-dimensional hyperspectral data. K-nearest neighbor (KNN) [
7] is one of the simplest classifiers for HSI classification. Random forest (RF) [
8] is an ensemble learning method that operates by constructing multiple decision trees in the training process. In addition to these, decision trees [
9], extreme learning machines [
10], sparse representation-based classifiers [
11] and many other methods have been further adopted to improve the performance of hyperspectral image classification. Nevertheless, it is difficult to accurately distinguish different land-cover categories using the spectral information [
12]. Zhan et al. [
13] used factor analysis to learn effective spectral and spatial features, and applied a Large-margin Distribution Machine (LDM) to hyperspectral remote sensing image classification. Meanwhile, morphological profile-based methods [
14] have been proposed to effectively combine spatial and spectral information.
However, the conventional methods are based on handcrafted spectral–spatial features [
15], which heavily depend on professional expertise and are quite empirical. Deep learning-based methods can automatically extract spectral features, spatial features, or spectral–spatial features of HSIs for classification application. Chen et al. [
16] proposed a stacked autoencoder (SAE) to extract the joint spectral–spatial features for completing accurate HSI classification. Li et al. [
17] used a single restricted Boltzmann machine (RBM) and a multilayer DBN to extract spectral–spatial features and obtained superior classification performance compared to the SVM-based method. Makantasis et al. [
18] introduced a 2-D CNN to HSI classification, which achieved satisfactory performance with encoded spectral–spatial information with CNN and conducted classification with a multilayer perceptron. Chen et al. [
19] used 3-D CNN to simultaneously extract spectral–spatial features and achieved a better result for HSI classification. Nonetheless, due to the loss of information caused by the vanishing gradient problem, training deep CNNs is still a little difficult. Recently, He et al. [
20] proposed the residual network (ResNet) to solve this problem well, which defines a residual block as infrastructure elements to facilitate learning of deeper networks and enabling them to be substantially deeper. Zhong et al. [
21] designed a spectral–spatial residual network (SSRN), which uses spectral residual blocks and spatial residual blocks consecutively to learn deep discriminative features from abundant spectral features and spatial contexts of HSI and achieved the most advanced HSI classification accuracy on agricultural, urban–rural and urban datasets. Moreover, a deep pyramidal residual network (PyResNet) [
22] was developed to learn more robust spectral–spatial representations from the HSI cubes and provided competitive advantages (in terms of both classification accuracy and computational time) over the most advanced HSI classification methods.
Although CNN-based models have achieved good performance for HSI classification, the intrinsic complexity of remote sensing hyperspectral images still limits the performance of many models based on CNN. Firstly, the parameters of CNN increase exponentially with the convolution layer, and the size becomes lager with the increase in computing power. In addition, due to the long-running multiplication and addition time, the consumption of calculation has become the bottleneck of practical application. Finally, the translation invariance and local connectivity of CNN will affect the HSI classification effect. MLP, as a neural network with less constraints, can eliminate the negative effects of translation invariance and local connectivity and has been proven to be a promising machine learning technology. The present MLP-Mixer [
23] is known as a pioneering MLP model. Furthermore, Liu et al. [
24] proposed gMLP, which is based on MLPs combined with gating, and showed that it can perform as well as transformers in vision applications and key language. H. Touvron et al. [
25] proposed ResMLP network structure built entirely upon multi-layer perceptron and attained surprisingly good accuracy/complexity tradeoffs on ImageNet. In addition, RaftMLP [
26] aims to achieve cost-effectiveness and ease of application to downstream tasks with fewer resources in developing a global MLP-based model.
MLP solves translation invariance and local connectivity problems; residual networks can prevent model degradation and facilitate rapid convergence due to the retention of original information. Therefore, we proposed two MLP-based classification framework: an improved MLP (IMLP) model, and IMLP combined with ResNet (IMLP-ResNet) to achieve superior HSI classification performance in this paper.
As a summary, the following are the main contributions of this study.
MLP, as a less constrained network, can eliminate the negative effects of translation invariance and local connectivity. Therefore, this paper introduces MLP into HSI classification to fully obtain the spectral–spatial features of each sample and improve the classification performance of HSI.
Based on the characteristics of hyperspectral images, we designed IMLP by introducing depthwise over-parameterized convolution, a Focal Loss function and a cosine annealing algorithm. Firstly, in order to improve network performance without increasing reasoning computation, depthwise over-parameterized convolutional layer replaced the ordinary convolution, which can speed up training with more parameters. Secondly, a Focal Loss function is used to enhance the important spectral–spatial features and prevent useless ones in the classification task, which allows the network to learn more useful hyperspectral image information. Finally, a cosine annealing algorithm is introduced to avoid oscillation and accelerate the convergence rate of the proposed model.
This paper inserts IMLP between two 3 × 3 convolutional layers in the ordinary residual block, called as IMLP-ResNet, which has a stronger ability to extract deeper features for HSI. Firstly, the residual structure can retain the original characteristics of the HSI data, and avoid the issues of gradient explosion and gradient disappearance during the training process. In addition, the residual structure can improve the modeling ability of the model. Moreover, IMLP can improve the feature extraction ability of residual network, so that the model strengthens the key features on the basis of retaining the original features of hyperspectral data.
The rest of this article is organized as follows.
Section 2 describes our proposed classification approach.
Section 3 reports the experimental results and appraises the performance of the proposed method part.
Section 4 gives the discussion and analyzes how to choose experimental parameters in the IMLP-ResNet classification model.
Section 5 gives the final conclusions and discusses research directions in the future.
2. The Proposed MLP-Based Methods for HSI Classification
Considering that the deepening of network layers in deep learning will cause the phenomenon of gradient disappearance and gradient explosion, the classification model adopts residual network as the basic framework.
Figure 1 shows the overview flowchart of the improved MLP combined with ResNet (IMLP-ResNet) for HSI classification.
First of all, the improved MLP (IMLP) model for HSI classification is described in detail.
2.1. The Proposed Improved MLP (IMLP) for HSI Classification
Figure 2 gives the overall architecture of the proposed IMLP for HSI classification, which consists of two stages: a training stage and a testing stage. In the training stage, the network consists of a Global Perceptron module, Partition Perceptron module and Local Perceptron module. The structural reparameterization means that the training-time model has a set of parameters and the inference-time model has another set [
27], and parameterizes the latter with the former’s parameters. The detailed description is explained as follows. It is assumed that the HSI dataset is the size of
, where
and
represent spatial height and width, and
is the frequency band number. First, each pixel of the hyperspectral image is processed with a fixed window size
, and a single sample with a shape of
is generated. Subsequently, with the shape of each patch, it becomes
. In this paper, the patch size is set to
.
The global perceptron module block consists of two branches. The first branch of the global perceptron module splits up the input hyperspectral feature image. The hyperspectral feature map changes from
to
In the second branch, the original feature map
is evenly pooled, and the size of the hyperspectral feature map becomes (
,
,
).
,
,
indicate the height, width and number of input channels of the input hyperspectral feature map, respectively.
,
,
respectively represent the height, width and number of output channels of the split hyperspectral feature image. Finally,
,
indicate the height and width of the hyperspectral feature image after average pooling as follows:
The second branch uses average pooling to achieve a pixel for each hyperspectral feature image, and then feeds it though BN and a two-layer MLP. The hyperspectral feature map
is sent to the BN layer and two fully connected layers. The ReLU function is introduced between the two fully connected layers to effectively avoid gradient explosion and gradient disappearance. For the fully connected layer,
represent input and output, and the kernel
is the matrix multiplication (MMUL) defined as follows:
The hyperspectral vector is transformed into by BN layer and two fully connected layers, after which the hyperspectral feature images are obtained after all branches are added. Then, the hyperspectral features are input to partition perceptron and local perceptron without dividing.
The Partition Perceptron module block contains a BN layer and a group convolution. The input of the partition perceptron is
. Then access the group convolution of groups = 4 and BN layer. After BN layer and a group convolution processing, it becomes the original hyperspectral feature input
.
indicates the output hyperspectral feature.
is the number of pixels filled, while
is the convolution kernel.
indicates the number of convolution groups. The calculation formula of
is shown in Equation (3).
The Local Perceptron module contains a depthwise over-parameterized convolutional layer (DO-Conv) [
28] and a BN layer. First, the local perceptron module sends the segmented hyperspectral feature image
simultaneously to the deep hyperparametric convolution layer. Then the feature graph is fed into BN layer, and the output of all convolution branches is added with the output of the partition perceptron as the final output. In the test phase, reparameterization is carried out to fuse the two parts of the local perceptron module and the partitioned perceptron module into a fully connected layer. The FC kernel of a DO-Conv kernel is the result of convolution on an identity matrix with proper reshaping operation. Formula (4) shows exactly how to build
from
.
In order to increase the learnable parameters of the proposed model, a deeply over-parameterized convolutional layer is introduced to replace the ordinary convolutional layer to construct IMLP. In addition, IMLP introduced Focal Loss for the purpose of solving the problem of data imbalance in hyperspectral image classification and the cosine annealing algorithm to improve the training performance of IMLP, which makes the convergence speed of the network faster. The three modifications are described in the following parts.
2.1.1. DO-Conv
In order to improve the training speed of the model, DO-Conv is introduced to replace the traditional convolution layer in the local perceptron module. The architecture of DO-Conv is shown as
Figure 3. There are two components in DO-Conv, including a feature component and a convolution kernel component. The model is more efficient after adding the convolution kernel component, so this paper uses the convolution kernel component to train the network. The DO-Conv is composed of a conventional convolution
and a deep convolution.
. In conventional convolution, the convolution layer slides the input data, and each element of the output feature is obtained by the horizontal slice of the convolution kernel and
dot product of the image block. In the deep convolution layer, the convolution kernel is convolved with each input channel during the training phase.
At the end of the training phase, the multi-layer composite linear operation used for over-parameterization is folded into a compact single-layer representation. Then, only one layer is used for reasoning, reducing the calculation to full equivalence with the regular layer.
and
are spatial dimensions of
,
is the number of input feature graphs,
is the number of
output feature graphs,
is the transposition of
and the convolution kernel of DO-Conv is
. First, the deep convolution kernel
and the convolution kernel of ordinary convolution
are combined into
,
. The convolution output feature
is then generated as
, where
is convolution,
is the dot product, and
is the defined operator.
2.1.2. Focal Loss
Data imbalance is common in hyperspectral remote sensing images. Because there are various objects with different sizes in a hyperspectral scene, it is very difficult to mark samples in practice. Therefore, there is usually a serious imbalance between various samples of hyperspectral data [
29]. Thus, this paper introduced focal loss function instead of cross entropy loss (CE) function. CE is written as follows:
where
specifies the ground-truth class and
is the model’s estimated probability for the class with label
, and
is defined as follows:
Focal Loss is calculated as follows:
Focusing parameter can adjust the weight of positive and negative samples as well as control the weight of difficult and easy samples. When some samples are misclassified and is very small, the regulatory factor is close to 1, which has little influence on loss function. However, as tends to 1, this factor will gradually tend to 0, and losses for well-classified samples will also decrease, so as to achieve the effect of reducing weight. will smoothly adjust the proportion of reduced weights for easily classified samples. Increased can enhance the influence of the regulatory factor, which reduces the loss contribution of easily classified samples and broadens the range of low loss received by samples.
2.1.3. Cosine Annealing Algorithm
The batch gradient descent (BGD) and stochastic gradient descent (SGD) are mainly used to update parameter values in deep learning. BGD needs to update each parameter with all the data sets. If the sample size is too large, the training speed will be too slow, which will increase the computational cost. However, SGD has a characteristic fast training speed, because it uses part of the information of the data and easily falls into a local optimal solution [
30]. Therefore, this article introduces the cosine annealing algorithm to update the parameter values under the premise of comprehensive training sample speed and computational cost, and the learning rate can be reduced by the cosine function. We decay the learning rate with a cosine annealing for each batch as follows:
where
and
are ranges for the learning rate,
is the total number of epochs, and
is the current epoch. When
=
,
reaches the minimum training batch.
When the gradient descent algorithm is used to optimize the objective function, the learning rate should become smaller to get closer to the global minimum value of the loss function and make the model as close as possible to this point. The cosine annealing algorithm can reduce the learning rate by cosine function. The cosine goes down slowly as x increases, then it accelerates and goes down slowly again.
2.2. The Proposed IMLP-ResNet Model for HSI Classification
The main idea of an IMLP-ResNet model refers to the insertion of IMLP between two convolutional layers in the ordinary residual block; that is to say, the IMLP module inserted into the third layer of ResNet has a stronger ability to extract deeper features for HSI. First of all, ResNet34 can retain the original characteristics of the HSI data. It can solve gradient explosion and gradient disappearance in the training process. In the meantime, ResNet34 can improve the modeling ability of the model. IMLP can improve the feature extraction ability of residual network and strengthen the key features on the basis of retaining the original features. ResNet34 compared with other CNN models can help overcome the over-fitting phenomenon. The ResNet family includes ResNet18, ResNet34, ResNet50, ResNet152, etc. In order to improve the classification efficiency, ResNet34 with fewer parameters was used in this paper.
2.2.1. The Structure of ResNet34
The classification performance of the deep learning model decreases with the increase in depth [
31]. Inspired by deep residual learning framework, this aggravating problem can be solved by adding quick connections and propagating eigenvalues between each layer.
The core of the deep residual network lies in the residual learning module, which can save part of the original input information during the training of the deep CNN model [
32,
33]. In this way, the learning target is transferred to avoid the saturation of classification accuracy caused by the depth of the network. As shown in
Figure 4,
represents the input,
represents the output, and
represents the residual function. The output of the residual unit is shown in Equation (10).
The residual module calculates the residual error when the span is not interrupted.
is used to show the residual module block, the residual module actually calculates the output result as shown in Equation (11).
is residual mapping and can be obtained by back propagation (BP). For the case of two weight layers, the calculation process is shown in Equation (12) when the bias is ignored.
The calculation of residual module requires that
and
have the same dimension. A linear projection
is proposed by the shortcut connections to match the dimensions:
Figure 5 is the overall architecture of ResNet34, which adds shortcut connections between each two layers, and can directly sample the input image with the convolution of stride of 2. It can be seen that there are four layers in the structure of ResNet34 and each layer has 3, 4, 6, and 3 residual blocks, respectively. The convolutional layers mostly have 3 × 3 filters for the same output feature map size. In order to maintain the time complexity of each layer, the number of filters was doubled if the feature map size is halved. The size of the feature map is halved and the number of feature maps is doubled to maintain the complexity of the network.
2.2.2. IMLP-ResNet Model
Figure 6a shows the structure of the ordinary residual block, which contains two 3 × 3 convolutional layers and a shortcut connection. BN is applied after the convolutional layer and before the activation function to accelerate the convergence of the module. The shortcut connection enables the gradient to propagate directly from the later to earlier layers, thus mitigating the gradient vanishing. The stacking multiple residual blocks can develop a deeper network to alleviate overfitting of the network.
As shown in
Figure 6b, this paper inserts IMLP between two 3 × 3 convolutional layers in the ordinary residual block to constitute a symmetric structure. Traditional convolutional layers obtain long-range dependencies by the large receptive fields formed by deep stacks of convolutional layers. However, repetition of local operations requires a lot of computation and may cause optimization difficulties. At the same time, some images have intrinsic positional prior, which cannot be fully utilized by a convolutional layer because it shares parameters among different positions. IMLP runs faster than CNN with the same number of parameters and has global capacity and positional perception. Therefore, our proposed IMLP-ResNet can perform fine-feature extraction at different network levels and learn more comprehensive feature representations for HSI classification.
4. Discussions
In order to find the optimal architecture, it is necessary to do experiments with different main parameters, which plays a crucial role in the size of the model and the complexity of the proposed IMLP–ResNet. By comparing the overall accuracy of different parameters, the influence of these parameters on the model can be analyzed. In the Indian Pines dataset, Pavia University dataset and Xuzhou dataset, the improvement effects of different parameter changes on the model are shown in
Figure 16,
Figure 17,
Figure 18 and
Figure 19.
The first parameter was experimentally verified in a different patch size. The hyperspectral images were first divided into fixed-size patches to input IMLP-ResNet, and patch sizes were set as
,
,
respectively. The corresponding input dataset is divided into
,
, and
. As shown in
Figure 16, for the three datasets, OA, AA and Kappa coefficients all showed a decreasing trend with the increase in patch size. When patch size = 4, the IMLP-ResNet model proposed achieves the best classification accuracy, because the correlation between the internal information of image patches weakens with the increase in patch size.
The second parameter is to choose the layer of ResNet into which the proposed IMLP module should be inserted to get the best classification results. As shown in
Figure 17, we can conclude that the IMLP module inserted into the third layer of ResNet has the highest accuracy in the three datasets. This is because the number of residual blocks in ResNet34 is [
3,
4,
6], that is, the number of residual blocks in the third layer is more than that in the other three layers. The IMLP module inserted in the third layer of ResNet has a deeper network than the other three layers, which has a stronger ability to extract deeper features of hyperspectral images, so the classification accuracy is higher.
The third parameter is the proportion of training samples to the total samples. The patch size is set to 4 and IMLP module is the third layer of ResNet; 5% and 10% of training samples are taken from the three data sets, respectively, as shown in
Figure 18 and
Figure 19.
It can be seen from
Figure 18 and
Figure 19 that, when the number of training samples accounts for 10% of the total samples, the OA is higher than when the number of training samples accounts for 5% of the total samples. This is because the more training samples exist, the more accurately the model can estimate the data distribution, thus the better the generalization performance in the validation set, which leads to higher accuracy. The above results show that when the patch size is 4, the IMLP module is inserted into the third layer of ResNet, and the number of training samples accounts for 10% of the total number of samples, the three datasets can achieve the best classification performance with our proposed IMLP-ResNet.
5. Conclusions
In this paper, two HSI classification frameworks based on MLP are proposed: the IMLP model and IMLP–ResNet. Firstly, according to the characteristics of HSI, three improvements were made to the original model and the IMLP was designed. Secondly, in order to improve the network performance without increasing the amount of inference computation, we introduced a deep over-parameterized convolution layer instead of ordinary convolution. Thirdly, in order to enable the network to learn more useful hyperspectral image information and suppress useless features, we used a Focal Loss function to enhance the key spectral spatial features in the classification task. Finally, in order to avoid oscillation, a cosine annealing algorithm is introduced to accelerate the convergence of the model. The residual structure can retain the original characteristics of this data, avoid the problems of gradient explosion and gradient disappearance in the training process, and improve the modeling ability of the model. In addition, IMLP can improve the feature extraction capability of ResNet, so that the model can enhance the key features while preserving the original features of hyperspectral data. Therefore, in this paper, we proposed IMLP–ResNet, which can extract 3D spectral–spatial features at different levels of the network and learn more comprehensive feature representation for HSI classification.
The proposed IMLP and IMLP–ResNet were tested on two public datasets (Indian Pine and Pavia) and a real HSI dataset (Xuzhou). Compared with the classic methods and deep learning-based methods, the proposed IMLP and IMLP-ResNet show obvious improvements. The results show that the proposed IMLP algorithm and IMLP–ResNet algorithm are meaningful and can obtain better classification results in HSI classification.
However, in the task of hyperspectral image classification, the available marker samples are usually very limited. When analyzing the classification effect of the number of training samples, the IMLP–ResNet proposed in this paper finds that the effect of 10% of the number of samples is better than 5%. Therefore, in the next step, we will consider data expansion, active learning, transfer learning, meta learning and other technologies to realize the construction and design of a network model combined with MLP under small samples. In addition, the means of using unlabeled samples more effectively for semi-supervised hyperspectral classification based on MLP is also worthy of further research.