*2.1. CNN*

Convolutional Neural Networks (CNNs) have been increasingly used for HSI analysis. A number of works aimed at improving the performance of Deep CNNs (DCNNs) have focused on different aspects, e.g., the network architecture, the type of nonlinear activation function, supervision methods, regularization mechanisms, and optimization techniques [4,49,50]. Based on the network architecture, specifically on the convolutional layers, there are different types of CNNs, namely 1D-CNN, 2D-CNN, and 3D-CNN. The 1D-CNN has one-dimensional filters in its convolutional layers which are naturally fit to the spectral data structure. Consider the case when the size of the input is *K* × 1, and the kernel size is *B* × 1, with *K* representing the number of HSI bands, *B* is the kernel size, and *B* << *K*. Wei Hu et al. [11] used 1D convolutional layers to extract the spectral features of HSI. Their network input is an HSI pixel vector, with size (*<sup>K</sup>*, <sup>1</sup>). This research initiated the use of multiple convolutional layers for HSI classification. Compared to 2D convolution and 3D convolution, the process of 1D convolution, which is shown in Equation (1), is the simplest one.

A 2D-CNN has two-dimensional filters in its convolutional layers. It has been widely used to solve several computer vision problems, such as object detection [51], scene recognition [52], and image classification [51], because of its ability to extract features from a raw image directly. For the HSI classification problem, 2D convolutions have been used to extract spatial features [25,30]. In contrast to RGB images, HSI has a much larger number of channels. Applying 2D convolutions along the hundreds of channels results in more learned parameters [18]; hence, several studies on HSIs, which employ 2D convolutions, do not use all of the channels. Most of them use a dimensionality reduction technique as a preprocessing step with their network [53–55] or use the average of all the images over the spectral bands of the HSI [25].

A 3D-CNN employs 3D convolutions in its convolutional layers. 3D-CNNs are popular for video classification [36], 3D object reconstruction [56], and action recognition [57]. For the case of HSIs, the form of the 3D filter suits the data structure of the HSI cube. Some research papers on HSI classification use 3D convolutional layers to extract the spectral-spatial features directly [18,33]. Their research shows that 3D-CNN outperforms both 1D-CNN and 2D-CNN. However, as shown in Equation (3), the process of 3D convolution requires more parameters, more memory, and requires a higher computational time and complexity compared to 1D convolution in Equation (1) and 2D convolution in Equation (2).

$$
\sigma\_{ij}^z = f(\sum\_{m}^{B\_i - 1} \sum\_{b=0}^{B\_i - 1} k\_{ijm}^b \sigma\_{(i-1)m}^{z+b} + r\_{ij}),
\tag{1}
$$

$$\upsilon\_{ij}^{xy} = f(\sum\_{m}^{H\_i - 1} \sum\_{h = 0}^{H\_i - 1} \sum\_{w = 0}^{W\_i - 1} k\_{ijm}^{\text{coh}} \upsilon\_{(i - 1)m}^{(x + h)(y + w)} + r\_{ij}),\tag{2}$$

$$w\_{ij}^{xyz} = f(\sum\_{m} \sum\_{b=0}^{B\_i - 1} \sum\_{h=0}^{H\_i - 1} \sum\_{w=0}^{\mathcal{W}\_i - 1} k\_{ijm}^{\
u h b} v\_{(i-1)m}^{(x+h)(y+w)(z+b)} + r\_{ij}),\tag{3}$$

where: *i* is the layer under consideration, *m* is the index of feature map, *z* is the index that corresponds to the spectral dimension, *vzij* is the output of the *i*th layer and the *j*th feature map at position *z* , *kbijm* is the kernel value at index *b* on the layer *i* and feature map *j*, *rij* is the bias at layer *i* and feature map *j*. For the 1D convolution in Equation (1), *Bi* is the size of the 1D filter in layer *i*, while, for the 3D convolution in Equation (3), *Bi* is the depth of 3D kernel. *Wi* and *Hi* are the width and height of the kernel, respectively, for both 2D and 3D convolutions.

The expensive computational cost and memory demand of 3D convolution has led studies to investigate alternative network architectures based on (2D + 1D) convolutions. For instance, in the case of action recognition, a study in Reference [58] proposed to replace 3D convolution with *m* parallel streams of *n* 2D and one 1D convolution. This study empirically showed that their network, which is based on (2D + 1D) convolution, achieves around 40% reduction in model size and yields a drastic reduction in the number of learning parameters compared to another network with 3D convolution. In Reference [36], a simplified 3D convolution was implemented using 2D and 1D convolutions in three different blocks: 2D followed by 1D, 2D and 1D in parallel, and 2D followed by 1D with skip connections. These blocks were subsequently interleaved using a sequence network. The proposed architecture has a depth of 199 and a model size of 261 MB, which is much lighter compared to the 3D-CNN, in which model size is 321 MB when the depth is 11. The architecture was also shown to be faster than its 3D-CNN counterpart [36]. Using (2D + 1D) convolutions instead of 3D convolutions allows the network to be deeper without significantly increasing the number of parameters. Such deep networks can extract richer features and have been shown to outperform 3D-CNN architectures [36,58,59]. Because the model size and the number of parameters grow dramatically as the network becomes deeper, the training of deep 3D-CNNs is extremely difficult with the risk of overfitting.

#### *2.2. Residual Network*

Deeper CNN can extract richer features [39]. In some cases, when the networks are deeper, their accuracy degrades because of the vanishing/exploding gradients problem [60]. Hence, He et al. [40] proposed to use a shortcut connection to perform identity mapping without adding extra parameters or extra computational time. The shortcut connection outputs are added to the output of the stacked layers, and a ReLU is applied as the activation function. This network is named ResNet. It has achieved outstanding classification accuracy on some image benchmark datasets, such as ImageNet, ILSVRC 2015, and CIFAR-10.

He et al. [48] followed up their work on ResNet by analyzing the propagation formulation behind the residual unit. Their analysis has shown that a "clean" information path in the skip connection results in the lowest training loss compared to those with scaling, gating, and convolution. Regarding the position of the ReLU activation function in the residual building blocks, they proved that putting ReLU and BN before the add function (full pre-activation) generalizes better than the original ResNet [40], which used ReLU after the add function (post-activation). The difference between pre-activation and post-activation in the residual building blocks is shown in Figure 3.

**Figure 3.** (**a**) Original residual unit with clear short-cut connection. (**b**) rectified linear unit (ReLU)-only pre-activation with dropout short-cut connection. (**c**) Full pre-activation with clear short-cut connection.

For the use of ResNet for HSI classification, Zhong et al. [44] proved that with the same size of convolutional layers, ResNet achieves better recognition accuracy than a plain CNN. Then, they explored ResNet by applying more various kernel sizes to sequentially extract the spectral features and the spatial features [19]. Roy et al. [61] used 3D convolutional layers followed by 2D convolutional layers in their residual network. Their network achieved high performance with 30% training samples. Meanwhile, Reference [45] explored ResNet by implementing a variable number of kernels in each convolutional layer. The kernel number was set to increase gradually in all convolutional layers like a pyramid to increase the diversity of the spectral-spatial features. In contrast to Reference [19,45,61], which focus on exploring the convolutional layer, Reference [46] improved the ResNet architecture by combining it with a dense convolutional network, which helps the ResNet to explore new features. These various works all improve ResNet performance by using a single network to extract both spectral and spatial features. Our proposed architecture extracts the spectral and spatial features from two separate stream networks to produce distinctive spectral and spatial features.

#### **3. Proposed TSRN Network**

The flowchart of the proposed TSRN is displayed in Figure 1. From the diagram, we can see that TSRN has two important residual network streams: a sRN and a saRN. Since the number of bands in the spectral dimension is very large (hundreds of channels), and thus comprises much redundancy, we first apply PCA to extract the first *K* principal components. Then, for each pixel, we take a 3 × 3 cube alongside the spectral direction, which is centered at that pixel, as the input of the sRN stream to learn the spectral features. Similarly, we take an *n* × *n* × *K* cube and feed it to the saRN streams to extract the spatial features. In this method, we propose to use the same number of spectral and spatial features. Hence, we need to ensure that their feature map sizes at the output of the sRN and the saRN networks are the same. To this end, we have applied a dense layer with the same number of units in each stream. Then, we apply a fully connected layer to fuse the spectral features and the spatial features. Finally, we use a softmax layer to classify the features. In the next subsection, we will explain in more detail both spectral and spatial networks.

#### *3.1. Spectral Residual Network*

Although we have minimized the data redundancy of the HSI by using PCA, the spectral values can still be noisy. In the sRN, we propose to compute the mean of the reflectance in each spectral band before inputting the spectral cube into the sRN to minimize the effects of the noise. Given a pixel *xij*, we choose a spectral cube *Xsij* ∈ *<sup>R</sup>*3×3×*K*, which is centered at *xij* and *K* is the number of PCA components. Then, we compute the mean reflectance of each band by using Equation (4), where *k* ∈ *K*, *Xsij* = {*xsij*1, *<sup>x</sup>sij*2, ..., *<sup>x</sup>sijK*}, and *Xsij* ∈ *R*1*xK* .

$$\overline{\mathfrak{X}}\_{i\bar{j}k}^{\mathfrak{s}} = \frac{\sum\_{h=j-1}^{h=j+1} \sum\_{\substack{\mathfrak{F} = i-1 \ \mathfrak{X}}}^{\mathfrak{g} = i+1} \mathfrak{X}\_{\bar{\mathfrak{g}},h,k}^{\mathfrak{s}}}{\mathfrak{g}}. \tag{4}$$

Then, we input the mean of the spectral value ( *Xsij*) into the sRN. In this proposed architecture, we use three full pre-activation with clear skip connection residual units inspired by Reference [48]. It has been shown that these residual units are better than the traditional residual units. These residual units consist of three different layers.


In the full pre-activation residual unit, the BN layer and the ReLU activation layer are established before the convolution layer, as shown in Figure 2a. From that figure, we can see that an Average Pooling layer, a Dropout layer, and a Dense layer have been applied at the end of the sRN. The Average Pooling layer and the Dropout layer are used as a regularizer to minimize the over-fitting problem due to the small number of training samples, while the Dense layer is used to perform the high-level reasoning to produce 128 spectral features.

#### *3.2. Spatial Residual Network*

The saRN is devised to extract the spatial features of a pixel *xij*. The input is an *n* × *n* × *K* cube, centered at pixel *xij*. Then, the input is processed by the full pre-activation residual unit. As shown in Figure 2b, the rest of the layers architecture of this saRN are similar to those of sRN. The main difference between them is that the saRN uses 2D convolutional layers, while the sRN uses 1D convolutional layers. In the end of this network, 128 spatial features are extracted.

Figure 2c illustrates the 2D convolution process with a spatial cube in which size is *n* × *n* × *K*, where *K* is the number of channels or the input depth. Since the input is 3D then the kernel is also 3D, i.e., the kernel depth must be the same as the input depth. Hence, in this case, the kernel depth must be *K*. So, we can only select the kernel width and height. The convolution process is performed along the *x* and *y*-direction and produces a 2D matrix as output [64].

Given a pixel *xij*, the spectral features *Fsij* produced by sRN and the spatial features *Fsa ij* produced by saRN are given by Equations (5) and (6), respectively.

$$F\_{\text{ij}}^{\text{s}} = sRN(\overline{X}\_{\text{ij}}^{\text{s}})\_{\text{\textdegree}} \tag{5}$$

$$F\_{ij}^{sa} = saRN(X\_{ij}^{sa}).\tag{6}$$

Using *Fsij* and *Fsa ij* , we implement a fully connected layer to obtain the joint spectral-spatial features *Fssa ij* using the formula in Equation (7), where *W f cl* and *b f cl* are the weight vector and the bias of the fully connected layer, respectively, and ⊕ is the concatenation operation.

$$F\_{ij}^{\rm esa} = f\left(\mathcal{W}^{fcl} \cdot \{F\_{ij}^{\rm s} \oplus F\_{ij}^{\rm sa}\} + b^{fcl}\right). \tag{7}$$

After obtaining the joint spectral-spatial feature *Fssa ij* , we use a softmax regression layer (SRL) to predict the class probability distribution of the pixel *xij* by using Equation (8). Here, *N* is the number of classes, and *<sup>P</sup>*(*xij*) is a vector consisting of the probability distribution of each class on pixel *xij*. In the end, the label of the pixel *xij* is decided using Equation (9).

$$P(\mathbf{x}\_{ij}) = \frac{1}{\sum\_{n=1}^{N} e^{w\_n^{srl} \cdot F\_{ij}^{ssa}}} \begin{bmatrix} e^{w\_1^{srl} \cdot F\_{ij}^{ssa}} \\ e^{w\_2^{srl} \cdot F\_{ij}^{ssa}} \\ \vdots \\ e^{w\_N^{srl} \cdot F\_{ij}^{ssa}} \end{bmatrix} \tag{8}$$

$$
abla l(\mathbf{x}\_{i\bar{j}}) = \arg\max P(\mathbf{x}\_{i\bar{j}}).\tag{9}$$
