2.2. 3D Depthwise Separable Convolutional Network
Depthwise separable convolution was first proposed by Howard et al. and used in Mobilenetv1 [
34]. The standard convolution is split into two parts through the depthwise separable convolution. The first part is the depthwise convolution, which is utilized to extract the features from each input channel separately. The second part is the pointwise convolution, which uses
convolution to combine the output of the depthwise convolution.
Compared with the standard convolution, the depthwise separable convolution significantly reduces the number of parameters and the computational complexity of the convolution layer. We assume that the size of the input feature map is
and the parameters of a standard convolution layer are
, where
and
represent the height and width of the input data, respectively,
denotes the number of channels in the input feature map,
represents the size of the convolution kernel for performing 2D convolutions, and
represents the number of output channels. If the feature map size is still
, we set
as the computational complexity of the standard 2D convolution. Next,
is calculated as follows [
34]:
If 2D depthwise separable convolution is adopted, we assume its computational cost is
.
consists of two parts. The first part denotes the computational cost of the 2D depthwise convolution, and the second part denotes the computational cost of the 2D pointwise convolution. The costs are represented by
and
, respectively. In order to compare the computational costs of the 2D depthwise separable convolution and standard 2D convolution, we assume that the size of the convolution kernel is
, the numbers of input and output channels are
and
, respectively, and the height and width of the input data are
and
, respectively. Next,
is calculated as follows:
By comparing the computational costs of the two convolutions, the ratio of the computation is obtained as follows:
For convenience, we define the computational cost factor
as the ratio of the computational cost of the current 2D convolution to that of the standard 2D convolution, as shown in Equation (4):
Generally, the values of and are greater than 2; thus, can be obtained from Equations (3) and (4), which shows that 2D depthwise separable convolution can effectively decrease the computational costs. If a convolution kernel of size is used, the computational cost of 2D depthwise separable convolution can be reduced by about 9 times as compared with the standard 2D convolution. Therefore, a lightweight network can be created using depthwise separable convolution, which can also increase the network’s training effectiveness.
In the 2D depthwise convolution part, the features are extracted separately from each input channel. If 2D depthwise convolution is adopted, the connection between different bands of the same pixel is ignored, and the spectral features cannot be learned completely. Moreover, it is easy to ignore the relationship between spatial and spectral features in channel-by-channel convolutions. Although pointwise convolution addresses this defect, there are still many features that cannot be obtained.
Considering the limitations of 2D depthwise separable convolution, we propose the 3D depthwise separable convolution technique, which can fully extract the spatial–spectral features and learn joint features from multiple bands to enhance the classification performance. As each 3D convolution convolves a data block, it is possible to capture the features of adjacent groups of bands.
Figure 2 depicts the structure of the proposed 3D depthwise separable convolution (3D-DW) module. The proposed technique also splits the standard 3D convolution into halves, including 3D depthwise convolution and 3D pointwise convolution.
In addition, the proposed 3D depthwise separable convolution retains the advantages of 2D depthwise separable convolutions. Note that the computational complexity of 3D depthwise separable convolution is lower as compared to the standard 3D convolution.
Assume that the size of the input data cube is
, where
is the number of input channels,
is the number of bands, and
and
are the height and width of the data cube, respectively. The number of parameters in a standard 3D convolution is
, where
is the size of the 3D convolution kernel and
is the number of output channels. If the space size of the output data cube remains unchanged, we consider
as the computational cost of the standard 3D convolution.
is computed as follows:
If 3D depthwise separable convolution is adopted, we assume its computational cost is
.
consists of two parts, i.e., computational cost of 3D depthwise convolution, and computational cost of 3D pointwise convolution, which are denoted as
and
, respectively. To compare the computational costs of 3D depthwise separable convolution with those of the standard 3D convolution, we assume that the size of the convolution kernel is
, the numbers of input channels and output channels are
and
, respectively, and the height and width of the input data are
and
, respectively. Next,
is calculated as follows:
To compare the computational costs of the convolutions, we define the computational cost factor
as follows:
Since and , is obtained using Equation (11). Therefore, it is evident that 3D depthwise separable convolution greatly reduces the computational cost.
Figure 3 shows the difference between the filters of the 3D depthwise separable convolution and the filters of the standard 3D convolution. Since each input layer channel is convolved separately in depthwise convolution, it is difficult to efficiently utilize the feature information from many channels in the same spatial position. The convolution kernels of the 3D depthwise convolution have three dimensions, so each convolution kernel extracts features from a group of adjacent bands, effectively avoiding the defects of depthwise convolution. Additionally, the number of channels is adjusted, and features are captured again using 3D pointwise convolution. Note that the size of the convolution kernel is only
. Therefore, as compared with the standard convolution, the 3D depthwise separable convolution has significantly fewer parameters and a lower computational cost.
The 3D depthwise separable convolutional network contains three 3D-DW modules. After each depthwise convolution and pointwise convolution, batch normalization (BN) is applied, along with the ReLU activation function. The parameters of the three 3D-DW modules are different. Since all the bands corresponding to each pixel in the HSI image collectively reflect the features of a pixel, it is necessary to aggregate the information from multiple bands as much as possible when extracting the features. Therefore, we set the size of the convolution kernels for the 3D depthwise convolution to , , and .
In addition, the stride and padding parameters of the depthwise convolution and pointwise convolution are set to 1 and 0, respectively. As a result, the number of channels can be increased without changing the height and width of the input images. Due to the low spatial resolution of HSIs, it is easy to lose small features if the data size is compressed too early. This operation ensures that the receptive field of the convolution kernels does not increase during the 3D convolution and that spectral and spatial dimension information can be aggregated.
The essence of 3D depthwise convolution is still 3D convolution. For pixels with spatial position
in the
feature map of the
layer, we assume that the activation value
is expressed as follows [
22]:
where
represents the ReLU activation function,
represents the bias parameter for the
feature map of the
layer,
denotes the number of feature maps in the
layer and the depth of kernel
for the
feature map of the
layer,
is the width of the convolution kernel,
is the height of the convolution kernel, and
is the depth of the convolution kernel along the spectral dimension.