1. Introduction
With the continuous development of computer vision technology, machine learning provides a variety of techniques and tools applied to remote sensing data to identify and extract important symmetric features in remote sensing images [
1,
2,
3]. However, different from natural images, having the characteristics of a wide imaging range and complex and diverse backgrounds, remote sensing images present more spectral channels and complex image structures than natural images [
4]. The unbalance of categories in remote sensing images and the segmentation of small objects are the reasons that affect the semantic segmentation effect [
5].
Convolutional neural networks (CNNs) have been successfully applied to many semantic segmentation tasks [
6,
7]. The classical semantic segmentation models and their contributions are shown in
Table 1. Great efforts have been made to successfully apply deep learning methods for the segmentation of remote sensing data [
8,
9]. Compared with natural image datasets, those comprising remote sensing images have higher intraclass variance and lower interclass variance, making the labeling task difficult [
10]. To deal with the special data structure of remotely sensed images, Geng et al. [
11] have extended the long short-term memory (LSTM) network [
12] to extract contextual relationship information, where the LSTM algorithm learns potential spatial correlations. Mou et al. [
13] and Tao et al. [
14] have designed the spatial relationship module and spatial information inference structure, respectively, in order to build more effective contextual spatial relationship models. In order to better acquire long-range contextual and location-sensitive information among features in remote sensing images, the multiscale module is improved in this paper. ASPP [
15] uses dilated convolution to increase the size of the receptive field and control the number of parameters. However, when remote sensing images contain objects with large size disparity, the pyramid pooling module cannot capture small objects well [
16]. In order to solve the problem of uneven data distribution among different tags in image segmentation, some researchers try to take symmetry into account in deep learning models and architectures [
17,
18,
19,
20]. Lv et al. proposed a new way to detect and track objects at night, inspired by symmetric neural networks, which involves using computer algorithms to enhance certain features of objects and location and appearance information [
21]. Park et al. proposed a symmetric graph convolutional autoencoder which produces a low-dimensional latent representation from a graph [
22]. These approaches not only enable to balance the data distribution but also reduce the complexity of the model. To solve the problem of unbalanced classes in remote sensing image datasets, Kampffmeyer et al. [
23] and Kemker et al. [
24] have proposed modifications to the cross-entropy loss function by introducing different weighting mechanisms. Inspired by this idea, in this paper, we propose a new loss function which can effectively improve the Jaccard index.
Moreover, attention mechanisms have been successfully applied in semantic segmentation [
30,
31] over the past few years, where introducing an attention mechanism into a semantic segmentation model allows the model to better focus on meaningful image features [
32]. In CNNs, channel attention [
33] is usually implemented after each convolution [
34], while spatial attention is typically implemented at the end of the network [
35]. As a symmetric semantic segmentation model, U-net can obtain the context information of an image while locating the segmentation boundary accurately. In U-Net-based networks, channel attention is usually added in each layer of the upsampling part [
36]. However, channel attention only considers interchannel information and ignores the importance of location information, which is crucial for obtaining the object structure of remote sensing images [
37]. To enhance the perception of information channels and important regions, Woo et al. [
38] have proposed the convolutional block attention module (CBAM) by linking channel attention and spatial attention in tandem. However, convolution can only capture local relationships and ignores the relational information between distant objects. Therefore, Hou et al. [
39] have proposed a new coordinate attention mechanism by embedding location information into channel attention and successfully applied it to the semantic segmentation of natural images.
The above methods in deep learning for remote sensing image classification imbalance and small objects do not fully utilize the spatial feature information and location-sensitive information in remote sensing images at different scales. In this paper, a novel semantic segmentation network, CAS-Net, is proposed which integrates coordinate attention and SPD-Conv [
40] layers for remote sensing images. CAS-Net adopts SPD-Conv to adjust the backbone network to reduce the loss of fine-grained information and improve the learning efficiency of feature information. In the feature extraction stage, a coordinate attention mechanism is used to enable the model to capture directional perception and position-sensitive information at the same time, so as to locate small objects more accurately at multiple scales. In addition, the Dice coefficient is introduced into the cross-entropy loss function to enable the model to maximizes the cross-merge ratio of a direct metric region and reduce the classification accuracy problem caused by classification imbalance.
The main contributions of this paper are as follows:
- (1)
A new segmentation method for small objects in remote sensing images is proposed. We construct an asymmetric encoder–decoder structure which, based on ResNet101 and added SPD-Conv layers, enables the model to reduce the loss of fine-grained information and improve the segmentation accuracy of small objects in images.
- (2)
We adopted the coordinate attention mechanism in the feature extraction stage to obtain more orientation-sensitive and position-sensitive feature information in remote sensing images and improve the segmentation accuracy of node edges.
- (3)
We introduce the Dice coefficient into the cross-entropy loss function, which can reflect the degree of overlap between the predicted and real regions when the classification is extremely unbalanced, reducing the accuracy problem caused by classification unbalance.
3. Method
To better capture location and spatial information during the semantic segmentation of remote sensing images, reduce segmentation accuracy problems caused by unbalanced distribution of features in the dataset, and improve the semantic segmentation accuracy of small objects, a new remote sensing image semantic segmentation network with coordinate attention and SPD-Conv is proposed in this paper. ResNet101 [
50] has powerful generalization ability, the residual connection in ResNet forcibly breaks the symmetry of the network, improves the characterization ability of the network, and can extract more effective feature information. In order to improve the classification accuracy of small objects, detailed information is captured and input into multiscale modules together with low-level feature information in the feature extraction stage; we adapted the ResNet101 network by including SPD-Conv into the original network and adding a nonstrided convolutional layer in the downsampled feature map. As the intraclass variance of remote sensing images is high while the interclass variance is low, we added the coordinate attention mechanism to the generic ASPP module to encode channel relationships and contextual information with a precise location at a certain distance, thus effectively improving the segmentation accuracy. In order to solve the unbalance problem of ground object classification, the Dice coefficient was introduced into the cross-entropy loss function, and gradient optimization of training results was carried out. The semantic segmentation model proposed in this paper includes the feature extraction module, a pyramid pooling module based on the coordinate attention mechanism (CA-ASPP), and an upsampling module. The entire model is an asymmetric decoder–encoder structure that is trained in an end-to-end manner. The overall structure of the proposed model is depicted in
Figure 1.
3.1. Feature Extraction Module
ResNet uses shortcut connections to solve the problem of gradient disappearance, and the ResNet-101 structure was chosen as the backbone network for this paper. In the feature extraction module, an asymmetric convolution structure was constructed based on the ResNet101 backbone network to obtain the receptive field of various ground objects. ResNet was mainly used for image classification tasks initially, as feature extraction allows it to effectively ignore detailed information. Unlike image classification, segmentation tasks, especially semantic segmentation of remote sensing images, require detailed information of features to be retained. Therefore, on the basis of a network containing asymmetric convolution structures, we modified the ResNet-101 structure. ResNet-101 uses 4 convolutions with a stride of 2 and a maximum pooling layer. We added SPD-Conv to the ResNet-101 structure by replacing the four stride convolutions in ResNet-101 with SPD-Conv. As remote sensing images, even after being cropped, typically have a size much larger than 64 × 64 pixels, the maximum pooling layer was retained. SPD-Conv is comprised of a space-to-depth (SPD) layer followed by a nonstrided convolution layer. The SPD layer is an image transformation technique used within ResNet-101, namely throughout the downsampled feature map of ResNet-101. The SPD layer is a combination of an intermediate feature map
cut out of a series of submaps
; this can be understood as first forming all submaps of
, mapping each subsample down to
proportionally, and finally stitching the subfeature maps along the channel dimension to obtain the feature map
. The structure of the SPD layer is shown in
Figure 2.
Figure 2a–d gives an example when scale = 2. After the SPD feature transition layer, add a nonstrided (i.e., stride = 1) convolution layer with
filters, where
, as in shown in
Figure 2e.
To retain as much valuable feature information as possible, a nonstrided convolution layer is added after the SPD layer. As there is a downsampling module before the first residual block in each stage of the ResNet-101 structure, the 1 × 1 convolution in each path of this downsampling module converts the input shape to the output shape of another path. This process leads to half of the information in the feature map being ignored; to solve this problem, we added a pooling layer before the 1 × 1 convolution. The improved ResNet-101 structure is detailed in
Table 2. The feature information obtained in the feature extraction stage is input into the multiscale module for the next operation.
3.2. Multiscale Module Based on Coordinate Attention Mechanism (CA-ASPP)
The feature information obtained from the feature extraction module is input into the multiscale module. Atrous convolution can expand the perceptual field of the convolution kernel without downsampling, such that the resolution of the remote sensing images will not be lost. The upsampling part of ASPP in DeepLabV3+ has been improved by transforming 8 upsampling operations into 2 × 4 upsampling operations, providing richer semantic information and improving the effect of feature edge segmentation. For this paper, the ASPP module in DeepLabV3+ was selected as the multiscale module. The Squeeze-and-Excitation (SE) channel attention mechanism and CBAM are the most popular attention mechanisms. SE ignores the spatial dimensional features, and CBAM adds convolution based on channel pooling to weigh the spatial dimension; however, convolution does not allow us to obtain relevant information about distant objects. The CA uses global average pooling (GAP) to calculate channel attention weights and then globally encodes spatial information, allowing us to effectively obtain correlation information among distant objects. Therefore, CA coordinate attention was added to the end of ASPP in order to improve the model’s acquisition of orientation-aware and position-sensitive features of objects in remote sensing images, thus preserving long-distance dependencies and accurate position information (vertical (
H) × horizontal (
W)). First, global pooling is performed for each waveform feature in the spatial dimension, and the expression at the
l channel is given as Equation (1):
where
is the output of channel l, and
denotes the value of
at the coordinate
.
To avoid the loss of spatial information, the global pooling is decomposed in the horizontal and vertical directions, and the outputs of the lth channel in the vertical direction
and the horizontal direction
w are expressed in Equations (2) and (3), respectively:
The decomposed feature maps are encoded to obtain two attention weights, and the aggregated features are connected and transformed, as expressed in Equation (4):
where
denotes the series of spatial dimensions,
is the nonlinear activation function,
is the convolutional transformation function with a convolutional kernel of 1 × 1, and
, where
is used to control the block size reduction ratio.
Splitting independently into two tensors, and , along the space, the convolution kernel uses 1 × 1 convolutional transform functions and to transform and to the input values and .
The expressions are as shown in Equations (5) and (6), respectively, where
is a sigmoid function:
Finally, the product of these two attention weights is weighted to obtain the final attention value, which allows different objects to maintain relevant information at a distance. The expression is given in Equation (7).
The structure of the CA module is shown in
Figure 3. In this paper, a coordinate attention mechanism is added after the ASPP module to acquire direction-aware and position-sensitive features without losing long-range contextual information in remote sensing images.
3.3. Loss Function
The most common loss function used in image semantic segmentation tasks is the cross-entropy loss function, which assigns weights to different classes to improve training on datasets with relatively balanced classes. However, the size of remote sensing imagery is typically relatively large and, so, it is likely that the classes of a remote sensing image dataset will be more unbalanced, and each batch of data may not contain all of the classes, which may lead to training errors. The Dice coefficient maximizes the cross-merge ratio of a direct metric region and improves model performance in the case of extremely unbalanced classes. Therefore, in this paper, the Dice coefficient is introduced into the cross-entropy loss function to solve the problem of class imbalance in the process of model training in the dataset. This was achieved by combining the cross-entropy loss function and Dice coefficient, according to the number of pixels of each classification in a batch image, and choosing the weighted cross-entropy loss function (
) and generalized Dice loss (
), expressed as follows:
where
[
51] is the number of pixels in category
, and
is the weight of
.
where
and
denote the predicted and true image elements from the total number
of image elements, respectively, and
is the number of categories.
where
and
denote the true and predicted values of
, respectively.
The weighted cross-entropy loss function (WCELoss) and generalized Dice loss are combined as follows:
5. Conclusions
In this paper, we proposed a new semantic segmentation model for classification imbalance and small target segmentation in remote sensing images. The model improves the backbone network with SPD-Conv and adds the coordinate attention mechanism to the multiscale module. The SPD-Conv convolutional layer was added to the original network, which reduced the loss of fine-grained information and retained detailed information of features, thus improving the segmentation accuracy of the model for small objects. A multiscale module based on the coordinate attention mechanism is proposed. The coordinate attention mechanism enables the model to acquire the relationships between different objects and avoid the loss of spatial relationship information in remote sensing images. We combine the Dice loss and cross-entropy loss functions to effectively improve the accuracy problem caused by image category imbalance in type training. The experimental results show that on the ISPRS Vaihingen and ISPRS Postdam datasets, the F1-scores of trees reached 87.73% and 88.97%, respectively, and the F1-scores of cars reached 80.78% and 81.46%, respectively. The overall accuracy of the two datasets reached 89.83% and 89.94%, respectively. It shows that CAS-Net can effectively improve the segmentation accuracy and OA of small and medium targets in remote sensing images, which is an improvement on the existing semantic segmentation model.
Remote sensing images usually contain a great many symmetrical objects; in future research, we will aim to make full use of the symmetry of ground objects and reduce the number of parameters on the basis of symmetric quantization, moving toward a lightweight model while maintaining the demonstrated model performance.