1. Introduction
In recent years, semantic segmentation models for high-resolution remote sensing images have emerged in an endless stream, which has also promoted the development of many applications, including precision agriculture, natural disaster assessment, and urban planning [
1,
2,
3,
4,
5]. Many traditional semantic segmentation methods have also achieved good results, but with the development of deep learning, because the semantic segmentation model of deep learning has stronger timeliness, it is more applicable to actual situations. As the pioneering work of semantic segmentation, a fully convolutional network (FCN) [
6] has opened up a new path for image segmentation. Next to the encoding and decoding structures, multi-scale feature extraction blocks and attention-based blocks in semantic segmentation models appear to improve model performance, but each structure has its advantages and disadvantages, and some models even use them in combination to boost the performance of the semantic segmentation model of remote sensing image.
Since the emergence of the FCN model, it was better than the traditional image segmentation method [
7,
8] in terms of segmentation accuracy and time consumption. FCN has achieved good results, but because many convolution operations are used to extract features, much primary feature information is lost. So, the skip connection method is used to achieve feature compensation. Models of encoding and decoding structures have also gradually emerged. Encoding means that the resolution of feature maps is gradually reduced, while extracting features with convolution and pooling operations, and decoding means that the resolution of feature maps is gradually increased through upsampling operations. In the end, the model obtains input and output images of the same size. At present, there is still a lot of research [
9,
10,
11] using the structure of encoding and decoding. The advantage of this structure is that it can alleviate the problem of loss of feature map information. For instance, UNet [
12] was a classic encoding and decoding structure model, and it used skip connections to achieve fusion before and after feature extraction to make up for the lack of important features. DeconvNet [
13] used VGGNet [
14] to delete the fully connected layer as a feature extractor based on an encoding and decoding structure, but the model employed deconvolution operations in the decoding stage to restore the resolution of the feature map, which alleviated the problem of missing features. In addition, the model complexity increased a lot. LinkNet [
15] also adopted the encoding and decoding structure to meet the real-time requirements of semantic segmentation and reduced the amount of model parameters and time consumption of the model by increasing the stride of the convolution operation.
The multi-scale feature extraction block has been widely used in the semantic segmentation model because this module has a strong role in mining the continuous context information, and it aims to enhance the ability of the model in recognizing objects of different scales in the image. DeepLabV3+ [
16] used dilated convolution to achieve feature extraction at different scales, and PSPNet [
17] used parallel pooling at different scales to extract key features of different types of ground objects so as to increase the segmentation performance of the model. DFCNet [
18] adopted multi-scale convolution operations to widen the model and advance the variety and abundance of extracted features, and it also adopted the fusion of multi-modal data to refine the segmentation result map. Hoin et al. [
19] proposed a model that uses features from different decoding stages to enhance the features of the holistically nested edge detection unit, and it finally achieved feature fusion at different scales to enhance the generalization ability of the model. MSA-UNet [
20] combined the multi-scale feature extraction module with the Unet model, and it utilized the feature pyramid module to realize the refinement of object edges. The sub-modules of RCCT-ASPPNet [
21] were dual decoding structures and atrous spatial pyramid pooling (ASPP), which cross-fused global and local semantic features of different scales, thus further promoting the model segmentation performance.
The attention mechanism module is derived from the study of human vision and aims to focus more on key areas in the image than other areas. In the field of computer vision, the purpose of the attention mechanism module is to enhance the weight of salient features and reduce the weight of noise and useless information in the image so as to achieve the purpose of extracting salient features in the image. Recently, many studies [
22,
23,
24,
25] have been proposed based on the attention mechanism. For instance, SE-Block [
26] was a classic attention mechanism method. Its motivation was to explicitly establish the interdependence relationship between feature channels. Specifically, it was to automatically obtain the weight of each channel through model learning and then according to this weight, promote useful features, and suppress useless features for the current task. CBAM-Block [
27] was an attention mechanism module that combined spatial and channel information. Compared with SE-Block, which only focused on the channel attention mechanism, it achieved better performance. MCAFNet [
28] extracted features through the feature extractor, and the global–local transformer block mined the context dependencies in the image, and the channel optimization attention module mined the context dependencies between feature map channels, thereby increasing the expressiveness of features. Zhang et al. [
29] proposed a model whose submodule, the semantic attention (SEA), consisted of a fully convolutional network, and this module enhanced the stimulation of regions of interest in the feature map and suppressed useless information and noise in the feature map. In addition, the scale complementary mask branch (SCMB) module realized feature extraction at different scales and made full use of multi-scale features. The two sub-modules were multiscale attention (MSA) and nonlocal filter (NLF) in MsanlfNet [
30], and the former enhanced the expressive ability of features by using the multi-scale feature attention module, and the latter captured the dependence of global context information, and the model improved performance through these modules.
In summary, the model of encoding and decoding structure, multi-scale feature extraction block, and attention mechanism block, have significantly improved the accuracy of semantic segmentation of remote sensing images, but these models are not obvious for feature map refinement operations. In addition, mining the contextual dependencies of the position information and channel information in the feature map is not thorough. Therefore, in the proposed hierarchical refinement residual network (HRRNet), the channel attention module (CAM) and pooling residual attention module (PRAM) are put forward to fully exploit the contextual dependencies of the feature map position information and channels, thus enhancing the deep expressive ability of features. The fusion of features realizes the refinement of feature maps. In addition, the attention block, the pooling residual structure in PRAM, and the residual structure between CAM and PRAM modules can significantly promote the performance of the network.
The main contributions of the proposed HRRNet are summarized as follows:
- (1)
The proposed CAM and PRAM sub-modules of HRRNet can fully exploit the feature map position information or the dependence of the context information between channels to enhance the deep expressive ability of features.
- (2)
Using ResNet50 as a feature extractor, the layered fusion of features extracted to different stages and different scales realizes the refinement of the feature map, and the fusion of multi-scale features also enhances the model’s ability to recognize various types of ground objects and promotes the generalization ability of the model.
- (3)
By setting different residual structures, the correlation between gradient and loss in the model training process is improved, which enhances the learning ability of the network and alleviates the problem of gradient disappearance.
3. Proposed Network Model
The flowchart of HRRNet is shown in
Figure 1. HRRNet consists of a backbone (feature extraction stage), four attention blocks, and five decoders. In this study, ResNet50 is used as a feature extractor, also known as backbone. An attention block is composed of CAM, PRAM, and residual structures, which play the roles of enhancing feature expression ability and aim to mine context dependencies in feature maps. A decoder consists of convolution, activation function, and deconvolution operation, and its function is to change the channel of the feature map and perform upsampling operations. The convolution operation is defined as:
where
stands for the dimensions of the output feature map,
represents the dimensions of the input feature map,
k is the size of the convolution kernel,
s stands for stride during the convolution operation,
represents adding a few pixels to the edge of the feature map matrix, and
stands for the padding of the output feature matrix. Specifically, the input of HRRNet is an image with a size of 3 channels × 256 × 256, and ResNet50 is used as the backbone to extract features, and the features of the four stages are, respectively, defined as
,
,
and
, and their number of channels and sizes are 256 channels and 64 × 64, 512 channels and 32 × 32, 1024 channels and 16 × 16, 2048 channels and 8 × 8, respectively. Then,
is fed to the attention block (AB), the output feature matrix is defined as
, and then
is input to Decoder4, the number of channels and the size of the output feature matrix are 1024 channels and 16 × 16, the output is defined as
, and then the corresponding elements of
and S3 feature map are calculated by the sum operation to realize the fusion of feature maps, which is defined as the feature matrix
:
Then,
is fed to CAM and PRAM, and the output feature map matrices are
and
, respectively. The attention matrix of the output Attention Block is defined as
:
3.1. Illustration of the Proposed HRRNet
In addition,
,
and
also achieve feature fusion with the previous stage in this way to complete the refinement of the feature matrix and finally output 3 channels and 256 × 256. HRRNet network training details are shown in Algorithm 1.
Algorithm 1 Train an optimal HRRNet model. |
Input: Input a set of images and their labels . Output: Get the segmentation results of the test set. - 1:
Initialize batch_size to be 5, the learnable parameters’ weight attenuation is set as , the number of maximum iterations is set as m = , Adam is chosen as the optimizer, the loss function is the cross-entropy ( 19); - 2:
High resolution images and their labels are preprocessed to a size of ; - 3:
Start training the HRRNet network; - 4:
for to m do - 5:
The features output from the four stages of ResNet50 are defined as , , and ; - 6:
These features are respectively passed through Attention Block, and the enhanced features are defined as , , and ; - 7:
These feature maps are fused step by step to obtain the final prediction result map; - 8:
According to the loss function, the loss between the prediction result and the label is obtained and the parameters are updated, and the training model of this time is obtained; - 9:
Validate the result of saving the weights with the validation set; - 10:
Save the model when there are better validation results. - 11:
break - 12:
Get the best HRRNet model. - 13:
end for - 14:
Test the optimal HRRNet model through the test set to get the experimental results.
|
3.2. Channel Attention Module
Figure 2 shows the channel attention module (CAM) of the proposed HRRNet. Assume that
M and
C represent the dimension of the input feature matrix and the number of channels, respectively, where
,
, and
represent the height and weight of the input feature matrix, respectively. Assume that we input a feature matrix
. CAM generates the corresponding feature matrices
Q,
K, and
V by operating on different branch feature matrices as:
where
,
and
represent the
operation on the feature map
F in different branches. It is worth noting that the reshape operation performed on the feature map matrix
F in each stage of the CAM process is represented by
,
, and
.
T represents the transpose of the feature map, and
I represents the dimension of the transformed feature map channel. Because the
Q,
K, and
V feature maps have the same channel dimensions, we use the same expression.
The feature matrix
K and
Q are multiplied by elements between channels, the symbol ⊗ represents channel element-level multiplication operation, and the similarity between the feature matrices is calculated through this operation. The result of an activation function
usually represents the similarity between feature maps, the feature maps output by the operation is defined as
Y, and the similarity between channels is calculated by the activation function is expressed by weight:
where
represents the activation function
:
Here,
Y represents the similarity matrix between the feature matrix
K and
Q, corresponding to the channel. Next, the output of the product operation of
Y and the feature matrix
V is defined as the feature matrix
, which is reshaped to have the same channel number and size as the input feature map
F. Finally, the summation operation is performed to output the result
:
It is worth noting that the input and output of CAM are feature maps of the same dimension and same size. Through a series of operations of this module, the contextual dependencies between feature map channels are fully excavated, and the representation ability of salient features is enhanced.
3.3. Pooling Residual Attention Module
Figure 3 shows the PRAM of the proposed HRRNet. Assume that
and
represent the dimension of the input feature matrix and the number of channels, respectively, where
,
and
represent the height and weight of the input feature matrix, respectively. The feature matrix
is output from the CAM module. PRAM implements 1 × 1 convolution operation on different branch feature matrices to generate corresponding feature matrices
and
:
where
represents the 1 × 1 convolution operation performed on the
feature map matrix to generate the
feature map matrix,
represents the 1 × 1 convolution operation performed on the
feature map matrix to generate the
feature map matrix,
represents the
operation of the feature map,
T is defined as the transpose of the feature matrix, and
is defined as the channel dimension of the transformed feature matrix. Because the
and
feature matrix channel dimensions are the same, we use the same expression.
The feature matrices
and
are multiplied by elements between channels, and, after the output is passed through the activation function
, the final output feature matrix
Z is obtained. Then, the activation function is used to calculate the similarity of the feature map position:
Next is the operation of the third branch of the feature matrix
. Here,
and
are the length and width of the feature map
. Specifically, the average pooling operation is first performed on the feature map
, and the resulting feature map is defined as
P. At this time, the corresponding relationship between the original feature map and the pooling feature map
is defined as:
where
is the value of each element of the feature map after average pooling,
,
is the
cth channel of the feature map. By performing the bilinear interpolation, the
feature map is upsampling to obtain
. The purpose is to perform the summation between the corresponding channels with the feature map
to obtain the feature map
. Next, we perform a 1 × 1 convolution operation on
and the
operation, which is defined as:
Finally, the
and
Z feature maps are multiplied between corresponding elements, and the feature map
is summed to output the feature map
M:
4. Experiments and Results
4.1. Dataset
The model uses two datasets of ISPRS Vaihingen and Potsdam to verify its validity. The first one is the Vaihingen dataset. It is composed of 33 tiles with an average size of 2500 × 2100, and the ground resolution is 9 cm. Tile consists of red, green, blue, and infrared (RGB-IR) four-channel, and a digital surface model (DSM) is provided in this Vaihingen dataset. This ground truth has six categories, including: buildings, impervious surfaces, low vegetation, tree, clutter, and car. For assessment, the 17 ground truth images are classified into three groups, 11 images are used as the training set, two images are used for the verification set, and four pictures are used for the test set.
The second one is the Potsdam dataset. It is composed of 38 tiles with an average size of 6000 × 6000, and the ground resolution is 5 cm. Tiles are RGB-IR images with four channels. The Potsdam dataset also has a DSM. The ground truth of the Potsdam dataset has the same number of categories as that of the Vaihingen dataset. For assessment, the 24 ground truth images are classified into three groups, 19 images are used as the training set, two images are used for the verification set, and three images are used for the test set.
4.2. Dataset Preprocessing and Evaluation Metrics
In high-resolution remote sensing images, the distribution of multiple categories of ground objects is chaotic, so the labeling of dataset labels is very difficult, which leads to a small amount of annotated datasets. Therefore, the training sets of Vaihingen and Potsdam use random flipping and mirroring for data enhancement to achieve the purpose of expanding the amount of data. We employed test time augmentation (TTA) in the flipping and mirroring stages of the image. In this study, the albumations library was adopted to implement Vaihingen and Potsdam data augmentation. After augmentation, the images of the training sets were normalized to [0, 1]. It is worth noting that other models also use the same data augmentation operation.
The performance of those models on the Vaihingen and Potsdam datasets is verified by the mean intersection over union (
), the overall accuracy (
), the F1 score (
) and the mean F1 score (
) indicators, which are calculated based on the confusion matrix as follows.
4.3. Training Details
Before model training started, the learning rate was set as
, and it was reduced to 0.85 times after every 15 epochs. In addition, we used Adam as the optimizer for the training model, and the polynomial learning rate was set as (1 − (cur_iter/max_iter))
, the learnable parameters’ weight attenuation was set as
, and the number of maximum iterations was set as
. In this study, the model utilized the following loss function by combining the cross-entropy function and median frequency balancing weights.
where
is the weight for class
a,
is the pixel frequency of class
a,
is the probability of sample belonging to class
a, and
denotes the class label of sample
n in class
a. For the Vaihingen and Potsdam datasets, the training sets are cropped and augmented to 5000 images of size 256 × 256, and the batch size is set to 5. We employed a sliding window (with a size of 448 × 448 and a step size of 100 pixels) on the test set by averaging the predicted results of the overlapping patches as the final results.
4.4. Ablation Study
The proposed submodule attention block of HRRNet fully explored contextual dependencies, and the enhanced features of multiple branches are fused step by step to refine the segmentation results. In order to prove the effectiveness of the attention block, CAM and PRAM on the experimental results, we made different experimental settings. First, we added an attention block on different branches based on the backbone (ResNet50). It was worth noting that adding an attention block each time was based on the previous model. Regardless of whether there was an attention block on each branch, multiple stages of feature fusion were performed step by step. Second, the attention block in the four branches remained unchanged, and the CAM, PRAM, and residual structures were sequentially added to obtain experimental results to prove their efficiency.
The results of two groups of ablation experiments on the Vaihingen dataset are shown in
Table 1 and
Table 2.
Table 1 shows the ablation results of the first set on the Vihingen dataset. It is clear that
and
increased by 2.84% and 5.23% after adding an attention block on the basis of Backbone, especially
, which increased by 7.87%. It shows that the attention block has greatly improved the performance of the model. In addition, with the increase of the attention block, all the indicators have been promoted, which proves that the attention block can fully explore the contextual dependencies of positions and channels. Finally, the experimental results are optimal when we add the attention block to each branch.
Table 2 shows the results of the second set of ablation experiments. At this time, we kept the number of attention blocks unchanged, and the CAM, PRAM, and residual structures were added to the attention block in order. It is shown from
Table 2 that
,
, and
increased by 2.95%, 5.19%, and 7.80%, respectively, after adding CAM, which proves that the CAM module fully exploits the contextual dependencies between feature map channels, thus increasing the segmentation of the model performance. Immediately after the PRAM module was added to the attention block, the contextual dependencies of the feature map position can be fully exploited, and
,
, and
increased by 0.21%, 0.41%, and 0.62%, respectively, indicating that the pooling residual structure increases the feature expressive ability. Finally, adding the residual structure to the attention block is HRRNet, especially mIou, which increased by 0.54%, which proves the effectiveness of the residual structure.
The results of two sets of ablation experiments on the Potsdam dataset are shown in
Table 3 and
Table 4.
Table 3 shows the results of the first set of ablation experiments. The experimental results show the same trend as the indicators on the Vaihingen dataset. With the increase in the number of attention blocks, the performance of the model gradually reaches the optimum.
Table 4 shows the results of the second group of ablation experiments. It is worth noting that, after adding PRAM,
,
, and
increased by 1.83%, 1.55%, and 2.64%, respectively, which fully proves that the contextual dependencies of the position can be further exploited to advance the segmentation performance of the HRRNet.All in all, the ablation experimental results on the Vaihingen and Potsdam datasets show the same trend, which proves the robustness of the HRRNet model, and it also shows that CAM and PRAM can fully mine the contextual dependencies of the channel and position in the feature map. In addition, the attention block of multiple branches realizes the purpose of refinement of the feature map.
4.5. Quantitative Comparison of Different Models
In order to further verify the efficiency of the proposed HRRNet, we also reproduced the classic semantic segmentation models and compared them with the HRRNet. First of all, methods based on attention mechanism, including CBAM-Block [
27], SE-Block [
26], SK-Block [
35], DANet [
38], and CoTNet [
39], all of which have a common feature, that is, to enhance the expressive ability of features through modules, so that salient features can be extract. In addition, semantic segmentation models based on multi-scale feature extraction, including DeepLabV3+ [
16] and PSPNet [
17], these models used dilated convolution or multi-scale pooling to extract features at different scales to grasp the dependencies of contextual information in images. However, they improved model performance by increasing the complexity of the model. In addition, peer models designed for semantic segmentation remote sensing images, such as LANet [
40] and SPANet [
41], were also chosen for comparison. For the fairness of experimental results, all models use ResNet50 as a feature extractor.
Table 5 shows the experimental results of various models on the Vaihingen dataset. It shows that the experimental results based on the attention mechanism model are slightly better than the multi-scale model. The
,
, and
of the SK-Block model are 1.87%, 3.22%, and 4.93% higher than those of PSPNet. The purpose of the attention mechanism module is to increase the weight of salient features in the feature map and suppress noise and useless information, and the multi-scale feature extraction model increases the complexity of the model and improves the continuity of context information, but the attention mechanism module obtains better experimental results. In addition, the
,
, and
of the CoTNet model are 2.02%, 3.07%, and 4.68% higher than those of DeepLabV3+. For LANet and SPANet models, what they have in common is the fusion of advanced semantic features and shallow features to complement geometric information and spatial information, but their difference is the feature enhancement module. LANet employed a pool to enhance the representation of features ability, while SPANet employed a successive pooling strategy to extract key salient features, and the segmentation of target boundaries was more accurate. However, HRRNet makes up for the shortcomings of the above models and adopts the fusion of features of different scales at different stages to realize the refinement of feature maps.
Table 6 shows the experimental results of various models on the Potsdam dataset. The experimental results of the model based on the attention mechanism are better than those of the multi-scale feature extraction model. This trend is similar to the experimental results on the Vaihingen dataset. DANet performed very well on the Potsdam dataset, and the position attention module and channel attention module played a big role in the performance of the model. CoTNet promoted the performance of the model by further exploring the contextual dependencies in the feature map through the convolution operation of multiple branches and adding the sum product operation between the feature maps.
Table 6 shows that the
,
, and
of the DANet model are 2.51%, 1.69%, and 2.78% higher than those of CoTNet, but the performance of the DANet model needs to be improved. SPANet used a successive pooling strategy to improve model performance. The
,
, and
of the SPANet model are 0.26%, 0.49%, and 0.88% higher than those of DANet. HRRNet explores the contextual dependencies between positions and channels and uses multi-stage feature map fusion to implement refinement operations to further improve the segmentation performance of the model. The
,
, and
of the HRRNet model are 0.97%, 0.53%, and 0.9% higher than SPANet. In addition, the
of the surface, building, and Low-veg categories are 1.14%, 0.98%, and 0.86% higher than SPANet.
For the experimental results, the , , and of the HRRNet model are 0.98%, 1.71%, and 2.69% higher than those of LANet, and 0.58%, 0.93%, and 1.46% higher than those of SPANet, and the of Low-veg, tree, and car categories are 1.62%, 0.74%, and 1.86% higher than those SPANet. The HRRNet model can obtain better segmentation performance.
All in all, the experimental results of HRRNet compared with various models on these two datasets show that the HRRNet has strong robustness and good segmentation performance.
6. Conclusions
Many deep convolutional network models do not fully refine the segmentation result maps, and, in addition, the long-range dependencies of the semantic feature map have not been fully exploited. This article proposed a hierarchical refinement residual network (HRRNet) to address these issues. The HRRNet mainly consists of ResNet50 as the backbone, attention blocks, and decoders. Attention block consists of a channel attention module (CAM), pooling residual attention module (PRAM), and residual structure. Specifically, the proposed CAM and PRAM sub-modules of HRRNet fully exploit the feature map position information or the dependence of the information context between channels to enhance the expressive ability of features. Then, using ResNet50 as a feature extractor, the layered fusion of features extracted to different stages and different scales realizes the refinement of the feature map, and the fusion of multi-scale features also enhances the model’s ability to recognize various types of ground objects, thus promoting the generalization ability of the model. In addition, by setting different residual structures, the correlation between gradient and loss in the model training process is improved, which enhances the learning ability of the network and alleviates the problem of gradient disappearance. Experiments show that the proposed HRRNet promotes the segmentation result maps compared with various models on ISPRS Vaihingen and Potsdam datasets.
In the future, the precise segmentation of Low-veg categories and tree categories in high-resolution variability remote sensing images is still a good research direction, and the problem of large intra-category differences and small inter-category differences is worthy of further study.