1. Introduction
Semantic segmentation of remote sensing images [
1], also known as land-cover classification [
2], aims at locating objects at the pixel level and predicting the semantic categorical label for each pixel in a remote sensing image. It plays an important role in many remote sensing applications [
3,
4] such as environmental change monitoring, precision agriculture, environmental protection, and urban planning and 3D modeling.
Driven by the rapid development of aeronautics and astronautics technology, together with Earth observation and remote-sensing technology, massive numbers of high-quality and high-resolution remote sensing images have been captured. These high-resolution (HR) remote sensing images are rich in information and contain substantial spatial detail, which can provide the data support for land-cover classification and segmentation. Although semantic segmentation of natural images has achieved considerable progress, the semantic segmentation of HR remote sensing images is still a challenging task, since larger scenes always contain more complex ground information with heterogeneous objects. HR remote sensing images often exhibit large intra-class variations and small inter-class variations at the semantic object level, due to the diversity and complexity of ground objects. For example, the structure, color, and size of buildings show significant variation even in a single scene, as shown by the two buildings outlined with blue rectangles in
Figure 1, while trees and low vegetation are always indistinguishable due to their similar colors and fuzzy boundaries, as shown by the areas in yellow circles in
Figure 1. This makes it a difficult task to classify high-resolution remote sensing images pixel by pixel.
Extensive studies have been presented on the challenging HR remote sensing images semantic segmentation task, including studies using traditional methods and deep-learning-based methods. The earlier traditional methods [
5,
6] mainly consisted of two parts: first extracting features based on the color, shape, texture and spatial position relations of a potential semantic object and then adopting clustering or classification methods to segment the image. They depended heavily on hand-crafted features, always achieving unsatisfactory performance. Recently, with the advent of deep learning, deep convolutional neural networks (DCNNs) have made great progress in semantic segmentation, due to the ability of DCNNs to automatically extract nonlinear and hierarchical features at different semantic levels. Most current semantic segmentation methods are based on the fully convolutional network (FCN) [
7] framework, which replaces the fully connected layers with convolutional ones to output spatial feature maps, then utilizes an upsampling operation to generate the predicted maps. Generally, FCN-based architectures consist of a contracting path (also known as an encoder), which extracts information from the input image and obtains high-level feature maps, and an expanding path (also known as a decoder), where high-level feature maps are utilized to generate the mask for pixel-wise segmentation using single-level (e.g., FCN [
7], DeepLab [
8]) or multilevel (e.g., UNet [
9]) upsampling procedures.
The
semantic context information is a key factor in HR remote sensing images semantic segmentation. Although DCNNs can automatically extract hierarchical semantic features, they simultaneously also reduce the spatial resolution and degrade the spatial detail information in high-level feature maps. Based on the high-level semantic features, single-level upsampling methods (e.g., FCN [
7], DeepLab [
8]) directly adopt upsampling and (or) dilated (atrous) convolution to generate the high-resolution segmentation output. They may fail in some cases, especially when there are many relatively small objects (e.g., cars) in the fine-spatial-resolution remote sensing images. Alternatively, to address this issue, the utilization of multi-scale contextual feature fusion is a feasible solution for differentiating semantic categories at different spatial scales. The most commonly utilized techniques for aggregating multi-scale contextual features include pyramid pooling modules [
10], atrous spatial pyramid pooling [
8,
11], and context encoding modules [
12]. These strategies are always incorporated into the decoder part of the UNet framework. However, in the UNet framework, the low-level and fine-grained detailed features extracted by the encoder are simply copied and then concatenated with the high-level and coarse-grained semantic features extracted by the decoder, leading to insufficient exploitation of the feature discrimination ability. Therefore, in this study, we investigated how to fuse the multi-scale semantic contexts for parsing HR remote sensing images.
In addition, the
boundary information of a semantic object may determine the final performance of the HR remote sensing images semantic segmentation. It is well known that in the semantic segmentation task, the pixels at the boundary are more likely to be misclassified. However, HR remote sensing images always contain large and complex scenes with heterogeneous objects with various shapes, scales, textures, and colors. The boundaries of objects are often ambiguous and blurry [
13,
14] due to the lighting conditions, imaging angles, occlusions, and shadows, as shown by the areas marked by yellow circles in
Figure 1. Although DCNNs have the ability to learn robust and discriminative features, the features extracted by a DCNN always fail to distinguish the adjacent objects, since in HR remote sensing images the semantic objects are adjacent and often have a similar appearance (color). The reason may be that the high-level features of DCNNs tend to extract the local semantic information while ignoring the global geometric prior and over-smoothing the boundaries of objects, which is important for object localization. There are generally two ways to improve the accuracy of the boundary: by adding the boundary loss [
15,
16,
17] and by adding extra boundary detection sub-networks [
18,
19,
20]. These methods mainly focus on the detection of the boundary, ignoring the relationships between the boundary and the semantic context. That is, they pay little attention to utilizing the boundary information to guide the semantic context, in order to improve the final performance of the semantic segmentation at the object level. Therefore, we also investigated how to explicitly adopt the extracted boundary information to enhance the semantic context for parsing HR remote sensing images.
To address the above-mentioned two research points, in this paper we propose the boundary enhancing semantic context network (BES-Net) for high-resolution remote sensing images semantic segmentation. Based on the FCN framework, a ResNet [
21] model is adopted as the backbone to extract the hierarchical semantic features. Additionally, BES-Net builds three modules to emphasize the boundary and semantic context information, including the boundary extraction (BE) module, the multi-scale semantic context fusion (MSF) module, and the boundary enhancing semantic context (BES) module. The BE module is introduced to predict the binary boundary of the objects, simultaneously adopting low-level detailed features and the highest-level semantic features from the backbone as the input. It is supervised by the binary boundary labels generated from the segmentation ground truth using the Laplacian operation [
22]. The MSF module adopts the high-level semantic features from the backbone as the input and fuses them with the attention mechanism in a hierarchical manner, to obtain the fused semantic features containing objects with multiple scales. Finally, the BES module is designed to explicitly adopt the extracted boundary information to enhance the fused semantic context using simple addition and multiplication operations. By aggregating the semantic context information along with the boundaries, pixels from the same semantic category can receive a similar response, enhancing the semantic consistency.
The main contributions can be summarized as follows:
We present a simple yet effective semantic segmentation framework, BES-Net, for HR remote sensing images semantic segmentation.
We explicitly, not implicitly, adopt the well-extracted boundary to enhance the semantic context for semantic segmentation. Accordingly, three modules are designed to enhance the semantic consistency in the complex HR remote sensing images.
Experimental results on two HR remote sensing images semantic segmentation datasets demonstrate the effectiveness of our proposed approach compared with state-of-the-art methods.
3. Methodology
In this section, we introduce the framework of our proposed boundary enhancing semantic network (BES-Net) for parsing high-resolution remote sensing images.
3.1. The Framework of BES-Net
As depicted in
Figure 2,
BES-Net explicitly adopts boundary information to enhance the semantic context, ensuring that those pixels within one object achieve similar responses after the semantic feature aggregation. Specifically, our method employs the ResNet [
21] model as its backbone to extract the hierarchical features, where the vanilla convolutions are replaced by dilated ones to enlarge the receptive field (also known as FCN_8s [
7]). The backbone outputs five feature maps:
and
at
of the size of the input resolution, with low-level detail information, and
,
, and
at
of the size of the input resolution, with high-level semantic information. Moreover, BES-Net contains three modules for aggregating the boundary and semantic context information, including the boundary extraction (BE) module, the multi-scale semantic context fusion (MSF) module, and the boundary enhancing semantic context (BES) module.
The BE module focuses on extracting the boundary information by regarding it as an independent sub-task along with the mainstream semantic segmentation. The low-level detail features are concatenated with the highest-level semantic features to capture the semantic boundaries. The boundary information is supervised by the binary boundary labels generated from the segmentation ground truth using a Laplacian operation. Furthermore, regarding the three high-level semantic features which have different semantic scales, the MSF module fuses them using the attention mechanism in a hierarchical manner to obtain the refined semantic features. Finally, the BES module is carefully designed to enhance the semantic context with the extracted semantic boundaries from the BE module. By aggregating the semantic information along the boundaries, pixels from the same semantic category can receive a similar response, thus enhancing the semantic consistency. We give details of each of the modules in the following subsections.
3.2. Boundary Extraction Module
The boundary extraction module is designed to extract the boundaries of semantic objects. Since deep convolutional neural networks can learn features containing both low-level detail information and high-level semantic information in a hierarchical manner, the BE module directly borrows the intermediate features from the backbone. Although the boundary information exists in the low-level detail feature maps, most of them lack the semantic information. Therefore, to extract the semantic boundary, as well as the two low-level detail features (
and
), the BE module also utilizes the highest-level semantic feature (
) as an input, as shown in
Figure 3.
Based on the inputs, the BE module firstly unifies their channels to (e.g., = 64) using convolutional layers. The extracted boundary feature map of is upsampled to the size of (i.e., of the size of the input) and then they are concatenated together as the boundary features .
Furthermore, to ensure the boundary features truly contain the boundary information of a semantic object, the extracted multi-scale boundary feature maps are supervised by the binary boundary labels generated from the segmentation ground truth. Specifically, the multi-scale boundary feature maps are mapped to the 1-channel boundary maps by convolutional layers and the sigmoid function, then merged by element-wise addition.
The binary boundary labels are generated from the semantic segmentation ground truth using the Laplacian operation, as shown in
Figure 4. Since the semantic objects in a scene always have various sizes, inspired by ASPP [
8] we performed the boundary extraction at different scales. Given the ground truth (
), we can obtain the corresponding boundary (
) as follows:
where
is the convolution which adopts the Laplacian operator
as the convolutional kernel to perform 2D convolutions with strides of
i (
, 2 and 4, respectively). This can produce soft, thin-detail feature maps with multi-scale semantic boundaries. Then, the feature maps are bilinearly upsampled (
) to the original size, and channels are concatenated (
) together and mapped to the 1-channel boundary maps by
convolutional layers (
).
3.3. Multi-Scale Semantic Context Fusion Module
The three high-level features (
,
, and
) have different semantic scales, due to the different receptive fields. The multi-scale semantic context fusion module is designed to fuse the multi-scale semantic information into one feature map. As shown in
Figure 5, the three high-level features (
,
, and
) are firstly converted to
,
, and
with the same channel
(e.g., = 128 for ResNet18, = 512 for ResNet50 and ResNet101) using
convolutional layers. Then, they are fused by the fusion blocks in a hierarchical manner to obtain the fused semantic features.
The fusion block is based on the channel attention mechanism, since different channels of features may correspond to different semantic classes [
36]. Compared to features from different spatial positions, the features from different channels may have higher class discriminability. Considering this, the fusion block aims to effectively exploit the cross-scale complementary information by re-weighting the importance of single-scale features in a channel-dependent way.
As shown on the right of
Figure 5, given the two-scale features (
and
,
), they are concatenated and then fed into two convolutional layers to obtain the relative importance of the paired features from different scales but in the same channels. The channel importance (or attention) weights
W can be calculated as follows:
where
is the channel concatenation operation, and
is a convolutional block with a
convolutional layer (for channel reduction) and a
convolutional layer (for feature refining).
is the sigmoid function, and
denotes the global average pooling operation.
Higher values of
W indicate that the corresponding channels of features at the
jth scale are more likely to be important than the corresponding channels of features at the
ith scale, and vice versa. As a result, the relative importances of the channels of features from different scales are obtained. Therefore, we can adopt the gate fusion method [
37] to fuse the two scale features, where the channel importance weights correspond to the gate. Based on the channel importance weights, the fused features can be computed as follows:
where
represents the whole fusion block, and • denotes channel-wise multiplication.
Finally, the three high-level features (
,
, and
) are fused by the MSF module in a hierarchical manner with a fusion block, to obtain the final fused feature
.
3.4. Boundary Enhancing Semantic Context Module
Since the semantic boundary has intrinsic partitioning capability for an object, the goal of the BES module is to enhance the fused semantic features , to give them more intra-class consistency using the extracted boundary features . The key point is the method of aggregating the two features. has sufficient semantic information (focusing on the body area of an object without the boundary information), while has salient boundary information. They complement each other in describing an object.
Like the fusion block of the MSF module in the previous subsection, the BES module is designed based on two fundamental mathematical operations (element-wise addition, +, and element-wise multiplication, ×) to aggregate the two complementary features. Generally, the multiplication operation can filter out the boundary-related information to emphasize it, while the addition operation can complement two features. With the two simple, parameter-free mathematical operations, two complementary features can be effectively fused to describe the complete information of objects. Moreover, to obtain more robust performance, we also utilize the extracted boundary features to enhance the highest-level semantic feature .
As shown in
Figure 6, given three features, i.e., the fused semantic features
, the extracted boundary features
, and the highest-level semantic features
, the BES module firstly unifies the channel numbers of the three features to
(e.g., = 128) using
convolutional layers. Taking the highest-level semantic features
as an example, this is updated as follows:
where
is the atrous spatial pyramid pooling module [
8], and
denotes the
convolutional layer for unifying the channel numbers.
All the features are then upsampled to the size of (i.e., of the size of the input).
Finally, the fused semantic features
are enhanced by the extracted boundary features
using the addition and multiplication operations. Simultaneously, in order to make the boundary-related information more salient, features
are also enhanced by
using the multiplication operation. The final enhanced features
are calculated as follows:
3.5. Loss Function
Based on the final enhanced features
, with sufficient semantic information and boundary information, we can predict the segmentation results with a convolution block (sequentially including a
convolutional layer and a
convolutional layer), in the same way as for FCN [
7] and DeepLab [
8].
In our framework, BES-Net outputs two main results that aim to generate the segmentation masks and boundaries, respectively. For semantic segmentation, we adopt the standard cross-entropy
to measure the difference between the predicted masks
P and the ground truth
G:
where
K is the number of semantic categories corresponding to
K predicted feature maps, and the subscript
denotes the
ith pixel in the
kth predicted feature map.
Similarly, for boundary prediction, we also adopt the binary cross-entropy to measure the difference between the predicted boundary
and the ground truth boundary
:
Additionally, following the settings in DeepLab [
8], to accelerate model convergence we also apply an auxiliary cross-entropy loss
to the intermediate feature representations of the backbone. Therefore, the overall training loss is
where
and
are the hyperparameters. We empirically set
and
.
4. Experiments and Results
In this section, we evaluate the effectiveness of our proposed methods for HR remote sensing images semantic segmentation on two public 2D semantic labeling datasets provided by the International Society for Photogrammetry and Remote Sensing (ISPRS) (
https://www2.isprs.org/commissions/comm2/wg4/benchmark/semantic-labeling/, accessed on 26 February 2022): Vaihingen and Potsdam. Both datasets cover urban scenes. The reference data are labeled according to six land-cover classes: impervious surfaces, buildings, low vegetation, trees, cars, and clutter/background.
4.1. Experimental Settings
4.1.1. Datasets and Settings
The ISPRS Vaihingen dataset recorded a relatively small village with many detached buildings and small multi-story buildings. It contains 33 orthophoto image patches. The images have an average size of pixels and a resolution of 9 cm. The near-infrared (IR), red (R), and green (G) channels, together with corresponding digital surface models (DSMs) and normalized DSMs (NDSMs), are provided in the dataset. We only utilized the IR-R-G images; the DSMs were not used in the experiments. Following the official data split, we employed 16 images for training and 17 images for testing.
The ISPRS Potsdam dataset recorded a typical historic city with large building blocks, narrow streets, and a dense settlement structure. It contains 38 orthophoto image patches. The images have a size of pixels and a resolution of 5 cm. The dataset provides the near-infrared, red, green, and blue channels, as well as DSMs and NDSMs. Again, we only utilized the IR-R-G images in the experiments. Following the official data split, we employed 24 images for training and 14 images for testing.
4.1.2. Evaluation Metrics
Following existing studies, the overall accuracy (OA), mean intersection over union (mIoU), and mean F1-score (mF1) were adopted as evaluation metrics. Based on the accumulated confusion matrix, the OA, mIoU, and mF1 were calculated as.
where
,
,
, and
denote the true positive, false positive, true negative, and false negative values for a class
k, respectively. In addition,
and
are the precision and recall indicators for class
k, respectively.
4.1.3. Implementation Details
The implementation (
https://github.com/FlyC235/BESNet, accessed on 26 February 2022) of our method uses the PyTorch framework. Following existing HR remote sensing images semantic segmentation studies, the ResNet model was adopted as the backbone network for a fair comparison, and the pre-trained ImageNet parameters were adopted for the network initialization. We used the dilated FCN_8s as the baseline. In the training phase, considering the limited GPU memory, we cut the training images, as well as the corresponding labels, into patches with a size of
, using a sliding window with an overlap of 171 (≈
) pixels. To avoid overfitting, some common data augmentation methods were adopted, including random flipping and rotating at 30-degree intervals. We adopted the mini-batch stochastic gradient descent (SGD) optimizer for optimization, with the momentum and weight decay parameters set to 0.9 and 0.0005, respectively. The maximum training epoch number
was set to 120 for the Vaihingen dataset and 200 for the Potsdam dataset. We set the initial learning rate as 0.005. The "poly" learning rate strategy [
38] was adopted to update the learning rate, where at each iteration (
) the learning rate is multiplied by
. Following the settings used in other studies [
19,
24,
27], in the testing phase we also adopted data augmentation strategies, including horizontal and vertical flipping, multiple scales
, and overlay fusion on the full test files, which is also known as test-time augmentation (TTA).
4.2. Ablation Experiments
In this section, we evaluate the effectiveness of our proposed BES-Net method, including three components: the boundary extraction (BE) module, the multi-scale semantic context fusion (MSF) module, and the boundary enhancing semantic context (BES) module. The ablation experiments were only conducted on the ISPRS Vaihingen dataset (note that to simply show the effectiveness of different components during the ablation experiments, we only reported the results without the test-time augmentation (TTA) during testing). The dilated FCN_8s with the ResNet50 backbone was adopted as the baseline. On this basis, we tested the effectiveness of each module by incorporating them separately or simultaneously. The results are shown in
Figure 7.
From the figure, we can see that the three modules all have a positive effect on improving the baseline performance, since:
Compared to the baseline, using only the BE module (+BE) improved the mF1, mIoU, and OA by 0.59%, 1.66%, and 0.24%, respectively.
Compared to the baseline, using only the MSF module (+MSF) improved the mF1, mIoU, and OA by 0.34%, 1.80%, and 0.56%, respectively.
When combining the BE and MSF modules using the BES module (our BES-Net), the mF1, mIoU, and OA were improved by 1.28%, 2.36%, and 0.72%, respectively, compared to the baseline.
Compared to +BE and +MSF methods, our BES-Net performed much better. This demonstrates the effectiveness of explicitly adopting boundary information to enhance the semantic context.
The ablation experiments demonstrated the effectiveness of our proposed three modules, BE, MSF, and BES, for HR remote sensing images semantic segmentation.
Furthermore, based on the framework of BES-Net, we visualized the intermediate feature maps generated by our proposed three modules. As shown in
Figure 8, the feature map of
has rich boundary information at the semantic object level, while the feature map of
has sufficient high-level semantic information (focusing on the body area of an object without the boundary information) of multi-scale objects, including those at a fine scale (the small cars) and at a coarse scale (the large buildings).
and
complement each other in describing an object. After aggregating the fused semantic features
using the extracted boundary features
to obtain the final enhanced features
, the feature map of
has abundant boundary and semantic context information simultaneously. Finally, the predicted results from
matched the ground truth more accurately, with more intra-class semantic consistency.
4.2.1. Boundary Extraction
In our proposed boundary extraction module, we adopted two low-level detail features (
and
) and only the highest-level semantic feature (
) as the inputs. To verify why only
was utilized and not the other two features
and
, we conducted the following experiments (Baseline+BE) with different input combinations. Since the goal of the BE module is to extract the semantic boundaries, we used a fixed input for the two low-level detail features and gradually added the high-level semantic features from
to
. The results are listed in
Table 1. We can see that only combining the highest-level semantic features
achieved the best performance. When adding more features, the performance was degraded, especially regarding the mIoU metric. This is reasonable, because
can provide sufficient semantic information to guide the boundary learning, and adding
and
introduces more convolution parameters.
4.2.2. Multi-Scale Semantic Context Fusion
In our proposed multi-scale semantic context fusion module, we fuse the three high-level semantic features
,
, and
in a hierarchical manner (Equation (
4)), firstly fusing two features and then fusing the last one with the fused result. To determine the fusion strategy, we conducted the following experiments (Baseline+MSF) with different fusion orders.
,
, and
have sequentially higher semantic scales. According to their semantic scales, we set three kinds of orders, corresponding to indexes ④, ⑤, and ⑥, as listed in
Table 2. We can see that method ④ with the fusion
order, in a coarse-to-fine scale manner, achieved the best performance.
Moreover, we also conducted more experiments by concatenating the results of the above three methods, corresponding to index ⑦. Compared to our method (index ④), this introduced twice as many convolution parameters, achieving a slightly better mF1 result (+0.32%) but a much worse mIoU result (−0.79%).
4.2.3. Boundary Enhancing Semantic Context
In our proposed boundary enhancing semantic context module, we introduced a feature aggregation method containing two simple mathematical operations to utilize the extracted boundary features to enhance the fused multi-scale semantic features. To verify the effectiveness of the feature aggregation method, we conducted the following experiments (BES-Net) with different boundary enhancing semantic features.
Table 3 shows that:
When the highest-level backbone semantic feature is enhanced by the boundary feature , corresponding to index ⑧, it achieves better performance compared to the baseline.
When enhances the fused multi-scale semantic features , corresponding to index ⑨, it slightly outperforms method ⑧.
Finally, simultaneously enhancing and , corresponding to index ⑩, achieves the best performance.
The experimental results demonstrate the effectiveness of our BES-Net in explicitly adopting boundary information to enhance the semantic context.
4.2.4. Backbone
Furthermore, we conducted experiments based on our BES-Net with different ResNet backbones, including ResNet18, ResNet50, and Resnet101. As shown in
Table 4, with increasing model parameters, the segmentation performances were improved according to the mF1 and OA metrics. This is reasonable, because the well-structured ResNet models with more parameters correspond to more powerful feature representation capabilities. However, from ResNet18 to ResNet101, the parameters increase by about five times, and the results on the Vaihingen dataset show improvements in mF1 of 0.82% and in OA of 0.65%, while the results on the Vaihingen dataset only show improvements in mF1 of 0.07% and in OA of 0.18%.
Moreover, The results of ResNet18 on the Potsdam dataset are better than those on Vaihingen dataset, which is reasonable since the Potsdam dataset has a fine resolution (5 cm), while the Vaihingen dataset has a slightly coarser resolution (9 cm). Fine-resolution images can provide more spatial detail about the objects.
4.2.5. The Hyperparameters in the Loss Function
There are two hyperparameters in the loss function (Equation (
9)):
for boundary loss
and
for auxiliary loss
. As most of the related studies follow the settings of DeepLab [
8] to set the hyperparameter
, we also adopted this setting here. For the hyperparameter
, we conducted experiments with different values,
.
As shown in
Figure 9, we find that the mF1/OA of BES-Net improves when the weight range is from 0.1 to 0.5 to 1.0, then drops when the weight range is from 1.0 to 5.0 to 10.0. The highest performance is achieved when
(the boundary loss and segmentation loss have equal weight). Furthermore, the performance of BES-Net drops slightly when the weights are 0.1 and 0.5 but drops dramatically when the weights are 5.0 and 10.0. These results suggest that focusing too much on the boundary information may lead to ignoring the inter-class semantic information. Therefore, the boundary loss and segmentation loss should have equal weight.
4.3. Comparison to the State of the Art
This section compares our BES-Net (with a ResNet18, ResNet50, and ResNet101 backbone) to state-of-the-art HR remote sensing images semantic segmentation methods. The results with test-time augmentation (TTA) on the ISPRS Vaihingen and Potsdam datasets are listed in
Table 5.
The experiments on the Vaihingen dataset show that:
The experiments on the Potsdam dataset show that our proposed BES-Net method with the ResNet18 backbone obtains the best performance with respect to the two metrics mF1 and mIoU. The experiments suggest that our proposed method can boost the semantic segmentation performance, with more intra-class segmentation consistency, at a holistic semantic object level. Regarding the backbones, ResNet18 can achieve comparable (or slightly better) performance on the two metrics mF1 and mIoU, compared to ResNet50 and ResNet101. However, ResNet18 has considerably fewer parameters.
The experimental results indicate that for the Potsdam dataset with a fine resolution (5 cm), ResNet18 is sufficient to extract the spatial details and semantic information for parsing the HR remote sensing images. It can provide the best balance between the computational burden of forward inference and richness of feature learning.
4.4. Qualitative Analysis
Figure 10 and
Figure 11 show the visualization results of our BES-Net (ResNet18) method and the corresponding baseline (FCN_8s) on the Vaihingen and Potsdam test datasets, respectively. It can be seen that compared to the baseline, our BES-Net method significantly improved the segmentation performance, especially in the regions marked with red dashed circles (or boxes). Benefiting from the utilization of the BE, MSF, and BES modules, our BES-Net method could obtain a coherent and accurately labeled result in these heterogenous regions which are hard to distinguish. The three modules together learn the features from the entire semantic object level, resulting in segmentation results with more complete boundaries. Some representative samples can be found in
Figure 10 and
Figure 11. We can observe that:
Moreover, to show the effectiveness of our proposed method for handling challenging situations when there is cloud shadow, some sample visualization results of our BES-Net (ResNet18) method and the corresponding baseline (FCN_8s) method are shown in
Figure 12. As the red circles illustrate, the baseline method may miss some cars in shadow, while our BES-Net is able to perceive them. Our network can achieve better performance in challenging situations with shadows.
4.5. Computational Complexity
We compared the computational complexity with state-of-the-art methods such as ABCNet [
27], FANet [
45], MAResU-Net [
46], and SwiftNet [
47]. Model parameters and computation FLOPs are also listed for comparison in
Table 6. Note that for a fair comparison, we used the same backbone (ResNet18) network to evaluate the computational complexity. We can see that compared to the state-of-the-art lightweight method ABCNet, our BES-Net achieved better performance with fewer parameters and FLOPs, maintaining both low computational cost and high accuracy simultaneously.
5. Conclusions and Discussion
We presented a boundary enhancing semantic context network (BES-Net) in this paper that could improve the semantic segmentation performance for parsing high-resolution remote sensing images. The main idea was that we explicitly, not implicitly, used the well-extracted boundary to enhance the semantic context for semantic segmentation. BES-Net simultaneously takes the boundary and semantic context information into account using three designed modules. The BES module enhances the fused multi-scale semantic context extracted from the MSF module, using the boundary information extracted from the BE module to boost the semantic segmentation performance, giving more intra-class segmentation consistency at a holistic semantic object level. Experiments on the ISPRS Vaihingen and Potsdam datasets showed that when only the boundary information or multi-scale semantic context was incorporated, the segmentation performance was slightly improved, while adding our BES module to explicitly enhance the fused multi-scale semantic segmentation using boundary information improved the performance considerably. This demonstrates the progressiveness and superiority of our proposed BES-Net method, even compared to the current state-of-the-art methods.
Although it achieves a relatively fine combination of boundary information and semantic context, the proposed BES-Net still has room for improvement. (1) Efficiency. At present, due to the high dimensionality, our BE module still has a relatively high computational cost. (2) Performance. To reduce the computational cost, our BES module only utilizes two simple parameter-free mathematical operations(element-wise addition and element-wise multiplication), to fuse two complementary features. There should be more effective ways to design the BES module. Therefore, our future work will focus on further optimizing the BE and BES modules, reducing the complexity of the BE module while maintaining its performance and adopting more effective methods for using boundary information to enhance semantic context. Furthermore, focusing on more challenging situations in the actual scenarios, e.g., images with clouds and shadows, is also an interesting topic.