4.1. Datasets
The experiments were conducted on three RS datasets, including the ISPRS Vaihingen dataset, the ISPRS Potsdam dataset, and the LoveDA dataset. The two ISPRS datasets are widely used for RS semantic segmentation tasks, and the LoveDA dataset is a newly released dataset which is more challenging. The details are described as follows.
(1) Vaihingen Dataset: The ISPRS Vaihingen and Potsdam datasets [
68] are both commonly evaluated benchmarks in the semantic segmentation of RS images. The Vaihingen dataset is a small-scale dataset with 33 true orthophoto (TOP) images which are collected by advanced airborne sensors and usually divided into small images for training and testing. The Vaihingen dataset covers an area of 1.38 km
with 6 different classes of objects, including buildings, trees, low vegetation, cars, clutters, and impervious surfaces. The Vaihingen dataset has a ground sampling distance (GSD) of 0.09 m, making it a high-resolution dataset. The scales of the TOP images range from 1887 to 3816 pixels; and each TOP image has infrared (IR), red (R), and green (G) channels to compose the pseudo-color image. In the experiments, 16 images were selected for training and 17 images for testing, following the common practice. For convenient processing, the TOP images were cropped into 512 × 512 pixels with a stride size of 256 pixels.
(2) Potsdam Dataset: Compared with the Vaihingen dataset, the size of Potsdam dataset is slightly larger, having 38 TOP images collected by advanced airborne sensors. The Potsdam dataset covers an area of 3.42 km
with the same 6 classes as the Vaihingen dataset and a GSD of 0.05 m. The TOP images of the Potsdam dataset with fixed size of 6000 × 6000 pixels were also cropped into 512 × 512 pixels with a stride size of 256 pixels. Particularly, 24 images were utilized for training, and the remaining 14 images were used for testing, following the previous studies [
48,
69,
70].
(3) LoveDA Dataset: Land-cover Domain Adaptive semantic segmentation (LoveDA) dataset [
71] is a recently released remote sensing dataset with high resolution. Unlike the Vaihingen dataset and Potsdam dataset with a comparatively small scale, the LoveDA dataset contains 5987 images collected with spaceborne sensors, covering an area of 536.15 km
with 7 classes, including buildings, roads, water, barren, forest, agriculture, and background. The size of each image in the LoveDA dataset is 1024 × 1024 pixels with a GSD of 0.3 m. Following the official dataset split, 2522 images were used for training and 1669 images for testing in this work.
4.3. Implementation Details
(1) Training Configuration: To provide a consistent input, the input images in the LoveDA dataset are resized into 512 × 512 pixels, which is the same size as images in the ISPRS Vaihingen and Potsdam datasets. Random resizing, clipping and flipping, and photometric distortion and band normalization were employed for data augmentation. The proposed BiTSRS and the compared models were implemented based on the PyTorch framework. The Adaptive momentum estimation with weight decay (AdamW) [
72] optimizer was employed for model training, with the learning rate of
and the weight decay of 0.01. To guarantee a reasonable number of training iterations, a polynomial learning rate [
73] with a linear warm-up was adopted for the first 1500 iterations with a warm-up ratio of
, and the maximum number of iterations was set to 16,000 for the entire training process. The polynomial policy in the rest of the iterations can be described as follows:
where
denotes the learning rate in time
t,
is the initial learning rate,
T is the total number of iterations, and
is the power of adopted polynomial. Furthermore, for a fair comparison, the following experiments did not employ pretrained models, namely, all methods in the experiments were trained from scratch.
(2) Experimental Environment: All the experiments were conducted on a Linux Ubuntu 18.04 LTS server with a single NVIDIA GeForce RTX 3080 GPU with 10,018 MB of memory.
4.4. Ablation Analysis
A simple framework including a Swin transformer encoder and an UperNet-style decoder without any proposed extra designs was employed as the baseline [
43]. To illustrate the effectiveness of the proposed model, the following ablation experiments were performed.
(1) Baseline: The baseline model simply combined the Swin transformer with the UperNet in the encoder–decoder structure, in which the Swin transformer encoder was used to extract features of the input images, and the UperNet was utilized to restore the resolution and generate the segmentation results.
(2) Effectiveness of Input Transform Module: The motivation of designing the ITM was to make the size of input images more appropriate for the Swin transformer encoder. With cascaded ITM, the designed framework is capable of performing semantic segmentation on RS images with large spatial scales, conveniently using the pretrained weights of the Swin transformer trained on the ImageNet dataset. The results for the comparison are shown in
Table 2. The first group depicts the results of the baseline model, which loads the pretrained weights (denotes as with pr) with or without the ITM. The middle group is the comparison when the baseline model does not load the pretrained weights with or without the ITM. Additionally, the bottom group is the comparison of the baseline model loading the pretrained weights with or without the ITM, when an input with a larger size of 1024 × 1024 pixels is employed (a smaller size input with 512 × 512 pixels was employed in the previous two groups). It can be observed that when the pretrained Swin transformer;s weights are loaded, improvements of 2.12% in mIoU and 0.28% in mAcc can be obtained by employing the ITM. Even without the pretrained weights, improvements of 1.56% in mIoU and 1.21% in mAcc can still be obtained by employing the ITM. Furthermore, when the framework is applied on larger-scale RS images with a size of 1024 × 1024 pixels, improvements of 2.25% in mIoU and 0.73% in mAcc can also be achieved. The experimental results illustrate that the designed ITM can effectively improve the performance of RS image semantic segmentation, especially for large-scale RS images. Even without pretrained weights, the use of ITM still leads to a salient improvement in semantic segmentation performance.
(3) Effectiveness of the Bi-decoder Design: In this part, the baseline with ITM was employed as the fundamental framework, and the performance improvements of the three specific components (Dilated-Uper decoder, FDCN, and focal loss) in the framework are discussed. The experimental results are shown in
Table 3. It can be observed that enclosing any one of the three specific designs in the baseline framework can lead to a performance improvement in semantic segmentation. Improvements could also be observed when their arbitrary combinations were employed in the framework. The best performance was achieved by combining all the three designs in the framework: there were improvements of 2.94% in mIoU and 3.25% in mAcc. These improvements can be attributed to their outstanding capabilities in receptive field expansion, image deformation acquisition, and handling the problem of training sample imbalance.
4.5. Comparison with Some State-of-the-Art Methods
To further demonstrate the effectiveness and superiority of the proposed BiTSRS, 12 state-of-the-art algorithms were adopted for the comparison, including 7 algorithms proposed for general scene segmentation, i.e., FCN [
24], U-net [
25], PSPNet [
28], DeepLabv3+ [
27], HRNet [
11], and SETR [
63], BeiT [
74]; and 5 algorithms specially designed for RS images, i.e., STransFuse [
48], UNetFormer [
75], SwinB-CNN [
49], and ST-UNet [
47]. The baseline mentioned of our proposed BiTSRS is also included in the comparison. It is noted that all the CNN-based methods are highly optimized versions based on the official implementations. For example, dilated convolution in the ResNet50 backbone is employed in CNN-based methods, such as FCN, U-net, and PSPNet, for better performance.
Particularly, training settings are inconsistent among the RS-specific methods; for example, the cropped size of the image is 1024 × 1024 in UNetFormer [
75], 256 × 256 in ST-UNet [
47], and 300 × 300 in SwinB-CNN [
49]. In addition, test-time augmentation (TTA) strategies are used in UNetFormer [
75] and not in other methods. Furthermore, the number of epochs is also different among the methods; for example, there are 100 epochs in ST-UNet [
47], 50 epochs in UNetFormer [
75], and 16,000 iterations in SwinB-CNN [
49]. To make a fair comparison, the RS-specific methods, including UNetFormer, SwinB-CNN, and ST-UNet, were re-trained and re-evaluated under the same experimental settings, including without pretrained weights and testing augmentation, with the same cropped size of the image and number of epochs, etc. As STransFuse [
48] does not provide the source code temporarily, its results were directly taken from its paper, since our experimental settings were as same as those in [
48].
Three commonly used RS image datasets were used in the comparison experiments, including the ISPRS Vaihingen dataset, the ISPRS Potsdam dataset, and the LoveDA dataset. The comparison results on Vaihingen dataset are listed in
Table 4. It can be observed that the proposed BiTSRS achieved the highest mAcc and mIoU on the Vaihingen dataset. From the perspective of the IoU of each class, the proposed BiTSRS achieves an obviously higher IoU for the clutter class, and a comparable IoU for the other five classes to those of the other methods. This is also confirmed by the visualization results depicted in
Figure 5. The proposed BiTSRS can effectively segment the clutter class, which is actually the most challenging class in the Vaihingen dataset, and this can be confirmed by the last rows of the visualization images.
The comparison results for the Potsdam dataset are shown in
Table 5. The highest mIoU and mAcc were both achieved by the proposed BiTSRS. It can be observed that BiTSRS outperforms the other methods for all classes except the tree class and building class. Similarly to its performance on the Vaihingen dataset, BiTSRS can segment the clutter class more accurately than the other methods, which is consistent with visualization results in
Figure 6. With a larger scale than the Vaihingen dataset, the Potsdam dataset has more sufficient data for training, so that better performance is achieved than for the Vaihingen dataset.
Differently from the Vaihingen and Potsdam datasets, semantic segmentation on LoveDA dataset is far more challenging, because of its more complex background samples, inconsistent class distributions, and multi-scale objects. The comparison results on the LoveDA dataset are shown in
Table 6, and it can be seen that the proposed BiTSRS still clearly outperforms the other methods by producing the best mIoU and mAcc. Particularly, for the most challenging class in LoveDA dataset, the barren class, the proposed BiTSRS achieves the highest IoU, which can be also observed in
Figure 7.
It is also worth noting that the simple FCN method can achieve better results for almost all classes on the Vaihingen dataset, though it has the lowest mIoU for the clutter class. However, for large and complex datasets, i.e., the Potsdam dataset and LoveDA dataset, the FCN performs comparably or even worse than the other methods, indicating that its architecture is not robust for generalization. However, the highly developed CNN-based methods, such as Deeplabv3+ and HRNet, in contrast, are more robust, producing promising results on the three datasets. The general transformer-based methods, such as SETR and BeiT, cannot achieve promising results on all the three datasets, which may be mainly because the unique characteristic of RS images are not specifically considered when designing these networks. The RS-specific transformer-based methods, such as UNetFormer, SwinB-CNN, and ST-UNet, can achieve better performances than other algorithms when they are re-trained and re-evaluated under the same experimental settings as the proposed BiTSRS. For example, the UNetFormer can produce promising results even though it is a real-time method with the smallest model size. Particularly, the Swin-transformer-based methods are more competitive for RS image semantic segmentation; e.g., ST-UNet achieves the best performance among all the compared algorithms. This is mainly because the Swin-transformer structure can better acquire the semantic information in RS images. However, even for the best transformer-based algorithm, i.e., ST-UNet, it is inferior to the proposed BiTSRS.
In summary, the proposed BiTSRS clearly outperforms the compared state-of-the-art methods on all the three datasets, which demonstrates its effectiveness. Specifically, the three delicately designed modules of BiTSRS (ITM, Dilated-Uper decoder, and FDCN) make effective contributions to performance improvement in RS image semantic segmentation. Furthermore, the outstanding capability of BiTSRS in challenging class segmentation illustrates the superiority of the proposed bi-decoder structure, which makes BiTSRS more robust to the challenging and complex samples by acquiring a wide range of receptive fields and more detailed features.