Figure 1.
Swin Tranformer backbone. Each Swin Transformer block contains a W-MSA block and a SW-MSA block. Except for the first stage, each stage is preceded by a Patch Merging layer to reduce the feature map size and deepen the feature map depth, which generates a multilevel feature representation. The image is divided into nonoverlapping patches using Patch Partition layer, and the image is mapped to C dimensions by a LinearEmbedding layer as the input to the backbone network.
Figure 1.
Swin Tranformer backbone. Each Swin Transformer block contains a W-MSA block and a SW-MSA block. Except for the first stage, each stage is preceded by a Patch Merging layer to reduce the feature map size and deepen the feature map depth, which generates a multilevel feature representation. The image is divided into nonoverlapping patches using Patch Partition layer, and the image is mapped to C dimensions by a LinearEmbedding layer as the input to the backbone network.
Figure 2.
(a) Window self-attention. The window self-attention mechanism divides the input feature map into multiple windows within which self-attention is computed. Information interaction across windows is achieved using shifted-window. (b) Cross-shaped window self-attention. The cross-shaped self-attention mechanism divides the feature map into vertical and horizontal branches, adjusts the size of the strip window according to on both branches, and calculates the self-attention within the strip window. Finally, the attention results are fused in both directions.
Figure 2.
(a) Window self-attention. The window self-attention mechanism divides the input feature map into multiple windows within which self-attention is computed. Information interaction across windows is achieved using shifted-window. (b) Cross-shaped window self-attention. The cross-shaped self-attention mechanism divides the feature map into vertical and horizontal branches, adjusts the size of the strip window according to on both branches, and calculates the self-attention within the strip window. Finally, the attention results are fused in both directions.
Figure 3.
Cross-shaped Swin Transformer backbone. The cross-Swin Transformer block consists of the LN layer, the cross-shaped window self-attention and the MLP layer. We reduce the feature map size and deepen the feature map depth using a convolution operation with a convolution kernel size of 3 and a stride size of 2 to generate a multiscale representation. Information is encoded using a convolutional layer with a convolutional kernel of 7 and a stride size of 4 before stage 1.
Figure 3.
Cross-shaped Swin Transformer backbone. The cross-Swin Transformer block consists of the LN layer, the cross-shaped window self-attention and the MLP layer. We reduce the feature map size and deepen the feature map depth using a convolution operation with a convolution kernel size of 3 and a stride size of 2 to generate a multiscale representation. Information is encoded using a convolutional layer with a convolutional kernel of 7 and a stride size of 4 before stage 1.
Figure 4.
The overall structure of SaSPE. High-level feature maps contain rich semantic information, and low-level feature maps contain detailed information about the features. Establishing bidirectional global position relationships in the position-aware module through two branches. Optimize the semantic embedding space by the semantic-space optimization module to obtain a spatial attention map to preserve detailed and global information. Cross-channel attention maps are generated adaptively by 1-D convolution to combine the advantages of two levels of feature maps (a) Position-aware module. (b) Semantic-space optimization module. (c) Adpatively kernel-size k.
Figure 4.
The overall structure of SaSPE. High-level feature maps contain rich semantic information, and low-level feature maps contain detailed information about the features. Establishing bidirectional global position relationships in the position-aware module through two branches. Optimize the semantic embedding space by the semantic-space optimization module to obtain a spatial attention map to preserve detailed and global information. Cross-channel attention maps are generated adaptively by 1-D convolution to combine the advantages of two levels of feature maps (a) Position-aware module. (b) Semantic-space optimization module. (c) Adpatively kernel-size k.
Figure 5.
The visual effects of the semantic space optimization module. The optimized feature information is more obvious, and both interclass differences and intraclass similarities are significantly optimized. (a) Input image. (b) Original semantic-space. (c) Optimized semantic-space.
Figure 5.
The visual effects of the semantic space optimization module. The optimized feature information is more obvious, and both interclass differences and intraclass similarities are significantly optimized. (a) Input image. (b) Original semantic-space. (c) Optimized semantic-space.
Figure 6.
Progressive feature fusion decoder. The high-level feature map is downsampled using convolution and upsampled to the same size as the sub-level feature map. Feature fusion is performed by fusing high-level feature maps with sub-level feature maps in turn using the SaSPE module to form a progressive feature fusion.
Figure 6.
Progressive feature fusion decoder. The high-level feature map is downsampled using convolution and upsampled to the same size as the sub-level feature map. Feature fusion is performed by fusing high-level feature maps with sub-level feature maps in turn using the SaSPE module to form a progressive feature fusion.
Figure 7.
Progressive MLP feature fusion decoder. The high level feature map is downsampled using convolution and upsampled to the same size as the sub-level feature map. The high level feature map is fused with the low level feature map to form a progressive feature fusion.
Figure 7.
Progressive MLP feature fusion decoder. The high level feature map is downsampled using convolution and upsampled to the same size as the sub-level feature map. The high level feature map is fused with the low level feature map to form a progressive feature fusion.
Figure 8.
Multiscale MLP feature fusion decoder. The feature maps of each stage are mapped to the same scale after MLP block. The global information at different scales is modeled level by level using the SaSPE module, and features are fused for multilevel information. The features of each stage are mapped to the same size after being processed by MLP blocks and input to the SaSPE module for multiscale global modeling.
Figure 8.
Multiscale MLP feature fusion decoder. The feature maps of each stage are mapped to the same scale after MLP block. The global information at different scales is modeled level by level using the SaSPE module, and features are fused for multilevel information. The features of each stage are mapped to the same size after being processed by MLP blocks and input to the SaSPE module for multiscale global modeling.
Figure 9.
Multiscale feature extraction decoder. The feature maps of each stage are mapped to the same scale after MLP block processing. MLA Head performs feature extraction for each layer result, and finally, global modeling is performed using SaSPE module.
Figure 9.
Multiscale feature extraction decoder. The feature maps of each stage are mapped to the same scale after MLP block processing. MLA Head performs feature extraction for each layer result, and finally, global modeling is performed using SaSPE module.
Figure 10.
Some samples in the Beijing land use dataset. We show the categories corresponding to the tags.
Figure 10.
Some samples in the Beijing land use dataset. We show the categories corresponding to the tags.
Figure 11.
Some samples in the Gaofen image dataset. We show the categories corresponding to the tags. We show the abbreviations for each category of GID.
Figure 11.
Some samples in the Gaofen image dataset. We show the categories corresponding to the tags. We show the abbreviations for each category of GID.
Figure 12.
Comparison of the visualization results of features on the BLU test dataset with different semantic segmentation methods. The blue label indicates “Water” and the gray label indicates “Road”. Observe the highlighted area in the lower right corner, where our model can identify “Road” that are obscured by other features, while other models do not.
Figure 12.
Comparison of the visualization results of features on the BLU test dataset with different semantic segmentation methods. The blue label indicates “Water” and the gray label indicates “Road”. Observe the highlighted area in the lower right corner, where our model can identify “Road” that are obscured by other features, while other models do not.
Figure 13.
Comparison of the visualization results of features on the GID test dataset with different semantic segmentation methods. The real labels in the highlighted area indicate RV (River) and TL (Traffic Land), respectively.
Figure 13.
Comparison of the visualization results of features on the GID test dataset with different semantic segmentation methods. The real labels in the highlighted area indicate RV (River) and TL (Traffic Land), respectively.
Figure 14.
Visualization of the SaSPE intermediate layer. (a) Image. (b) Ground truth. (c) Input of position-aware module. (d) Output of the position-aware module. (e) Semactic-space optimization results. (f) Output of the SaSPE module. (g) Prediction.
Figure 14.
Visualization of the SaSPE intermediate layer. (a) Image. (b) Ground truth. (c) Input of position-aware module. (d) Output of the position-aware module. (e) Semactic-space optimization results. (f) Output of the SaSPE module. (g) Prediction.
Figure 15.
Swin Transformer and cross-shaped Swin Transformer visualization of the effective perceptual field in each stage. From left to right, it is divided into stages from 1 to 4. The red area indicates a high impact on the target value, and the blue area indicates a low impact on the target value, i.e., the nonblue area can indicate the effective perceptual field at the current stage.
Figure 15.
Swin Transformer and cross-shaped Swin Transformer visualization of the effective perceptual field in each stage. From left to right, it is divided into stages from 1 to 4. The red area indicates a high impact on the target value, and the blue area indicates a low impact on the target value, i.e., the nonblue area can indicate the effective perceptual field at the current stage.
Figure 16.
Comparison of visualization results of the PFF decoder, the PMLP decoder, the MSMLP decoder, and the MFE decoder on the BLU test set. The yellow label indicates “Agricultural”, the blue label indicates “water”, and the gray label indicates road.
Figure 16.
Comparison of visualization results of the PFF decoder, the PMLP decoder, the MSMLP decoder, and the MFE decoder on the BLU test set. The yellow label indicates “Agricultural”, the blue label indicates “water”, and the gray label indicates road.
Table 1.
SaSPE module ablation experiments on the BLU dataset which uses the PFF decoder and the swin-T backbone.
Table 1.
SaSPE module ablation experiments on the BLU dataset which uses the PFF decoder and the swin-T backbone.
Decoder | Backbone | SaSPE Module | Acc(%) | Mean F1(%) | mIoU(%) |
---|
Position-Aware | Semantic-Space Optimization |
---|
PFF | swin-T | √ | | 95.98 | 83.39 | 72.39 |
PFF | swin-T | | √ | 95.89 | 84.88 | 71.85 |
PFF | swin-T | √ | √ | 96.05 | 83.84 | 72.99 |
Table 2.
SaSPE module ablation experiments on the GID dataset which uses the PFF decoder and the swin-T backbone.
Table 2.
SaSPE module ablation experiments on the GID dataset which uses the PFF decoder and the swin-T backbone.
Decoder | Backbone | SaSPE Module | Acc(%) | Mean F1(%) | mIoU(%) |
---|
Position-Aware | Semantic-Space Optimization |
---|
PFF | swin-T | √ | | 97.84 | 71.28 | 55.41 |
PFF | swin-T | | √ | 97.88 | 70.68 | 54.91 |
PFF | swin-T | √ | √ | 97.89 | 67.35 | 55.73 |
Table 3.
Ablation experiments with window self-attention and cross-shaped window self-attention on the BLU dataset, with results on four decoders and using the SaSPE module by default.
Table 3.
Ablation experiments with window self-attention and cross-shaped window self-attention on the BLU dataset, with results on four decoders and using the SaSPE module by default.
Decoder | Flops/(GMac) | Params/(M) | Backbone | SaSPE Module | Acc(%) | Mean F1(%) | mIoU(%) |
---|
Window Self-Attention | Cross-Shaped Window Attention |
---|
PFF | 30.17 | 30.93 | √ | | √ | 96.05 | 83.84 | 72.99 |
PFF | 24.16 | 23.48 | | √ | √ | 96.18 | 84.49 | 73.84 |
PMLP | 28.94 | 28.51 | √ | | √ | 95.96 | 83.66 | 72.71 |
PMLP | 23.56 | 22.34 | | √ | √ | 96.15 | 84.28 | 73.56 |
MSMLP | 28.58 | 28.19 | √ | | √ | 95.95 | 83.57 | 72.56 |
MSMLP | 23.4 | 22.19 | | √ | √ | 96.12 | 84.34 | 73.66 |
MFE | 38.15 | 28.77 | √ | | √ | 95.94 | 83.65 | 72.71 |
MFE | 26.67 | 22.45 | | √ | √ | 96.16 | 84.45 | 73.80 |
Table 4.
Ablation experiments with window self-attention and cross-shaped window self-attention on the GID dataset, with results on four decoders and using the SaSPE module by default.
Table 4.
Ablation experiments with window self-attention and cross-shaped window self-attention on the GID dataset, with results on four decoders and using the SaSPE module by default.
Decoder | Flops/(GMac) | Params/(M) | Backbone | SaSPE Module | Acc(%) | Mean F1(%) | mIoU(%) |
---|
Window Self-Attention | Cross-Shaped Window Attention |
---|
PFF | 30.17 | 30.93 | √ | | √ | 97.89 | 67.35 | 55.73 |
PFF | 24.16 | 23.48 | | √ | √ | 97.89 | 68.81 | 56.87 |
PMLP | 28.94 | 28.51 | √ | | √ | 97.88 | 68.61 | 56.32 |
PMLP | 23.56 | 22.34 | | √ | √ | 97.87 | 73.14 | 56.73 |
MSMLP | 28.58 | 28.19 | √ | | √ | 97.89 | 68.54 | 56.37 |
MSMLP | 23.4 | 22.19 | | √ | √ | 97.93 | 70.03 | 57.63 |
MFE | 38.15 | 28.77 | √ | | √ | 97.85 | 68.42 | 56.20 |
MFE | 26.67 | 22.45 | | √ | √ | 97.94 | 69.56 | 56.96 |
Table 5.
Experimental results of four decoders on the BLU dataset using the swin-T backbone and SaSPE module by default.
Table 5.
Experimental results of four decoders on the BLU dataset using the swin-T backbone and SaSPE module by default.
Decoder | Backbone | SaSPE Module | Acc(%) | Mean F1(%) | mIoU(%) |
---|
PFF | swin-T | √ | 96.05 | 83.84 | 72.99 |
PMLP | swin-T | √ | 95.96 | 83.66 | 72.71 |
MSMLP | swin-T | √ | 95.95 | 83.57 | 72.56 |
MFE | swin-T | √ | 95.94 | 83.65 | 72.71 |
Table 6.
Experimental results of four decoders on the GID dataset using the swin-T backbone and SaSPE module by default.
Table 6.
Experimental results of four decoders on the GID dataset using the swin-T backbone and SaSPE module by default.
Decoder | Backbone | SaSPE Module | Acc(%) | Mean F1(%) | mIoU(%) |
---|
PFF | swin-T | √ | 97.89 | 67.35 | 55.73 |
PMLP | swin-T | √ | 97.88 | 68.61 | 56.32 |
MSMLP | swin-T | √ | 97.89 | 68.54 | 56.37 |
MFE | swin-T | √ | 97.85 | 68.42 | 56.20 |
Table 7.
Ablation experiments with various combinations of decoders, backbone, and SaSPE module in the BLU dataset.
Table 7.
Ablation experiments with various combinations of decoders, backbone, and SaSPE module in the BLU dataset.
Decoder | Backbone | SaSPE Module | Acc(%) | Mean F1(%) | mIoU(%) |
---|
Window Self-Attention | Cross-Shaped Window Self-Attention | Position-Aware | Semantic-Space Optimization |
---|
PFF | √ | | √ | | 95.98 | 83.39 | 72.39 |
PFF | √ | | √ | √ | 96.05 | 83.84 | 72.99 |
PFF | | √ | √ | | 96.12 | 84.26 | 73.55 |
PFF | | √ | √ | √ | 96.18 | 84.49 | 73.84 |
PMLP | √ | | √ | | 95.91 | 83.45 | 72.41 |
PMLP | √ | | √ | √ | 95.96 | 83.66 | 72.71 |
PMLP | | √ | √ | | 96.09 | 84.11 | 73.28 |
PMLP | | √ | √ | √ | 96.15 | 84.28 | 73.56 |
MSMLP | √ | | √ | | 95.92 | 83.44 | 72.38 |
MSMLP | √ | | √ | √ | 95.95 | 83.57 | 72.56 |
MSMLP | | √ | √ | | 96.05 | 84.09 | 73.23 |
MSMLP | | √ | √ | √ | 96.12 | 84.34 | 73.66 |
MFE | √ | | √ | | 95.93 | 83.52 | 72.52 |
MFE | √ | | √ | √ | 95.94 | 83.65 | 72.71 |
MFE | | √ | √ | | 96.06 | 84.30 | 73.56 |
MFE | | √ | √ | √ | 96.16 | 84.45 | 73.80 |
Table 8.
Ablation experiments with various combinations of decoders, backbone, and SaSPE module in the GID dataset.
Table 8.
Ablation experiments with various combinations of decoders, backbone, and SaSPE module in the GID dataset.
Decoder | Backbone | SaSPE Module | Acc(%) | Mean F1(%) | mIoU(%) |
---|
Window Self-Attention | Cross-Shaped Window Self-Attention | Position-Aware | Semantic-Space Optimization |
---|
PFF | √ | | √ | | 97.84 | 71.28 | 55.41 |
PFF | √ | | √ | √ | 97.89 | 67.35 | 55.73 |
PFF | | √ | √ | | 97.93 | 72.76 | 56.71 |
PFF | | √ | √ | √ | 97.89 | 68.81 | 56.87 |
PMLP | √ | | √ | | 97.86 | 67.48 | 55.44 |
PMLP | √ | | √ | √ | 97.88 | 68.61 | 56.32 |
PMLP | | √ | √ | | 97.87 | 72.29 | 56.63 |
PMLP | | √ | √ | √ | 97.87 | 73.14 | 56.73 |
MSMLP | √ | | √ | | 97.91 | 68.12 | 55.99 |
MSMLP | √ | | √ | √ | 97.89 | 68.54 | 56.37 |
MSMLP | | √ | √ | | 97.92 | 73.06 | 56.53 |
MSMLP | | √ | √ | √ | 97.93 | 70.03 | 57.63 |
MFE | √ | | √ | | 97.90 | 67.66 | 55.80 |
MFE | √ | | √ | √ | 97.85 | 68.42 | 56.20 |
MFE | | √ | √ | | 97.88 | 68.64 | 56.65 |
MFE | | √ | √ | √ | 97.94 | 69.56 | 56.96 |
Table 9.
Comparison of experiment results with other models on the BLU dataset.
Table 9.
Comparison of experiment results with other models on the BLU dataset.
Dods | Backbone | Per-Class IoU(%) | Acc(%) | Mean F1(%) | mIoU(%) |
---|
Backgroud | Built-Up | Vegetation | Road | Agricultural | Water |
---|
DANet [35] | swin-T | 49.02 | 69.18 | 80.12 | 43.09 | 78.44 | 72.79 | 94.84 | 78.15 | 65.44 |
SE-Net [14] | swin-T | 59.36 | 78.16 | 83.61 | 49.79 | 79.82 | 77.47 | 95.87 | 82.65 | 71.37 |
CBAM [19] | swin-T | 59.27 | 78.20 | 83.78 | 50.84 | 80.05 | 77.00 | 95.90 | 82.78 | 71.52 |
SKNet [20] | swin-T | 59.92 | 79.37 | 83.90 | 52.35 | 80.03 | 77.05 | 95.96 | 83.23 | 72.10 |
ECA-Net [15] | swin-T | 59.67 | 78.35 | 83.71 | 51.03 | 80.49 | 76.06 | 95.92 | 82.82 | 71.55 |
EMANet [36] | swin-T | 59.55 | 79.15 | 83.36 | 52.69 | 79.44 | 75.70 | 95.87 | 82.94 | 71.65 |
UperNet [37] | swin-T | 59.14 | 78.86 | 83.97 | 52.06 | 80.23 | 77.19 | 95.95 | 82.45 | 71.91 |
Efficient transformer [40] | Effificient-T | 59.51 | 78.92 | 83.78 | 55.06 | 79.96 | 78.27 | 95.94 | 83.62 | 72.58 |
WicoNet [38] | ResNet50+transformer | 55.08 | 76.79 | 81.22 | 49.55 | 75.02 | 74.53 | 95.17 | 80.82 | 68.70 |
SETR-PUP [39] | swin-T | 60.56 | 78.77 | 83.54 | 52.75 | 79.92 | 77.35 | 95.91 | 83.29 | 72.15 |
SETR-MLA [39] | swin-T | 60.90 | 78.39 | 83.91 | 53.56 | 79.58 | 77.76 | 95.96 | 83.45 | 72.35 |
SegFormer [41] | MiT-B2 | 59.96 | 80.14 | 84.19 | 54.80 | 80.36 | 78.06 | 96.04 | 83.83 | 72.92 |
PFF | cross-shaped swin-T | 61.35 | 80.21 | 84.60 | 56.60 | 81.52 | 78.73 | 96.18 | 84.49 | 73.84 |
PMLP | cross-shaped swin-T | 60.87 | 79.85 | 84.57 | 55.86 | 81.42 | 78.78 | 96.15 | 84.28 | 73.56 |
MSMLP | cross-shaped swin-T | 61.00 | 80.00 | 84.38 | 55.63 | 80.96 | 79.99 | 96.12 | 84.34 | 73.66 |
MFE | cross-shaped swin-T | 61.59 | 80.39 | 84.34 | 55.88 | 81.36 | 79.21 | 96.16 | 84.45 | 73.80 |
Table 10.
Comparison experiment results with other models on the GID dataset.
Table 10.
Comparison experiment results with other models on the GID dataset.
Methods | Backbone | Per-Class IoU(%) | Acc(%) | Mean F1(%) | mIoU(%) |
---|
BG | IDL | UR | RR | TL | PF | IL | DC | GP | AW | SL | NG | AG | RV | LK | PN |
---|
DANet [35] | swin-T | 72.33 | 58.40 | 68.37 | 48.66 | 38.25 | 78.92 | 73.32 | 39.65 | 16.58 | 67.42 | 1.73 | 36.99 | 7.47 | 91.88 | 72.55 | 79.75 | 97.76 | 64.89 | 53.27 |
SE-Net [14] | swin-T | 73.24 | 59.20 | 67.83 | 47.17 | 52.12 | 79.89 | 74.08 | 38.75 | 25.20 | 68.48 | 7.04 | 36.88 | 10.82 | 93.54 | 76.91 | 78.88 | 97.86 | 67.63 | 55.63 |
CBAM [19] | swin-T | 73.71 | 59.58 | 69.81 | 48.96 | 48.94 | 77.80 | 76.01 | 46.57 | 18.86 | 68.39 | 0.00 | 37.23 | 7.16 | 94.25 | 73.42 | 79.24 | 97.92 | 70.76 | 55.00 |
SKNet [20] | swin-T | 74.50 | 58.87 | 68.63 | 47.00 | 51.51 | 79.29 | 75.08 | 38.12 | 19.71 | 68.47 | 7.18 | 27.97 | 10.22 | 94.66 | 76.98 | 81.75 | 97.74 | 66.65 | 55.00 |
ECA-Net [15] | swin-T | 74.27 | 60.43 | 69.91 | 51.15 | 51.08 | 79.66 | 74.63 | 41.69 | 15.93 | 67.12 | 0.00 | 37.69 | 14.57 | 95.02 | 69.99 | 82.35 | 97.94 | 71.28 | 55.34 |
EMANet [36] | swin-T | 73.95 | 58.09 | 69.17 | 50.65 | 51.98 | 79.03 | 73.80 | 45.90 | 23.22 | 65.74 | 6.78 | 38.78 | 13.99 | 93.41 | 74.15 | 80.32 | 97.90 | 68.32 | 56.19 |
UperNet [37] | swin-T | 73.21 | 57.04 | 69.74 | 42.96 | 51.73 | 79.89 | 70.41 | 41.88 | 25.67 | 66.65 | 8.32 | 42.43 | 10.71 | 93.59 | 74.21 | 81.34 | 97.84 | 67.80 | 55.61 |
Efficient transformer [40] | Effificient-T | 71.98 | 56.37 | 68.06 | 52.46 | 52.18 | 82.65 | 74.17 | 40.38 | 21.56 | 68.32 | 1.89 | 45.62 | 13.03 | 94.84 | 64.35 | 76.08 | 97.81 | 67.22 | 55.25 |
WicoNet [38] | ResNet50+transformer | 69.54 | 40.81 | 65.01 | 49.47 | 47.28 | 79.99 | 67.21 | 23.29 | 11.87 | 64.49 | 7.11 | 46.79 | 13.81 | 93.29 | 61.17 | 82.15 | 97.52 | 63.74 | 51.46 |
SETR-PUP [39] | swin-T | 73.49 | 58.94 | 69.00 | 46.15 | 51.09 | 76.64 | 73.83 | 30.64 | 26.98 | 69.79 | 0.00 | 39.13 | 8.35 | 92.47 | 72.87 | 79.92 | 97.89 | 70.37 | 54.33 |
SETR-MLA [39] | swin-T | 72.54 | 55.84 | 65.39 | 46.09 | 52.53 | 81.24 | 74.80 | 45.41 | 25.84 | 66.44 | 13.81 | 33.36 | 5.87 | 93.48 | 74.98 | 82.10 | 97.79 | 67.70 | 55.61 |
SegFormer [41] | MiT-B2 | 72.97 | 58.84 | 69.37 | 51.77 | 53.49 | 82.22 | 73.76 | 39.46 | 17.48 | 67.24 | 11.14 | 43.28 | 12.79 | 93.09 | 74.74 | 76.87 | 97.86 | 68.26 | 56.16 |
PFF | cross-shaped swin-T | 73.36 | 60.80 | 68.74 | 52.38 | 54.62 | 81.50 | 74.61 | 36.96 | 25.59 | 66.64 | 5.08 | 39.08 | 17.53 | 93.18 | 82.71 | 77.17 | 97.89 | 68.81 | 56.87 |
PMLP | cross-shaped swin-T | 73.48 | 60.36 | 67.66 | 52.54 | 53.65 | 81.75 | 74.32 | 40.26 | 23.43 | 67.92 | 0.00 | 41.64 | 21.91 | 92.64 | 79.65 | 76.41 | 97.87 | 73.14 | 56.73 |
MSMLP | cross-shaped swin-T | 73.71 | 61.67 | 69.62 | 52.50 | 54.88 | 80.87 | 75.54 | 41.74 | 25.80 | 67.12 | 7.18 | 43.11 | 25.30 | 92.51 | 72.76 | 77.81 | 97.93 | 70.03 | 57.63 |
MFE | cross-shaped swin-T | 73.52 | 63.00 | 70.88 | 53.27 | 53.86 | 81.59 | 74.35 | 35.30 | 29.13 | 67.46 | 10.51 | 42.13 | 21.21 | 93.32 | 64.98 | 76.77 | 97.94 | 69.56 | 56.96 |
Table 11.
Comparison of model parameters, FLOPS, accuracy, and mIoU.
Table 11.
Comparison of model parameters, FLOPS, accuracy, and mIoU.
Methods | Backbone | BLU-mIoU(%) | GID-mIoU(%) | Params(M) | FLOPS(GMac) |
---|
DANet [35] | swin-T | 65.44 | 53.27 | 30.89 | 25.54 |
SE-Net [14] | swin-T | 71.37 | 55.63 | 27.99 | 26.27 |
CBAM [19] | swin-T | 71.52 | 55.00 | 28.02 | 26.27 |
SKNet [20] | swin-T | 72.10 | 55.00 | 29.34 | 29.10 |
ECA-Net [15] | swin-T | 71.55 | 55.34 | 28.00 | 26.27 |
EMANet [36] | swin-T | 71.65 | 56.19 | 28.38 | 27.18 |
UperNet [37] | swin-T | 71.91 | 55.61 | 30.35 | 31.42 |
Efficient transformer [40] | Effificient-T | 72.58 | 55.25 | 29.92 | 23.06 |
WicoNet [38] | ResNet50+transformer | 68.70 | 51.46 | 39.07 | 83.56 |
SETR-PUP [39] | swin-T | 72.15 | 54.33 | 27.30 | 28.10 |
SETR-MLA [39] | swin-T | 72.35 | 55.61 | 35.98 | 28.33 |
SegFormer [41] | MiT-B2 | 72.92 | 56.16 | 27.35 | 56.70 |
PFF | cross-shaped swin-T | 73.84 | 56.87 | 23.48 | 24.16 |
PMLP | cross-shaped swin-T | 73.56 | 56.73 | 22.34 | 23.56 |
MSMLP | cross-shaped swin-T | 73.66 | 57.63 | 22.19 | 23.40 |
MFE | cross-shaped swin-T | 73.80 | 56.96 | 22.45 | 26.67 |