4.4. Comparison with the State-of-the-Art
In this part, we compare the proposed method with several other state-of-the-art (SOTA) instance segmentation methods. Including Mask R-CNN [
14], Mask Scoring R-CNN [
40], Cascade Mask R-CNN [
41], HTC [
43], SCNet [
44], PointRend [
45], VitAE [
32], VitDet [
33], Yolact [
47], Condlnst [
51], Boxlnst [
56], Querylnst [
52], HQ-ISNet [
57], CATNet [
60], LFGNet [
5], RSPrompter [
63], and YOLO v8. In the method of comparison, except for the pre-training model of VitDet and VitAE with Transformer structure, the other methods use ResNet101 [
19] as the pre-training model. We compare the quantitative results of CSNet with other SOTA methods on the NWPU dataset and SSDD dataset in threshold AP (including
,
and
) and scale AP (including
,
and
) defined in
Section 4.2.
In order to further improve the performance of the model, we conducted additional multi-scale training on CSNet. The quantitative results on the NWPU dataset are shown in the
Table 1. Without multi-scale training, CSNet achieves 71.6%
, 95.0%
, 84.2%
, 67.9%
, 94.8%
, and 74.2%
value. Compared with Mask R-CNN, CSNet receives 7.3%
and 7.6%
increments; compared with VitDet, CSNet receives 5.9%
and 3.7%
increments; compared with Cascade Mask R-CNN, CSNet receives 4.9%
and 5.0%
increments. Among the 12 test indicators, CSNet achieves the best performance. We use Standard Scale Jittering (SSJ) [
71], which resizes and crops an image with a resize range of 0.8 to 1.25 of the original image size to achieve multi-scale training of CSNet. When the multi-scale training technique is simultaneously adopted in training CSNet, termed CSNet*, it receives 73.1%
, 95.2%
, 86.2%
, 70.5%
, 95.1%
, and 78.4%
value, which is the highest quantitative result among the state-of-the-art methods. Compared with VitDet, CSNet* receives 7.4%
and 6.3%
increments; compared with HQ-ISNet, CSNet* receives 3.9%
and 6.4%
increments. This shows that CSNet has a very good performance in the instance segmentation task of optical remote sensing images. Due to the small size of the NWPU dataset, it shows that with the help of the generalization ability of SAM, the accurate segmentation ability of the object can be enhanced.
The quantitative results on the SSDD dataset are shown in the
Table 2. The SSDD dataset is different from the NWPU dataset. The NWPU dataset has many different instances, while the SSDD dataset only detects and segments the ship objects in the image. In addition, the number of images in the SSDD dataset is more than the NWPU dataset. Without multi-scale training, CSNet achieves 73.4%
, 97.4%
, 90.0%
, 70.5%
, 96.6%
, and 96.7%
value. Compared with Mask R-CNN, CSNet receives 3.3%
and 2.7%
increments; compared with VitDet, CSNet receives 4.9%
and 3.4%
increments; compared with Cascade Mask R-CNN, CSNet receives 1.0%
and 1.6%
increments. When a multi-scale training technique is simultaneously adopted in training CSNet, termed CSNet*, it receives 75.3%
, 98.0%
, 93.1%
, 72.7%
, 98.0%
, and 92.0%
value, which is the highest quantitative result among the state-of-the-art methods. Compared with VitDet, CSNet* receives 6.8%
and 5.6%
increments; compared with HQ-ISNet, CSNet* receives 2.3%
and 4.0%
increments. CSNet and CSNet * achieve the best in all indicators, indicating that our model has a very good performance in SAR image ship instance segmentation.
In order to visually compare our model with SOTA, we chose Mask R-CNN, SCNet, Condlnst, VitDet, and HQ-ISNet as the representative of SOTA for qualitative comparison with CSNet.
Figure 5 and
Figure 6 illustrate sample segmentation instances from the NWPU dataset and SSDD dataset. In the qualitative results of the NWPU dataset, in the scene of the car park, CSNet has the best precision and recall rate, and the segmentation of the vehicle contour is accurate; in the airport scene, CSNet has the best processing of the edge of the aircraft mask; in the playground scene, CSNet can achieve smooth, continuous and angular segmentation edges of large and small objects. In the port scene, CSNet accurately segments each port. In the qualitative results of the SSDD dataset, CSNet can also accurately detect and segment images with complex backgrounds and dense ship arrangements. These visualization results validate the superiority of our CSNet in the task of instance segmentation.
In summary, with the help of SAM’s zero-shot segmentation ability, the model segmentation accuracy has been significantly improved. The introduction of context information makes the recall accuracy and recall rate of CSNet higher. CSNet has better instance segmentation ability than the SOTA method and has better performance on optical images and SAR images.
In this section, we undertake a series of experiments on the NWPU dataset to investigate the significance of each component and parameter setting within our proposed methodology.
The impacts of the main component are as follows: We designed module experiments to verify the impact of each module on the performance of the model and added PFFPN, CABH, and SGMH to the baseline model in turn (the baseline model uses SAM’s image encoder as the backbone, simple FPN as the neck, and the head design uses Cascade Mask R-CNN’s bbox head and mask head). In order to fully prove the effectiveness of each module, we conducted experiments on NWPU and SSDD datasets, and the results are shown in
Table 3. On the NWPU dataset, with the addition of PFFPN, CABH, and SGMH, the
and
of the model increased by 0.9%/2.2%, 0.7%/0.7%, and 1.1%/1.1%, respectively, with a total increase of 2.9%/4.0%. On the SSDD dataset, with the addition of PFFPN, CABH, and SGMH, the
and
of the model increase by 1.2%/2.0%, 1.0%/1.0%, and 1.4%/1.4%, respectively, and the total increase is 3.6%/4.4%. On other evaluation indicators, PFFPN, CABH, and SGMH all have a positive impact on the model, indicating that our designed PFFPN, CABH, and SGMH is a method that can effectively improve the accuracy of instance segmentation. At the same time, in order to show the performance of the module more intuitively, we visualize the experimental results, as shown in
Figure 7 and
Figure 8. It can be seen from the feature map that PFFPN can enhance the response of the object and highlight the location of the object. It can be seen from the segmentation results that after highlighting the object response, the mask of the instance is more complete and can reduce the aliasing of the mask. CABH further makes the overall area of the object in a high response state, reducing false detection in the results. SGMH has restrictions on the edge, and the response on the feature is more in line with the object shape. In the experimental results, the mask edge of SGMH is very flat, and the edge angle is more obvious. The visualization experiment shows that our module is obviously helpful for instance segmentation.
The impacts of various image encoders as backbones are as follows: Backbone is a very important part of tasks such as instance segmentation. A good backbone can achieve better performance under the same model. For our model, we use the SAM image encoder as the backbone. There are three versions of the image encoder: ViT-B, ViT-L, and ViT-H. We designed different versions of the image encoder as the backbone to explore the influence of the image encoder on the model. The experimental results are shown in
Table 4. It can be seen from the table that the performance of the model increases as the parameter size of the image encoder becomes larger. When the model wants to achieve the best performance, the
and
of ViT-H reach 71.6% and 67.9%. When the model has the fastest running speed, the
and
of ViT-B can also reach 68.8% and 65.4%. When trying to balance performance and speed, the
and
of ViT-L can reach 69.7% and 66.2%, respectively.
The impacts of the schedule in training are as follows: We hope that the training epochs are as few as possible without affecting the accuracy, so as to reduce the training time. Therefore, we designed experiments to explore the influence of training rounds on the experimental results. The experimental results are shown in
Table 5. Under the training schedule of 1x and 2x,
and
are significantly smaller than those under the training schedule of 3x and 6x. The training schedule set to 3x and 6x have basically the same performance. Considering the factors that reduce the training time, the model chooses 3x for the training schedule.
The impacts of various hierarchical features and feature dimensions are as follows: The instance segmentation method based on ViT structure is still less explored for FPN. Which features should be selected to construct FPN in various hierarchical features in ViT needs further exploration. Due to the large feature dimension in ViT, the dimension should be appropriately reduced, and the loss of performance should be reduced as much as possible while reducing the amount of calculation. Therefore, for our proposed PFFPN structure, we designed experiments to explore the effects of various hierarchical features and feature dimensions on FPN performance. The experimental structure is shown in
Table 6. It can be seen from the experimental results that when a feature is extracted every 8 layers, and a total of 4 layers of features are extracted for the construction of FPN, the performance of the model is better than that of extracting 1, 2, and 8 layers of features. When the number of feature layers is 4, the performance difference between 256,128, and 64 is not significant. However, when the dimension is 64, the optimal performance can be obtained on five indicators, and the
and
of the model can reach 71.6% and 67.9%. PFFPN takes [
7,
15,
23,
31] as input, and the dimension reduction to 64 after interpolation can maximize the performance improvement of the model.
The impacts of various fpns in different vit structures are as follows: The proposed PFFPN achieves the best level of the structure after optimizing the parameters. In order to test whether PFFPN performs well in the model with ViT as the backbone, we compare PFFPN with FPN in VitAE and Simple FPN in ViTDet. The experimental results are shown in
Table 7. From the table, it can be seen that by replacing the FPN of VitAE with PFFPN, the
and
increased from 65.4% and 63.3% to 68.4% and 65.4%, with an increase of 3.0% and 2.1%, respectively. By replacing the Simple FPN of ViTDet with PFFPN, the
and
increased from 65.7% and 64.2% to 68.3% and 64.7%, with an increase of 2.6% and 0.5%, respectively. For CSNet without PFFPN,
and
decrease by 1.3% and 1.8%, respectively. Experiments show that in the network with VitAE, ViTDet and SAM as backbones, PFFPN has stronger feature extraction and fusion capabilities, which can significantly improve the performance of the network.
The impacts of the object branch and context branch are as follows: In CABH, we design the object branch and context branch to assist object recognition and localization with context information. However, for the object branch and context branch, further experiments are still needed to explore how to design to maximize performance. At present, the most common bbox head is the two-layer fully connected layer set in Fast R-CNN. In addition, there are four layers of convolution in ViTDet and one layer of fully connected layer. In these settings of the bbox head, object recognition and positioning are through the same network. In order to remove the coupling relationship between object recognition and positioning tasks, we designed a bbox head to predict cls and box through different two-layer fully connected layers. We designed experiments to explore what kind of network is the most effective for the object branch and context branch, respectively. The experimental results are shown in
Table 8. It can be seen from the experiment that the object branch uses Separated 2fc, and the context branch uses Shared 4conv1fc to make the CABH detection performance the best. Separated 2fc can remove the coupling relationship between object recognition and positioning tasks and improve the detection performance. Shared 4conv1fc first captures context information through convolution and then transforms context information into useful information for object recognition and localization tasks through a fully connected layer.
The impacts of CABH as the bbox head are as follows: We hope that our designed CABH can not only improve the detection performance of our model but also improve the detection performance when applied to other models (based on Mask R-CNN). Therefore, we design to replace the bbox head of Mask R-CNN, Cascade Mask R-CNN, and ViTDet with CABH and judge the effect of CABH on the performance of the model by the index of the
. The experimental results are shown in
Table 9. CABH improved the
of Mask R-CNN from 64.3% to 66.4%. The
of Cascade Mask R-CNN increased from 66.7% to 68.5%. The
of ViTDet increased from 65.7% to 67.1%. The introduction of CABH improves the six indicators of detection in all three models. Replacing the CABH in CSNet with a general bbox head, the
decreased by 1.1%, and the other five indicators also decreased. The above experiments show that CABH can effectively improve the detection performance.
The impacts of SAM, FCN, and SGMH used in mask head are as follows: Because SAM has the ability to zero-shot, masks can be generated for untrained images without fine-tuning. Therefore, SAM has the ability to act as a mask head alone. We designed experiments to replace SGMH in CSNet with SAM and FCN to evaluate which mask head has the best performance. The experimental results are shown in
Table 10. Among SAM, FCN, and SGMH, SGMH has the best performance, and SAM has the worst performance. This shows that although SAM can generate masks without training, the mask accuracy is not as good as the model trained under the pixel-level label. Compared with FCN, SGMH has improved performance on NWPU and SSDD datasets, indicating that learning SAM masks through student networks can capture more information to enrich deep features, thereby improving mask quality.
The impacts of mask size are as follows: In SGMH, FCN is used as the structure of the prediction instance mask. The larger the size of the mask, the greater the amount of calculation consumed. However, in order to learn the knowledge of SAM in SGMH, the larger the mask size, the more details can be retained. Therefore, the mask size should be the appropriate size, and we designed to experiment with the performance of CSNet under different mask sizes. The experimental results are shown in
Table 11. From the table, when the mask size is
, CSNet has the best performance on the five mask-related indicators,
reaches 67.9%. Considering the performance and computation time, the mask size in SGMH should be set to
.
The impacts of SGMH as the mask head are as follows: Similar to CABH, we hope that SGMH can improve the segmentation performance of other models (based on Mask R-CNN). Therefore, we designed to replace the mask head of Mask R-CNN, Cascade Mask R-CNN, and ViTDet with SGMH(ViT-B), SGMH(ViT-L), and SGMH(ViT-H) and judge the effect of SGMH on the performance of the model by the index of the
. The experimental results are shown in
Table 12. SGMH(ViT-L) improved the
of Mask R-CNN from 60.3% to 64.7%. The
of Cascade Mask R-CNN increased from 62.9% to 65.0%. The
of ViTDet increased from 66.2% to 69.5%. SGMH(ViT-B) and SGMH(ViT-L) also improved the
. The introduction of SGMH improves the six indicators of segmentation in all three models. Replacing the SGMH in CSNet with a general mask head,
decreased by 1.0%, and the other five indicators also decreased. The above experiments show that SGMH can effectively improve the segmentation performance.