Author Contributions
Conceptualization, J.X. and F.D.; methodology, J.X.; validation, J.X.; formal analysis, J.X.; investigation, J.X.; resources, F.D.; data curation, J.X.; writing—original draft preparation, J.X.; writing—review and editing, J.X.; visualization, J.X.; supervision, J.X.; funding acquisition, F.D. All authors have read and agreed to the published version of the manuscript.
Figure 1.
Schematic diagram of the overall network flow.
Figure 1.
Schematic diagram of the overall network flow.
Figure 2.
Whole architecture of the proposed 3D point cloud instance segmentation method.
Figure 2.
Whole architecture of the proposed 3D point cloud instance segmentation method.
Figure 3.
Feature fusion module. Convolutional features for fusing global shape features and encoding shape information.
Figure 3.
Feature fusion module. Convolutional features for fusing global shape features and encoding shape information.
Figure 4.
Global Shape Attention module (Global Shape Attention, GSA). This module mainly consists of the Shape Feature Aggregation module (Shape Feature Aggregation, SFA) and the Cross-Attention module (Cross-Attention, CA). Attention features fused by global shape contours were finally obtained.
Figure 4.
Global Shape Attention module (Global Shape Attention, GSA). This module mainly consists of the Shape Feature Aggregation module (Shape Feature Aggregation, SFA) and the Cross-Attention module (Cross-Attention, CA). Attention features fused by global shape contours were finally obtained.
Figure 5.
Instance Query Module. This module was mainly used to obtain the instance query vector Instance Query, representing the characteristics of the instance. The input of this module comprised the initialized instance query vector Instance Query and point features that took into account the global shape contour features, and it used Cross-Attention and Self-Attention to obtain the final instance query vector Instance Query, where the size of Instance Query was . indicates the number of instances in the preset scene.
Figure 5.
Instance Query Module. This module was mainly used to obtain the instance query vector Instance Query, representing the characteristics of the instance. The input of this module comprised the initialized instance query vector Instance Query and point features that took into account the global shape contour features, and it used Cross-Attention and Self-Attention to obtain the final instance query vector Instance Query, where the size of Instance Query was . indicates the number of instances in the preset scene.
Figure 6.
Instance Mask Module (IMM). The input of this module comprised the point feature and the refined instance query vector Instance Query. The output comprised the mask label of the instance and the semantic category label of the corresponding instance.
Figure 6.
Instance Mask Module (IMM). The input of this module comprised the point feature and the refined instance query vector Instance Query. The output comprised the mask label of the instance and the semantic category label of the corresponding instance.
Figure 7.
Area 5 instance segmentation result visualization. The left side is the original input, the middle shows the true value and predicted value of semantic segmentation, different colors indicate different categories, and the right side shows the true value and predicted value of instance segmentation, different colors indicate different instance object.
Figure 7.
Area 5 instance segmentation result visualization. The left side is the original input, the middle shows the true value and predicted value of semantic segmentation, different colors indicate different categories, and the right side shows the true value and predicted value of instance segmentation, different colors indicate different instance object.
Figure 8.
The visualization results for the ScanNet test dataset. The left side shows the original RGB input, the center the semantic prediction results, different colors indicate different categories. And the right side the instance prediction results, different colors indicate different instance object.
Figure 8.
The visualization results for the ScanNet test dataset. The left side shows the original RGB input, the center the semantic prediction results, different colors indicate different categories. And the right side the instance prediction results, different colors indicate different instance object.
Figure 9.
Visualization of STPLS3D results. The lower side shows the original input, the middle part shows the true value of the instance, and the upper part shows the predicted instance.
Figure 9.
Visualization of STPLS3D results. The lower side shows the original input, the middle part shows the true value of the instance, and the upper part shows the predicted instance.
Figure 10.
S3DIS dataset comparison result visualization. The picture above shows our proposed network and the picture below shows the result of the Mask3D network. In the black box, different colors indicate different instances, and it can be seen that the segmentation result of our proposed network alleviate the problem of the difficulty distinguishing between spatially distributed similar instances in a scene.
Figure 10.
S3DIS dataset comparison result visualization. The picture above shows our proposed network and the picture below shows the result of the Mask3D network. In the black box, different colors indicate different instances, and it can be seen that the segmentation result of our proposed network alleviate the problem of the difficulty distinguishing between spatially distributed similar instances in a scene.
Figure 11.
ScanNet dataset comparison results. The figure below shows the original input, the middle shows the Mask3D result, and the upper side shows the proposed network prediction result. In the black box, different colors indicate different instances, and it can be seen that the segmentation result of our proposed network alleviate the problem of the difficulty distinguishing between spatially distributed similar instances in a scene.
Figure 11.
ScanNet dataset comparison results. The figure below shows the original input, the middle shows the Mask3D result, and the upper side shows the proposed network prediction result. In the black box, different colors indicate different instances, and it can be seen that the segmentation result of our proposed network alleviate the problem of the difficulty distinguishing between spatially distributed similar instances in a scene.
Figure 12.
Visualization of the learned shape contour features.
Figure 12.
Visualization of the learned shape contour features.
Table 1.
Quantitative analysis results of Area5 instance segmentation. The evaluation indicators are mAP, mAP50, Prec50, and Rec50.
Table 1.
Quantitative analysis results of Area5 instance segmentation. The evaluation indicators are mAP, mAP50, Prec50, and Rec50.
Method | mAP | mAP50 | Prec50 | Rec50 |
---|
SGPN [4] | - | - | 36.0 | 28.7 |
ASIS [5] | - | - | 55.3 | 42.4 |
3D-Bonet [3] | - | - | 57.5 | 40.2 |
PointGroup [22] | - | 57.8 | 61.9 | 62.1 |
MaskGroup [23] | - | 65.0 | 62.9 | 64.7 |
SSTNet [24] | 42.7 | 59.3 | 65.5 | 64.2 |
Mask3D [9] | 56.6 | 68.4 | 68.7 | 66.3 |
Ours | 52.4 | 66.9 | 63.6 | 64.3 |
Table 2.
ScanNet dataset quantitative analysis results. Evaluation indicators include mAP and mAP50.
Table 2.
ScanNet dataset quantitative analysis results. Evaluation indicators include mAP and mAP50.
| ScanNet Val | ScanNet Test |
---|
Method | mAP | mAP50 | mAP | mAP50 |
---|
SGPN [4] | - | - | 4.9 | 14.3 |
GSPN [1] | 19.3 | 37.8 | - | 30.6 |
3D-SIS [2] | - | 18.7 | 16.1 | 38.2 |
3D-BoNet [3] | - | - | 25.3 | 48.8 |
MTML [25] | 20.3 | 40.2 | 28.2 | 54.9 |
3D-MPA [25] | 35.5 | 59.1 | 35.5 | 61.1 |
DyCo3D [26] | 35.4 | 57.6 | 39.5 | 64.1 |
PointGroup [22] | 34.8 | 57.6 | 40.7 | 63.6 |
MaskGroup [23] | 42.0 | 63.3 | 43.4 | 66.4 |
OccuSeg [27] | 44.2 | 60.7 | 48.6 | 67.2 |
SSTNet [24] | 49.4 | 64.3 | 50.6 | 69.8 |
HAIS [28] | 43.5 | 64.1 | 45.7 | 69.9 |
SoftGroup [16] | 46.0 | 67.6 | 50.4 | 76.1 |
Mask3D | 55.2 | 73.7 | 50.6 | 78.0 |
Ours | 54.5 | 70.5 | 49.8 | 77.3 |
Table 3.
Quantitative analysis results with the three STPLS3D test sets.
Table 3.
Quantitative analysis results with the three STPLS3D test sets.
| AP | AP50 | AP25 |
---|
Building | 0.822 | 0.905 | 0.918 |
Low vegetation | 0.329 | 0.617 | 0.731 |
Middle vegetation | 0.361 | 0.539 | 0.676 |
High vegetation | 0.488 | 0.922 | 0.985 |
Car | 0.831 | 0.912 | 0.985 |
Trucks | 0.815 | 0.9 | 0.946 |
Aircraft | 0.603 | 0.808 | 0.850 |
Military vehicles | 0.816 | 0.818 | 0.886 |
Bikes | 0.218 | 0.547 | 0.695 |
Motorcycle | 0.604 | 0.819 | 0.912 |
Light pole | 0.602 | 0.816 | 0.902 |
Street sign | 0.213 | 0.431 | 0.541 |
Clutter | 0.601 | 0.733 | 0.792 |
Fence | 0.442 | 0.627 | 0.860 |
Average (mAP) | 0.551 | 0.717 | 0.815 |
Table 4.
Model sizes and run times.
Table 4.
Model sizes and run times.
Method | Model Size | Run Time (ms) |
---|
HAIS [28] | 30.856 M | 339 |
SoftGroup [16] | 30.858 M | 345 |
Mask3D [9] | 39.617 M | 339 |
Ours | 15.7 M | 253 |
Table 5.
Effect of initial sampling location.
Table 5.
Effect of initial sampling location.
| mAP | mAP50 |
---|
Same initial location | 50.1 | 63.4 |
Different initial location | 50.4 | 63.9 |
Table 6.
Elimination experiments for position encoding.
Table 6.
Elimination experiments for position encoding.
| mIoU |
---|
With PE | 64.0 |
Without PE | 63.9 |