Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

3D Point Cloud Instance Segmentation Considering Global Shape Contour Constraints

Remote Sens. 2023, 15(20), 4939; https://doi.org/10.3390/rs15204939

by Jiabin Xv

and Fei Deng^*

Reviewer 1:

Quan Qiu

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Remote Sens. 2023, 15(20), 4939; https://doi.org/10.3390/rs15204939

Submission received: 27 August 2023 / Revised: 10 October 2023 / Accepted: 11 October 2023 / Published: 12 October 2023

Round 1

Reviewer 1 Report

This paper proposes a 3D point cloud instance segmentation network considering the global shape contour. A Transformer module that can capture the shape contour information into the Transformer structure as K-Value. Experiments are conducted on three data sets -- S3DIS, ScanNet, and STPL3D, which show the proposed network can efficiently capture the shape contour information of scene instances.

This is a pretty good paper. However, there are a few problems.

1. Experiments are conducted on three typical data set. Network performance parameters, such as AP, Prec, and Rec are given. If the time consuming performances are also considered and given, the paper will be better.

2. The English of the paper is not good enough for the journal. The authors should find a professional helper, especially experienced research paper author, to go through the paper, and greatly improve the poor formats and grammer. Some examples are listed as followings:

1) Line 49, “the instance segmentation task of 3D point cloud Mask3D[9]”;

2) Line 63, “The experimental results show that indoor S3DIS data In”;

3) Line 71, “and AP50 reached 71.7%”;

4) Line 77 to Line 83, “1,”, “2,”, “3,”;

5) Line 184, “The Whole Framework” maybe “The Whole Network Framework”?

6) Line 188, “is other color feature, etc., and is the number of point clouds”;

7) Line 198, “feature Extract”;

8) Line 207 and Line 208, two adjacent subordinate clauses are both “which is”;

9) ……

Some of the figures are too simple and may make readers confuse, such as Fig.1, Fig.3, Fig.5 and Fig.6.

This is a pretty good paper. However, there are a few problems.

1) Line 49, “the instance segmentation task of 3D point cloud Mask3D[9]”;

2) Line 63, “The experimental results show that indoor S3DIS data In”;

3) Line 71, “and AP50 reached 71.7%”;

4) Line 77 to Line 83, “1,”, “2,”, “3,”;

5) Line 184, “The Whole Framework” maybe “The Whole Network Framework”?

6) Line 188, “is other color feature, etc., and is the number of point clouds”;

7) Line 198, “feature Extract”;

8) Line 207 and Line 208, two adjacent subordinate clauses are both “which is”;

9) ……

Some of the figures are too simple and may make readers confuse, such as Fig.1, Fig.3, Fig.5 and Fig.6.

Author Response

This is a pretty good paper. However, there are a few problems.

Experiments are conducted on three typical data set. Network performance parameters, such as AP, Prec, and Rec are given. If the time consuming performances are also considered and given, the paper will be better.

Response: Thanks for your advice, we have showed the runtime of our proposed network on the ScanNet dataset, along with comparisons to some other networks in Table 5 in Subsection 4.2.3.

The English of the paper is not good enough for the journal. The authors should find a professional helper, especially experienced research paper author, to go through the paper, and greatly improve the poor formats and grammer. Some examples are listed as followings:

1) Line 49, “the instance segmentation task of 3D point cloud Mask3D[9]”;

2) Line 63, “The experimental results show that indoor S3DIS data In”;

3) Line 71, “and AP50 reached 71.7%”;

4) Line 77 to Line 83, “1,”, “2,”, “3,”;

5) Line 184, “The Whole Framework” maybe “The Whole Network Framework”?

6) Line 188, “is other color feature, etc., and is the number of point clouds”;

7) Line 198, “feature Extract”;

8) Line 207 and Line 208, two adjacent subordinate clauses are both “which is”;

9) ……

Response: We feel sorry for our carelessness. In our resubmitted manuscript, the typo is revised. Thanks for your correction.

We tried our best to improve the manuscript and made some changes to the manuscript using English editing service with MDPI. These changes will not influence the content and framework of the paper. And here we did not list the changes but marked in red in the revised paper. We appreciate for Editors/Reviewers’ warm work earnestly and hope that the correction will meet with approval.

Some of the figures are too simple and may make readers confuse, such as Fig.1, Fig.3, Fig.5 and Fig.6.

Response: Thank you for your comments. These figure are mainly used to describe the constituent modules of the whole network and the specific architecture of the corresponding modules. We do this for two purposes: first, it can clearly express which modules our network has and the corresponding functions of the respective modules; second, it can also concretely represent how the network modules are constructed and computed, which helps readers to understand.

Author Response File: Author Response.docx

Reviewer 2 Report

This paper proposed an instance segmentation network that followed the global-local design idea, which tries to solve the problem that similar instances cannot be segmented properly and can directly predict the instance mask in an end-end mode. Overall this is a well-written paper, with a clear structure and reasonable results. There are some concerns that should be addressed before publication.

(1) On line 187: “where 3 is the coordinate dimension of the point cloud, is other color features, etc., and is the number of point clouds.” seems an incomplete sentence.

(2) On line 198, “The Point Embedding layer in our design is a two-layer shared multi-layer perceptron (Multi-Layer Perception, MLP), followed by a Relu activation function.” I was thinking what is the point embedding here and what is it doing? What's his input? What's his output? From the context it seems to mean that the points of cloud are to be represented in a continuous space by these points embedding layer-if so, what is the input of this layer, the coordinates of this point? And why “first layer of multilayer perceptron MLP is 32, and the second layer is 64”? The “32” and “64” is just arbitrary or something else?

(3) One line 226, “through 1x1 convolution to obtain the fused point feature”, please point out why the use of this pointwise convolution? For channel-wise feature learning?

(4) One line 237, “In the target detection task, the idea from the whole to the part can effectively alleviate the problem that spatially distributed similar instances cannot be distinguished”, please provided reference to this statement.

(5) On line 249, please specify what does the K and N mean in the context.

(6) Explain notation in detail, for example on line 277, where M is not explained in detail. And what's the FFN worth of line 287? Is it Feed Forward Neural Networks? It needs to be clearly stated in the text. This applies to the whole article.

(7) On line 292, “The research combines 3D. The instance segmentation task of the point cloud is regarded as a collection”, this sentence seems incomplete.

(8) Please check all the typos (and missing punctuation) in this manuscript. Line 289, “as shown in Figure 4.6.”, no Figure 4.6 in this manuscript.

(9) Why are the evaluation indicators used on different data sets are different? Like for S3DIS dataset, AP, AP50, Prec50, Rec50 were used; for ScanNet, mAP50 and mAP were used; for STPLS3D, AP, AP50, AP25 were used. Why not use the same evaluation index, please explain in detail.

(10) On line 461, “to verify the effect of introducing Transformer to capture global shape contour features, we used the Mask3D to conduct comparative experiments on the S3DIS and Scan-Net dataset.” Can it be understood that the difference between Mask3D and the model proposed in this film is only whether GSA is introduced or not, so it can be concluded that GSA works? If not, it may not support the conclusion that transformer GSA made the difference because these two models do have other differences besides the GSA. Should use Ablation Analysis if needed.

(11) Please add taking-home message for sec 4.2.4.

Minor editing of the English language may be needed.

Author Response

On line 187: “where 3 is the coordinate dimension of the point cloud, is other color features, etc., and is the number of point clouds.” seems an incomplete sentence.

Response: Sorry for our carelessness. We have correct the sentence “where 3 is the coordinate dimension of the point cloud, is other color features, etc., and is the number of point clouds” into “ is other color features, such as RGB etc., and is the number of point clouds. ”

On line 198, “The Point Embedding layer in our design is a two-layer shared multi-layer perceptron (Multi-Layer Perception, MLP), followed by a Relu activation function.” I was thinking what is the point embedding here and what is it doing? What's his input? What's his output? From the context it seems to mean that the points of cloud are to be represented in a continuous space by these points embedding layer-if so, what is the input of this layer, the coordinates of this point? And why “first layer of multilayer perceptron MLP is 32, and the second layer is 64”? The “32” and “64” is just arbitrary or something else?

Response ：The Point embedding is a module commonly used in point cloud neural network, usually located between the input and the convolution module or transformer. As you said, its purpose is to project the low-dimensional input space to a high-dimensional continuous space for subsequent convolution or attention operations. The input is the original coordinates + other features (RGB, or normal or intensity, etc.) In this study, we use the coordinate + RGB, and the output is a high-dimensional feature space, which will be used for subsequent convolution or attention operations. The module is more flexible, taking into account the computational efficiency and memory factors, generally one layer or two layers, when one layer is, generally 64 or 128, when two layers, generally 32/64 or 64/128.

One line 226, “through 1x1 convolution to obtain the fused point feature”, please point out why the use of this pointwise convolution? For channel-wise feature learning?

Response: There are two roles using 1x1 convolution, one is to fully fuse the concatenated global and convolutional features, and the other is to perform dimensionality changes to adapt them to subsequent operations.

One line 237, “In the target detection task, the idea from the whole to the part can effectively alleviate the problem that spatially distributed similar instances cannot be distinguished”, please provided reference to this statement.

Response: Thank you for the correction. We have added relevant literature on line 252

(5) On line 249, please specify what does the K and N mean in the context.

Response: I apologize for overlooking this, we have corrected it.

Response: Based on your suggestions , we have checked and explained the corresponding symbols and terms.

(7) On line 292, “The research combines 3D. The instance segmentation task of the point cloud is regarded as a collection”, this sentence seems incomplete.

Response: we have remove the incomplete “the research combines 3D”

(8) Please check all the typos (and missing punctuation) in this manuscript. Line 289, “as shown in Figure 4.6.”, no Figure 4.6 in this manuscript.

Response: Sorry for the error, we have corrected the whole article. We have changed it to figure 4.6 to figure 5.

Responces: We're sorry for the confusion. Actually, for S3DIS and ScanNet，we use the same the evaluation index, the mean average precision –mAP, mAP50 under different IoU thresholds, we have corrected the wrong index in subsection 4.1.1. As to Prec50, Rec50 this two index, it is up to follow the consistency, in order to compare it with the performance of some previous work, like MaskGroup, Mask3D, etc..

For dataset STPLS3D, the Ttable 4 shows the each class performance , it is average precision-AP, only the bottom of the table 4 is the mean value-mAP.

Response: Thanks for your suggestion. Essentially the difference between our network and Mask3D is in the introduction of the GSA, but due to computational resource constraints, we have replaced Mask3D's convolution-based Encoder backbone network (4D Spatio-ConvNets: Minkowski Convolutional Neural Networks. IEEE Conference on Computer Vision and Pattern Recognition, 2019. https://doi.org/10.48550/arXiv.1904.08755) with another lighter and more efficient convolutional network(reference 19 , doi:10.3390/ijgi11120591), both of which have the same architecture with U-Net.

(11) Please add taking-home message for sec 4.2.4.

Response: Thanks for the tip. We have add the follow sentence “The GSA module needs to sample the input scene point cloud in each layer in order to obtain global features. In order to obtain the sampling points more uniformly, we use the farthest point sampling (FPS) method. Since the sampling results of this method are affected by the initial sampling points, different initial sampling points will have different sampling results, which may have an effect on the acquisition of global features, in order to validate the idea, we do a set of comparative experiments between the sampling method that fixes the initial position in each layer of the network, and the sampling method that does not fix the initial position.” In subsection 4.2.4

Author Response File: Author Response.docx

Reviewer 3 Report

Table 2. (results of Area 5 of S3DIS dataset) misses the comparison to Mask3D making the impression that the proposed method is the current state-of-the-art.

Line 414 has some mistyping.

Author Response

Table 2. (results of Area 5 of S3DIS dataset) misses the comparison to Mask3D making the impression that the proposed method is the current state-of-the-art.

Response: We are very guilty of causing you such confusion. According to you, you should be referring to the Table 1 (Quantitative analysis results of Area5 instance segmentation), and we have added the mask3d representation in Table 1.

Line 414 has some mistyping.

Response: Thank you for your , we have correct it. It contains 1202 training scenes, 312 validation scenes.

Round 2

Reviewer 1 Report

All my concerns have been addressed. I think this paper is ready to publish now.

Author Response

All my concerns have been addressed. I think this paper is ready to publish now.

Responce: Thanks!

Reviewer 2 Report

The authors revised according to the comments in the previous revision. Some minor editing may still be needed (line 261, what is K □ N? ).

Author Response

The authors revised according to the comments in the previous revision. Some minor editing may still be needed (line 261, what is K □ N? ).

Response:Thanks for the suggestion. The K and N is on line 260. According to your opinion, we have went through the whole article .

Article Menu

3D Point Cloud Instance Segmentation Considering Global Shape Contour Constraints

Further Information

Guidelines

MDPI Initiatives

Follow MDPI