MPVF: Multi-Modal 3D Object Detection Algorithm with Pointwise and Voxelwise Fusion
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsTitle: “MPVF: Multi-Modal 3D Object Detection Algorithm with Pointwise and Voxelwise Fusion”
In this paper, the MPVF algorithm is proposed, which integrates pointwise and voxelwise fusion for multi-modal 3D object detection. The PVWF module is introduced to combine local features from the PWF module with global features from the VWF module, enhancing the interaction between features across modalities. The work is good, but there are major points that should be considered which are as follows:
Comment to the authors:
1-This work addresses the insufficient interaction between local and global features and the limited fine-grained feature extraction which are considered general problems. It is recommended to specify briefly the specific problem statements that are addressed and mention them in the Abstract clearly to show the main contribution to this work.
2- It is recommended to define the abbreviations in their first appearance. For example: MPVF which is the proposed algorithm.
3-In the introduction section, an overview of the previous methods was presented, with an explanation of the gaps in each of them, but it is recommended to cite each weak point through a reliable source to confirm the mentioned data.
4- The recognition accuracy of the pedestrians is considered low compared to other existing works although it exhibited slightly longer running times. How this point can be explained.
5- The related work section is good, but it is advisable to focus on the most relevant work, which is subsection (2.3) to provide a clear view of the proposed work and its actual contribution in the context of other recent related research papers. In addition, it is very important to discuss these works and mention the gaps in each of them that lead to the proposed work proposal and not just list them.
6- Please cite any specific figures, dataset, equation, and measurements with their reliable sources unless they belong to the authors.
7- It is mentioned that the size of the RGB image is expanded to 600×1988×3 through standard data augmentation techniques. Please explain this point more clearly.
8- It is important to clearly define how the validity and reliability of the proposed work have been ensured.
9- KITTI dataset is considered the most well-known benchmark for evaluating computer vision algorithms performance. Please give some details of it based on its uses in this work in addition to that given in 4.1.
10- On what basis were the IoU thresholds set at 0.7 for cars and 0.5 for both pedestrians and cyclists. Additionally, why isn't 0.5 considered for all objects?
11-Is it necessary to compare the proposed work with other related work based on distance criteria with providing more explanations? This is an important point to be taken into consideration in such works.
Besides, it is required to calculate the recognition performance of the proposed work based on different road environments to highlight the robustness of the proposed work, since only visual comparison has been done in Qualitative Analysis section between the MPVF and GraphAlign++.
12- The captions of some figures such as Figures 4 and 5 should be clearer.
13- It is better to make the conclusion section as one paragraph, besides, it better to restart the problem statement, summarize the key findings, and demonstrate the significance point of the proposed work.
Comments on the Quality of English LanguageThe English could be improved to more clearly express the research.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsPlease see the attachment.
Comments for author File: Comments.pdf
Author Response
Comments 1: The introduction is well structured and the principal contributions are well exposed. Also, the references are up to date. However, in this section I think it would be interesting to say something about autonomous cars, which are one of the driving forces under the development of multi-modal 3D object detection technology.
Response 1: Thank you for pointing this out. We agree with this comment. Therefore, we have incorporated a discussion on autonomous driving in the introduction, emphasizing its role as a key driving force behind the advancement of multi-modal 3D object detection technology (Page 1-2, Lines 40–46).
Comments 2: In subsection 4.1, I suggest to write a sentence about why KITTI dataset was selected in front of other commonly used datasets in this field, such as Waymo Open or nuScenes.
Response 2: Thank you for pointing this out. We agree with this comment. Therefore, we have added a comparative analysis of the KITTI dataset with other commonly used datasets, such as Waymo Open and nuScenes, to justify our selection of KITTI for this study (Page 11, Lines 433–437).
Reviewer 3 Report
Comments and Suggestions for AuthorsThe authors proposed the MPVF algorithm to integrate pointwise and voxelwise fusion for multi-modal 3D object detection. The PVWF module is introduced to combine local features with global features. The IR module uses a group convolution strategy to inject high-level semantic features into the PWF and VWF modules to improve extraction efficiency. The FP module captures fine-grained features at various resolutions to further enhance detection precision. Detection accuracy for cars, pedestrians, and cyclists reaches 85.12%, 48.61%, and 70.12%, respectively.
Overall, the submission is written well and innovative. However, there are some suggestions on the improvements.
- Some abbreviations such as MPVF, PVWF, IR, and FP in the abstract and text, need to be defined before usage.
- The proposed IRFP module looks like YOLO deep learning module, which the authors should describe their differences.
- Some state-of-the-arts based on YOLO should be referred, like YOLO3D are missing.
- The proposed method achieved 3D AP of 85.12%, 48.61%, and 70.12% for medium-difficulty detections of cars, pedestrians, and cyclists, respectively. However, 48.61% for real pedestrians detection seems too low.
Author Response
Comments 1: Some abbreviations such as MPVF, PVWF, IR, and FP in the abstract and text, need to be defined before usage.
Response 1: Thank you for pointing this out. We agree with this comment. Therefore, we have clearly and consistently defined each abbreviation upon its first occurrence. Specifically, in both the Abstract and the main text, we have explicitly defined the following abbreviations: MPVF (page 1, line 17; page 2, line 91), PVWF (page 1, line 19; page 3, line 94), PWF (page 1, line 20; page 3, line 95), VWF (page 1, line 21; page 3, line 96), IRFP (page 1, line 24; page 3, line 101), IR (page 1, line 24; page 3, line 101), FP (page 1, line 25; page 3, line 102), and mAP (page 1, line 29; page 11, line 448).
Comments 2: The proposed IRFP module looks like YOLO deep learning module, which the authors should describe their differences.
Response 2: Thank you for pointing this out. We agree with this comment. Therefore, we have added relevant descriptions in the text to elaborate on the differences between the IRFP module and the YOLO deep learning module (Page 6-7, Lines 260–265).
Comments 3: Some state-of-the-arts based on YOLO should be referred, like YOLO3D are missing.
Response 3: Thank you for pointing this out. We agree with this comment. Therefore, we have supplemented the relevant references and incorporated YOLO3D into the single-stage LiDAR-based 3D object detection section to enhance the completeness of the literature and theoretical support (Page 4, Lines 154–156).
Comments 4: The proposed method achieved 3D AP of 85.12%, 48.61%, and 70.12% for medium-difficulty detections of cars, pedestrians, and cyclists, respectively. However, 48.61% for real pedestrians detection seems too low.
Response 4: Thank you for pointing this out. In response to this issue, our explanation is as follows: The proposed MPVF algorithm achieves a 3D detection accuracy (3D AP) of 48.61% for pedestrians under medium difficulty. This result does not indicate low detection accuracy; rather, it demonstrates a relatively superior performance. It is important to note that in pedestrian detection tasks, due to the complexity of pedestrian movements and pose variations, 3D detection accuracy is generally lower, a phenomenon that has been widely validated in existing research. Therefore, although the inference time is slightly longer, our method still outperforms many state-of-the-art techniques, especially in complex scenarios. We have provided additional explanations regarding this phenomenon (page 13, lines 521–526).
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsAll comments have been addressed correctly and no further comments are required.