Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Printed Edition

A printed edition of this Special Issue is available at MDPI Books....

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

The Stability Optimization of Indoor Visible 3D Positioning Algorithms Based on Single-Light Imaging Using Attention Mechanism Convolutional Neural Networks

Photonics 2024, 11(9), 794; https://doi.org/10.3390/photonics11090794

by Wenjie Ji¹, Lianxin Hu¹, Xun Zhang^1,2,*

, Jiongnan Lou¹, Hongda Chen¹ and Zefeng Wang¹

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Photonics 2024, 11(9), 794; https://doi.org/10.3390/photonics11090794

Submission received: 17 July 2024 / Revised: 17 August 2024 / Accepted: 21 August 2024 / Published: 26 August 2024

(This article belongs to the Special Issue Machine Learning Applied to Optical Communication Systems)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

In general the paper is well structured and the proposed methodology appears sound. However, the following requires improvement or further elaboration:

1. Figures 1, 2 and 6 resolution needs to be improved. Some of the text is not very clear which presents difficulty when reading and trying to understanding the presented framework.

2. The authors propose to use MHA-Resnet50 based on its generalization capability and efficiency. However, further explanation is recommended to strengthen or justify it use over other well established models.

3. In section 4, the authors compare the proposed approach to the 'original algorithm', however it is not very clearly stated what this is. Further explanation should be explicitly provided.

4. The proposed method should be compared to traditional well established methodologies for positioning (albeit more costly methods) to evaluated its comparative accuracy. This would provide a better understanding of system cost to accuracy ratio and whether this approach is justified.

5. Please discuss the potential limitations of the study.

Comments on the Quality of English Language

In general the paper is logical and reads well, however it is recommended that the paper should be checked for spelling errors and grammar. Refer to Line 346, "It unaffected.....". Line 483. "Highâ€•resolution" for some errors that were detected.

Author Response

Dear Editor and Reviewer,

We would like to begin with our sincere appreciation for all the valuable comments, insightful suggestions and thoughtful corrections offered by the reviewers and editor to our manuscript (photonics-3136334). The comments and suggestions definitely helped us to improve the quality of the manuscript. We have revised the manuscript in which major changes are highlighted in green. These changes are summarized below following a point-by-point response to reviewer's comments.

1. Figures 1, 2 and 6 resolution needs to be improved. Some of the text is not very clear which presents difficulty when reading and trying to understanding the presented framework.

Response: Thank you for your suggestion. We have made revisions to three of the images in the manuscript.

Action:

A. We have adjusted the font size in Figures 1 and 2 to make the text clearer. B. Due to the complexity of Figure 6 (In the latest manuscript is Figure 7), we have revised the overall layout and font size of the figure to better convey the structure of the entire model. C. Finally, we have formatted the figures according to the journal's resolution requirements. Please see lines 81, 95 and 194. 2. The authors propose to use MHA-Resnet50 based on its generalization capability and efficiency. However, further explanation is recommended to strengthen or justify it use over other well established model.

Response: Thanks for your suggestion. To further explain and justify the strengths of MHA-ResNet50, we have added Part 4.2.4 to the manuscript. This Part compares the performance of multiple models through feature maps and heatmaps analysis. Please refer to line 385 on page 13.Below we will explain the advantages of MHA-Resnet50 based on experiments.

Firstly, the ResNet models can extract higher-dimensional features from images. This allows some edge detail features to be extracted. As evidenced by the comparisons between DenseNet, MobileNet, ResNet50, and ResNet101.

Secondly, MHA uses multiple heads for feature extraction so that a wider variety of features can be extracted and dependencies between these features can be established. MHA also learns the importance of different features and assigns weights accordingly. This allows extraction of light intensity features that are not obvious in the image. This is evident from the comparison of ResNet50, ResNet101, MHA-ResNet50 and MHA-ResNet101.

Final, MHA-ResNet50 requires fewer computational resources and memory compared to MHA-ResNet101. We obtained FLOPs and Params for both models and added them to Table 4 on page 12.

Action: To justify the strengths of MHA-Resnet50, we have made two additions to the manuscript.

A. We added comparative experiments for validation. Please refer to part 4.2.4 on page 13. B. The FLOPs and Params for the model were obtained and are supplemented in Table 4 on page 12. 3. In section 4, the authors compare the proposed approach to the 'original algorithm', however it is not very clearly stated what this is. Further explanation should be explicitly provided.

Response: Thank you for your suggestion. The original algorithm in section 4 is described in Part2.2, "Foundation," of the manuscript. It is possible that our unclear description makes understanding the original algorithm difficult. Therefore, we have made three revisions to the manuscript as follows.

Action:

A. At line 108 on page 3, we added "The original positioning..." B. At line 129 on page 5, we added "From the above..." C. At line 115 on page 4, we revised the figure caption to "Original positioning principle." 4. The proposed method should be compared to traditional well established methodologies for positioning (albeit more costly methods) to evaluated its comparative accuracy. This would provide a better understanding of system cost to accuracy ratio and whether this approach is justified.

Response and Action：Thank you for your suggestion. We have added Part 4.4 to compare and discuss with traditional positioning methods in section 4. The information involved: number of LEDs needed, system cost, angle, resolution, receiver type, accuracy. Please refer to line 504 on page 17.

5. Please discuss the potential limitations of the study.

Response: Thank you for your suggestion. In our future work, we will first add more training data with light intensity interference caused by varying lighting conditions and changes in LED characteristics to enhance the model's robustness. Additionally, we will attempt to use lower-resolution signal frames for positioning to reduce the algorithm's computational complexity and cost. Finally, we will test larger camera pose variations to assess the method's limits.

Action: Based on your suggestion, we added a description of these potential limitations in the conclusion. As shown in line 534, "Our proposed positioning...".

6. Comments on the Quality of English Language：In general the paper is logical and reads well, however it is recommended that the paper should be checked for spelling errors and grammar. Refer to Line 346, "It unaffected.....". Line 483. "Highâ€•resolution" for some errors that were detected.

Response: Thank you for pointing out the grammatical errors, we have corrected them and re-checked the entire manuscript for problems with the statements. We've labeled it in blue. Please see line 439, line 594.

We tried our best to improve the manuscript and we appreciate for Editors/Reviewers warm work earnestly, and hope that the correction will meet with approval. Once again, thank you very much for your comments and suggestions.

Your sincerely

Xun Zhang (corresponding author)

On behalf of all the co-authors

Author Response File: Author Response.doc

Reviewer 2 Report

Comments and Suggestions for Authors

The authors proposed a CNN based 3D positioning algorithm for indoor VLP. Using the camera for self-localisation is an interesting solution for VLP, but the authors do not make the motivation for the introduction of CNN networks clear. In addition, how the attention mechanisms improve positioning accuracy is also unclear. Following are my concerns:

1) The keyword VLP was mistakenly written as VLC.

2) How does MHA-Resnet50 avoid model overfitting and make training more efficient by introducing multi-head attention mechanism? Please give more explanation.

3) According to Fig. 3 and Fig. 5, we can calculate the localisation of the receiver by light propagation calculations, so the motivation for introducing MHA-Resnet50 is unclear.

4) What are the specific outputs of the MHA-Resnet50? Is it a direct output of the 3D coordinates or the LED shape box?

5) The reader may want to know how the MHA attention mechanism can enhance the key edge features for these black-and-white stripe signals.

6) The authors proposed a 3D positioning method but used a 2D RMSE metric in the experimental part.

Comments on the Quality of English Language

The language is readable

Author Response

Dear Editor and Reviewer,

We would like to begin with our sincere appreciation for all the valuable comments, insightful suggestions and thoughtful corrections offered by the reviewers and editor to our manuscript (photonics-3136334). The comments and suggestions definitely helped us to improve the quality of the manuscript. We have revised the manuscript in which major changes are highlighted in yellow. These changes are summarized below following a point-by-point response to reviewer's comments.

1. The keyword VLP was mistakenly written as VLC.

Response: Thanks for your suggestion. VLC was not incorrect, however, due to our oversight, we omitted the keyword VLP.

Action: In the latest manuscript, we have added the keyword VLP. Please refer to line 25 on the first page.

2. How does MHA-Resnet50 avoid model overfitting and make training more efficient by introducing multi-head attention mechanism? Please give more explanation.

Response: Thanks for your suggestion. To prevent model overfitting, we incorporated the multi-head attention mechanism to increase the diversity and quantity of features, as well as to strengthen the relationships between features. Additionally, the multi-head attention mechanism allows multiple heads to extract features in parallel and assign weights to these features based on their importance. This approach enhances both the efficiency of feature extraction and the overall training efficiency. A more detailed explanation is provided below.

Avoiding Overfitting:

A. MHA can compute attention distributions in parallel across different subspaces, allowing each attention head to focus on different features. This enables the model to extract more diverse light intensity and deformation features, rather than overly relying on any specific feature, thereby reducing the risk of overfitting. B. MHA-ResNet50 can focus attention on global light intensity features, which, compared to ResNet50, helps prevent the model from focusing too much on local imaging deformation features.

Improving Training Efficiency:

A. Compared to ResNet50, adding MHA allows the image to be processed in parallel by multiple heads, enabling quicker capture of global dependencies among different features, which can enhance feature extraction efficiency. B. MHA can automatically assign weights to different features, enabling the model to automatically focus on the most useful features, thereby speeding up the training process.

Action：Based on your suggestions, we have added a targeted paragraph of explanation and a set of experiments to the manuscript.

A. We have added explanations regarding how MHA reduces overfitting risk and improves training efficiency in Part 3.2.1. Please refer to the "Motivation" at line 232 on page 8. B. Based on the experimental results in Part 4.2.3, "Training Results and Comparison," we have added a comparative experiment on feature heatmaps in Part 4.2.4. These two parts fully support the explanations provided in Part 3.2.1. Please refer to the "Comparison and analysis of feature" at line 385 on page 13. 3. According to Fig. 3 and Fig. 5, we can calculate the localisation of the receiver by light propagation calculations, so the motivation for introducing MHA-Resnet50 is unclear.

Response: Thanks for your suggestion. Figures 3 and Figures 5 indeed describe the traditional optical imaging positioning algorithm, which is based on the principle of monocular camera imaging. However, traditional algorithms struggle to extract the light intensity of the stripes in the signal frame and the deformation of the LED imaging. So we use MHA-ResNet50 to extract these features and predict the position. This approach not only reduces the errors caused by camera pose variations but also avoids the use of IMU sensors.

Action: In order to give the reader a clear understanding of our motivation for using MHA-ResNet50. We have added Part 2.4 to the manuscript to make the illustration. Please refer to the "Solution concept" at line 169 on page 6.

4. What are the specific outputs of the MHA-Resnet50? Is it a direct output of the 3D coordinates or the LED shape box?

Response: Thanks for your suggestion. The direct output of MHA-ResNet50 is the 3D coordinates of the camera relative to the LED. The input images are processed by the backbone network ResNet50 and the multi-head attention mechanism to extract light intensity features and deformation features, which are then input into the xgboost regressor to predict the 3D coordinates.

Action: In order to make the reader understand the output of MHA-Resnet50 more intuitively, we have made two revisions.

A. We revised the description of the model's input, feature extraction, and final coordinate output. Please refer to line 196 on page 7, "The input to..."and line 209 on page 7, "Introducing the multiple...". B. In the output part of the model representation in Figure 7, we have added the final coordinate output identifier. Please see Figure 7 on page 7. 5. The reader may want to know how the MHA attention mechanism can enhance the key edge features for these black-and-white stripe signals.

Response: Thanks for your suggestion. Before making the explanation, we would like to remind that the black-and-white stripe edge features enhanced by MHA reflect the light intensity attenuation characteristics. The principle of its enhancement is below.

Firstly, the MHA uses multiple heads to extract features from the image, allowing it to capture more light intensity attenuation features and establish dependencies between these features.

Secondly, the multiple heads can learn the importance of different light intensity attenuation features and assign weights accordingly, which enhances the impact of some less noticeable light intensity attenuation features.

Overall, the essence of MHA's enhancement of these black and white strips features is to help the model extract more light intensity features and increase the weight of some of the light intensity features.

Action: We have made additions to the manuscript in three aspects: motivation, theoretical explanation, and experimental validation.

A. We have added Part 2.4, describing the role of light intensity attenuation features in positioning and the motivation for using CNN. Please refer to the "Solution concept" at line 169 on page 6. B. We have added Part 3.2.1, explaining from a theoretical perspective why MHA is integrated into ResNet50 and how MHA enhances light intensity attenuation features. Please refer to the "Motivation" at line 232 on page 8. C. We have added Part 4.2.4, where we demonstrate the validity of the descriptions in Part 2.4 and Part 3.2.1 by comparing feature heatmaps before and after using MHA. Please refer to the "Comparison and analysis of feature" at line 385 on page 12. 6. The authors proposed a 3D positioning method but used a 2D RMSE metric in the experimental part.

Response: Thank you for your remark. What we propose is indeed a 3D positioning method. In Section 4, during the model training comparison and positioning experiments, we used the three-dimensional RMSE to calculate both the loss and the error. However, due to our oversight, we did not include the z-coordinate in equation 16. We sincerely apologize for the confusion this oversight may have caused.

Action: We have revised the RMSE of equation 16 in the manuscript to three dimensions. Please refer to page 11, line 345.

Your sincerely

Xun Zhang (corresponding author)

On behalf of all the co-authors

Author Response File: Author Response.doc

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

All my concerns have been addressed.

Comments on the Quality of English Language

The language is readable

Article Menu

Printed Edition

The Stability Optimization of Indoor Visible 3D Positioning Algorithms Based on Single-Light Imaging Using Attention Mechanism Convolutional Neural Networks

Further Information

Guidelines

MDPI Initiatives

Follow MDPI