Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Learning Motion Constraint-Based Spatio-Temporal Networks for Infrared Dim Target Detections

Appl. Sci. 2022, 12(22), 11519; https://doi.org/10.3390/app122211519

by Jie Li^1,2, Pengxi Liu¹, Xiayang Huang¹, Wennan Cui^1,* and Tao Zhang^1,2,3,*

Reviewer 1:

Francis Jesmar P. Montalbo Montalbo

Reviewer 2:

Yuhang Ye

Appl. Sci. 2022, 12(22), 11519; https://doi.org/10.3390/app122211519

Submission received: 21 September 2022 / Revised: 8 November 2022 / Accepted: 11 November 2022 / Published: 13 November 2022

Round 1

Reviewer 1 Report

1. The model should indicate its layers' specific properties and values, like kernel size, filters, strides, padding, and others, to help indicate their validity and reproducibility. The authors should also declare why and how they decided on the specific values and not simply discuss these critical values.

2. Additional information and clarity about the traditional methods and why the authors proposed such a strategy needs improvement. The authors need to provide a broader analysis to enhance the study's intuition further.

3. The given annotations in Figure 6 should have an adequate explanation. The given figure contains information that does not even have enough details in the paper.

4. The construction of the model shows an alternating inclusion of batch normalization layers. The authors should provide adequate information about their intuition of putting several alternating layers. The authors should also indicate their effects on the previous and following layers. Can the authors provide experimentation without them?

5. Explanation about Figure 4 should use the presented variables and annotations in the figure.

6. In lines 172-174, can the authors provide additional details about these lines and references?

7. The authors should mention the dimension of their Conv-LSTM consistently in the paper.

8. Equations 1-6 should have more precise explanations and better introductions of each variable used. The equations should also have added references.

9. The authors should provide the calculation of their shape and their compatibility. The structure shows a composition of the Conv-LSTM2D and Conv3D to the UpSampling3D. Issues can arise during reproduction without information about the calculation of the 2D to 3D shapes and their compatibility.

10. The authors should provide references for the models used in Table 4.

11. The results presented in Figs. 11-16 should have proper annotations on the image itself. Relying on colored boxes alone can limit its meaning if printed on a grayscale medium.

12. The authors need to identify and highlight their work's drawbacks or limitations.

13. The STM, DFM, and SATM do not seem to reflect or have enough information on how they get integrated into the overall proposed model or structure.

14. The authors need to provide enough mathematical proof to justify the significance of the Kalman filter in their proposed model. The authors must also supply more references regarding their intuition in section 2.2.1.

15. Figure 10 has (a) and (b) but does not have a proper discussion about each. Why did the authors put these illustrations in the same figure number? It does not seem to provide a comparison, only confusion.

16. The parameter settings in Table 3 need additional details.

17. Please improve the abstract by adding more info regarding how they achieved their highest score of 95.87%. The abstract mainly indicates that it is because the SNR lands at 1-3.

18. The authors should add more references that will strengthen the novelty and significance of their work.

19. Conclusions present more of a recap or overview than an actual conclusion about the findings. The authors need to deepen and substantially highlight the most significant findings and drawbacks during experimentation and production.

20. The authors should consider open-sourcing their codes for better replication and validity of their study.

21. Most of the figures require re-work. The figures should have higher DPIs and better text quality. The authors may not include the text on the figures during rendering and include them using their selected text editing software.

22. The overall paper still needs proofreading and further checking before consideration, not only in grammar but also in technicalities.

Author Response

Response to Reviewer 1 Comments

Dear reviewer:

Manuscript ID: applsci-1954749

Title: Learning Motion Constraint Based Spatio-Temporal Network for Infrared Dim Target Detection

Authors: Jie Li, Pengxi Liu, Xiayang Huang, Wennan Cui *, Tao Zhang *

Thank you for your precious suggestions. Those comments are valuable and very helpful. We have read through comments carefully and have made corrections point by point.

Accordingly, we have uploaded a copy of the original manuscript with all the changes highlighted by using the “Track Changes” function. Appended to this letter is our point-by-point response. The comments are reproduced and our responses are given directly afterward in a different color (red).

Point 1: The model should indicate its layers' specific properties and values, like kernel size, filters, strides, padding, and others, to help indicate their validity and reproducibility. The authors should also declare why and how they decided on the specific values and not simply discuss these critical values.

Response 1: Thank you for underlining this deficiency. Following the suggestion, we have added a more detailed interpretation regarding the network design, as following: “Specific layers design is shown in the Figure 6. The input layer has four parameters, which are input sequence length t, image resolution and number of channels. Given that infrared images are normally dominated by grayscale, we used a single channel as the network input. Then are three Conv-LSTM layers. The convolutional kernel sizes are [t, 3, 3, 64], [t, 3, 3, 128] and [t, 3, 3, 64] respectively. The design of the dimension and the channel compliances with universal design rules and real experimental effects. To preserve feature maps in temporal dimensionality, 3D convolutional kernels are used for the next maxpooling operation. The parameter meaning is kept consistent with the preorder layer. Next the same scale layers would be directly concatenated after upsampling operation. The output of the network is three pattens: 8, 16, and 32 scales. The choice of scale mainly takes into account the size of the target. Before the calculation of the loss function, we intergrate the multi-scale feature maps for ease of calculation, which makes use of the list in Python.” (Lines 188-200)

In addition to the linguistic illustration, we also added the Figure 6. to illustrate the layer details in the revised manuscript.

Point 2: Additional information and clarity about the traditional methods and why the authors proposed such a strategy needs improvement. The authors need to provide a broader analysis to enhance the study's intuition further.

Response 2: Thank you for this advice. According to the comment, we provided more analysis to enhance our intuition as the following: “As analyzed above, the traditional methods mainly rely on the edge gradient between the background and target. Typical single-frame methods perform well when the background is relatively unitary. But the result is very sensitive to the bad points which blamed the production defect of the infrared sensors. Widely used multi-frame algorithms take advantage of the serial correlation. However, they consume too much computer resources. The calculation of fixed parameters and the sensitivity to bad points limit the effectiveness of traditional algorithms.” (Lines 76-82)

Point 3: The given annotations in Figure 6 should have an adequate explanation. The given figure contains information that does not even have enough details in the paper.

Response 3: Thank you for your careful review. We apologize for not describing the criteria clearer. We have now added a more detailed explanation, as following: “The STM has 3 Conv-LSTM layers followed by 1 pooling layer. It takes multiple frames as input, such as t frames. The t is related to the target motion speed. We set t to 5 in this paper.” (Lines 231-233) and “Multiple Conv-LSTM layers could improve the memory ability of the network [28]. Considering the size of the target and the resolution of the dataset images, we choose 3 layers to learn the sequence motion feature. The C_i,j implies the cell status of the time line. And the H_i,j is the output of the single cell. The i denotes the network layer and j denotes the specific unit. The Pool is the pooling layer. It is employed to refine the learning result. The output of the STM is the time-weighted feature layer, which is displayed by the pink square.” (Lines 234-240)

Point 4: The construction of the model shows an alternating inclusion of batch normalization layers. The authors should provide adequate information about their intuition of putting several alternating layers. The authors should also indicate their effects on the previous and following layers. Can the authors provide experimentation without them?

Response 4: We deeply appreciate the suggestion and we have added the explanation, as following: “Notice that we place batch normalization layers after each feature extraction layers, mainly to accelerate the convergence of the model. The batch normalization layers could remove the correlation between features and make all features have the same mean and variance [25].” (Lines 201-204)

For consideration of the current generality of the batch normalization layer in the neural network and the limitation of space in this paper, we have cited a reference [25] in the revised manuscript for interested readers to study.

Point 5: Explanation about Figure 4 should use the presented variables and annotations in the figure.

Response 5: We agree with the comment and amend related contents, as following: “The time stream handing of this paper is shown in Figure 4. The indices i, i-1…, i-t in the squares indicate the image sequence. When the new frame updates, the time pipe discards the first frame (Frame_i-t) and adds the new image (Frame_i) data at the tail. Such an operation can fully ensure the continuity of the target movement and improve the efficiency of memory usage.” (Lines 171-175)

Point 6: In lines 172-174, can the authors provide additional details about these lines and references?

Response 6: Thank you for your careful review. We have cited references [26, 27] to provide additional details in the discussion. (Lines 213)

Point 7: The authors should mention the dimension of their Conv-LSTM consistently in the paper.

Response 7: We agree with the comment and we are very sorry for the mistakes in this manuscript and inconvenience they caused in your reading. The manuscript has been thoroughly revised in red font.

Point 8: Equations 1-6 should have more precise explanations and better introductions of each variable used. The equations should also have added references.

Response 8: Thank you for underlining this deficiency and we have cited references [29, 30] to related equations.

Also, we have rewritten the explanation of the equations to provide a better illustration, as following: “In the Equation (1), the box_pre and the box_gt respectively mean the network prediction and true coordinates. The part 1-IoU(box_pre, box_gt) evaluates the overlapping area of the network prediction and true coordinates. The ρ (box_pre, box_gt) computes the Euclidean distance. The C represents the diagonal length of the minimum bounding box which encircles box_pre and box_gt. The second part computes the distance of center point between the prediction and the true label. The υ compares the consistency of the aspect ratio. The w is the width of the prediction box and the h is height. Accordingly, the w^gt and h^gt are the parameters of the true box. The α is a positive coefficient computed by the υ and IoU. The third part is used to control the convergence of the width and height as quickly as possible. The Equation (4) is the confidence. It is evaluated by Equation (5), where P denotes the probability of the object. The P = 0 means no target in the box, and conversely, P = 1 means there exists targets.” (Lines 276-287)

Point 9: The authors should provide the calculation of their shape and their compatibility. The structure shows a composition of the Conv-LSTM2D and Conv3D to the UpSampling3D. Issues can arise during reproduction without information about the calculation of the 2D to 3D shapes and their compatibility.

Response 9: We totally agree with your suggestions and we have added the corresponding description, as following: “The input layer has four parameters, which are input sequence length t, image resolution and number of channels. Given that infrared images are normally dominated by grayscale, we used a single channel as the network input. Then are three Conv-LSTM layers. The convolutional kernel sizes are [t, 3, 3, 64], [t, 3, 3, 128] and [t, 3, 3, 64] respectively. The design of the dimension and the channel compliances with universal design rules and real experimental effects. To preserve feature maps in temporal dimensionality, 3D convolutional kernels are used for the next maxpooling operation. The parameter meaning is kept consistent with the preorder layer. Next the same scale layers would be directly concatenated after upsampling operation.” (Lines 188-196)

Also, a more concise illustration can be seen by the addition of Figure 6. in the revised manuscript.

Point 10: The authors should provide references for the models used in Table 4.

Response 10: Thank you for underlining this deficiency. We have added the references [10, 11, 32, 33] in the original Table 4.

Point 11: The results presented in Figs. 11-16 should have proper annotations on the image itself. Relying on colored boxes alone can limit its meaning if printed on a grayscale medium.

Response 11: Thank you for your careful consideration. We have added the text description and drawing statement to provide a better comparison in the revised manuscript. (The symbols in the revised manuscript are Figs. 13-18.)

Point 12: The authors need to identify and highlight their work's drawbacks or limitations.

Response 12: We agree with the comment and have added the deficiency in the section 4, as following: “However, the LSTM cell inevitably needs magnanimous training datasets to learn the motion feature. That reduces the usefulness in particular circumstances. In addition, the proposed method still involves manual hyperparameters, such as the length of input frames determined by the motion speed. The manual parameter also limits the application.” (Lines 527-531)

Point 13: The STM, DFM, and SATM do not seem to reflect or have enough information on how they get integrated into the overall proposed model or structure.

Response 13: Thank you for your careful review. We apologize for not describing the inclusion criteria clearer.

We have re-emphasized the names in the abstract of the revised manuscript. (Lines 16-18)

And we have renamed subsection in Section 2 to better illustrate the structure of the proposed model. (Lines 209, 242, 297)

Point 14: The authors need to provide enough mathematical proof to justify the significance of the Kalman filter in their proposed model. The authors must also supply more references regarding their intuition in section 2.2.1.

Response 14: Thank you for your careful comments. We do appreciate your concerns. Considering that the popularity of the Kalman Filter and the key points of the paper, the formula derivation was weakened. In order to better illustrate its feasibility, we have cited references [35, 36, 37] as theoretical support and a flow diagram of the study participants (Figure 11). We hope that could provide detailed mathematical proof for interested readers.

Section 2.2.1 is mainly used to solve problems encountered in practical applications, as mentioned in references [31, 32, 33]. Inspired by the references [34], we proposed the inference strategy that might be suitable for the infrared small target detection. We have cited related articles in the discussion. (Lines 298-322)

Point 15: Figure 10 has (a) and (b) but does not have a proper discussion about each. Why did the authors put these illustrations in the same figure number? It does not seem to provide a comparison, only confusion.

Response 15: Thank you for highlighting this shortfall. The two subfigures are mainly presented to illustrate that our proposed network achieves convergence.

We have provided corresponding details to illustrate the possible reasons, as following: “Figure 12. shows the convergence of our model. Figure 12. (a) plots the training error curve and the validation variation tendency. It can be seen that the loss value decreases considerably at the beginning of the training phase, indicating that the learning rate is appropriate and the gradient descent process is performed. The loss curve tends to smooth out after a certain stage of learning and the network achieves convergence. Figure 12. (b) visualizes the evolution of the learning rate. As shown in the subplot, the learning rate decreases normally during the training phase and stabilizes after 25 epochs.”. (Lines 392-399)

Point 16: The parameter settings in Table 3 need additional details.

Response 16: Thank you for your careful review. We are very sorry for the mistakes in this manuscript and we have added the information required as following: “For the TVF, the time domain window size T and the iteration step_size are respectively T = 16 and S_Z = 8. For the DP, the number of frames in a batch processing is T = 8. For the WSLCM, the coefficient of Gaussian ﬁlter is set to K = 9 and the factor of the threshold operation is set to λ=0.8. For the MCLoG, the number of scales is set to K = 4. Considering that the moving speed of the target in our dataset is no more than 5 pixels, the input frame length of the proposed method in the experiment is supposed to be T = 5.” (Lines 382-388)

Point 17: Please improve the abstract by adding more info regarding how they achieved their highest score of 95.87%. The abstract mainly indicates that it is because the SNR lands at 1-3.

Response 17: Thank you for the suggestion. We agree with the comment and re-wrote the sentence in the revised manuscript as the following: “On the mid-wave infrared datasets collected by the authors, the proposed method achieves 95.87% detection rate. The SNR of the dataset is around 1-3 dB and the background of the sequence includes sky, asphalt road and buildings”. (Lines 22-24)

Point 18: The authors should add more references that will strengthen the novelty and significance of their work.

Response 18: We agree with the comment, and we have added references [25-29] and [30-38] in the revised manuscript.

Point 19: Conclusions present more of a recap or overview than an actual conclusion about the findings. The authors need to deepen and substantially highlight the most significant findings and drawbacks during experimentation and production.

Response 19: Thank you for underlining this deficiency. This section was revised according to the suggestion, as following: “This method would hopefully be applied to IRST for anti-UAV remote monitoring and early warning. Different from existing single-frame based deep learning methods, the proposed method introduces recurrent network to learn the temporal feature of moving targets. The experiments demonstrate the potential of multi-frame based neural networks and emphasize the importance of joint detection of spatial-temporal information. However, the LSTM cell inevitably needs magnanimous training datasets to learn the motion feature. That reduces the usefulness in particular circumstances. In addition, the proposed method still involves manual hyperparameters, such as the length of input frames determined by the motion speed. The manual parameter also limits the application. In the future, the self-supervised training without labels will be discussed for improving the network training and the intelligence of the method.” (Lines 522-532)

Point 20: The authors should consider open-sourcing their codes for better replication and validity of their study.

Response 20: We completely agree with the comment. But we apologize that the project has not yet been concluded, and the code cannot be released temporarily due to the confidentiality. Once the project is completely finished, we would like to release the project on GitHub. We hope our study could provide a slight contribution to the research of infrared small target detection.

Point 21: Most of the figures require re-work. The figures should have higher DPIs and better text quality. The authors may not include the text on the figures during rendering and include them using their selected text editing software.

Response 21: Thank you for your precious advices. We have remade all figures in a copy of the original manuscript.

Point 22: The overall paper still needs proofreading and further checking before consideration, not only in grammar but also in technicalities.

Response 22: Our deepest gratitude goes to you for your careful work and thoughtful suggestions that have helped improve this paper substantially. According to the suggestions, we have rechecked the manuscript and verified the experimental data. The modified content has been marked up by the “Track Changes” function.

From your detailed and careful review suggestions, we really gained a lot, including technology and content arrangement.

Thank you for allowing us to resubmit a revised copy of the manuscript and we highly appreciate your time and consideration.

Looking forward to your decision,

With kind personal regard,

Sincerely yours,

Ms. Jie Li

E-Mail: [email protected]

Author Response File: Author Response.docx

Reviewer 2 Report

The paper is well written. And the design intuition and concepts are clearly explained.

The novelty of the algorithm (ConvLSTM + U-Net) is limited but in the application scenario, this is definitely a pioneer work.

The results are significantly better than baselines.

The design of the inference stage method is sound.

However, I would have some douts on what are the performance of more recent object detection algoritms (for generic videos) when using them in this infrared scenario. Such as newer version of YOLA. It would be interesting to have more discussion on why these methods are maybe not suitable for infrared videos.

Author Response

Response to Reviewer 2 Comments

Dear reviewer:

Manuscript ID: applsci-1954749

Title: Learning Motion Constraint Based Spatio-Temporal Network for Infrared Dim Target Detection

Authors: Jie Li, Pengxi Liu, Xiayang Huang, Wennan Cui *, Tao Zhang *

Thank you for your careful review and summary. We appreciate your positive evaluation of our work and agree with the comments regarding the limitations of our study.

We have been following up on the latest small target detection methods, including the very promising Vision Transformer and YOLA. We are now conducting a new study on small object detection using optical flow and a self-supervised network. Thank you for your precious suggestions. We would do comprehensive experiments between our work and other SOTA methods to explain why their work not suitable for infrared videos. We hope our study could provide a slight contribution to the research of infrared small target detection.

Thank you once again for your attention to our paper.

We wish good health to you, your family, and community.

With kind personal regard,

Sincerely yours,

Ms. Jie Li

E-Mail: [email protected]

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

The authors satisfied my requirements and concerns. However, it does require a bit more improvements in terms of writing. Hence, the authors need to have their paper proofread in a more stricter approach.

Author Response

Response to Reviewer 1 Comments

Dear reviewer:

Manuscript ID: applsci-1954749

Title: Learning Motion Constraint Based Spatio-Temporal Network for Infrared Dim Target Detection

Authors: Jie Li, Pengxi Liu, Xiayang Huang, Wennan Cui *, Tao Zhang *

Thank you for this valuable feedback. The paper has been carefully revised by the professional language editing service to improve the grammar and readability.

Accordingly, we have uploaded the revised manuscript. We sincerely hope that this revised manuscript has addressed the deficiency in English Writing.

We appreciated for your warm work earnestly. Once again, thank you very much for your comments and suggestions.

Looking forward to your decision,

With kind personal regard,

Sincerely yours,

Ms. Jie Li

E-Mail: [email protected]

Author Response File: Author Response.docx

Article Menu

Learning Motion Constraint-Based Spatio-Temporal Networks for Infrared Dim Target Detections

Further Information

Guidelines

MDPI Initiatives

Follow MDPI