Improved Feature Extraction and Similarity Algorithm for Video Object Detection
Round 1
Reviewer 1 Report
- The motivation should be moved earlier than the methods section. In general, the whole paper structure need to be improved so that there are Introduction and related work, materials and methods, and results and discussion sections.
- What is the value of the IoU used for the reported results in the tables?
- What are the numbers and locations of detection heads?
- What is the throughput of your methods? This is necessary for real-time video object recognition.
- Provide the precision-recall curves.
- Include references for the well-know models (e.g., ٌRESNET101).
- - Why are there Apostrophe in equations 9 and 10
- Similar studies utilizing Yolo can be cited so that to established the trustworthiness of the models and can provide reliabilit to baseline settings, see Detection of K-complexes in EEG waveform images using faster R-CNN and deep transfer learning. BMC Med Inform Decis Mak 22, 297 (2022). https://doi.org/10.1186/s12911-022-02042-x
- The table of abbreviations is missing but required by the template.
- The section numbering should start from 1.
Author Response
Please see the attachment.
Author Response File: Author Response.docx
Reviewer 2 Report
The paper has described a machine learning algorithm for object detection in video streams. The proposed approach has been trained and tested on ImageNet video dataset with a reported final mean average precision of 83.55%. A major problem with the manuscript is its tutorial content and language. The background material is not enough and the language is hard to read at various points. For instance,
1- Several abbreviations have been used without defining first.
2- The introduction section is too short. The problem statement, reference works and the contributions of the paper have been summarized in only 30 lines.
3- A full section (No. 1) has been dedicated to only one reference work i.e. Faster RCNN and still does not describe it in full detail. It is advisable to merge such short sections into one large section with sub-sections.
4- Section 3 on the proposed methodology needs to be elaborated as well. At present, it is not apparent how the concepts of similarity and feature aggregation relate to the optimization of the object detection problem. Sub-section 3.1 has only 5 lines dedicated to this effect and do not sufficiently cover the background.
5- The modified SSIM has been described in eq. 4. However, its differences from the original SSIM has not been discussed.
6- The S-SELSA architecture shown in fig. 3 is not clear. The figure quality is too poor.
7- What is the relation of transition probability (eq. 7) and normalized min cut (eq. 8) to the proposed architecture?
8- In table 2, the comparison has been made with other reference works. However, the citations/elaboration of the abbreviations for these works (TCN, TCN+LSTM, FGFA, D&T) are missing.
9- The claimed precision of 83.55% is not state-of-the-art. Please take a look at the following link for latest results on ImageNet VID dataset.
https://paperswithcode.com/sota/video-object-detection-on-imagenet-vid
Author Response
Please see the attachment.
Author Response File: Author Response.docx
Round 2
Reviewer 1 Report
The authors addressed my comments. However, there is one comments relating to Yolo, but I actually meant Faster RCNN as indicated by the mentioned reference and pointed out by the authors response. The trustworthiness of the models and chosen parameters can provide reliability to baseline settings, refer to Detection of K-complexes in EEG waveform images using faster R-CNN and deep transfer learning. BMC Med Inform Decis Mak 22, 297 (2022). https://doi.org/10.1186/s12911-022-02042-x
Author Response
Please see the attachment.
Author Response File: Author Response.docx
Reviewer 2 Report
The authors have adequately addressed all the points raised in the previous review cycle.
Author Response
Thank you very much for your careful guidance