Next Article in Journal
A Comprehensive Evaluation of Features and Simple Machine Learning Algorithms for Electroencephalographic-Based Emotion Recognition
Previous Article in Journal
Motion Sickness in Mixed-Reality Situational Awareness System
Previous Article in Special Issue
Theoretical and Simulation Analysis of a Thin Film Temperature Sensor Error Model for In Situ Detection in Near Space
 
 
Article
Peer-Review Record

Teacher–Student Model Using Grounding DINO and You Only Look Once for Multi-Sensor-Based Object Detection

Appl. Sci. 2024, 14(6), 2232; https://doi.org/10.3390/app14062232
by Jinhwan Son and Heechul Jung *
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Appl. Sci. 2024, 14(6), 2232; https://doi.org/10.3390/app14062232
Submission received: 18 January 2024 / Revised: 24 February 2024 / Accepted: 28 February 2024 / Published: 7 March 2024
(This article belongs to the Special Issue Information Fusion and Its Applications for Smart Sensing)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

•    A brief summary (one short paragraph) outlining the aim of the paper, its main contributions and strengths
Paper presents a study for real-time object detection in environments requiring fast inference speeds, on three types of multi-image datasets captured in identical situations (CCTV, automotive dashcam an smartphone scenarios) based on YOLOv8 model. Authors used Grounding DINO model, a zero-shot object detector with a high mAP performance, for autolabeling the dataset. Grounding DINO (teacher) generated labels are used alone or mixed with manual labels to train YOLO (student). The dataset used in the experiments is The Open AI Dataset Project (Multi-Video Same Situation and Object Identification Data) which contains 11431 images in 4 classes person, scooter, vehicle and bicycle from CCTV cams, automotive dashcam an smartphone. Authors present experimental results and shows that using auto-generated labels for object detection does not lead to a degradation in performance. Paper shows that combination of auto-labeling and manual labeling enhances performance.

•    General concept comments
Main observations are:
- bibliography contains relatively recent references - out of 31 used references, 15 references are from the last 5 years, 12 references are from last 10 years and 4 are older than 10 years;
- equations are not referenced in text
- figures are not referenced in text
- tables are referenced wrong, tables are not referenced
- description should be added for label combination / label fusion method
- more details can be added to some tables
- a table for comparing results for manual / generated and combined annotations can be added
- English language could be improved


•    Specific comments referring to line numbers, tables or figures that point out inaccuracies within the text or sentences that are unclear.
- in chapter 2.2.1. Information Recognition Technique for Substation Terminal Block Diagrams Based on Regional Segmentation, lines 136-143 are repeated in lines 144-150 - please rephrase lines 144-150
- figure 1 is not referenced in text
- figure 2 is not referenced in text
- figure 3 is not referenced in text
- line 229 - more details should be added regarding combination method ( label fusion ) and how weights are computed and used
- line 233 - referenced tables 4.1 and 4.2 do not exist in text
- line 237 - referenced table 4.3 does not exist in text
- for clarity, in tables 2 and 3 a description for each group (CT, BB, SP) can be added like it was done in table 4
- in description of figure 7 - "for group 1" should be added - because training results are for group 1
- in description of figure 8 - "for group 1" should be added - because confusion matrix is for group 1
- in table 5 the name of the of the header should be Test Group ( instead of "Text Group" )
- in table 6 the name of the of the header should be Test Group ( instead of "Text Group" )
- in table 7 the name of the of the header should be Test Group ( instead of "Text Group" )
- in table 8 the name of the of the header should be Test Group ( instead of "Text Group" )
- a table for comparing results for manual / auto generated / combination of manual and auto-labels can be added to compare results side by side
- line 311 - conclusions "it was confirmed that there is no degradation in object detection performance when utilizing automatically generated labels. Furthermore, combining auto-labels with manual labels resulted in an enhancement of object detection performance." are true when classes of objects can be detected by Grounding DINO ( like the 4 classes - person, scooter, vehicle and bicycle - used in this study )

•    Is the manuscript clear, relevant for the field and presented in a well-structured manner?
Manuscript is relevant for the field and well presented.

•    Are the cited references mostly recent publications (within the last 5 years) and relevant? Does it include an excessive number of self-citations?
- bibliography contains relatively recent references - out of 31 used references, 15 references are from the last 5 years, 12 references are from last 10 years and 4 are older than 10 years;

•    Is the manuscript scientifically sound and is the experimental design appropriate to test the hypothesis?
The experiment design is appropriate to test the hypothesis.

•    Are the manuscript’s results reproducible based on the details given in the methods section?
More information should be added in order for the results to be reproducible




Comments on the Quality of English Language

Comments on the Quality of English Language
English language can be improved.

Author Response

We thank the editor and reviewers for their constructive comments. We have addressed all of them and modified the manuscript accordingly. Our detailed answers follow. Please note that additions to the original manuscript are indicated in blue.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

A very interesting paper, where the authors looked into account the influence of image source and its proprieties (orientation, luminosity) on the identification and labeling of the objects. The paper is well organized in presenting the research steps and the obtained results.

The authors should also add in the abstract and in the introduction the full naming for each abbreviation. They are doing it later in the paper, but the full name should appear in each section (abstract, main article) and at the first mentioning of the technology.

 

v  Originality and relevance

A very interesting paper, where the authors investigated the influence of image source and its proprieties (orientation, luminosity) on the identification and labeling of the objects. The paper is well organized in presenting the research steps and the obtained results.

 

v  Contribution to the subject area

The authors approach the subject of improving image labeling through a new approach in which using a slow but reliable algorithm to teach a faster one to properly label identified objects. Through this method, the authors manage to remove human error, which can appear when going through large data when training the system. By combining both techniques (teaching through Grounding DINO and human intervention) the system achieves the best response.

 

v  Methodology improvements

The authors could mention the system’s speed and accuracy in labeling the objects in the images, proving the real time abilities. A comparison between the system's performance in speed and that of other algorithms can bring an improvement to the paper.

 

v  Consistency of conclusions

 The presented results and conclusions demonstrate that using a teacher-student approach in training the faster algorithm improves the labeling process, and by adding the human intervention in the process improves the labeling accuracy even more.

 

v  References:

 The references are appropriate, covering a wide range of foundational and recent studies that support the article's methodologies, experiments, and discussions. They provide a solid background and justification for the conducted research.

 

The references highlight the author’s choice in using YOLO and Grounding DINO for the object labeling process. Additionally, through the mentioned references it is sustained the challenges in object detection due to image source, orientation, and brightness.

 

v  Tables and figures

The tables and figures effectively illustrate the dataset composition, experimental setup, and results. They are well-designed, providing clear and concise information that enhances the reader's understanding of the study's findings and methodologies.

Just a note to improve the manuscript's clarity: the author should ensure that Figures 1, 2, and 3 are explicitly mentioned in the text.

Additionally, it is important to verify the names of the tables referenced throughout the document (for instance, if Table 3 is mentioned in the text, ensure it is not incorrectly referred to as Table 4.3 at line 232 in the article).

v  Conclusion:

        The article contributes to the field of object detection by integrating auto-labeling techniques with established object detection frameworks, showcasing practical applicability and potential areas for future research.

Author Response

We thank the editor and reviewers for their constructive comments. We have addressed all of them and modified the manuscript accordingly. Our detailed answers follow. Please note that additions to the original manuscript are indicated in blue.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The authors present an approach that mixes two different architectures of neural networks in order to improve the object detection rate of the simplest one of them, by using the other one to support the automatic labeling process of the training data. The idea behind this scheme is interesting, although not entirely new.

Figure 4 contains metadata from a JSON file and labels that are barely legible. I suggest the authors to rethink the way they want to present the concept encompassed in the figure.

In lines 191-193: what do you mean by "its detection performance on this dataset did not meet expectations"? what are those expectations?

Lines 199-202: What would an optimal value be for real-time object detection purposes? It is clear that grounding DINO is slower than YOLO, but it is not clear what is real-time for the application under study.

Labels in figure 5 are a little bigger than the typeface used for the text in the document. The figure can be made smaller, although I suggest to increase the width of the arrows.

Figure 6 is not that illustrative, kind of like in Figure 4.

Line 290: What does it mean that "performance was generally satisfactory"? 

Overall, even though the paper presents a highly visual application of neural networks, the authors should aim to improve the contents, style and presentation of the figures.

Comments on the Quality of English Language

I detected few errors in syntax, but it is worth to assess the English writing quality at least once more before moving forward with the process.

 

Author Response

We thank the editor and reviewers for their constructive comments. We have addressed all of them and modified the manuscript accordingly. Our detailed answers follow. Please note that additions to the original manuscript are indicated in blue.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Thank you very much for your effort, observations were addressed and recommendations have been implemented.

Comments on the Quality of English Language

English language was improved in the new version.

Back to TopTop