Next Article in Journal
High-Fidelity Synthetic Face Generation for Rosacea Skin Condition from Limited Data
Next Article in Special Issue
Ensemble Meta-Learning-Based Robust Chipping Prediction for Wafer Dicing
Previous Article in Journal
Strong Electromagnetic Interference and Protection in UAVs
Previous Article in Special Issue
Applying Advanced Lightweight Architecture DSGSE-Yolov5 to Rapid Chip Contour Detection
 
 
Article
Peer-Review Record

Integration of ShuffleNet V2 and YOLOv5s Networks for a Lightweight Object Detection Model of Electric Bikes within Elevators

Electronics 2024, 13(2), 394; https://doi.org/10.3390/electronics13020394
by Jingfang Su, Minrui Yang and Xinliang Tang *
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Electronics 2024, 13(2), 394; https://doi.org/10.3390/electronics13020394
Submission received: 28 November 2023 / Revised: 13 January 2024 / Accepted: 15 January 2024 / Published: 18 January 2024
(This article belongs to the Special Issue Novel Methods for Object Detection and Segmentation)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This work addresses the increasing demand for electric bikes in densely populated areas, highlighting the safety risks associated with charging them in residential buildings. It underscores the need for a detection system to identify electric bikes entering elevators. The paper proposes a lightweight YOLO detection network, offering high frame rates per second (FPS) and low computational load, suitable for edge deployment. The paper is well-structured, with clear motivation, detailed methodological descriptions, and well-designed experiments. The results show an improvement in the detection rate with a significant reduction in model size.

Concerns:

  1. The authors did not specify whether the proposed model is pretrained on the COCO or ImageNet dataset. This information is crucial, as it provides context about the model's foundational training and capabilities.

  2. Although the authors demonstrated effective detection of both persons and motorcycles in Figures 11 and 12, the training datasets mentioned only include classifications for persons, electric bikes, and bikes. This discrepancy raises questions about the model's ability to accurately detect the intended target classes, suggesting a potential failure in this aspect.

  3. The camera positioning shown in Figures 11 and 12 appears to be from a handheld perspective, which is atypical for elevator environments, where cameras are usually mounted to the top corners of the elevator cabins. Given that viewing angles are critical for object detection in machine vision models, the authors should address this discrepancy to ensure the model's applicability in real-world elevator scenarios.

Author Response

Dear Reviewer,

Greetings!  The document titled 'Reviewer1' is a response to the questions you raised. Additionally, the manuscript has been annotated using red, yellow, and blue colors. Blue indicates modifications made based on the reviewer's suggestions, yellow represents refinement of English grammar, and red signifies content deletion from the manuscript.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Review report:
•    A brief summary (one short paragraph) outlining the aim of the paper, its main contributions and strengths

Paper presents a lightweight object detection model (YOLOv5s-Sh+SimAM+Swin Transformer+GSconv+VoVGSCSP+EIOU) to be used for edge deployment in elevator environments designed for detecting electric bikes. Authors propose a model based on the YOLOv5s network, where the backbone network is replacing the original CSPDarknet53 with a lightweight multilayer ShuffleNet V2 convolutional neural network, a Swin Transformer modules is introduced between layers to enhance the feature expression capability of images, SimAM attention mechanism is applied at the end layer to improve the feature extraction capability of the backbone network. For the neck network, Conv and C3 basic convolutional modules are replaced by lightweight and depth-balanced GSConv and VoV-GSCSP modules allowing reduction of the number of parameters while enhancing the crosscale connection and fusion capabilities of feature maps. Authors used a  more accurate EIOU error function for prediction network that allows faster-converging. Authors presents results from 4 experiments with the proposed model. Dataset used in experiments contains 1900 images, 1000 images from online resources and 900 images obtained by applying augmentation techniques such as cropping, flipping, scaling, and noise injection. Images were manually annotated for three classes: electric bikes, bikes, and people. Image dataset was divided into training (70%), testing (15%), and validation (15%). Authors presents experimental results with proposed object detection model that reduces computational complexity by 84.8%, and both model size and parameter count decrease by 81.0% and 84.3% and mean Average Precision (mAP) decreases with 0.9%. This model meets the real-time detection requirements for electric bikes in elevator scenarios, providing a feasible technical solution for deploying edge devices inside elevators.

•    General concept comments
Main observations are:
- equations are not referenced in text
- bibliography contains recent references - out of 21 used references, 19 references are from the last 5 years, 2 references are from last 10 years and no references older than 10 years
- English language could be improved
- some paragraphs should be rephrased


•    Specific comments referring to line numbers, tables or figures that point out inaccuracies within the text or sentences that are unclear.
- in chapter 1. Introduction, line 68, expression "Literature [7]" could be replaced by "Authors from ... propose...", "In paper/article..."
- in chapter 1. Introduction, line 73, expression "In literature [8]" could be replaced by "Authors from ... propose...", "In paper/article..."
- in chapter 1. Introduction, line 77, expression "Literature [9] focused on " could be replaced by "Authors from ... focused on...", "Paper/article... is focused on"
- in chapter 1. Introduction, line 81, expression "In a different approach, Literature [10] deployed the YOLOv3 algorithm" could be replaced by "In a different approach, in article [10] YOLOv3 algorithm is deployed"
- in chapter 1. Introduction, line 84, expression "Furthermore, Literature [11] introduced a machine-learning algorithm" could be replaced by "Furthermore, in article [11] it is introduced a machine-learning algorithm"
- in chapter 2. Related Work, line 114, expression " 2017 Face++ proposed ShuffleNet [15]," should be rephrased
- in subchapter 3.2. Lightweight improvements to YOLOv5, lines 171, 173, 176 and 178, references could be added for Swin Transformer, GSConv and VoV-GSCSP and EIOU error function (there are references in bibliography for all of them but appear later in text, it would be good to add references at the first use or at least where improved YOLOv5 model is presented)
- equations 1 and 2 are not referenced in text
- in subchapter 3.2.5. Using the EIOU loss function, line 278, "The YOLOv5s network utilizes the CIOU error function as the loss function, represented by Formula 6:" - there is no formula/equation 6 (it should be equation 1)
- in subchapter 3.2.5. Using the EIOU loss function, line 288, "Formula (7) ..." - there is no formula/equation 7 (it should be equation 2)
- in subchapter 3.2.5. Using the EIOU loss function, line 288, "In the formula, ... and ... replace the ... term in the CIOU formula" - should be rephrased - for example "In equation 2 terms ... and ... "
- in subchapter 4.3.3. Ablation Experiment, line 368, "In the neck network, replacing the GSCONV+VOVGSPSP module will cause a slight increase in parameter count and model volume." - should be rephrased - for example "In the neck network, replacing original Conv and C3 modules with lightweight GSConv and VOV-GSCSP modules will cause a slight increase in parameter count and model volume."
- in subchapter 4.3.3. Ablation Experiment, line 371, "Finally, by replacing the EIOU loss function while keeping other metrics unchanged," - should be rephrased - for example "Finally, by replacing the original loss function with EIOU loss function while keeping other metrics unchanged,"
- term "obscured targets" could be replaced with occluded targets/objects


•    Is the manuscript clear, relevant for the field and presented in a well-structured manner?
Manuscript is relevant for the field and well presented.

•    Are the cited references mostly recent publications (within the last 5 years) and relevant? Does it include an excessive number of self-citations?
- bibliography contains recent references - out of 21 used references, 19 references are from the last 5 years, 2 references are from last 10 years and no references older than 10 years

Comments on the Quality of English Language

English language can be improved.

Author Response

Dear Reviewer,

Greetings!  The document titled 'Reviewer2' is a response to the questions you raised. Additionally, the manuscript has been annotated using red, yellow, and blue colors. Blue indicates modifications made based on the reviewer's suggestions, yellow represents refinement of English grammar, and red signifies content deletion from the manuscript.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Thank you to the authors for their reply and the efforts made to improve this work. However, my concerns have not been fully addressed: While I understand that the authors have collected extensive data to cover new classes specifically for this work, it is still unclear to me whether the model was trained from scratch or fine-tuned on a pre-trained model from the COCO or ImageNet datasets. If it was trained from scratch, the size of the collected image dataset seems too small for effective model learning. My second question relates to the first: if the model was fine-tuned on a pre-trained model, then it makes sense that the model can recognize motorcycles, which were not included in the dataset collected by the authors. However, how can the authors demonstrate the effectiveness of the additional dataset used in training the model? The resolution of Figures 11 and 12 is quite low, making it difficult for readers to discern the information in them.

Author Response

Dear Reviewer,

Greetings!   The document titled 'Reviewer1' is a response to the questions you raised.  Additionally, the manuscript has been annotated using red, yellow, and blue colors.  Blue indicates modifications made based on the reviewer's suggestions, yellow represents refinement of English grammar, and red signifies content deletion from the manuscript.

Please see the attachment

Author Response File: Author Response.pdf

Back to TopTop