Next Article in Journal
Graph Coloring via Locally-Active Memristor Oscillatory Networks
Next Article in Special Issue
Efficiency of Priority Queue Architectures in FPGA
Previous Article in Journal
An Experimental Study on Step-Up DC–DC Converters for Organic Photovoltaic Cells
Previous Article in Special Issue
Implementation of a Fuel Estimation Algorithm Using Approximated Computing
 
 
Article
Peer-Review Record

Real-Time Embedded Implementation of Improved Object Detector for Resource-Constrained Devices

J. Low Power Electron. Appl. 2022, 12(2), 21; https://doi.org/10.3390/jlpea12020021
by Niranjan Ravi and Mohamed El-Sharkawy *
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
J. Low Power Electron. Appl. 2022, 12(2), 21; https://doi.org/10.3390/jlpea12020021
Submission received: 15 February 2022 / Revised: 24 March 2022 / Accepted: 31 March 2022 / Published: 13 April 2022
(This article belongs to the Special Issue Advanced Researches in Embedded Systems)

Round 1

Reviewer 1 Report

The paper is related to improving the object detection performance in CV algorithms. Authors have proposed and used a new metric called “improved intersection over union”, which basically augments the previously used generalized IoU to normalize over scaling and aspect ratio issues associated with bounding box detection. They train the YOLO5 network with loss function based on a new metric and show that its performance is better compared to other IoU based methods.

In general the paper is written well but authors need to motivate more on the actual inefficiency they are trying to address and also how different metrics connect together.

The sentences on lines 41-43 need to be rewritten. Second part of the sentence seems to contradict with the first part. 

Line 135: what does 16 GFLOPS specification means here? Is it the expected computations to meet real-time goals? Clarify it in the paper.

Figure two hasn’t been explained except just referring to it at different places. You should use this diagram to explain different types of issues with the vanilla IoU method.

Line 182 - 184: you need to expand on what these problems mean. These seem like the central  issues your new proposed method is addressing, isn’t it?

Sentence on line 165 is not clear. Can you use simple terms to explain the primary drawback?

Section 4.4: it is not clear how your improved IoU improves the performance over DIoU. Also, how will your IIoU handle the case of concentric rectangles where the centers match exactly but the predicted rectangle is smaller than ground truth?

Algorithm 1: why do you want to calculate IIoU twice on lines 25 and 26.

Section 8: before using mAP please explain in paper what mAP is, how it is computed and how it depends on IoU.

In figure 8:  can you clarify how the range of mAP over IoU values is just represented in a single x-axis?

What confidence thresholds did you use during your evaluation?

Line 312: does .067 sec perf image means your algorithm took 2 secs to process the video of length 1 sec. 

Section 8.3: can you compare the performance (latency) of SSD and default YOLO5 with real-time outdoors and compare and contrast with your model?

Author Response

Respected Reviewers,

                           We would like to thank the reviewers for taking their time to review our research.  The valuable comments from reviewers have helped us to re-evaluate our research findings and extend our research as well. We have added the response/changes we have made in the article in response to the comments. In the revised version, many sections were modified, and new section have been added. The number of references has also changed. Latexdiff software did not perform effectively since algorithms and equations were also included.  So, we had to mark the changes in yellow in the updated PDF document.

 

Reviewer 1:

The paper is related to improving the object detection performance in CV algorithms. Authors have proposed and used a new metric called “improved intersection over union”, which basically augments the previously used generalized IoU to normalize over scaling and aspect ratio issues associated with bounding box detection. They train the YOLO5 network with loss function based on a new metric and show that its performance is better compared to other IoU based methods.

In general, the paper is written well but authors need to motivate more on the actual inefficiency they are trying to address and also how different metrics connect together.

The sentences on lines 41-43 need to be rewritten. Second part of the sentence seems to contradict with the first part. 

Response: The lines have been modified.

Line 135: what does 16 GFLOPS specification means here? Is it the expected computations to meet real-time goals? Clarify it in the paper.

Response: We re-examined the YOLOv5s architecture and observed that 16 GFLOPS adds more computation and its not most suitable for edge devices. We overcome this problem, by choosing a lightweight model, YOLOv5n6 which has 4.3 GFLOPS. This number of GFLOPS can further be reduced to increase the speed of the network for real-time object detection goals. But it comes with a trade-off with accuracy. Lines 82 – 86 of section 2.1 provides an introduction between FLOPS, accuracy and network parameters.

Figure two hasn’t been explained except just referring to it at different places. You should use this diagram to explain different types of issues with the vanilla IoU method.

Response: The figure 2 has been renamed to figure 3. Yes, in the updated revision, we have reutilized the figure to address drawbacks of IoU, GIoU in lines 182 – 192 and lines 202 – 204.

Line 182 - 184: you need to expand on what these problems mean. These seem like the central issues your new proposed method is addressing, isn’t it?

Response: The drawbacks of IoU metrics are analyzed in detail and we also provided explanations on how the loss metric performs in three major cases. In case of no overlap, we explained how the loss metric becomes 0 with help of gradients. Also, we added more detail on the behavior of the loss metric during partial or complete overlap.

Sentence on line 165 is not clear. Can you use simple terms to explain the primary drawback?

Response: Lines 197 – 204, address the primary drawback of GIoU metric. In this revision, we tried to explain the drawback with the help of a figure to provide more understanding to the readers.

Section 4.4: it is not clear how your improved IoU improves the performance over DIoU. Also, how will your IIoU handle the case of concentric rectangles where the centers match exactly but the predicted rectangle is smaller than ground truth?

Response: In the proposed metric, we are considering the centre coordinate of x and y axis. For instance, in a rectangular box coordinate of x center point is (x_center, y_min) and center of y axis is at (y_center, x_min). Figure 4 explains a sample of three cases where two rectangles can share a common centre. They could be one inside another with different aspect ratios.

We can consider figure 4b as an example here. In this particular scenario, the x_center and y_center, x_min of the boxes match perfectly. But y_min value of the boxes are different. Since we are evaluating the euclidean distance between the coordinates, the gradient of IIoU metric does not become completely 0.  The similar explanation can be suited for figure 4c, where in addition to the centre’s aligning, y_min also aligns. But x_min does not get aligned and thus making the gradient to acquire a value for it to regress the model further. We calculated loss values in figure 4a, 4b, 4c by providing the figures different areas and aspect ratio to verify our proposal.  We have also modified the lines 219, 220 to provide little more insight on the coordinate points we are considering.

We have also carried out a simulation experiment in this revision which compares all the loss metrics and how they converge and total error of the loss metric at the end of simulation experiment. They are detailed in section 5.

Algorithm 1: why do you want to calculate IIoU twice on lines 25 and 26.

Response: The redundant lines have been removed.

Section 8: before using mAP please explain in paper what mAP is, how it is computed and how it depends on IoU.

Response: In the background, we have added a new section called performance metrics. This section provides brief introduction about precision, recall and how they are estimated. Average precision can be calculated across various thresholds depending upon the application goals.

In figure 8:  can you clarify how the range of mAP over IoU values is just represented in a single x-axis?

Response: The figure 8 has been removed. The initial picture was results on validation dataset. We realized it would not provide any additional information to the literature. So, we added new tables, Table 1 to show the results of the experiment on evaluation datasets. To provide additional information and extend the current research to cases where objects are densely associated with each other, CGMU dataset was also considered and table 2 was added. Section 7 was also modified to provide a little background on datasets. 

What confidence thresholds did you use during your evaluation?

Response: We evaluated our results across various IoU thresholds ranging from 50% to 95% and taking an average of all thresholds as well. Tables 1 and 2, display how each of the loss metrics perform when compared against IoU metric on the dataset. We were able to observe that the proposed metric was able to outperform the existing metrics in most of the threshold limits and overall AP also increased in evaluation of both the datasets.

Line 312: does .067 sec perf image means your algorithm took 2 secs to process the video of length 1 sec. 

Response: This statement has been removed. In the updated revision, we calculated FPS (Frames per second) value based on how many camera frames the model was able to detect in a second. We tested yolov5n6 (baseline), improved yolov5n6(with our metric) and SSD in different outdoor environments and calculated an average FPS.

Section 8.3: can you compare the performance (latency) of SSD and default YOLO5 with real-time outdoors and compare and contrast with your model?

Response: Figure 9 is added in this revision. The latency is measured in terms of FPS in all the three architectures. We observed that baseline and improved YOLOv5n6 performed almost at the same speed whereas SSD network was slower in real time. Figure 10 also provides sample images of inaccurate detections by SSD but where the proposed YOLOv5n6 is able to detect most of the objects in the frame at a higher speed of approx. 25 FPS.

Author Response File: Author Response.pdf

Reviewer 2 Report

The manuscript presents an implementation of YOLOv5 on NVIDIA Jetson Xavier device with an improved Intersection-Over-Union (IIoU) loss function. The proposed IIoU is a combination of Generalized Intersection-Over-Union (GIoU) (ref [42]) and Distance Intersection-Over-Union (DIoU) (ref [43]). But the proposed IIoU calculates the distance loss in a different way from Ref [43]. In [43], the distance loss is computed based on the Euclidean distance between the central points of the two bounding boxes. The proposed IIoU calculates the distance loss along the x-axis and y-axis separately. The authors demonstrated the improvement by the experiments in term of mAP, but the improvement is not significant. The authors did not explain in detail why IIoU is theoretically much better than other metrics. The contribution of the work is very limited.

I have the following concerns about the manuscript:

  • The proposed IIoU improves the object detection performance (insignificantly), with a cost of a more complicated IoU loss function. This is not desirable for resources-constrained devices. However, the manuscript tile is “ for resource-constrained devices”, but the manuscript does not address any issue related to “resource-constrained devices”.
  • The implementation on NVIDIA Jetson device does not strongly support the advantage of IIoU, and thus is redundant and unnecessary and removable. The performance (Fig7 and Fig8) is obtained from training infrastructure.
  • The writing is not concise and informative. The authors spend a lot of space to present less relevant (or too general) content: for example, YOLO, ROS, PASCAL VOC dataset, etc.  But the authors did not present IoU in a sufficient detail and depth. The analysis of the performance is not sufficient or not convincing. For example, in Fig. 7 and Fig.8, IIoU delivers the better results than other three metrics, but the other three metrics deliver almost the same performance (this is doubtable, and should be explained). Some figures (Fig4. Fig.6, Fig.9) are not cited in the context. Fig5. Fig.6, Fig.9, Fig.11 are redundant, and do not provide important information.

Author Response

Respected Reviewers,

                           We would like to thank the reviewers for taking their time to review our research.  The valuable comments from reviewers have helped us to re-evaluate our research findings and extend our research as well. We have added the response/changes we have made in the article in response to the comments. In the revised version, many sections were modified, and new section have been added. The number of references has also changed. Latexdiff software did not perform effectively since algorithms and equations were also included.  So, we had to mark the changes in yellow in the updated PDF document.

Reviewer 2:

The manuscript presents an implementation of YOLOv5 on NVIDIA Jetson Xavier device with an improved Intersection-Over-Union (IIoU) loss function. The proposed IIoU is a combination of Generalized Intersection-Over-Union (GIoU) (ref [42]) and Distance Intersection-Over-Union (DIoU) (ref [43]). But the proposed IIoU calculates the distance loss in a different way from Ref [43]. In [43], the distance loss is computed based on the Euclidean distance between the central points of the two bounding boxes. The proposed IIoU calculates the distance loss along the x-axis and y-axis separately. The authors demonstrated the improvement by the experiments in term of mAP, but the improvement is not significant. The authors did not explain in detail why IIoU is theoretically much better than other metrics. The contribution of the work is very limited.

I have the following concerns about the manuscript:

  • The proposed IIoU improves the object detection performance (insignificantly), with a cost of a more complicated IoU loss function. This is not desirable for resources-constrained devices. However, the manuscript tile is “for resource-constrained devices”, but the manuscript does not address any issue related to “resource-constrained devices”.

Response:  Thank you for the valuable comments. It helped to identify various regions of improvement in our research.

 

Section 4.1 and 4.4 has been improved. We addressed the drawbacks of existing loss metrics and how the gradient value is affected in various scenarios. For instance, IoU utilizes less cost compared to other metrics, but the gradient value becomes 0 when there is no overlap between bounding box and target box during training phase. This would reduce the converge speed of the network and in turn network would require more iterations to converge, as seen in figure 5. GIoU and DIoU addresses the drawbacks of IoU and help the network to converge faster. But with help of figure 4, we can observe at during cases of concentric rectangles whose centres are aligned, they degrade to IoU metric and convergence rate is affected. The proposed metric addresses the drawbacks and aids the network to converge at an optimal speed.

 

During the evaluation phase, target samples are estimated only based on the IoU thresholds, ranging from 50% to 95%. The number of parameters in baseline YOLOv5n6 trained with IoU metric and improved YOLOv5n6 trained on IIoU metric remained the same, 3.1 million. Both the networks were able to perform approximately at 25 FPS in real-time, as shown in figure 9. From our observations on various architectures, the network size and number of layers associated with the network play a crucial role in developing a faster network.

 

For resource constrained devices, various light weight models have been utilized in recent years. But the trade off between accuracy and latency always remains a key component. Lines 41 – 43 provides a brief background on the concepts of accuracy and latency.

 

  • The implementation on NVIDIA Jetson device does not strongly support the advantage of IIoU, and thus is redundant and unnecessary and removable. The performance (Fig7 and Fig8) is obtained from training infrastructure.

Response:  NVIDIA Jetson was chosen as an example of resource-constrained device and utilized in this research to observe the capabilities of the loss metrics in real time scenarios. For instance, SSD has been established as a benchmark for most object detection algorithms. It also has high accuracy scores. But in the real-time testing carried out, we observed the network was slow and has inaccurate predictions, shown in figure 10.

 

  • The writing is not concise and informative. The authors spend a lot of space to present less relevant (or too general) content: for example, YOLO, ROS, PASCAL VOC dataset, etc.  But the authors did not present IoU in a sufficient detail and depth. The analysis of the performance is not sufficient or not convincing. For example, in Fig. 7 and Fig.8, IIoU delivers the better results than other three metrics, but the other three metrics deliver almost the same performance (this is doubtable and should be explained). Some figures (Fig4. Fig.6, Fig.9) are not cited in the context. Fig5. Fig.6, Fig.9, Fig.11 are redundant, and do not provide important information.

 

Response: The loss metrics section 4 has been modified to provide in depth understanding on the drawbacks and the need for an improved metric. We have removed figures 7 and 8 as they show a very minimal difference and does not provide/support relevant information to the literature. Figure 4 visually presents how the loss metrics perform in scenarios where centres of the boxes align, and it also displays the performance of proposed metric is different from existing metrics.

 

A simulation experiment, section 5 was carried out to analyze the performance of the loss metrics in various scenarios and figure 5 shows the converge speed of the metrics. We observed the proposed metric has the lowest total loss value and almost convergence speed has a very minimal difference as compared to DIOU (~0.8). We can also see that IoU and GIoU has required more iterations to converge due to their drawbacks.  Redundant figures as mentioned were removed and all the figures are cited in the text and explained.

 

We further extended the research to test the performance of the improved metric on a dense dataset, CGMU. We were also able to observe the significant increase in overall AP. The evaluation results have been tabulated in tables 1 and 2. 

Author Response File: Author Response.pdf

Reviewer 3 Report

The paper addresses an important topic such as real-time object detection in edge computing. It is proposed a new loss metric, that basically is an improvement of an already know metric, the Intersection Over Union (IoU). My main question is how does this new metric improve the mAP results in contrast to the associated higher computational cost and if justify the benefits in real implementations. YOLO has been widely used and especially its version 5 actually shows very interesting results compared to other methods. However, this is already implemented by ultralytics, I look at this work as just its use in a ROS environment, which in itself is not something new. The potential novelty is restricted on o the new loss metric, but in my opinion more experimental results needs to be implemented to demonstrated that is a good cost benefit solution.

At the end of the introduction, a paragraph should be added with the organization of the paper.

Author Response

Respected Reviewers,

                           We would like to thank the reviewers for taking their time to review our research.  The valuable comments from reviewers have helped us to re-evaluate our research findings and extend our research as well. We have added the response/changes we have made in the article in response to the comments. In the revised version, many sections were modified, and new section have been added. The number of references has also changed. Latexdiff software did not perform effectively since algorithms and equations were also included.  So, we had to mark the changes in yellow in the updated PDF document.

 

Reviewer 3:                                

The paper addresses an important topic such as real-time object detection in edge computing. It is proposed a new loss metric, that basically is an improvement of an already know metric, the Intersection Over Union (IoU). My main question is how this new metric improves the mAP results in contrast to the associated higher computational cost and if justify the benefits in real implementations. YOLO has been widely used and especially its version 5 shows very interesting results compared to other methods. However, this is already implemented by ultralytics, I look at this work as just its use in a ROS environment, which is not something new. The potential novelty is restricted on o the new loss metric, but in my opinion more experimental results needs to be implemented to demonstrated that is a good cost benefit solution.

Response: IoU metric has the lowest computation cost compared to GIoU, DIoU and proposed IIoU metric. But it suffers major drawbacks in cases of no overlap, where the gradient becomes 0 and it affects the convergence speed of the network and requiring more iterations to minimize the loss values. Whereas GIoU and DIoU addresses this drawback and provides improved performance and helps the network converge faster. But all the metrics suffer from significant drawbacks when the center of the bounding boxes align with each other. In these cases, GIoU, DIoU degrades to IoU metric. This internally affects the gradient calculation and leading to slower convergence.

In the proposed metric, we are considering the centre coordinate of x and y axis. For instance, in a rectangular box coordinate of x center point is (x_center, y_min) and center of y axis is at (y_center, x_min). Figure 4 explains a sample of three cases where two rectangles can share a common centre. They could be one inside another with different aspect ratios.

We can consider figure 4b as an example here. In this particular scenario, the x_center and y_center, x_min of the boxes matches perfectly. But y_min value of the boxes is different. Since we are evaluating the euclidean distance between the coordinates, the gradient of IIoU metric does not become completely 0.  The similar explanation can be suited for figure 4c, where in addition to the centre’s aligning, y_min also aligns. But x_min does not get aligned and thus making the gradient to acquire a value for it to regress the model further. We calculated loss values in figure 4a, 4b, 4c by providing the figures different areas and aspect ratio to verify our proposal.  We have also modified the lines 219, 220 to provide little more insight on the coordinate points we are considering.

The cost associated with the loss metric is associated with the training phase. The real-time detection phase was based on IoU thresholds. From 9. We can observe that both baseline and improved metric on YOLOv6 has same number of model parameters and approx. 25 FPS.  Section 5 provides a simulation experiment was carried out as explained on a synthetic dataset and figure 5 compares the convergence rate of different metrics. We were able to observe the proposed metric has the lowest total loss error compared to other metrics and converges at optimal speed.

YOLOv5 architectures were adapted in this research owing to its good trade off between accuracy and latency in both real time and lab environment testing. The scope of the research was further extended to dense object detection dataset, CGMU and we were observing the significant increase in overall AP. ROS platform was used as a common pipeline to control the frames generated by the camera and also as an initial step towards integrating various additional sensors with the camera.

At the end of the introduction, a paragraph should be added with the organization of the paper.

Response: A paragraph has been added at the end of introduction to explain the organization of the paper.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Authors have addressed all the review comments.

Reviewer 2 Report

You addressed my comments. Please do English spell check to eliminate any errors. I have no further comments/suggestions.

Back to TopTop