Next Article in Journal
A Methodology to Design Static NCL Libraries
Next Article in Special Issue
Implementing a Timing Error-Resilient and Energy-Efficient Near-Threshold Hardware Accelerator for Deep Neural Network Inference
Previous Article in Journal / Special Issue
Low-Overhead Reinforcement Learning-Based Power Management Using 2QoSM
 
 
Article
Peer-Review Record

Embedded Object Detection with Custom LittleNet, FINN and Vitis AI DCNN Accelerators

J. Low Power Electron. Appl. 2022, 12(2), 30; https://doi.org/10.3390/jlpea12020030
by Michal Machura, Michal Danilowicz and Tomasz Kryjak *
Reviewer 1: Anonymous
Reviewer 2: Anonymous
J. Low Power Electron. Appl. 2022, 12(2), 30; https://doi.org/10.3390/jlpea12020030
Submission received: 4 April 2022 / Revised: 6 May 2022 / Accepted: 13 May 2022 / Published: 20 May 2022
(This article belongs to the Special Issue Hardware for Machine Learning)

Round 1

Reviewer 1 Report

This paper presents a set of works related to deep neural network (DNN) accelerators. While it includes many works, it has several drawbacks to be addressed. Please refer to the followings for more details.

  1. In line 99, “reminder” must be “remainder”.
  2. In line 101, the section number is missing.
  3. In Figures 5(b) and 5(d), “U” in “IoU” needs to be properly capitalized.
  4. Many Figures in Section 5 are too abstract. They need to be elaborated further.
  5. The schemes in Figure 1 are well known, and the results of implementation are rather obvious. While the confirmation of such facts does involve a set of hard works as presented in the paper, it is rather questionable that it is a novel contribution.
  6. Many of the contents are from the prior works, and the contribution of this paper’s own needs to be clearly highlighted. In the similar viewpoint, the paper is unnecessarily verbose.
  7. In Figure 17, how can the energy consumption be reduced by increasing the operating frequency?
  8. The proposed architectures need to be compared with the existing works in the literature.

Author Response

Dear Reviewer,

Thank you for your valuable comments. Below please find our replies and comments. 

Issues 1,2,3 (language errors):

We have corrected the errors and also once more did a proofreading of the manuscript. We hope that now the language level is good enough.

Issue 4: Many Figures in Section 5 are too abstract. They need to be elaborated further.

We have modified the mentioned figures:

  • Figure 10 - Input Layer:
    • Splitter Unit - the block diagram shows how the data representation was changed.
  • Figure 11 - Depthwise Layer:
    • Sliding Window Unit - the block diagram shows how the context is generated.
    • Weight Loading Unit  - the block diagram shows how to read weights through the serial-parallel register.
  • Figure 13 - Pointwise Layer:
    • Cyclic Streamer Unit - the block diagram represents reading data from ROM. The bias and normalisation weights read at the beginning (of each filter) are written to a serial-parallel register.
    • The Point Streamer Unit - the block diagram represents how the input feature map elements are read by determining the address based on appropriately incremented counters.
    • The Max Pooling Unit - the scheme shows how the larger values from two consecutive columns and then from two consecutive rows for each input channel are selected.

 

The Max Finder Layer‘s schema was not expanded due to the extensive sequential part.

Issue 5: The schemes in Figure 1 are well known, and the results of implementation are rather obvious. While the confirmation of such facts does involve a set of hard works as presented in the paper, it is rather questionable that it is a novel contribution.

We agree with the comment that the accelerators diagrams are well known, but we have included them to better structure the discussion. We also agree with the comment that the results of network implementations on particular types of accelerators are quite obvious, e.g. a sequential accelerator may (but not necessarily) obtain worse results than e.g. a coarse-grained accelerator. However, the purpose of this paper was to practically demonstrate what these differences might be (e.g. in the case under consideration, it turned out that not so much).

Furthermore, we deliberately chose a relatively "small" platform to show how given approaches perform with significantly limited logical resources. In particular, for the coarse-grained approach, it would be possible to select a layout "large" enough to accommodate the implementation of each network and always give better results than the others. But this would not be a valuable comparison.

However, in our opinion, the most important aspect and novelty of our work is the comparison of custom (LittleNet), FINN *(YoloFINN) and Vitis AI approaches. To the best of our knowledge, such a comparison has not been presented before in the scientific literature. Additionally, the FINN user community was expecting a research paper with a comparison between the tool and Vitis AI.

Issue 6: Many of the contents are from the prior works, and the contribution of this paper’s own needs to be clearly highlighted. In the similar viewpoint, the paper is unnecessarily verbose

Due to the comparative nature of the paper, much of the content and concepts are based on previous work. We have highlighted our contribution by adding the following sentences in the introduction (part of The main contribution):

 

  • a comparison between FINN and Vitis AI – we are not aware of other papers that compare these environments. There is a comparison of a number of different acceleration methods in the paper [10] – including FINN. However, the Vitis AI environment was not available at the time of writing the publication. The authors of the paper [11] compare Vitis AI with a GPU implementation. They use selected neural network architectures adapted to the detection task. Our comparison uses the same network architecture on the same device.
  • the proposal of two convolutional network architectures: LittleNet and YoloFINN for the detection task optimised for a reconfigurable embedded computing platform. For the first one, we applied multipliers of YOLO width/height channels for anchors adjustment – a solution for the problem of anchor box sizing. We are not aware of any work using similar approaches.
  • a coarse-grained accelerator for the LittleNet network. We used caching of the results of successive layers, as well as multiple use of memory blocks by selected accelerators. This allows access to the full input feature map of a given accelerator. Furthermore, this limits the transfer with external memory. Our accelerator allows for multi-depthwise convolution operation, it is rather a unique feature, as well as operation.
  • a formulation of optimisation rules to reduce the energy consumption of a system that processes a finite dataset. The method takes into account the frequency and the degree of parallelisation of the computations.



The designed LittleNet coarse-grained accelerator is based on well-known methods with PE (processing elements) modules. However, a more detailed discussion was required to be able to compare the designed architecture, which was however slightly different than those described in the literature (FINN and DNNBuilder – fine-grained, Vitis AI and HybridDNN – sequential, DNNExplorer – coarse-grained, but have only two parts and it is rather based on connection of two different accelerators, than “truly” coarse-grained processing like ours). We also wanted to describe the various elements quite thoroughly to avoid misunderstandings and inaccuracies and to make it easier to repeat our experiments. Hence the rather extended form of this part of the paper - Sections 4, 5, and 6. On the other hand, in the case of YoloFINN and Vitis AI, we used ready-made tools from XIlinx, hence the description of the implementation is significantly shorter.



Issue 7: In Figure 17, how can the energy consumption be reduced by increasing the operating frequency?

This conclusion follows from the dependencies presented earlier (Subsections 6.1-6.2). It seems crucial to emphasise here that it is the total energy (CPU + FPGA) needed to process a fixed number of images. We have added an additional explanation in the caption of Figure 17.

“... Applying higher frequencies and higher parallelism allow to decrease total energy consumption (CPU+FPGA) for the processing of a fixed size dataset

The main idea here is to reduce the running time of the entire processing system including the CPU (in general the elements outside the programmable logic). The use of a higher accelerator's frequency (the CPU frequency remains fixed and independent of the programmable logic) allows to increase the throughput of the entire system. This results in a reduction of time required to process a given amount of data. We also achieve shorter activity time for the CPU and other elements not directly involved in the processing. Finally, these elements will dissipate less energy during the whole process.



Issue 8: The proposed architectures need to be compared with the existing works in the literature.

The fundamental goal of our project was to compare three architectures and approaches to accelerators designing: custom, FINN and Vitis AI. For this purpose, we proposed the LittleNet (custom) and YoloFINN (FINN) architectures and their equivalents for the Vitis AI accelerator. The former was inspired by SkyNet and the latter by UltraNet. Furthermore, the LittleNet architecture was primarily focused on energy efficiency while maintaining an acceptable classless detection performance.

Due to the above, a direct comparison of the analyzed architectures is not possible. In addition, to our knowledge, there is no reliable data on the effectiveness of methods similar to ours and comparing against much larger models has no point. Thus, obtaining a meaningful comparison would require a number of time-consuming experiments (e.g., training of different models on a common dataset and possibly also implementing these models).

However, to try to address the remark, we have included a comparison of our two networks with similar architectures, i.e. SkyNet, UltraNet, YOLOv3-tiny and MobileNetV2. We have considered the number of parameters and unique features of each model - new Section 10. Comparison with similar networks architectures.



Best regards,

MichaÅ‚ Machura, MichaÅ‚ DaniÅ‚owicz and Tomasz Kryjak

Reviewer 2 Report

The paper show different implementations of CNN accelerators on FPGA devices.

The manuscript is very detailed, the state of the art was extensively evaluated, and the results are clearly presented.

However, I have a few amendments.

 

  • Why the authors chose a generic object detector without classification? What is the application or the ratio of such coiche?
  • Line 101. There is a Section reference missing (I suppose Section 3)
  • Line 187. Typo: "a embedded" -> "an embedded".
  • Fig 4. A legend should be added in the figure to make easier to read (DW, BN ecc)
  • Sec 6.4. Typo: "Concussions" -> "Conclusions"
  • Tabb 5, 7 and 8. A power evaluation, not only an energy one, should be added.

Author Response

Dear Reviewer,

Thank you for your valuable comments. Below please find our replies and comments. 

Issue 1: Why the authors chose a generic object detector without classification? What is the application or the ratio of such choice?

The aim of this paper was to compare different methods of DCNN hardware implementations. For this purpose, we have chosen a relatively simple task - classless detection. It is more difficult than typical classification, but easier than full detection (which requires focusing more attention on model training).

Classless detection can be used for tasks where it is not important what is detected or there is one type of object, e.g. vehicles - bicycle, car, boat. Such a task is also used in the DAC SDC competition (https://byuccl.github.io/dac_sdc_2022/), which was one of the inspirations for the discussed work.

Issue 2:Line 101. There is a Section reference missing (I suppose Section 3)

Corrected.

Issue 3: Line 187. Typo: "a embedded" -> "an embedded".

Corrected.

Issue 4: Fig 4. A legend should be added in the figure to make easier to read (DW, BN ecc)

We have explained all abbreviations.

Issue 5: Sec 6.4. Typo: "Concussions" -> "Conclusions"

Corrected.

Issue 6: Tab 5, 7 and 8. A power evaluation, not only an energy one, should be added.

Initially, we did not record the power measurements. However, it was possible to determine this value based on the size of the test set, the throughput obtained, and the energy measurement. We placed the power values determined in this way in additional columns.

Adding the information suggested by the reviewer allows us to show the relationship of the increasing accelerator's power and decreasing energy consumption for the determined optimisation formulas.

Best regards,

Michał Machura, Michał Daniłowicz and Tomasz Kryjak

Round 2

Reviewer 1 Report

The concerns have been addressed.

Author Response

Dear Reviewer,

Thank you very much for accepting our manuscript.

All the best,

Tomasz Kryjak

Back to TopTop