Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

mNet2FPGA: A Design Flow for Mapping a Fixed-Point CNN to Zynq SoC FPGA

Electronics 2020, 9(11), 1823; https://doi.org/10.3390/electronics9111823

by Tomyslav Sledevič^* and Artūras Serackis

Reviewer 1: Anonymous

Reviewer 2:

Minas Dasygenis

Reviewer 3: Anonymous

Electronics 2020, 9(11), 1823; https://doi.org/10.3390/electronics9111823

Submission received: 24 September 2020 / Revised: 27 October 2020 / Accepted: 30 October 2020 / Published: 2 November 2020

(This article belongs to the Section Artificial Intelligence Circuits and Systems (AICAS))

Round 1

Reviewer 1 Report

The Paper is well written. There are a few points which requires some modifications,

Section 3.1 Training of CNN, Authors have used Matlab Deep learning toolbox for the CNN training. Is there any specific reason to choose this toolbox? What are the shortcomings/limitation with this method.
Pg. 5, Line 161: the parameters of training phase are stored in csv files. What can be the maximum size of these files, and is there any limitations on the size, which Python function can read. Also, how much latency is involved in the reading and truncating the precision.
Please discuss that what modifications/improvements can be done to reduce the number of DMA transfers and to reduce the BRAM more efficiently.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

In this paper, the authors propose a framework to deploy CNNs on a
low-cost FPGA. The proposed design accelerates fixed size convolutions
and FC layers. Although the results seem on par with similar
frameworks, the proposed design lacks flexibility. Many issues should be addressed if this paper is to be published.
a) There are several constraints such as the fixes size/stride of the
kernels and the number of the layers.
b) Similar designs can accept pre-trained weights from common deep
learning frameworks (Caffe, Tensorflow) or even load the whole network
architecture. In this design, the authors train the model in Matlab
and use a proxy program in Python to generate the instructions. There
are no enough details about this process.
c) A comparison table of accuracies of each design with quantization
should be provided.
e) The title is wrong. The authors state "design flow ... Soc FPGA". In reality, in this paper the authors show the implementation of CNN to Zedboard). This cannot be generalized to every Soc FPGA. Even the authors mention that they target low-cost Soc Only, so the title is too vague and does not describe accurately the paper.
f) The motivation is not good. The authors specify that ARM architectures are not able to meet the required performance. But they do not say that other CPUs like Intel or AMD can do this. Why do they compare with ARM? It would be the same flaw in the logic of the authors said "Intel ATOM CPUs cannot meet the required performance". If they want to compare with a low-end CPU they should better motivate us.
g) Some claims of the authors are misleading. Inline 31 : "A high end Titan....consumes five times more power". this is a misleading statement that leaves the audience to believe that GPUs are bad. But If you read carefully the reference [4] you can see that "GPU consumes five times more power, but also has five-time more performance than FPGAs". Taking something out of context just to support your claims is very bad in research. Objectivity should be the mantra of a good publication.
h) The word 'tricks' is bad in scientific work. It is very informal. It should be replaced with 'techniques'
i) They do not motivate us about why use a low-cost Soc FPGA. In the beginning, they introduced large FPGAs, like Arria 10 (@line 32). Now suddenly, the focused zoom at a low cost. (@line 62)
j) The main contributions (authors wrote 'contribution' even though there are 3) on line 74 do not say anything about the design flow, which is the title of the paper. the design flow is not a contribution? Also, the contributions are very vague. For example, a python program is mentioned, but in reality, this is not discussed in detail. the hardware core is discussed in detail. Everything else lacks the necessary details.
k) On this work the PC seems to play a significant role, even during runtime (it sends images and reconfiguration characteristics). So this small Soc does not seem to operate stand-alone. If this is true, then the PC energy should be also taken into account.
l) from line 131 various specific numbers are given (for example input of 224 x 224 RGB) most of which without any explanation. Furthermore, they do not specify how some numbers are obtained and they are specific for the zedboard sued. This is one of my main objections. This paper does not give a design flow for every SoC. It gives a specific implementation for the ZedBoard. If the authors wanted a design flow, they would present the generic design flow first and then they would use as an example the Zedboard to show how it is used. Another objection is that the Zedboard is not a representative of low-cost SoC FPGA. There are even lower CPU boards. The authors should mention why this board represents the low-cost SoC FPGA. Altera DE-SOC has even lower resources and it is SoC.
m) @169 It seems that the Pc sends one image and gets the result, which means that the process is not pipelined or interleaved, possibly incurring serious performance penalties. Is this process pipelined? If not then I believe that major improvement should happen with the pipeline, before this is to be published. All modern accelerators should support the pipeline or interleave. if not then we should skip it and use something else.

n) @248 'the the'
o) @312 the methodology seems to involve trial and error experiments to find the optimum number of AXI streams. This is not a good approach. Theoretical computation should happen using equations and then verify the conclusions with the implementation.
p) the results are very limited. the authors should compare with [4] or other state of the art approaches. Currently, We cannot infer any conclusions.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

The authors present a novel design flow for implementing on Xilinx SoC FPGA-based a CNN.

The work is well organized but I do not understand the novelty of the proposed solution with respect to other flows.

FPGA mapping of CNNs is one of the most popular topics of last years. CNNs have been widely covered by a large number of studies, as consequence, the comparison to just one solution (NullHop-FPGA) turns out to be insufficient.

Another aspect regards the hardware design of CNN. How have they been implemented? Did the authors write HDL code by hand or they have used some IP?

Finally, when you talk about hardware accelerators, it's a good practice to specify the speed-up factor with respect to a standard solution. Acceleration with respect to what? For instance, a comparison with a GPU implementation might be fine.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

This paper has been significantly improved and all issues have been taken care. Thus, I believe it is suitable for publication.

Author Response

We thank the reviewer for his time and for recommendations.

Reviewer 3 Report

Authors significantly improved their work, but some concepts are still not clear. I seem to understand that you developed in VHDL on FPGA a general CNN core, and then you configure it by using the mNet2FPGA flow. Is this correct, or do you re-synthesize a new core sometimes, depending on configurations?

If you configure a general core, please give some details about its structure, otherwise, if I was wrong, please clarify this aspect. I miss the passage from Matlab training to core implementation.

Author Response

We thank the reviewer for his time and for recommendations, therefore modifications to the manuscript were applied, below we discuss each of them more deeply. All the changes are marked with a red color in a revised version of the paper.

ANS: Thank you for the remark. It is correct. We do not resynthesize the core. The general CNN core stays the same for different types of CNN. For example, if we need to change the configuration of the core to another type of CNN, we upload only the new parameters (kernel weights, neuron weights, biases) and a list of instructions from PC to the board.

If you configure a general core, please give some details about its structure, otherwise, if I was wrong, please clarify this aspect.

ANS: Thank you for the remark. In a “4. Design” section, we present a core structure (Fig. 10). The parameters (weights of the kernels with batch normalization coefficients and biases) for that core are loaded from convolution core configuration memory (Fig. 5). We have extended the subsection “4.4. Multi-Channel Convolution Core”.

I miss the passage from Matlab training to core implementation.

ANS: Thank you for the remark. All the parameters of the network (received after training in Matlab) are passed through so-called Conversion/scheduling program (implemented in Python) and briefly presented in section “3.3. Conversion/Scheduling Program”. That program knows all about the hardware structure of the CNN core. The main job what this program does, it converts the Matlab trained CNN to the format understandable by the core. The hardware core has a limited number of input/output channels, and therefore the processing of CNN was sectioned into portions by the scheduling program. We have added a schematic diagram to the 3.3 Section.

Round 3

Reviewer 3 Report

The authors have greatly improved the work and have answered my requests exhaustively.

Article Menu

mNet2FPGA: A Design Flow for Mapping a Fixed-Point CNN to Zynq SoC FPGA

Further Information

Guidelines

MDPI Initiatives

Follow MDPI