Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Efficient Distributed Mapping-Based Computation for Convolutional Neural Networks in Multi-Core Embedded Parallel Environment

Electronics 2023, 12(18), 3747; https://doi.org/10.3390/electronics12183747

by Long Jia¹, Gang Li², Meili Lu³, Xile Wei⁴ and Guosheng Yi^4,*

Reviewer 1:

Timothy Ganesan

Reviewer 2: Anonymous

Reviewer 3:

Hideharu Amano

Reviewer 4: Anonymous

Electronics 2023, 12(18), 3747; https://doi.org/10.3390/electronics12183747

Submission received: 15 June 2023 / Revised: 17 August 2023 / Accepted: 29 August 2023 / Published: 5 September 2023

(This article belongs to the Special Issue Neural Circuit Modeling and Embedded Application for Computational Intelligence)

Round 1

Reviewer 1 Report (Previous Reviewer 3)

The manuscript proposes a novel multi-core ARM-based embedded hardware platform with a three-dimensional mesh structure to support the decentralized algorithms. A distributed mapping mechanism was proposed to decentralize computation tasks in the form of a multi-branch assembly line; to apply deep convolutional neural networks (CNNs) in a embedded parallel environment. My comments are as follows:

1. Please provide the application of CNN in a step-by-step procedure or algorithmic form. This is to enable other researchers to reproduce the algorithm; as well as its integration into an embedded parallel environment.

2. Please declare the algorithmic parameters as well as the parameters employed in the computational experiments during benchmarking. This is to improve reproducibility of the proposed techniques by other researchers.

3. Please discuss the potential improvements to procedure in terms of alternate algorithmic implementation (besides CNN), algorithmic hybridization or modification of current algorithm (for performance improvement).

4. Please provide discussion on the robustness of the propose approach in terms of implementation to other embedded system application.

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 2 Report (New Reviewer)

Given the increasing demand for high-performance computing tasks in embedded systems, the chosen topic is interesting and relevant. However, this manuscript needs significant revision in several key areas before being considered for publication. The paper merely involves model construction and running programs but lacks an in-depth mathematical analysis of the model.

Firstly, the language used throughout the manuscript needs refinement. Currently, the language is not as precise or fluent as one would expect in a research paper. Several sentences are awkwardly phrased, leading to a lack of clarity in conveying the message. I strongly recommend that you have a native English speaker or a professional service edit the manuscript.

Secondly, the use of mathematical symbols in your manuscript is non-standard. This can confuse readers familiar with the standard notations. I urge you to revise the mathematical notation and ensure it aligns with the common usage in the field.

The manuscript could also benefit significantly from including complexity analysis and convergence experiments. These analyses will allow readers to understand the proposed strategies' computational efficiency and practical effectiveness. I suggest you refer to chapter 8 of the book "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville for details on conducting and presenting such analyses.

Finally, a detailed review of the relevant literature appears missing from the manuscript. In particular, more references to recent publications on similar topics would not only strengthen your arguments but also position your work within the existing body of knowledge. Again, the book "Deep Learning" by Goodfellow et al. can serve as a starting point for these references, as it has extensive coverage on your research topic.

In conclusion, the manuscript requires substantial revisions on several fronts. While the subject matter is intriguing, the manuscript in its current state fails to meet the academic standards of our journal.

Good

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 3 Report (New Reviewer)

1.If the target application of the proposed multi-core embedded processor is CNN, the memory is not enough. In general, DRAMs are required to store a large weight of data. This architecture can execute only tiny CNNs like LeNet. At least, ReSNet18 or mibileNet should be implemented even for embedded systems.

2. The architecture seems to cause the access conflict at RAMs in a BEM. I don't think 6 CUs and 2RAMs make a good balance.

However, I think this paper is acceptable since the empirical results will give good examples for students who want to build such systems.

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 4 Report (New Reviewer)

Who or what (software) map CNNs to BEMs in Figure 1? There is chapter 3.2. "CNN distribution mapping mechanism" on various strategies for mapping, but on the particular CNN network, who decides what will be mapped to which BEM? Is there software for that, or is it done manually? What is the algorithm for this mapping?

For example, take one network and map it to the particular BAM, list steps to achieve this. What is mapped to BEMs, and inside BEM, what is mapped to CU? What happens when there are more kernels than BEMs?

What is 1 [1, 3, 105, 105] in Figure 4? (batch size) Add explanation for that.

The explanation for Figure 5 is not clear. Draw pictures differently, or explain more clearly. For someone not on CNN, what dashed lines even mean? (They are the start and the end of the calculation for CU3, but that is not written) There is no explanation for them. What with other sliding windows? Rewrite that paragraph, and maybe change the picture (more than one).

For figure 5, how is load balanced in case there is 5 CUs?

Add comment on Figure 12, why there is no benefit for fully connected layers.

Add sentence that experiments are done only for 105x105 input.

Is there benefit for distributing 32x32?

The authors wrote:

"The C1 and P1 layers are scattered according to the size of the convolution kernel, and then a single feature map is scattered into two parts and mapped into 12 computing units."

Where is P1? What is scattering?

What is Figure 11 about? I completely do not understand it. What is layer, conv, pool, full, and no mapping methods? The whole paragraph about this Figure, from my point of view, does not have any connection to the data shown on Figure.

The drop in accuracy from 88.2% to 83.3% is very large. From my point of view, that large drop is not acceptable. How many calculations are needed for 2% drop? Proceed with experiments with that value. What k is used for calculation?

What will happen to other CNNs, for example MobileNet or Efficientnet? Why not experiment with this more modern network for edge computing? Give a reason why Mobilenet is not used for experiments, for example.

Connect the terms "convolution kernel mapping", "single convolution" or "fully connected distributed mapping" to the chapter 3.2 explanations.

Explain Figure 7? What is C1 6@28*28? What is F5 120? Is this figure taken from some other paper?

I think I am not qualified to comment on this. For me the text quality seems good.

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report (New Reviewer)

The manuscript in its current state fails to meet the academic standards of our journal.

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 4 Report (New Reviewer)

The authors have clarified all my objections. The paper can be accepted.

I suggest the authors to add information from the theirs response to my objections to the paper. It seems more clear to me, than actual text in the paper.

Author Response

Thank you for your suggestion. Unfortunately, due to the limited space in our article, we are unable to present all the explanations given to you in the article. After receiving your proposal, we have extracted some important viewpoints from the response and added them to the corresponding positions in the article. (Line 411-415 482-486 493-495 560-563) We hope our supplement can satisfy you. Thank you again for your valuable feedback

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.

Round 1

Reviewer 1 Report

The authors present the implementation of a parallel computing system for ANN execution based on single-core ARM Cortex-M4 uCs (STM32F4).

The authors state: "single-core ARM chip does not have parallel computing capabilities"

Yes, this is true. But there are a multitude of chips available implementing multiple ARM cores on a single die (see eg https://en.wikipedia.org/wiki/ARM_big.LITTLE). Your assumption that there are no parallel ARM-based products is wrong.

The authors conclude based on the above (wrong) observation: "Therefore, it is of great significance to develop a low-cost, low-power, easy-to-develop, and parallel computing embedded hardware platform for edge computing."

While this is in principle correct, this has been attempted many times already. You missed to properly research and present related work. Therefore your work is a documentation of an engineering project. I cannot see a scientific contribution.

Line 51: "RAM architecture series microcontrollers are widely used in the industrial field"

Typo, you mean ARM architectures, don't you? Please be more careful in proofreading.

"In this work, we try to build an ARM-based CNN parallel computing hardware platform to provide the possibility of embedded edge computing for neural networks"

Same thing as above: This has been attempted many times and respective products are available (see eg https://en.wikipedia.org/wiki/Tensor_Processing_Unit#Edge_TPU).

"We treat a single ARM as the smallest compute unit (CU)."

What is "single ARM"? Further down you mention the STM32F4. According to the product webpage (https://www.st.com/en/microcontrollers-microprocessors/stm32f4-series.html), "The STM32F4 series consists of eight compatible product lines". Please be more accurate.

The English is readable, but often inaccurate. The major flaw is the wrong reasoning.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

This paper proposes a multi-core ARM based platform to optimize the execution of convolutional neural networks (CNNs). It solves memory problems with a dimensionality reduction and long execution times by splitting the tasks or operations among many cores.

I have two main issues with this paper. On the one hand, none of the proposals is sufficiently described in the sense that if a reader wishes to implement something similar he or she should be able to do it just with the description given on this paper. On the other hand, I fail to see the benefit of this platform compared with for example, a multicore ARM processor as the one that a Raspberry Pi could have. And I cannot accept the paper without solving or clarifying these two issues

A proper description of the CNN is missing. Therefore, unless the reader is quite familiar with them, it is hard to understand the need of this platform, the operations that can be executed in parallel and how data can be merged again

What is inter-layer expansion in the context of this paper? I assume it has something to do with the distribution of the different BEMs in the network, but it is unclear to me.

The paper mentions that due to the large storage space of the ROM, historical data can be stored. Again, it should be clarified how much data can be stored, and what is the cost of accessing it.

Regarding the external resource sharing and access in loop, it is not clear at all which would be the next CU to get access to the IOCOM and consequently to the SRAM. As for the design, it seems that one CU can acquire the access to the IOCOM and never release it. Further, the paper does not evaluate how long can one CU wait for a resource and how much this affects to the final time.

Is the RAM shared between the different computing boards? If so why does the RU need to write the data both to the RAM and then transmit it to the RU in the next layer?

Real-time means that something executes within some time boundaries, therefore claiming that if something executes in more than 1 second is not real-time, is wrong. First, the time boundaries should have been defined, then one can claim whether the application meets these requirements or not.

The paper mentions that the accuracy of the results is reduced due to the approximation of the exponentiation function, however it did not mention how much it is degraded.

Finally, it is not clear at all what are the limitations of the proposed platform in terms of maximum input size, number of layers and neurons per layer and number of operations per second. Note that the paper assumes that data has to be processed quickly (my assumption is that as quick as the data acquisition system) but no further details are given.

Typos: RAM architecture, Figure 4 says feature map.

Figure 11 does not have any unit on the time axis.

The paper is quite readable, but some sentences and constructions need to be reviewed so they are clear.

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 3 Report

The authors proposed a novel multi-core ARM-based embedded hardware platform with a three-dimensional mesh structure to support the decentralized algorithms. A distributed mapping mechanism was proposed to deploy deep convolutional neural networks (CNNs) in this embedded parallel environment - to efficiently decentralize computation tasks in the form of a multi-branch assembly line. My minor comments are as follows:

1. The experiments verify that the neural network parallel computing hardware platform can implement the CNN model - with advantages on low power consumption scalability, and low cost. Please provide explanations on possible setbacks or weaknesses of the proposed strategy or implementation of the platform - e.g. algorithmic complexity, implementation cost,...,etc.

2. Please include explanations on the possible weaknesses of implementing the dimensional reduction method (e.g. data losses). In addition, please include the method at which the authors employ to measure the performance/effectiveness of the dimensional reduction approach from a data loss standpoint.

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Dear authors,

I appreciate your feedback. Unfortunately, the revised version of your paper does not address my raised concerns "You missed to properly research and present related work. Therefore your work is a documentation of an engineering project."

I therefore stick to my original rating "your work is a documentation of an engineering project. I cannot see a scientific contribution."

I see two ways to improve the paper:
- either you properly research existing solutions for low-cost low-power distributed machine learning and show that none of them meets the design point of your work (this survey might serve as a starting point: S. S. Saha, S. S. Sandha and M. Srivastava, "Machine Learning for Microcontroller-Class Hardware: A Review," in IEEE Sensors Journal, vol. 22, no. 22, pp. 21362-21390, 15 Nov.15, 2022, doi: 10.1109/JSEN.2022.3210773.),
- or you do not claim the architecture itself a novelty and focus on methods to adapt Neural Networks to specific ARM instruction set architectures. But again, you need to research first existing solutions and relate your work to it. The following publication might serve as a starting point: L. Deng, G. Li, S. Han, L. Shi and Y. Xie, "Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey," in Proceedings of the IEEE, vol. 108, no. 4, pp. 485-532, April 2020, doi: 10.1109/JPROC.2020.2976475.

With best regards

Moderate editing of English language required

Author Response

please see the attachment

Author Response File: Author Response.pdf

Article Menu

Efficient Distributed Mapping-Based Computation for Convolutional Neural Networks in Multi-Core Embedded Parallel Environment

Further Information

Guidelines

MDPI Initiatives

Follow MDPI