**1. Introduction**

Recently, neural networks (NNs) have demonstrated superhuman performance in a multitude of tasks, such as in image classification [1], object detection [2], semantic segmentation [3] or natural language processing [4]. NNs are also making their way into real-life practical applications, such as in medical diagnostics [5], autonomous driving [6] or aviation [7–9]. While in medicine, the applications of NNs are primarily limited by their algorithmic performance, in other practical scenarios such as in autonomous driving, their hardware performance needs to also be considered in addition to their decision making capabilities. The hardware performance is usually considered in terms of latency or energy efficiency, which is especially crucial when aiming at real-time response rates. While it is indeed possible to run NNs on stock hardware platforms such as central processing units (CPUs) or graphical processing units (GPUs), to achieve peak hardware performance, it is

**Citation:** Ferianc, M.; Fan, H.; Manocha, D.; Zhou, H.; Liu, S.; Niu, X.; Luk, W. Improving Performance Estimation for Design Space Exploration for Convolutional Neural Network Accelerators. *Electronics* **2021**, *10*, 520. https://doi.org/ 10.3390/electronics10040520

Academic Editor: Alexander Barkalov

Received: 28 December 2020 Accepted: 10 February 2021 Published: 23 February 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

also necessary to consider reconfigurable hardware accelerators [10]. Considering the rapid pace of NN architecture design, accelerators need to be partially reconfigurable such that they are adaptable to the new generation of NN designs, while still achieving favourable hardware performance.

Therefore, to fully utilise the performance capabilities of a reconfigurable accelerator, it is necessary to perform design space exploration (DSE) [11] to determine the optimal hardware configuration of the accelerator, given the desired NN architectures. The search space when performing DSE is determined by the available accelerator's configuration domains which can, for example, be determined by the levels of implementable parallelism [10]. Naively, DSE is conducted by systematically synthesising different configurations of a given accelerator on the hardware platform and measuring the real-world performance of the desired NNs on the accelerator. Given a large search space, consisting of different configurations of the accelerator, the time and resource costs of actually implementing the accelerator on the target hardware platform limit the speed of DSE. Practically, it is therefore necessary to accurately estimate the hardware performance during DSE with respect to multiple different hardware specifications, to enable the fast exploration and exploitation of the available configurations for the given NNs.

There are several performance estimation frameworks for reconfigurable accelerators [12–14]; however, estimating the performance without knowing the run-time intricacies when running different NNs is still a challenging task. There are two main reasons for this complication: (1) the cost of executing a certain operation on hardware varies by on/offchip communication, synchronisation, control signals, I/O interruptions, in particular for the NN accelerators, the NN's architecture, complicating the estimation; (2) it is difficult to accurately select the most representative design features for all hardware specifications during performance estimation.

In this work, we propose a novel approach for performance estimation of custom convolutional neural network (CNN) accelerators. The proposed method constitutes a Gaussian process regression model [15] coupled with features that can be readily read off datasheets for the underlying hardware platform or the target algorithm (a tutorial code is available at https://git.io/Jv31c). We evaluate the method for estimating layer-wise latency, as well as network-wise latency and energy consumption. Experiments were conducted with respect to two hardware platforms, the Intel Arria GX 1150 field-programmable gate array (FPGA), as well as a structured application-specific integrated circuit (ASIC) implementation of the targeted accelerator. We compared the proposed approach to other machine learning-inspired methods such as linear regression (LR), gradient tree boosting (GTB) or a feed-forward fully-connected NN. The proposed approach is simple to implement, fast in providing predictions and more accurate in comparison to the other compared methods in estimating both latency and energy. This article extends our previous work [16] by further evaluation with respect to estimating an additional hardware metric, energy consumption, by benchmarking the proposed method with respect to an additional hardware implementation platform (ASIC) and by supportive software experiments. The further experimentation proves that the Gaussian process is an accurate estimator that can be used to estimate the hardware performance for running CNNs.

In Section 2, we discuss the background on NN design and the related work on performance estimation. Then, in Section 3, we introduce the proposed method, followed by Section 4, where we describe the implemented hardware design of the benchmarked accelerator. Then, we present the experiments, results and discussion in Section 5. Lastly, we conclude the work in Section 6.

#### **2. Background and Related Work**

In this section, we present an overview of NNs and their compute pattern and related work on performance estimation methods.

#### *2.1. Neural Networks*

NNs are built by stacking several mathematical operations on top of each other, otherwise known as layers. In this work, we mainly demonstrate our method on an accelerator for CNNs; however, the proposed method is not limited to accelerators for CNNs. The processing of a CNN is usually done in a layer-by-layer fashion; nevertheless, most modern networks [17–19] have residual or concatenative connections between them [17]. Specifically for CNNs, frequently used layers are 2D convolutional, fully-connected or pooling layers interchanged with element-wise applied non-linearities [20]. Convolutional or fullyconnected layers aim to learn useful features that can be used to recognise patterns in the input data, while pooling aims to reduce the representation and pool the most important information, while processing the data through the NN. Practically, convolutional and fully-connected layers take up over 90% of the computation and energy consumption in a CNN model [2,21,22]. The algorithm behind 2D convolution is shown in Algorithm 1. The notation used in this paper is presented in Table 1.

#### **Algorithm 1** Convolution.

**Input**: Input feature map **I** of shape *C* × *H<sup>I</sup>* × *W<sup>I</sup>* ; weight matrix **W** of shape *F* × *C* × *K* × *K*

**Output**: Output feature map **O** of shape *F* × *H<sup>O</sup>* × *W<sup>O</sup>*


#### **Table 1.** Notation used in this paper.


As illustrated in Algorithm 1, the convolution accepts a *C* × *H<sup>I</sup>* × *W<sup>I</sup>* sized input feature map, and then, the input is convolved with a kernel with the shape of *F* × *C* × *K* × *K*. Each kernel window with the size of *K* × *K* is applied to one channel of the input *H<sup>I</sup>* ×*W<sup>I</sup>* by sliding the kernel with a stride of *s* to produce one output feature map *H<sup>O</sup>* × *WO*; then, the results of *C* channels are accumulated to produce one filter of the output. All filters of the output feature maps *F* × *H<sup>O</sup>* × *W<sup>O</sup>* are generated by repeating this process *F* times. A fullyconnected layer can be re-interpreted as a convolution by considering the kernel size *K* = 1. Utilizing this compute pattern, it is then possible to summarize the number of compute operations, as well as the number of memory transfers, as shown in Table 2. At the same time, given the different for-loops in Algorithm 1, it is possible to parallelise the convolution operation in each for-loop dimension: filter, channel, data vector or kernel. In Section 4, we introduce the implemented accelerator, which is capable of taking advantage of this property in multiple dimensions.


**Table 2.** Number of operations and the data size for a convolution.

#### *2.2. Performance Estimation*

As discussed in Section 1, the most accurate and reliable method for determining the performance of a CNN for a specific system configuration is deploying the CNN on the hardware platform and measuring its performance. A significant drawback of this method is that it requires re-implementation for different hardware specifications on the hardware's fabric. Given a large number of potential configurations that might need to be benchmarked during DSE, this approach is too time consuming and resource demanding. Therefore, it is more feasible and practical to perform DSE with respect to an estimate of the performance at the software level, rather than running the CNN for each hardware configuration of different hardware architectures. Considering a complex accelerator for multi-layer CNNs, it is likely that due to the intricacy of the data manipulation or the compute, the performance for the CNNs will need to be estimated on a case-by-case basis. Therefore, this approach is infeasible in general, as it is usually constrained to a single hardware configuration. Nevertheless, there have been a few researchers who have proposed general performance estimation methodologies [12–14].

A performance estimation framework for reconfigurable dataflow platforms was proposed by Yasudo et al. [12], which can analytically determine the number of accelerator units suitable for an application. Dai et al. [13] proposed an estimation method based on a GTB and a high-level synthesis report. However, their method requires a significant amount of data and features from the synthesis report, which might not be available, especially when high-level synthesis is not being used to implement the accelerator. Liu et al. proposed a general heuristic based method [14] for estimating the performance of FPGA based CNN accelerators and that is now used as the standard go-to estimation method. The heuristic analytic approach does not depend on any potentially collected measurements to perform the estimation, and it is simple to implement since it relies only on the variables that can be easily read from the respective datasheets for the hardware platform or the algorithmic configuration. Nevertheless, this general estimation method usually computes the most optimistic estimate, and it does not take into account communication, synchronisation or control. One way to refine the estimation is that we can collect a few runtime data points and use them to improve the estimate.

Therefore, in our work, we propose using a Gaussian process (GP) regression model [23] together with data samples collected by running the CNN on real hardware. GP is a model built on Bayesian probabilistic theory, which can embody prior knowledge into the predictive model and can be used for the regression of real-valued non-linear targets [23].

#### **3. Method**

In this section, we motivate and describe the proposed method for performance estimation, which is based on a GP regression model.

Given a dataset D = {(*x<sup>i</sup>* , *yi*)}; *i* = 1, . . . , *N* consisting of *N* observations with inputs and outputs as *<sup>x</sup><sup>i</sup>* <sup>∈</sup> <sup>R</sup>*<sup>M</sup>* and *<sup>y</sup><sup>i</sup>* <sup>∈</sup> <sup>R</sup><sup>1</sup> , respectively, a function *f* needs to be induced to hypothesise *<sup>y</sup>*<sup>∗</sup> on new, previously unseen, inputs *x*∗. *x* represents a vector of *M* features, while *y* represents the real-valued target that is to be estimated in this case. As discussed in the previous Section 2.2, there are multiple function classes that can be used to perform this task.

A naive parametric approach would make use of a predictive conditional distribution that can be written as *p*(*y*∗|*w*, D, *x*∗). This approach constitutes an LR, using parameters *w*, such that the prediction is made as *y* = ∑ *M <sup>m</sup> wmxm*. It requires learning the parameters *w*, which represent one potential function realisation *f* that fits the data.

Assuming a Gaussian weight prior *p*(*w*) = N (*w*|**0**, **Σ***w*), with some pre-defined covariance matrix **Σ***w*, we can induce a Gaussian distribution on any set of *y*: *p*(*y*|*x*) = <sup>N</sup> (*y*|*µ*, *<sup>K</sup>*), where *<sup>K</sup>* <sup>∈</sup> <sup>R</sup>*N*×*<sup>N</sup>* is the covariance matrix characterised by a covariance function and *µ* represents the mean. This leads to the consideration of a non-parametric predictor, where instead of learning *w*, the focus is shifted towards inferring an entire distribution of function classes for explaining the data. Specifically, a non-parametric predictor uses a parametric model and integrates the parameters. A prior *p*(*θ*) induces a distribution over plausible functions, where *θ* is a latent random variable. Using such a probabilistic modelling framework, we can sample plausible data-fitting functions directly. This approach avoids necessitating a decision on which predefined class of function predictors to use, as it considers all of them. The assumption that any set of values specified at an arbitrary point **x***<sup>i</sup>* over functions is Gaussian distributed leads to a GP model.

GP is a flexible Bayesian model characterised by a finite collection of Gaussian random variables [*f*1, *f*2, . . .], such that for any finite set of plausible inputs *X*∗, the vector *f* <sup>∗</sup> <sup>=</sup> *<sup>f</sup>*(*X*∗) follows a Gaussian distribution [23]. The stochastic process can be entirely determined by second-order statistics: a mean function *m*(.) and a kernel (covariance) function *k*(., .). The mean function represents the value that the mean across the functions *f* tends towards. The covariance matrix *K* is characterised by the kernel function values [*K*]*i*,*<sup>j</sup>* = *k*(*x<sup>i</sup>* , *xj*) = *φ*(*xi*) *<sup>T</sup>φ*(*xj*), for some non-linear function *φ*(.), which represent the value that the sample covariance for all sampled functions tends towards for the points *x<sup>i</sup>* and *x<sup>j</sup>* . The kernel encodes structural information of the latent function *f* and must be symmetric and positive semi-definite.

For *<sup>N</sup>* Gaussian observations *<sup>X</sup><sup>N</sup>* <sup>∈</sup> <sup>R</sup>*N*×*M*; *<sup>Y</sup> <sup>N</sup>* <sup>∈</sup> <sup>R</sup>*N*×<sup>1</sup> , *y<sup>i</sup>* = *f*(*xi*) + *e<sup>i</sup>* where *e<sup>i</sup>* ∼ N (*e<sup>i</sup>* |0, *σ* 2 ), the posterior for unseen data *X*<sup>∗</sup> is defined as in Equations (1) and (2) (for a detailed derivation, please refer to [23]):

$$f\_\*|y \sim \mathcal{N}(\mathfrak{m}\_{\*|N\prime} \mathbf{K}\_{\*,\*|N})\tag{1}$$

$$\begin{aligned} \mathbf{m}\_{\*|N} &= m(\mathbf{X}\_N) + \mathbf{K}\_{\*,N} (\mathbf{K}\_{N,N} + \sigma^2 \mathbf{I})^{-1} (\mathbf{Y}\_N - m(\mathbf{X}\_N)) \\ \mathbf{K}\_{\*,\*|N} &= \mathbf{K}\_{\*,\*} - \mathbf{K}\_{\*,N} (\mathbf{K}\_{N,N} + \sigma^2 \mathbf{I})^{-1} \mathbf{K}\_{N,\*} \end{aligned} \tag{2}$$

Furthermore, training the GP requires finding appropriate latent random variables or hyperparameters *θ*. Considering the posterior over hyperparameters: *p*(*θ*|*X*, *y*) = *p*(*y*|*X*,*θ*)*p*(*θ*) *p*(*y*|*X*) , hyperparameters *θ* ∗ are obtained through maximising the log of marginal likelihood *θ* <sup>∗</sup> = arg max*<sup>θ</sup>* log *p*(*y*|*X*, *θ*) + log *p*(*θ*).

In this paper, we propose to use a GP regression model as outlined above to predict the performance of an algorithm realisation on a given accelerator and a hardware platform. We propose to use the characteristics of the accelerator at design time and the target NN as features, as shown in Table 1, with respect to which we can predict the target performance measure (a tutorial code is available at https://git.io/Jv31c). Practically, this means that an input vector *x* is a vector of *M* features with algorithmic or hardware properties for one configuration of the system, while *y* can represent the performance that is to be estimated. The features of the input vector *x* being used are those that are already known and used in the standard analytic estimation [14], avoiding the need for any additional feature extraction from the dataset or the datasheets. These features consist of characteristics of the CNN to be run, as well as the hardware accelerator. Additionally, it is possible to embody the standard analytic method into the GP based estimator, through using it as the mean function *m*(.). This model enables us to use any available measurements as training data and does not restrict us to one class of predictors; it considers a plausible family of best fitting models that are characterised by the kernel and the mean function. The proposed method is able to make predictions outside of the observed data samples without collapsing [23]. At the same time, by choosing the features given by the datasheets, the

model is more interpretable than an NN or an LR, where the corresponding uninterpretable weights *w* need to be learned. Moreover, the Gaussian noise assumption can be interpreted as an additive instrumentation error, while collecting measurements. Furthermore, if used during DSE, the GP model can additionally provide an uncertainty estimate for its predictions, which can more precisely guide the exploration and the exploitation of the search space [23]. The overall system diagram, including all the necessary parts of the prediction methodology, is presented in Figure 1. The dashed lines symbolise the fitting of the GP, through providing hardware measurements, along with the characteristic NN and hardware features, to the GP to obtain the *θ* ∗ , *Y <sup>N</sup>*, *KN*,*<sup>N</sup>* to be used during the evaluation. During the evaluation, the features and the fitted GP model are then used for prediction.

For a training set of size *N* samples, the computational complexity of the training scales in ∼O(*N*<sup>3</sup> ) due to the unavoidable Cholesky factorisation, while the prediction is ∼O(*N*<sup>2</sup> ), and the memory requirements are ∼O(*NM* <sup>+</sup> *<sup>N</sup>*<sup>2</sup> ). Therefore, given a typical number of collected real-world measurements (which is <1000) for different configurations of the accelerator, the method is scalable to be used in practice.

**Figure 1.** Overview of the proposed prediction methodology based on a Gaussian process (GP).

In the next section, we present the CNN accelerator on which we used the proposed method. We compare our approach with other estimators in predicting layer-wise latency and network-wise latency and energy consumption.

#### **4. Hardware Design**

In this section, we detail the accelerator architecture, the performance for multiple different CNN architectures of which we aim to estimate.

#### *4.1. Accelerator's Architecture*

The hardware design of our accelerator is illustrated in Figure 2. The design consists of a CNN engine, a central communication interconnect and an off-chip main memory. The weights of the whole network are transferred and stored in the off-chip memory via a central communication interconnect before the processing. The CNN engine is composed of an input buffer, a weight buffer, a convolutional processing engine (PE) and other functional modules including batch normalisation (BN) [24], shortcut (SC) [17], pooling (Pool) and rectified linear unit (ReLU) activation. In order to fully utilise the extensive concurrency exhibited in CNNs and improve the hardware efficiency, we support three types of finegrained parallelism in our CNN engine: filter parallelism (PF), channel parallelism (PC) and vector parallelism (PV). The accelerator processes each layer in a CNN one-by-one, and the intermediate results between layers are transferred and stored in the off-chip memory, in case the output size is bigger than the available on-chip memory. To achieve higher hardware performance, the accelerator is designed to support 8 bit operations.

**Figure 2.** The convolutional neural network accelerator's design. SC, shortcut; PC, channel parallelism; PV, vector parallelism; PF, filter parallelism; DMA, direct memory access.

To avoid large memory consumption on the on-chip memory, we adopt the channelmajor computational pattern for convolution, which is illustrated in Algorithm 2. In our channel-major PE, the computation required along the channel dimension in each filter is finished first. In this way, the on-chip memory only needs to cache the intermediate results for one filter, which largely decreases the memory usage.

In this paper, we used this accelerator design to perform the benchmarking of our proposed estimator method in estimating layer-wise latency, network-wise latency and energy consumption.

#### **Algorithm 2** Channel-major computational pattern.

**Input**: Input feature map **I** of shape *C* × *H<sup>I</sup>* × *W<sup>I</sup>* ; weight matrix **W** of shape *F* × *C* × *K* × *K*

**Output**: Output feature map **O** of shape *F* × *H<sup>O</sup>* × *W<sup>O</sup>*

1: for (*f* = 0; *f* < *<sup>F</sup> PF* ; *f* + +)


#### *4.2. Standard Analytical Latency Model*

In this section, we outline the layer-wise processing latency model for the proposed accelerator, which constitutes the standard method as proposed in [14] for comparison.

The simplest form of a heuristic that estimates layer-wise latency on a hardware accelerator consists of partitioning the overall processing time to individual layers, *T<sup>i</sup>* , corresponding to the time to perform one convolution in a feed-forward CNN consisting of *B* convolutions/layers. The per-layer latency of an implemented CNN accelerator consists of three parts: (1) time for loading the input; (2) computation time; (3) time for storing the results.

The complete input has to be loaded into the on-chip memory only once for the first layer, while the partial results that do not fit into the on-chip memory are off-loaded to the off-chip memory. Nevertheless, the time spent on this memory transfer is assumed to be negligible.

The size of the weights and the input/output for convolution is shown in Table 2, following the notation defined in Table 1. The per-layer latency *T<sup>i</sup>* for a single convolutional layer *i*; *i* = 1, . . . , *B* of a CNN with *B* layers is shown in Equations (3)–(5) as follows:

1. Loading time, i.e., the time to load the input into the on-chip memory. Note that the loading of the data is in parallel with respect to the channel parallelism *PC*:

$$T\_{weight\_i} = \frac{K\_i \times K\_i \times F\_i \times \mathbf{C}\_i \times DW}{PC \times PV \times M\_{CLK} \times S \times M\_{EFF}}$$

$$T\_{data\_i} = \frac{H\_{I\_i} \times W\_{I\_i} \times \mathbf{C}\_i \times DW}{PC \times PV \times M\_{CLK} \times S \times M\_{EFF}}$$

$$T\_{load\_i} = T\_{weight\_i} + T\_{data\_i} \tag{3}$$

2. Computation time, i.e., the time to compute *PF* × *PC* parallel filters and channels, respectively:

$$T\_{compute\_i} = \frac{F\_{i} \times C\_{i} \times H\_{I\_{i}} \times W\_{I\_{i}} \times K\_{i} \times K\_{i}}{PF \times PC \times L\_{CLK}} \tag{4}$$

3. Storing time, i.e., the time to store the output back to the off-chip memory. Note that similar to the input loading time, the storage time is divided by the channel parallelism *PC*:

$$T\_{\text{stor}\_i} = \frac{H\_{O\_i} \times W\_{O\_i} \times F\_i \times DW}{PC \times PV \times M\_{\text{CLK}} \times S \times M\_{\text{EFF}}} \tag{5}$$

Therefore, the time required to process a single convolutional layer can be written as in Equation (6) below:

$$T\_i = \begin{cases} T\_{i=1} &= T\_{load\_i} + T\_{compute\_i} \\ T\_{i \ne 1 \lor N} &= \max(T\_{\text{weights}\_i}, T\_{compute\_i}) \\ T\_{i=N} &= \max(T\_{\text{weights}\_i}, T\_{compute\_i}) + T\_{\text{store}\_i} \end{cases} \tag{6}$$

Note the *max* operations, which are present due to pipelining of the design, result in a latency determined by the slowest operation.

#### **5. Experiments**

In this section, we present the experimental settings, as well as the results with respect to both latency and energy estimation on different CNN architectures on the implemented accelerator (Section 4). The experiments were performed on an FPGA, as well as a custom ASIC. The networks were quantized into 8 bits [25], such that *DW* = 8 bits.

#### *5.1. Evaluation for FPGA Design*

This section describes the accelerator on an Intel Arria GX 1150 FPGA, and we evaluate the proposed GP based method with respect to layer-wise latency estimation, while running CNNs on the accelerator. The fixed hardware parameters used for the FPGA implementation are such that the filter, channel and data parallelism were set as *PF* = 64, *PC* = 64, *PV* = 1. At the same time, the memory and logic clock frequencies were *MCLK* = 200 MHz and *LCLK* = 200 MHz. The memory efficiency was assumed to be *MEFF* = 70%, and the communicating data-width size was *S* = 64 bits. The evaluation dataset comprised of several different configurations of convolutional layers, which were the building blocks of three different CNNs, namely SSD [18] with 24 convolutions, Yolo [19] with 75 convolutions and ResNet-50 [17] with 57 convolutions. The characteristics of the dataset from a software perspective are shown in Table 3. These networks were chosen because their algorithmic structures present challenges to the accelerator design, its control and its scheduling. In particular, SSD and Yolo are characteristic by their irregularities, which result in the output being produced at different times, while ResNet is known for its residual blocks, which require implementing additional control in hardware.


**Table 3.** Dataset for the evaluation of the layer-wise latency on an FPGA.

In total, the dataset for layer-wise latency estimation for each layer *i* consisted of *N* = 156 training samples, and the input feature size *M* was 15, corresponding to: *HIi* , *WI<sup>i</sup>* , *HO<sup>i</sup>* , *WO<sup>i</sup>* , *K<sup>i</sup>* , *Fi* , *C<sup>i</sup>* , *PF*, *PC*, *PV*, *MCLK*, *LCLK*, *MEFF*, *S* and *DW*. The recorded latency per convolution represents the targets *y*. Due to the limited size of the dataset, leave-one-out cross-validation (LOOCV) with respect to the mean absolute error (MAE) was used to compare the estimators. LOOCV is a particular case of leave-*k*-out crossvalidation where *k* = 1, which means that a model is trained on all samples except one, on which the performance is then evaluated. Although potentially more expensive to implement, it provides a less biased estimate of the test errors. In this instance, the performance of the predictor is measured by the absolute error between the prediction and the target value. The error is accumulated for all samples from which the mean is then calculated by dividing the total summed error by the number of samples.

In the evaluation, the proposed method is compared with the standard analytical method, including LR, GTB and a fully-connected multi-layer NN. Due to the few data samples, we used the layer-wise latency model as presented in Section 4.2 as the mean function *m*(.) of the GP model. We considered several hyperparameters for the proposed GP based method such as the learning rate, ranging from 0.1 to 0.000001 on a logarithmic scale, and the kernel, ranging from linear, Gaussian to Matérn kernels [23], and their combinations. The best parameters were found by a grid search with respect to the LOOCV MAE. For GTB and NN, we needed to determine the most influential parameters such as the learning rate, ranging from 0.01 to 0.0001 on a logarithmic scale, or for the GTB, the number of trees or the tree depth determined by gradual pruning. For the NN, we needed to decide the number of hidden nodes, between [10, 1], [10, 10, 1] and [10, 10, 10, 1], and for the activation function, we considered tanh, ReLU and sigmoid. The hyperparameters were similarly found through a grid search with respect to the LOOCV MAE. For the standard method and LR, it was not necessary to determine any hyperparameters. The results for latency estimation are presented in Table 4.



Overall, the best method proved to be the combination of the standard method as the mean function for the GP and the collected data. In comparison to other approaches, the proposed method achieved approximately a 30.7% improvement in LOOCV with respect to MAE, decreasing to 0.312 ms in comparison with the second best-performing methods, which were LR and the standard method with a 0.450 ms MAE.

#### *5.2. Evaluation on the ASIC Design*

In this section, we implement the outlined hardware accelerator using 28 nm eA-SIC [31] technology on the Intel N3XS platform with 8GB DDR3 installed as an off-chip memory. The whole design was clocked at *MCLK*, *LCLK* = 333 MHz, and the *PF*, *PC* and *PV* were set as 64, 64 and 1, respectively. The example design we used in this experiment kept the same parallelism configuration for the entire CNN model. Other designs, such as the streaming design [32], can support layer-wise configurable parallelism. However, the layer-wise instantiation of a modern deep CNN requires extensive hardware resources, which are often not available.

Before the evaluation of our GP based estimation, we compare both the FPGA and eASIC implementations in terms of latency and power efficiency (frames per second per Watt (FPS/W)) on four CNN models including SSD, ResNet-50, Yolo and VGG-16. It can be clearly seen from Table 5 that the eASIC design achieved higher energy efficiency and smaller latency than the FPGA implementation on all four CNN models.

**Table 5.** Hardware performance comparison between the FPGA and eASIC design.


Next, we evaluated the GP based estimation for the eASIC design with respect to latency and energy consumption. Instead of estimating per-layer latency, this experiment aimed at validating the GP based estimation of a whole NN for both latency and energy consumption. We ran ResNet-50 [17] using different network configurations with respect to energy and latency to form the evaluation and training datasets, which is illustrated in Figure 3.

**Figure 3.** ResNet-50 with different depths, channel numbers and expansion ratios.

The network contains three parts: head part, middle part and tail part. The head part includes a convolutional layer and a pooling layer with stride-2, while the tail part consists of an average pooling layer followed by a fully-connected layer. We fixed the head and tail parts while changing the network configurations for the middle part that contains four residual blocks with a gradually reduced feature map size and increased channel numbers. In each residual block, the depth ranges from two to *D<sup>i</sup>* , where *D<sup>i</sup>* denotes the maximal depth in the *i*th block. In each cell of the residual block, the expansion ratio (*E*) was chosen from [0.5, 0.75, 1.0]. For regression, as the hardware properties are fixed for the eASIC design, we only needed to encode the network configurations as a 13-dimensional vector, which represents the expansion ratio used in the 13 cells, giving *M* = 13. The expansion ratio was zero, if this cell was skipped. We randomly sampled 800 different network configurations and evaluated these networks on our eASIC designs with respect to latency and energy consumption. We used 600 samples for training and 200 samples for evaluation. Therefore, even though the hardware configuration remained fixed, we benchmarked the methodology with respect to changing various software parameters.

To demonstrate the advantages of GP based estimation compared with other regression techniques, we also compared it with LR, GTB and NN, which is illustrated in Table 6. In this instance we used a zero mean function, such that the methods should rely more on data, instead of any bias that could have been potentially induced by inaccurate analytical approximation. All methods used the same hyperparameters as in Section 5.1, to demonstrate the flexibility and simplicity of the implementation of the proposed GP regression model. It can be seen that our method achieved a smaller MAE on both latency and energy estimation, when compared with the other methods. In comparison to LR, which is a simple and widely adopted estimator, the performance can be improved by approximately two times with respect to both latency and energy estimates.


**Table 6.** Evaluation of network-wise latency and energy estimation for different methods on the convolutional neural network accelerator on an eASIC.

> Furthermore, in Figure 4, we show the advantages of GP over the aforementioned methods on smaller datasets by varying the training dataset size and number of features as the input of the models with respect to the overall prediction latency and energy consumption on the eASIC. Each experiment was repeated three times varying the number of available data points or features to evaluate the robustness of the compared methods. It can be observed that the GP is more accurate and also more robust as the standard deviation is consistently smaller in comparison to the other methods in all experiments.

**Figure 4.** Prediction benchmarks for latency with respect to changing training data size (**a**) and feature set size (**b**). Benchmarks for energy with respect to changing training data size (**c**) and feature set size (**d**).

The main advantage of the proposed method lays in its implementation simplicity, as it reuses those variables that can be commonly found in hardware or algorithmic datasheets and commonly used in DSE, combined with recorded measurements. The method can be improved by recording more measurements and simple fine-tuning of the hyperparameters related to the kernel *K*. Nevertheless, as demonstrated in Sections 5.1 and 5.2, the method is capable of estimating the performance even with respect to few collected data samples.

A potential limitation of this method, as was eluded to in Section 3, stems from the kernel computation, which scales with the complexity of <sup>O</sup>(*N*<sup>3</sup> ). This means that the inference time can be prolonged if there are many training samples. One possible solution to overcome this problem is to use variational inference to determine the *k* most important points that have to be included in the kernel computation [34]. Nevertheless, the inference time is much less than the time needed for synthesis and then running the design on hardware.

#### **6. Conclusions**

In this paper, we propose an accurate method for estimating the performance of an accelerator for convolutional neural networks and compare it with the standard method, linear regression, gradient tree boosting and an artificial neural network. Moreover, we evaluate our method with respect to two hardware platforms on which we accurately predict the overall latency or energy consumption of the given convolutional neural networks. The evaluation demonstrates that the innovative Gaussian process method paired with collected data can provide an accuracy improvement with respect to the other compared methods. Future work includes providing tools to automate our approach, and extending it to cover applications beyond machine learning designs.

**Author Contributions:** Conceptualization, M.F., H.F. and D.M.; data curation, M.F. and H.F.; investigation, M.F. and H.F.; resources, S.L. and X.N.; supervision, W.L.; validation, M.F. and H.F.; writing—original draft, M.F., H.F. and D.M.; writing—review and editing, M.F., H.F., D.M., H.Z., S.L. and X.N. All authors read and agreed to the published version of the manuscript.

**Funding:** The support of the U.K. EPSRC (EP/L016796/1, EP/N031768/1, EP/P010040/1 and EP/S030069/1), Corerain, Intel and Xilinx is gratefully acknowledged.

**Data Availability Statement:** The data presented in this study are available on request from the corresponding author.

**Acknowledgments:** We thank Yann Herklotz, Alexander Montgomerie-Corcoran, the ARC'20 and Electronics reviewers for insightful suggestions.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; nor in the decision to publish the results.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**

