*Article* **Logic-in-Memory Computation: Is It Worth It? A Binary Neural Network Case Study**

#### **Andrea Coluccio \*, Marco Vacca and Giovanna Turvani**

Department of electronics and telecommunication (DET), Politecnico di Torino, Corso Castelfidardo 39, 10129 Torino, Italy; marco.vacca@polito.it (M.V.); giovanna.turvani@polito.it (G.T.)

**\*** Correspondence: andrea.coluccio@polito.it

Received: 23 December 2019; Accepted: 4 February 2020; Published: 22 February 2020

**Abstract:** Recently, the Logic-in-Memory (LiM) concept has been widely studied in the literature. This paradigm represents one of the most efficient ways to solve the limitations of a Von Neumann's architecture: by placing simple logic circuits inside or near a memory element, it is possible to obtain a local computation without the need to fetch data from the main memory. Although this concept introduces a lot of advantages from a theoretical point of view, its implementation could introduce an increasing complexity overhead of the memory itself, leading to a more sophisticated design flow. As a case study, Binary Neural Networks (BNNs) have been chosen. BNNs binarize both weights and inputs, transforming multiply-and-accumulate into a simpler bitwise logical operation while maintaining high accuracy, making them well-suited for a LiM implementation. In this paper, we present two circuits implementing a BNN model in CMOS technology. The first one, called Out-Of-Memory (OOM) architecture, is implemented following a standard Von Neumann structure. The same architecture was redesigned to adapt the critical part of the algorithm for a modified memory, which is also capable of executing logic calculations. By comparing both OOM and LiM architectures we aim to evaluate if Logic-in-Memory paradigm is worth it. The results highlight that LiM architectures have a clear advantage over Von Neumann architectures, allowing a reduction in energy consumption while increasing the overall speed of the circuit.

**Keywords:** Logic-in-Memory (LiM); Von Neumann's bottleneck; memory-wall

#### **1. Introduction**

Nowadays Logic-in-Memory (LiM) architectures are widely studied in order to solve the memory-wall problem, which is a bottleneck due to the communication between processing units and memories. A LiM implementation consists of very small computational units placed near a memory element. This enables a distributed computation instead of a classical Von-Neumann one. The big advantage of this design procedure is the reduction of the Von Neumann bottlenecks (such as fetching latency and the wasted power due to the communication between CPU-Memory), which enables also a very fast and energy-efficient structure. From a theoretical perspective, they bring many advantages, but are they worth it?

To answer to this important question, we have to consider implementing a LiM architecture by modifying the original structure of the memory, creating a structure that merges computation and memory Figure 1. As a natural consequence, the overall complexity of the customized design flow increases. To explore the features of a LiM implementation more in depth, two architectures have been designed, an Out-Of-Memory (OOM) that follows a classical Von Neumann approach and the derived LiM novel alternative. The performance obtained in both cases is subsequently compared.

**Figure 1.** Von Neumann's classical architecture (**a**) composed of CPU and memory. Logic-in-Memory (LiM) novel architecture (**b**) that merges computation and memory.

As a case study, a memory-intensive application, a Neural Network (NN), was chosen, since it is a good candidate to demonstrate the benefits of a LiM architecture. NNs are used to perform very complex tasks such as speech and image recognition in a very efficient and accurate way. Convolutional Neural Network (CNN) and Multi-Layer Perceptron (MLP) models are employed and both can achieve very high accuracy. In literature, many CNNs have been proposed: they can be distinguished by their task, complexity and achieved accuracy. Considering image classification applications, the most common CNNs are LeNet-5[1] and AlexNet [2]. LeNet-5 is a very small network which is able to achieve a TOP-1 error rate of 0.35% on the Modified National Institute of Standards and Technology dataset (MNIST dataset). AlexNet is a more complex structure largely used for recognizing RGB images, which achieves a TOP-5 error rate of 16.4% on the ImageNet dataset. Also GoogLeNet [3], VGG-Net [4] and ResNet-152 [5] can be used on the same dataset and achieve 6.67%, 7.3% and 3.6% TOP-5 error rates respectively. In general, these models require a lot of computational resources implying very high energy consumption, thus making them inoperable in low energy contexts like embedded applications.

In this work, a binarized NN is chosen. Binary Neural Network (BNN) approximations have been proposed in several works like BinaryConnect (BC) [6], Binary-Weight Network (BWN) and XNOR-Net [7], in order to reduce the computational complexity by changing the weight-inputs precision, by means of a binarization process. Weights, and eventually inputs, are approximated with only two values (−1,+1), that can be represented on a single bit, '0' indicates −1 and '1' indicates +1. The chosen approximation for this work is the XNOR-Net [7]. The XNOR-Net reaches high accuracy rates compared to the original floating-point model and is particularly well suited for a LiM solution, since the binary multiplication can be performed by a simple XNOR gate. While a specific Neural Network model was chosen, the architectures were developed with reconfigurability in mind, meaning that most NNs can be implemented by the hardware. Our goal is to demonstrate the effectiveness of a LiM design, so our contributions in this work can be summarized as:


The rest of the paper is organized as follows: Section 2 gives a brief explanation on what a LiM architecture is and recalls a useful classification from [9]. Section 3 discusses briefly NN background, giving an overview on what its main components are; binary approximations are compared and explained in more details. Section 4 reports the detailed design flow adopted for both OOM and LiM architectures and Section 5 makes an initial qualitative comparison between them. In Section 6, performance evaluations are reported, firstly taking two NN models as a case study and then by performing parametric sweeps. Lastly, Section 7 presents conclusions and future work.

#### **2. LiM Background**

#### *A Quick Overview*

LiM concept is widely discussed in the literature and a lot of different approaches have been adopted. In [9] an interesting classification of the various types of LiM paradigms is presented. Four possible typologies can be found.


determined by the conductivity of a conduction path that can be broken (high resistance state) or reformed (low resistance state). Sometimes it is used in a 1 transistor 1 RRAM (1T1R) configuration, to avoid unwanted or sneak current paths. In [15], authors have presented a memristor-based implementation of a BNN able to achieve both high accuracy on MNIST and IRIS dataset and low power consumption. In some others, improvements in memristor architectures have been proposed that enable multiple bits per cell. Reference [16] has exploited the frequency dependence of GeSeSn-W memristor devices to obtain multiple conductance values representing different weights. In [17], the memory array has been modified, including up to 4 memristors arranged in parallel in the same cell, in order to have multiple resistance values and so higher precision weights. Based on a similar approach to [12], a GAN training accelerator has been discussed in [18] which is able to efficiently perform approximated add/sub operations in a memristor array, achieving both speed-up and high energy efficiency.


As can be deducted, LiM is a widely studied and heterogeneous topic, and it is becoming increasingly important over the years. A lot of works presented in literature implement an application-specific LiM solution. The discussed emerging technologies are very promising, especially in Neural Networks applications, because of their high efficiency to compute multiply-accumulate operations [16]. In our work, we concentrated on CMOS technology because, while it is not the best available, RRAM and MTJ devices are still under development. As future task, we will focus our attention on them once these circuits are optimized.

#### **3. Neural Networks: An Introduction**

#### *3.1. Neuron's Model*

A NN is a computational model that is able to perform very complex tasks. It is composed of "neurons", which are the basic building blocks. By organizing them in an interconnected network, the NN can take decisions and learn when these decisions are wrong [19].

In Figure 2 a neuron structure example is depicted. As it is possible to see, it is made of two main parts which are *net*, which is in charge of weighted sum computation, and *f*(*net*), which is an activation function applied to the neuron's output. In general, *net* expression can be written as:

$$met = \sum\_{i=0}^{N} X\_i \times \mathcal{W}\_i + Bias \tag{1}$$

where *Xi* is the input value, *Wi* is the corresponding weight and *Bias* is an additive term. Neurons' weights and biases can be adjusted to achieve the desired output with a procedure called training.

**Figure 2.** Schematic of a neuron, representing its structure. Three inputs example [19].

In Figure 2, it is indicated another part which is the activation function *f*(*net*). Usually, this is a nonlinear function. The most important activation functions are Rectified Linear Unit (ReLU), hyperbolic tangent (tanh) and sigmoid function, which are discussed in great details in [20].

#### *3.2. Neural Network's Structure*

Usually, NNs are made up by layers, which are composed of a set of arranged neurons. The most common structures are Convolutional Neural Network and Multi-Layer Perceptron.

In Figure 3 it is reported the LeNet5 CNN as example. The network is composed of 2 convolutional, 2 pooling and 3 fully connected (FC) layers. Each of them is discussed in detail:

• Convolutional layers perform the convolution operation of the input feature map (IFMAP) with a set of weights called kernel. An example of a convolution computation is depicted in Figure 4. The parameter taken into account are the kernel's weights, the input feature map and the stride. After the first convolution is finished, the kernel window is moved by a step equal to stride, and a new convolution can start. In this example, the convolution computation match perfectly the neuron's equation reported in Equation (1), in fact after a convolutional layer is usually used an activation function to normalize the results. In the LeNet 5 CNN [1] example in Figure 3, all the convolutional layers have the same 5 × 5 kernel sizes. The first one produces six output feature maps (OFMAPs), meaning that the same IFMAP has been convolved with six different kernels. The second convolutional layer instead produces 16 OFMAPs, starting from 6 IFMAPs: for each input, there are 16 kernels that produce 16 outputs, so 16 from the first IFMAP, 16 for the second IFMAP and so on. This implies a total number of OFMAPs equals to

$$\#OFMAPs = 6 \times 16\tag{2}$$

To obtain 16 OFMAPs indicated by LeNet 5 scheme, the obtained OFMAPs of each layer are added together.

These considerations bring to the following formula for a convolutional layer, derived from [21]:

$$y\_o(j, i) = Bias\_o + \sum\_{c\_{iw}=0}^{\hbar \subset\_{iv} - 1} \sum\_{k=0}^{\mathcal{W}\_y - 1} \sum\_{p=0}^{\mathcal{W}\_x - 1} k\_{o, c\_{iw}}(k, p) \times X\_{o, c\_{iw}}(j \times stride + k, i \times stride + p) \tag{3}$$

where *i*, *j* are the indexes for the OFMAP corresponding pixel, *cin* is the input channel index, #*Cin* the total number of input channels, *Wx*, *Wy* are the kernel's matrix size indicating number of rows and columns respectively, *o* subscript refers to the OFMAP considered and *p*, *k* are the kernel's indexes.

• Pooling layers have a similar behavior to convolutional layers. In the literature, different kind of poolings are used such as average or max pooling [22]. They perform the maximum (or the average) of the selected input pixels and returns only one value, performing the so-called

subsampling operation. Pooling, and more specifically max pooling, is widely used to reduce the size and the complexity of the CNN. In Figure 3, the kernel size is 2 × 2 for all the cases.

• FC layers are MLP subnetworks included in the CNN to perform the classification operation. They are made of layers of fully interconnected neurons, as shown in Figure 3.

**Figure 3.** Structure of LeNet 5 Convolutional Neural Network (CNN) [1], composed of 2 convolutional, 2 pooling and 3 fully connected layers and their sizes are indicated in the model.

**Figure 4.** Convolution computation example with a 2×2 kernel.

There are also normalization layers (not reported in Figure 4). One of the most used is the Batch Normalization (BatchNorm) [23] that is very useful in BNNs to recover a portion of the accuracy lost from the binarization [24]. BatchNorm equation is reported from [23]:

$$\vec{X} = \frac{X - \mu}{\sqrt{\sigma^2 + \epsilon}} \times \gamma + \beta \tag{4}$$

where *μ*, *σ* are the batch mean and variance, while *γ*, *β* are correction values. These four variables are trainable, meaning that during training procedure they are modified in order to increase the accuracy. is usually added to the variance to avoid 0 division if the variance is 0. is a very small number, so the following approximation for non-zero variance can be made:

$$\bar{X} \approx \frac{X - \mu}{\sigma} \times \gamma + \beta = X \times \frac{\gamma}{\sigma} + \left(-\frac{\mu \times \gamma}{\sigma} + \beta\right) = X \times A + B \tag{5}$$

#### *3.3. Binary Approximation*

Since NN are very complex models, they can be very power hungry and implementing them on low energy budget systems, like in embedded application, can be challenging [25]. For this reason, a BNN approximation is chosen, trying to reach a good trade-off between complexity and accuracy. In [7] is presented an interesting comparison between some BNN approximations, introducing also XNOR-Net. The values are recalled in Figure 5. In the plot, TOP5 is intended as the accuracy classification rate to hit one out of five most probable classes. In the plot, TOP5 is intended as the accuracy rate to hit 1 out of 5 most probable classes. The BNNs accuracy are compared with the original floating-point implementation (FP) of AlexNet neural network [2].

Binary approximations

**Figure 5.** TOP5 accuracy comparison between different binary approximations [7].

In the considered approximation, all the weights are in binary format, meaning that *w* ≈ *wb* ∈ {−1, 1} where *wb* is the binary weight value. The binarization techniques are now briefly summarized from [7].

• BWN [7] binarizes only weights of the NN, keeping at full precision the activations and the inputs. By binarizing only weights, the convolution operation can be performed only with adds and subtractions, avoiding multiplication as reported in Equation (6) [7].

$$Conv\_{out,BWN} = X \ast w + Bias \approx a(X \ast w\_b) + Bias \tag{6}$$

An extra factor *α* is multiplied to the convolution result, in order to compensate precision losses [7]:

$$\alpha = \frac{\sum\_{i=0}^{N} ||w\_i||}{N} \tag{7}$$

where *wi* is the considered full precision weight and *N* is the number of weights. BWN is a very good alternative useful to reduce CNN's complexity. However it requires full precision inputs and activations.

• XNOR-Net [7] binarizes both weights and inputs. The convolution result is obtained by performing the binary convolution and multiplying by a correction factor *α* (the same in Equation (7)) and a matrix **K**. **K** is defined in Equation (8).

$$\mathbf{K} = \frac{\overbrace{\frac{\mathbb{H}^{\text{First term}}}{\mathfrak{C}\_{\text{in}} = 0} \left| X(\cdot, \cdot, c\_{\text{in}}) \right|}^{\text{First term}}}{\mathfrak{H} \mathbb{C}\_{\text{in}}} \ast \begin{bmatrix} \frac{1}{\mathcal{W}\_{\text{x}} \times \mathcal{W}\_{\text{y}}} & \frac{1}{\mathcal{W}\_{\text{x}} \times \mathcal{W}\_{\text{y}}} & \cdots\\ \frac{1}{\mathcal{W}\_{\text{x}} \times \mathcal{W}\_{\text{y}}} & \frac{1}{\mathcal{W}\_{\text{x}} \times \mathcal{W}\_{\text{y}}} & \cdots\\ \vdots & \vdots & \ddots \end{bmatrix} \tag{8}$$

In Equation (8), the first term indicates the absolute punctual sum of the multiple IFMAPs divided by the number of input channels, thus the number of IFMAPs. The second term is a regular matrix of *Wx* <sup>×</sup> *Wy* size, which contains <sup>1</sup> *WxWy* in all positions. Finally, the XNOR-Net convolution can be rewritten as [7]:

$$Conv\_{out, \text{XNOR} - \text{Net}} \approx (X\_b \oplus w\_b) \cdot \mathbf{K} \times \mathfrak{a} \tag{9}$$

where *Xb* is the binarized input, is the binary convolution, · is punctual multiplication and × is a simple product. In [7] the binary convolution is performed considering the XNOR pop-counting of binary inputs/weights. XNOR truth table matches to the multiplication if -1 is mapped to logic '0' and +1 is logic '1'. Pop-counting computes the difference between the number of 1s and the number of 0s of the input sample.

• BC [6] binarizes both inputs and weights, without applying any correction factor to the final convolutional equation. This implies less recognition accuracy as shown in Figure 5. Taking into account all the considerations on the binarization techniques, we chose XNOR-Net [7] as reference model since it represents a very good trade-off between accuracy and complexity.

#### *3.4. NN Implementations Based on LiM Concept*

The LiM approach is often applied in NNs' implementations. Some of them are considering the binary approximations, choosing an implementation based on emerging technologies. Some works [12,13,26,27] are based on MTJ technology while [15–18,28,29] have used RRAM. In each of these works the resistive element is used to perform simple logical operations based on current sensing technique. In [26,27,30,31] several Binary Convolutional Neural Networks (BCNNs) implementations are discussed: they achieve very good results in terms of energy and power, thanks to the intrinsic low power nature of the MTJ and RRAM devices. Reference [28] proposes a BNN design based on SRAM array. The logic parts perform the computations and are disposed below the memory array. The memory parts enable to store the required parameters for the NN computation (like weights and biases) and the logic parts compute the results for the next layer, forming an alternation between memory-logic. This architecture achieves very good performance in terms of energy and speed, thanks to its pipelined-like structure. In [29], the NN has been mapped in a Wide-IO2 DRAM, using TSVs as high speed communication link obtaining remarkable results in terms of execution time.

#### **4. OOM and LiM Architectures**

Here we discuss the adopted processing flow more in detail. The goal of a LiM architecture is to move part of the computation inside a memory array, which already contains the needed logical elements to complete the calculations. Using as a case study the XNOR-Net, we can derive that the main part of the algorithm is the calculation of the XNOR products combined with pop-counting to determine the result of the binary convolution. The adopted design flow is the following:


#### *4.1. OOM Architecture Design*

#### 4.1.1. Single Input/Multiple Output Channels Design

By looking at Equation (9) and Figure 6, the core part of the OOM architecture is composed of a set of XNOR gates and a pop-counter. In Figure 6, a tiny example of a 2 × 2 convolution is reported: since the dimension of the kernel is 4, the total number of XNOR gates required are 4, because each of them performs a multiplication. In general, their number must be at least equal to *Wx* × *Wy*. NNs' kernel sizes depend on the model chosen, for example in AlexNet the maximum kernel size is 11 × 11 [2] or in LeNet5 is 5 × 5 [1]. The flexibility of the hardware circuit depends strictly on how big the kernel size is considered, so a worst-case analysis must be taken into account. To best of our knowledge, kernel sizes higher than 11 × 11 are very seldom, since accuracy usually decreases with bigger filter sizes. Binarized inputs/weights are fed directly to XNOR inputs from a memory implemented as a register file (RF) in the design, named Binary Input RF in Figure 7. In Binary Input RF, each row contains all the input elements required for a convolutional window computation, implying a bitwidth size equals to *Wx* × *Wy* bits. Regarding the number of rows, they have to be at least equal to the total number of convolutional windows *D*<sup>2</sup> *out* required, which is also the dimension of the OFMAP. *Dout* can be computed considering kernel, IFMAP sizes (*Din*) and the stride.

$$D\_{\rm out} = \frac{D\_{\rm in} - \mathcal{W}\_{\rm x}}{stride} + 1\tag{10}$$

Also in this case, the number of memory rows *D*<sup>2</sup> *out* has to consider the maximum OFMAP size of the NN model considered. When a different OFMAP has to be computed, the weight set is simply switched by using a multiplexer. The total number of multiplexer is equal to the maximum number of OFMAPs, called number of output channels (#*Cout*) of the NN. In Equation (10), it is indicated only *Wx*, since usually the kernels are regular matrices with *Wx* = *Wy*.

Regarding the pop-counting computation, handling many parallel inputs requires too many hardware resources. For this reason, the outputs of the XNOR gates are multiplexed and only one of them is processed per clock cycle. A pop-counter can be simply implemented with an adder, a NOT gate and a register as shown in Figure 8. Together with the pop-counting circuit, the main computational part has been called XNOR-Pop Unit, as shown in Figure 7.

**Figure 6.** Example of Binary convolution based on XNOR-Pop procedure.

**Figure 7.** Out-Of-Memory (OOM) main computational part. Each Binary Input RF's row holds the binary inputs required for a convolutional window computation, while weights are provided by an external memory. The outputs of the XNOR gates have been multiplexed to reduce the computational overhead of the pop-counting part.

**Figure 8.** Four bits example of a pop-counter circuit.

#### 4.1.2. Multiple Input Channels Design

Many CNNs have multiple IFMAPs in input. Each convolutional window must be computed separately and, in the end, summed together to get the resulting OFMAP. This can be obtained by increasing the level of parallelism of the architecture, having multiple XNOR-Pop Units working at the same time. As can be seen in Figure 9, a *Cin* number of XNOR-Pop Units are required and a final accumulation circuit computes the sum of the single channels. XNOR-Pop Units are multiplexed to reduce the hardware complexity for bigger networks.

**Figure 9.** Multiple input channels OOM design.

#### 4.1.3. FC Layer Integration

Until this point, the convolution algorithm has been mapped on the hardware architecture described so far. To implement the fully connected layer the same circuit can be reused by simply inverting weights and inputs sources. To better understand this concept, the example reported in Figure 10 is considere. The weights values for each input neuron are *w*<sup>0</sup> <sup>0</sup>, *<sup>w</sup>*<sup>0</sup> <sup>1</sup>, *<sup>w</sup>*<sup>0</sup> <sup>2</sup> for *<sup>X</sup>*0, *<sup>w</sup>*<sup>1</sup> <sup>0</sup>, *<sup>w</sup>*<sup>1</sup> <sup>1</sup>, *<sup>w</sup>*<sup>1</sup> <sup>2</sup> for *X*<sup>1</sup> and *w*<sup>2</sup> <sup>0</sup>, *<sup>w</sup>*<sup>2</sup> <sup>1</sup>, *<sup>w</sup>*<sup>2</sup> <sup>2</sup> for *X*2. The output *O*<sup>0</sup> can be computed considering Equation (11).

$$O\_0 = \text{pop-count}(X\_0 \oplus w\_{0\prime}^0 \, X\_1 \oplus w\_{0\prime}^1 \, X\_2 \oplus w\_0^2) \tag{11}$$

**Figure 10.** Example of a 3-3 FC network mapping.

As depicted in Figure 10, the Binary Input Register File (RF) contain the binary weights instead of the inputs, in fact by addressing each line the multiplication of the weights with inputs is performed, and then pop-counted.

The size of the Binary Input RF is also bounded to the FC network's characteristics, so the relations of width-height of the Binary Input RF are the following:

$$\begin{cases} \text{Menny size}\_{\text{x}} = \max \left( \mathcal{W}\_{\text{x}} \times \mathcal{W}\_{\text{y}}, \text{#input neurons}\_{\text{FC}} \right) \\ \text{Menny size}\_{\text{y}} = \max \left( D\_{\text{out}}, \text{#output neurons}\_{\text{FC}} \right) \end{cases} \tag{12}$$

Although this is the straight forward way to map an FC algorithm on the architecture, this can be very complex with a high number of input neurons. Considering LeNet5 [1] depicted in Figure 3, the first FC layer has 120 output neurons, that can be acceptable, but for more sophisticated algorithms like AlexNet [2], which has 4096 input neurons, makes this kind of scheduling very inefficient. A generic output neuron's equation *Oi* is given by

$$O\_i = \sum\_{j=0}^{4095} X\_j \times w\_i^j + Bias = X\_0 \times w\_i^0 + X\_1 \times w\_i^1 + \dots + X\_{4095} \times w\_i^{4095} + Bias \tag{13}$$

this sum can be computed by performing fewer number of adds per each clock cycle. The partial result is stored and added in each clock cycle. The algorithm steps become:

$$\begin{aligned} \text{Store temp}(0) &= 0\\ \text{Store temp}(1) &= \sum\_{j=0}^{n} X\_j \times w\_i^j + \text{Store temp}(0) \\ \text{Store temp}(2) &= \sum\_{j=n+1}^{2n} X\_j \times w\_i^j + \text{Store temp}(1) \\ &\dots \end{aligned} \tag{14}$$

where *n* is the total number of considered terms for each summation and Store temp holds temporary additions partial results.

Figure 11 shows an example of serialization of 2 input neurons per cycle, meaning that only a subset of weights (highlighted by the dashed lines in the figure) are stored inside the Binary Input RF.

The partial result is computed and then temporarily stored in each algorithmic step. In Equation (14), *n* is 2 and consequently the Memory size values can be rewritten as

$$\begin{cases} \text{Mernory size}\_{x} = \max(\mathcal{W}\_{x} \times \mathcal{W}\_{y}, n) \\ \text{Mernory size}\_{y} = \max(D\_{out}, \text{output neurons}\_{FC}) \end{cases} \tag{15}$$

**Figure 11.** Example of serialization of the FC computation.

#### 4.1.4. OOM Convolution-FC Unit

A scheme of the complete OOM architecture is now provided in Figure 12 and each element's functionality is reported in details.

**Figure 12.** Complete OOM architecture. The thicker red dashed line frames the units which are the main components of the SurroundingLogic unit. Inputs are provided to the SurroundingLogic unit from the external world. Outputs are processed and saved outside in the testbench.


#### *4.2. LiM Architecture*

#### 4.2.1. XNOR-Pop LiM Unit

The design driving concept of a LiM architecture is to increase as much as possible the level of parallelism. Starting from the OOM standard implementation, we designed two LiM arrays that perform XNOR bitwise and pop-counting operations. Since we didn't have the possibility to implement a custom memory, we used as memory element a flip flop and a static CMOS based logical part.

These choices imply a higher power and area estimation in the synthesis phase, that will be discussed more in details in Section 6. Regarding the XNOR part, the idea is to put a XNOR gate inside each memory cell and to perform the binary product between the content of the cell and an external binary input. An example is depicted in Figure 13, in which is shown how a simple 2 × 2 convolution is mapped inside a LiM array. In order to perform the bitwise multiplication between the binary input and the corresponding weight, as we can see from the example in Figure 13 the highlighted portion of IFMAP has to be convolved with the kernel in the following way:

$$\text{Incoming bit}\_0 = \text{pop-count}(\overline{X\_0 \oplus w\_0}, \overline{X\_1 \oplus w\_1}, \overline{X\_4 \oplus w\_3}, \overline{X\_5 \oplus w\_3}) \tag{16}$$

Since one of the XNOR inputs is hardwired to an external connection, it is sufficient to store inside the memory array the input required to perform the convolution. The same for the following row line: the convolution is performed with the same kernel, so each memory row corresponds to a convolutional window.

Regarding pop-counting procedure, in order to reduce the complexity of the memory cell, we can simplify the pop-count equation in the following way:

$$\text{pop-count} = \#\text{1s} - \#\text{0s} = 2 \times \#\text{1s} - \text{length(word)}\tag{17}$$

where length(word) is intended as the size of the array entering in the pop-counter, which is 4 in Figure 13. A ones counter is simply made of half adders (HA) connected as depicted in Figure 14, so in the pop-counting part there will be a HA for each memory cell. Figure 15 provides an overview of the entire LiM implementation. It is possible to distinguish between LiM XNOR part and the LiM ones counter whose detailed architectures are depicted in Figures 13 and 14 respectively. Together, with the multiplexer depicted in Figure 15, they form the LiM XNOR-Pop unit.

**Figure 13.** XNOR part of the XNOR-Pop Unit LiM implementation: example of 2 × 2 kernel and 4 × 4 IFMAP sizes with stride 1.

**Figure 14.** Example of a 4 bits ones counter LiM implementation.

#### 4.2.2. LiM convolution-FC Unit

From the previous considerations, the entire LiM architecture can be designed as shown in Figure 15.

*J. Low Power Electron. Appl.* **2020**, *10*, 7

**Figure 15.** LiM entire architecture. The main blocks of the LiM implementation are the LiM XNOR part, interface decoder, LiM one-counter, and shifters-subtractors for the pop-counting computation, that are replicated for *Cin* number of times. The surrounding logic is the same as the OOM case reported in Figure 12.

The Surrounding logic unit remains the same, since the interface has been kept between OOM-LiM XNOR-Pop units. The other units are replicated #*Cin* times, depending on the total number of input channels required by the algorithm. The LiM alternative can achieve a higher level of parallelism, because XNOR-ones counter parts can perform the operations in parallel. In Figure 15, there are also "<< 1" blocks: they perform the shift by 1 position, corresponding to multiplication by 2.

#### *4.3. Top-Level Entity*

The top-level entity contains both the Convolution-FC, Pooling circuits. Pooling is simply made of a multiplexed comparator that takes the maximum out of *Wx* × *Wy* number of inputs. The top-level entity contains also an Interface, which is in charge of dispatching the inputs coming from the testbench and to provide the results of Pooling/Convolution-FC to the outside. The top-level entity can be schematized in Figure 16.

**Figure 16.** Top-level entity of both LiM and OOM architectures.

#### **5. Qualitative Comparison OOM-LiM Architectures**

In order to make a qualitative comparison, algorithm execution time was considered as a benchmark parameter. We can distinguish between convolution, fully connected and max pooling execution times, since they are completely different. The computation is based on a CNN, since it generally contains all of those layers. The CNN's parameters are not specified, since we are doing a parametric estimation.

#### *5.1. OOM Execution Time*

#### 5.1.1. Pooling Layer

In our analysis we start from Pooling layer. As said in Section 4.3, Pooling is made of a simple multiplexed comparator. The input scanning ends when all them have been considered, so after an entire pooling window content is evaluated. This value is multiplied by the total number of pixels of the resulting OFMAP obtaining

$$\text{Pool}\_{\text{time}} \approx D\_{\text{out}(\text{pool})}^2 \times \left(\mathcal{W}\_{\text{x}(\text{pool})} \times \mathcal{W}\_{\text{y}(\text{pool})}\right) \times t\_{\text{ck}} = D\_{\text{out}(\text{pool})}^2 \times \left(\mathcal{W}\_{\text{x}(\text{pool})}^2\right) \times t\_{\text{ck}} \tag{18}$$

where *D*<sup>2</sup> *out*(*pool*) is the pooled OFMAP size. The worst-case filter dimension is set to *<sup>W</sup>*<sup>2</sup> *<sup>x</sup>* for both convolutional and pooling.

#### 5.1.2. Convolutional Layer

At the beginning of the convolution algorithm, the binary inputs are precharged inside the Binary Input RF and **K** matrix is computed in the meanwhile, meaning that for each input set are required *W*<sup>2</sup> *<sup>x</sup>* clock cycles. Since an entire OFMAP has a number of pixels equals to *D*<sup>2</sup> *out*(*conv*) , the total number of cycles required in this step are *D*<sup>2</sup> *out*(*conv*) <sup>×</sup> *<sup>W</sup>*<sup>2</sup> *<sup>x</sup>* clock cycles. After that, convolution is performed: considering Figure 7, an entire convolutional window is computed when all the XNOR outputs have been scanned. The number of XNOR gates is equal to the Binary Input RF's word length, which is *W*<sup>2</sup> *<sup>x</sup>* . By multiplying the time required by a convolutional window computation with the total number of convolutional windows *D*<sup>2</sup> *out*(*conv*) , we get the total convolution time which is Convolution*time*,*OOM*. The last contribution set is the BatchNorm, that can applied after each convolutional window and *α* computation together with results' storing. Each of them takes only 1 clock cycle. We can derive the equation for the convolutional layer execution time with 1 input/output feature map as follows.

$$\text{Convolution}\_{\text{time},COM} \approx \left( \overbrace{\underbrace{\text{D}^2\_{\text{out}(conv)} \times \text{W}^2\_x}^{\text{Scome input} \text{ é} \text{Convolution}}}^{\text{Scome input} \text{ é} } + \overbrace{\text{D}^2\_{\text{out}(conv)} \times (\text{W}^2\_x + 1)}^{\text{Convolution} \text{ é} \text{Batch}} + \overbrace{(1 + 1)}^{\text{a.} \text{-Score results}} \right) \times t\_{ck} \tag{19}$$

When multiple output channels are considered, the convolution windows computation has to be repeated for each of the OFMAP:

$$\begin{aligned} \text{Convolution}\_{time,COM} & \approx \overbrace{\begin{aligned} ^{\text{Store inputs \& \& \text{Computation}}} \end{^{\text{of}}\text{Convolution}} \\ &+ \mathbb{C}\_{out} \times \left( \overbrace{D^{2}\_{out(conv)} \times (\mathbb{W}^{2}\_{x} + 1)}^{\text{Convolution \& BacMorm}} + \overbrace{(1 + 1)}^{a \text{ \textquotedblleft Score results}} \right) \times t\_{ck} \end{aligned} \tag{20}$$

the last situation is the multiple input/output channels case. Since the convolution operation is parallelized, the convolutional windows coming from each XNOR-Pop Unit is added in a serial fashion. This means that to achieve the final convolution value, each contribution has to be added together before executing the BatchNorm. The final Convolution*time* expression is the following.

$$\begin{aligned} \text{Convolution}\_{\text{lim,COM}} & \approx \overbrace{\underbrace{\boldsymbol{D}^{2}\_{\text{out}(\text{conv})} \times \boldsymbol{\mathcal{W}}^{2}\_{\text{x}}}^{\text{Store inputs \& K computation}} \times t\_{ck} + \\ & + \text{Cout} \times \left( \overbrace{\boldsymbol{D}^{2}\_{\text{out}(\text{conv})} \times (\boldsymbol{\mathcal{W}}^{2}\_{\text{x}} + 1 + \boldsymbol{\mathcal{C}}\_{\text{in}})}^{\text{Convolution} \& \text{Rom multiple } \boldsymbol{\mathcal{C}}\_{\text{in}}} + \overbrace{(1+1)}^{\text{R} \text{ - Score results}} \right) \times t\_{ck} \end{aligned} \tag{21}$$

#### 5.1.3. FC Layer

For the FC computation, we have to consider the scheduling explained in Section 4.1.3. As the convolution case, the algorithm starts precharging the inputs inside the array, taking *Dout*(*FC*) clock cycles, where *Dout*(*FC*) is the total number of output neurons. Since the dimension of the Binary Input RF is Memory size*x*, only Memory size*<sup>x</sup>* input neurons are considered per time, so as performed for the convolutional layer, the time required for a FC output is equal to *Dout*(*FC*) × Memory size*<sup>x</sup>* that has to be added to the previous contribution. FC results have to be stored, and this can be made by scanning the content of Store temp register (depicted in Figure 12), taking *Dout*(*FC*) clock cycles. The execution time for a single step of the FC scheduling is given by:

$$\text{FC}\_{\text{time,OOM}} \approx \left( \overbrace{D\_{\text{out(FC)}}}^{\text{Store inputs}} + \overbrace{D\_{\text{out(FC)}} \times \text{Mernory size}\_x}^{\text{FC output computation}} + \overbrace{D\_{\text{out(FC)}}}^{\text{Store tempancy}} \right) \times t\_{ck} \tag{22}$$

this partial result has to be repeated by the total number of iterations (*niter*) required to calculate the FC layer. The final FC execution time expression is:

$$\text{FC}\_{\text{time,OOM}} \approx \left[ \underbrace{\text{u}\_{\text{iter}} \times \left( \underbrace{\text{Stor}^{\text{inputs}}}\_{\text{out(FC)}} + \underbrace{\text{FC}\_{\text{out(FC)}} \times \underbrace{\text{Mernory size}\_{\text{x}}}\_{\text{in}}}\_{\text{in}} \right) + \underbrace{\text{Stor temp carrying}}\_{\text{out(FC)}} \right] \times t\_{\text{ck}} \tag{23}$$

#### *5.2. LiM Execution Time*

Similarly to the OOM case, Pooling, Convolution and FC execution times are provided and explained. Since Pooling layer is the same in both cases, it is not analyzed in this part.

#### 5.2.1. Convolutional Layer

As already done in OOM, the array has to be precharged taking *D*<sup>2</sup> *out*(*conv*) clock cycles. After that, all the XNOR gates inside the XNOR part work together at the same time, and the Interface Decoder, which is depicted in Figure 13, takes one by one each XNOR result and provide it to the ones counter. When all XNORs' output have been scanned after *W*<sup>2</sup> *<sup>x</sup>* clock cycles, the ones counter results are stored inside the LiM ones counter reported in Figure 15. At this point, all the LiM ones counter values must be fetched for each input channel, requiring *Cin* × *<sup>D</sup>*<sup>2</sup> *out* clock cycles to perform the residual part of the algorithm. The final formula for the LiM convolution execution time is

$$\begin{aligned} \text{Convolution}\_{time, LiM} &\approx \overbrace{\underbrace{\text{D}^2\_{out(conv)} \times \text{W}^2\_x}^{\text{C} \text{Amplitude}}}^{\text{Store inputs \&\\_Compression}} \times t\_{ck} +\\ &+ \text{C}\_{out} \times \left[ \overbrace{\text{W}^2\_x + \text{D}^2\_{out(conv)} \times (1 + \text{\#C}\_{in})}^{\text{Convolution \&\\_Stour quality}} + \overbrace{(1 + 1)}^{a \text{ \textdegree Score results}} \right] \times t\_{ck} \end{aligned} \tag{24}$$

#### 5.2.2. FC Layer

Similarly to the OOM case, we have scheduled the algorithm to reduce the complexity. After *Dout*(*FC*) clock cycles required to store the values inside the LiM array, an entire FC step is computed in Memory size*<sup>x</sup>* clock cycles and the final results are scanned from the LiM ones counter in *Dout*(*FC*) cycles. In LiM architecture, the Store temp register file is not required since the pop-count values are already stored in the LiM part. By iterating the entire algorithm *niter* times, we get the final FC execution time:

$$\text{FC}\_{\text{time},\text{LiM}} \approx \left[ \eta\_{\text{iter}} \times \left( \overbrace{D\_{\text{out}(\text{FC})}}^{\text{Store inputs}} + \overbrace{\text{Memory size}\_{\text{x}}}^{\text{FC}\,\text{output}} \right) + \underbrace{\stackrel{\text{Store tempancy}}{D\_{\text{out}(\text{FC})}}}^{\text{Store tempancy}} \right] \times t\_{\text{ck}} \tag{25}$$

#### *5.3. Comparison Results*

The results obtained by performing the ratio between OOM/LiM execution times are now provided. The previous part, and in particular Sections 5.1 and 5.2 take into account an approximate computation of the execution time, since the overheads of idle/dummy states were not considered for sake of simplicity. In this part, we show the real estimations that consider all the contributions. Considering Figure 17, it is possible to see how delay ratio (obtained as execution time OOM/execution time LiM) changes in different cases. A series of sweeps were made, considering the most important variables, in particular #*Cin*, #*Cout*, *Wx*, *Din*, *Dout*(*f c*), *niter*. On the vertical axes, there is Delay ratio in all plots. Some of the estimations were performed considering the convolution timing equations reported in Equations (21) and (24). These plots are are labelled with "Convolution computation" flag in Figure 17. The remaining one consider FC delay expressions reported in Equations (23) and (25).

Delay ratios (OOM/LiM)

**Figure 17.** Delay ratio obtained as OOM/LiM for different parameters, in order to see how the two architectures behave for different cases.

• Delay ratio vs #Cin & Wx: the Delay ratio with respect to #*Cin* has a decreasing trend because, as shown in Figure 15, the Interface Decoder, the multiplexers placed after the LiM ones counter and the serial accumulation of the values of each channel represent a bottleneck for LiM architecture. As a result a higher execution time for higher values of #*Cin* is observed. In general, for high values of *Wx*, the Delay ratio increases, because of the parallelization in LiM architecture.


From these considerations, it is evident that LiM architecture introduces a gain in terms of execution time, because by increasing the level of parallelism in the architecture, multiple operations can be performed at the same time. The LiM bottlenecks are the Interface Decoder and the multiplexers depicted in Figure 15, that introduce both higher delay and power consumption, but they are required to interface the design blocks.

#### **6. Perfomance Evaluation**

In this part, the evaluation steps will be explained. In this work the memories were implemented as register files and each memory cell is a flip flop, so the results obtained are an overestimation (especially for the LiM case). The real performance values can be obtained with a more precise memory model. The performance evaluation is made of three parts:


3. An analysis of the differences between our LiM, where memory elements are flip flops, and a LiM circuit with a custom memory is performed. In [8], a very similar XNOR-Net implementation has been implemented with a CAM memory-based XNOR-Pop procedure. Some useful results are provided, since authors have implemented a modified memory array with 65 nm CMOS technology. For this reason, a synthesis with 65 nm CMOS technology @ 1.0 V is performed, trying to use the same metrics as [8] to evaluate how a more real memory model can influence the results obtained.

#### *6.1. Two NN Models Examined*

#### 6.1.1. Fashion-MNIST CNN Results

The first NN model is able to classify with an accuracy of 81% a Fashion-MNIST image [33], which is a greyscale picture of 28 × 28 pixels that can belong to one of 10 different categories such as T-shits, trousers, pullovers, dresses, coats, sandals, shirts, sneakers, bags and ankle boots. The NN model is reported in Table 1. The parameters listed in Table 1 give an indication on the dimensions of the hardware implementations, such as the dimensions Memory sizex and Memory sizey of the Binary Input RF/LiM arrays in order to perform an ad-hoc synthesis optimized for that NN model. Binary Input RF, LiM XNOR part and LiM ones counter have Memory size*<sup>y</sup>* = 24 × 24 = 576 rows with a bitwidth of Memory size*<sup>x</sup>* = 32 bits to host both convolution and FC algorithms. Since there are a maximum number of input channels equals to 6 as reported in Table 1, the XNOR-Pop Unit for both OOM and LiM has been replicated 6 times.


**Table 1.** Fashion-MNIST CNN under test parameters.

The number of bits used to perform the extra calculations required (such as multiplications, BatchNorm etc) are 18 expressed in fixed point format. After performing the synthesis with Synopsys Design Compiler, the results obtained for area, CPD and power are reported in Table 2.

**Table 2.** Synthesis results for the Fashion-MNIST CNN model.


A preliminary analysis of the results listed in Table 2, highlights a higher area and power consumption in LiM with respect to the OOM alternative, since the LiM implementation is highly parallellized and, consequently, a higher number of logic elements are required. The CPD is slightly higher in the OOM case because of a more complicated FC scheduling handling circuit (depicted

in Figure 12), which requires the Store temp register file. From these results, a simple comparison between power, area and CPD is not enough to determine which is the best architecture between the two proposed ones. For this reason we estimate also the execution time and the energy estimation obtained as Power × Execution time, as shown in Table 3.


**Table 3.** Power-Execution time-Energy results for Fashion-MNIST CNN model.

From Table 3, we can derive two very important values which are the Energy ratio and Delay ratio as follows

$$\text{Delay ratio} = \frac{\text{Execution time}\_{COM}}{\text{Execution time}\_{LiM}} = \frac{0.92 \,\text{ms}}{0.21 \,\text{ms}} \simeq 4.38 \tag{26}$$

$$\text{Energy ratio} = \frac{\text{Energy}\_{OOM}}{\text{Energy}\_{LiM}} = \frac{178.41 \,\text{μJ}}{53.44 \,\text{μJ}} \simeq 3.34 \tag{27}$$

Energy and delay ratio give very important indications on LiM strong points: it consumes less energy and it's faster, although has a higher power value. From an energetic point of view, LiM architecture is more efficient for that particular NN model. In Figure 18 and Table 4 are reported the results obtained from a post place&route estimation with .vcd backannotation. Considering both switching activity and interconnections, the resulting power of the LiM architecture is increased by ∼ 22%, bringing a lower energy ratio, but still promoting LiM as an energy efficient architecture.

$$\text{Energy ratio}\_{\text{post place and route}} = \frac{130.91 \,\upmu\text{J}}{68.9 \,\upmu\text{J}} \simeq 1.9\tag{28}$$

**Figure 18.** Snapshot of the chips obtained after a post place and route procedure for Fashion-MNIST CNN.

**Table 4.** Post place and route estimation of Fashion-MNIST CNN implementation.


#### *6.2. MNIST-MLP Network*

Another NN model is evaluated, in order to verify the behavior of both LiM and OOM architectures with different computational models. The realized MLP network is made of a set of FC layers organized as 784-196-196-10 neurons and it is able to achieve up to ∼90% of accuracy on MNIST dataset. Further details on MLP structure are presented in Table 5. Dropout layers indicated in Table 5 are useful in training procedure, since they prevent the network from overfitting by simply "turning off" neurons with a given probability [34]. As already done in Section 6.1.1, the results that will be presented are the ones obtained by the synthesis and the estimated value of energy, based on the execution time. The chosen dimensions of the implementation are Memory size*<sup>x</sup>* = 14, #*Cin* = 1 while the memory arrays have 196 rows because the maximum number of output neurons (*Dout*(*FC*)) is 196.


**Table 5.** MNIST MLP under test parameters.

In Table 6, the architectures have a similar power consumption, because in OOM it is required a Store temp register file that has 196 rows. The hardware complexity of the LiM implementation is not so different from the OOM's one, because the memory arrays have a very small size of 196 × 14. Since it is an MLP network, *niter* parameter becomes very important, because it gives an indication on how many times the FC scheduling has to be executed for each layer: *niter* changes for each FC layer and can be obtained as *niter* = *Din*(*FC*)/Memory size*x*. From the energy and execution time results in Table 6, it is evident that OOM is not competitive with respect to the LiM version. This is due to the much more inefficient FC handling, since the whole Binary Input RF has to be scanned, while the LiM version performs all the calculations directly inside the array.

$$\text{Delay ratio} = \frac{1.62 \,\text{ms}}{0.132 \,\text{ms}} \simeq 12.27 \,\tag{29}$$

$$\text{Energy ratio} = \frac{23.20 \,\upmu\text{J}}{1.99 \,\upmu\text{J}} \simeq 11.7 \tag{30}$$

After performing post place and route estimation, the results obtained are reported in Figure 19 and in Table 7.

**Table 6.** Perfomance parameters of MLP implementation.


**Figure 19.** Snapshot of the chips obtained after a post place and route procedure for MLP NN.

**Table 7.** Post place and route estimation of MLP implementation.


In Table 7, the power results are slightly lower than the synthesis ones, since the .vcd file and the switching activity information have relevant roles, giving a more precise power estimation, instead of the worst case reported in Table 6. The energy ratio results to be equal to ∼10× compared to the previous one equal to ∼11.7× provided by the synthesis.

#### *6.3. Parametric Sweeps*

The meaningful parameters of the designs such as #*Cin* and memory arrays dimensions (Memory sizex,Memory sizey) were varied to determine the differences between the two architectures in terms of performance. Two parameters are chosen per time and a sweep is executed on them, while the remaining are kept constant. For sake of clarity, from now on the following substitution is considered:

$$\begin{cases} H = \text{Memory size}\_y^2\\ W\_a = \text{Memory size}\_x \end{cases} \tag{31}$$

In Figures 20 and 21 are depicted the Power, Area and CPD for different values of #*Cin*, *Wa* and <sup>√</sup>*H*. By increasing the *Wa*, power and area increase almost quadratically since *Wa* directly influences the bitwidth of the memories. Also, the trends depending by <sup>√</sup>*<sup>H</sup>* behave quadratically, meaning that a the memory complexities influence a lot the performance of both architectures. In general, the power and area for LiM case are slightly higher than the OOM ones, since the total number of logic elements required by the LiM implementation is greater than OOM. CPD remains almost the same, even for more complex implementations. To better understand the differences of the obtained parameters for both architectures, a ratio was computed for all the cases: the results obtained are reported in Figure 22, where in general for an increasing size of #*Cin*, *Wa* and <sup>√</sup>*<sup>H</sup>* the power and area ratios decreases, confirming the bigger grade of complexity of the LiM. Another useful estimation can be performed on the energy ratios for the various cases.

**Figure 20.** Power, Area and CPD results for different values of #*Cin*, *Wa* and <sup>√</sup>*<sup>H</sup>* considering LiM implementation.

**Figure 21.** Power, Area and CPD results for different values of #*Cin*, *Wa* and <sup>√</sup>*<sup>H</sup>* considering OOM implementation.

**Power, Area, CPD ratios vs** #Cin **& W<sup>a</sup>**

**Figure 22.** Power, Area and CPD ratios with respect to #*Cin*, *Wa* and <sup>√</sup>*H*.

Those values are obtained as Energy*OOM*/Energy*LiM*, as shown in Figure 23, decrease for higher memory dimensions, since power of the LiM architecture starts to assume a predominant contribution in the energy equation. It is important to keep in mind that these are very pessimistic estimations, and they can be improved by employing more realistic memory cells. The pessimistic case, which is

reported in Figure <sup>23</sup> in #*Cin* & <sup>√</sup>*<sup>H</sup>* size plot, is to have a very long (√*<sup>H</sup>* big) and narrow (*Wa* very small) memory structure, which is replicated a lot of times (#*Cin* big): these set of conditions describes an improbable situation, because the driving force for a memory design is to have a regular squared shape array. The last energy estimation reported in Figure <sup>23</sup> flagged by FC #*Cin* & <sup>√</sup>*<sup>H</sup>* size, takes into account an FC algorithm mapped on the implementations considering the worst case of big <sup>√</sup>*<sup>H</sup>* and #*Cin*. By varying <sup>√</sup>*H*, the trend for the energy ratio is increasing, meaning that the more complex is the FC algorithm the lower is the energy for the LiM implementation compared to OOM one.

**Figure 23.** Energy ratio values obtained by varying #*Cin*, *Wa* and <sup>√</sup>*H*.

#### 6.3.1. Qualitative Estimation

To give a definitive answer on which architecture performs better, a qualitative estimation is performed, considering the mean values of all the cases explained before.

A ratio obtained as OOM/LiM between each parameter is proposed, which clarify the main points of both implementations. As shown in Figure 24, the values of area and power ratios are below 1, meaning that in general the LiM architecture behaves worse than OOM for the motivations explained before. On the other hand, execution time and energy ratios are equal to ∼6× and ∼4× respectively, implying that a very good improvement can be achieved by the LiM implementation on these two quantities. These trends confirm our expectations on LiM and further improvements can be achieved by having a more precise LiM array model.

**Figure 24.** Mean performance ratios obtained as an average of all the cases analyzed from the previously discussed results.

#### 6.3.2. LiM Array Estimation: Impact on Perfomance

In order to estimate the performance of the LiM array, several synthesis estimations were performed with different arrays dimensions. Taking into account the system's structure depicted in Figure 15, the LiM values of power and area are compared with the ones obtained from the same process applied only to the SurroundingLogic unit, in order to understand what are the main performance contributions. In Figure <sup>25</sup> are shown the performance values obtained by sweeping both <sup>√</sup>*H*, *Wa*, while #*Cin* is kept equal to 1, in order to estimate how the array sizes impact the overall performance. As it is possible to see, area and power increases almost quadratically, because of a more complex LiM structure. In Figure 26 it is reported an estimation of the SurroundingLogic unit by varying the same parameters as in Figure 25. The CPD bottleneck is located in the SurroundingLogic unit rather than LiM parts, because of the multipliers/adders employed to perform the final convolution result. Higher values of *Wa* implies a constant power/area, since there is no correlation between the LiM Memory sizex and the complexity of the SurroundingLogic unit. By increasing <sup>√</sup>*H*, power and area increase because of the higher complexity required, for example a bigger dimension of the **K** register file (Figure 12). By comparing the performance in terms of power obtained in Figures 25 and 26, it is possible to see that the highest contribution comes from LiM parts, as shown in the breakdown plot depicted in Figure 27. The percentage values are obtained following a rough approach, starting with computing the total power/area, given by the sum of the results obtained in Figures 25 and 26 and by dividing the LiM power/area by the total ones. As it is possible to see, for bigger arrays, LiM parts will assume a predominant contribution on the power/area performance. This behavior recalls the need of employing a more accurate LiM model, instead of the discussed one based on flip flops and static logic gates.

**Figure 25.** LiM performance estimations by varying *Wa* and <sup>√</sup>*<sup>H</sup>* sizes. #*Cin* is kept equal to 1.

**Figure 26.** SurroundingLogic unit performance estimations by varying *Wa* and <sup>√</sup>*<sup>H</sup>* sizes. #*Cin* is kept equal to 1.

**Figure 27.** Power and area breakdown of LiM parts.

#### *6.4. A More Detailed LiM Model*

Reference [8] proposes a very similar approach, but it performs a Content Addressable Memory (CAM)-based XNOR-Pop procedure, implementing the second convolutional layer of LeNet5 NN model [1], which is depicted in Figure 3. Five arrays are realized and their dimensions are 30 × 10. They have been implemented with 65nm CMOS technology: the performance results are reported in Table 8. To have a fair comparison with [8], the same conditions have been applied to our LiM design: only the XNOR-Pop part, reported in Figure 15, is synthesized with 65nm technology with a dimension of 30 × 10 for LiM XNOR part array. To obtain the energy estimation, we started from the power result given by Synopsys and we have mapped the second convolutional layer of LeNet5 CNN, obtaining the corresponding execution time called Convolution*time*,II-LeNet5 using the more precise version of Equation (24).

$$\text{Convolution}\_{time, \text{II} \cdot \text{LeNet5}} = 15852 \times t\_{ck} \tag{32}$$

The power obtained by Synopsys is for only 1 LiM array, so the this value has to be multiplied by five:

$$\text{Power}\_{\text{5-arrays}} = 0.2473 \,\text{mW} \times 5 \approx 1.24 \,\text{mW} \tag{33}$$

From the synthesis, CPD for the LiM array is equal to 1.91 ns, so the total energy is:

$$\text{Energy}\_{II-LeNet5} = \text{Power}\_{5\text{-array}} \times \text{Convolution}\_{\text{time,II-LeNet5}} \approx 38 \text{ nJ} \tag{34}$$

We can perform a comparison between our less LiM model based on flip flops with the case described in Table 8: the energy ratio between our work and the reference one is about 4.22 while the Bank Area ratio is almost equal to 4.92. This means that, if we design a custom memory, instead of relying on flip flops the performance of our architetcure can be greatly improved. But even considering this fact, the results here presented highlight that LiM architetcures have a huge advantage over traditional Von Neumann circuit, in terms of energy and overall execution speed.


**Table 8.** CAM-based XNOR-Pop [8] and our LiM architectures performance parameters comparison.

#### **7. Conclusions and Future Works**

In this work LiM and OOM architectures have been designed to demonstrate if a logic-in-memory approach is effectively better than a Von Neumann one in designing architectures for memory-intensive applications. From the results here highlighted, LiM design obtains remarkable results in terms of energy dissipation, because of a higher degree of parallel execution of the algorithm. Since the memory part of our designs was synthesized with Synopsys, the results that we obtained are overestimated, meaning that the energy can be significantly smaller with a proper memory design. We can conclude therefore that Logic-In-Memory architectures are worth it. Even considering the increased complexity of the memory design, they provide significant advantages over Von-Neumann architectures.

As a future work we are designing custom memories, based both and CMOS and eventually on emerging technologies, to further improve our analysis.

**Author Contributions:** Conceptualization, A.C, M.V. and G.T.; methodology, A.C.; software, A.C.; validation, A.C.; formal analysis, A.C.; investigation, A.C.; resources, A.C.; data curation, A.C.; writing—original draft preparation, A.C.; writing—review and editing, G.T. and M.V.; visualization, M.V.; supervision, M.V.; project administration, M.V. and G.T. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
