*7.1. Logic Resource Utilization and Reconfiguration Times*

Table 3 contains the resource utilization of the static system and each PE in the BbNN implementation shown in Figure 15. The static system includes the BbNN controller, the fine-grain reconfiguration engine and an empty RR where the BbNN PE blocks can be allocated at run-time. Table 3 also shows the resources used by each individual processing element. Each PE uses 473 LUTS, 163 flip-flops and 1 DSP. As the PE can be implemented in different reconfigurable regions, there are small variations in the percentage of used resources among RRs. The size of the partial bitstreams also depends on the region where the PE is implemented. The PE bitstream size varies from 21.8 kB to 26.9 kB, depending on the reconfigurable region where the PE is implemented, while the input module bitstream size is 5.8 kB.

As shown in Figure 14, the PE can adopt three different configurations, which have to be implemented as three different RMs. To implement each possible BbNN size, we have to analyze all the possible locations where the RMs can be allocated. The bypass RMs can only be allocated in the inner regions of the RR. However, the edge RMs (e.g., west and east PEs) can be placed in every region except the opposite edge regions (i.e., a west RM cannot be allocated on the east side of the RR). When using the Xilinx reconfiguration flow, it is necessary to generate one partial bitstream for each RM location, which would result in 33 (12 × 2 for edge RMs and 9 for the bypass RM) different partial bitstreams to generate all possible combinations. However, generating all the possible combinations is avoided by the flexibility benefits provided by IMPRESS, which allows relocating one partial bitstream to compatible regions (i.e., regions that have the same resource distribution). In the BbNN implementation shown in Figure 15, the total number of partial bitstreams needed to generate all the possible combinations is reduced to 9.


**Table 3.** Resource utilization of the BbNN implemented on a Zynq XC7Z020.

\* Percentage of the resources available in the RR used by the PE.

Table 4 shows a comparison of the proposed PE implementation and existing proposals in the state-of-the-art in terms of logic resources. The proposed architecture presents the lowest footprint in memory elements and DSPs. This is achieved by the proposed implementation for the sigmoid function and the strategy proposed to reuse the single DSP over different clock cycles. The downside of the dynamic scalability and flexibility of the proposed architecture is reflected in the high utilization of logic elements. This logic overhead is a consequence of the online training feature. One downside of dynamic partial reconfiguration is that other circuits cannot use the unused resources of the reconfigurable region where the PE is implemented. Table 3 shows that in our PE proposal, the LUTs are the bottleneck and leave several FFs, DSP and BRAMs unused.


**Table 4.** Resource utilization per individual PE in comparison with other works in the state-of-the-art.

\* Logic element implementation depends on the selected platform: 4-inputs LUTs (Virtex-II Pro), 6-inputs (Zynq XC7Z020, Stratix III, Virtex V) LUTs or 8-inputs (Stratix IV). \*\* Only resource consumption for 8 × 16 BbNN size provided. Approximate metrics per PE.

Table 5 shows the breakdown of the time spent in each operation stage, both during the training and the inference phases. The time the BbNN needs to process a set of inputs depends on the latency. In a 3 × 3 BbNN, the maximum latency is 9. As each PE needs 7 clock cycles to compute its outputs, the BbNN takes a maximum of 63 clock cycles to make the computation, which results in 0.63 µs at 100 MHz. The transference of inputs to the BbNN is carried out using the fine-grain reconfiguration engine, and it takes 6.1 µs. In turn, the outputs are read with the AXI interface in 4.3 µs. Therefore, the maximum throughput in the inference phase is 90.66 Kilo Operations per second (KOPS). By operation, we mean to process a new set of inputs completely to obtain the desired output from the BbNN. In the training phase, it is also necessary to configure the BbNN. The configuration of each parameter of the BbNN also relies on the fine-grain reconfiguration, and it takes 41.7 µs. In the training phase, it is also necessary to take into account the time required by the software to calculate the fitness and to generate the chromosomes, which is application dependant. The computing times for fitness computation are reported next for each use case. All the design operates at 100 MHz, except the fine-grain reconfiguration engine that works at 175 MHz. While the ICAP configuration port has a nominal value of 100 MHz, it has been demonstrated in [50] that it is possible to overclock it at higher frequencies without behavior malfunction. This overclocking aims at reducing the total time needed to reconfigure the BbNN during evolution.


**Table 5.** Time breakdown for a 3 × 3 BbNN.

Table 6 shows how the BbNN size impacts the time spent in each stage. Reconfiguration times shown in the table are a consequence of how IMPRESS carries out the fine-grain reconfiguration process, which is described next. First, when the evolutionary algorithm commands to change one parameter in an RM, IMPRESS has to search the device column where the parameter is placed and then modify the column configuration accordingly. Once the user has changed all the parameters, the new configuration values are sent to the reconfiguration engine, a hardware component in charge of reconfiguring the selected columns with the new configuration data. The time spent on this first stage depends on the number of parameters that have to be changed. In contrast, the second phase only depends on the number of columns that have to be reconfigured. In the implementation shown in Figure 15a, the number of frames that have to be reconfigured increases with the BbNN depth. Therefore, the BbNN configuration time of a 3 × 3 BbNN increases significantly compared to a 1 × 3 BbNN. However, when increasing the width of the BbNN, the number of frames that have to be reconfigured is kept constant, thus resulting in a more efficient reconfiguration process. Table 6

shows that increasing the size from a 3 × 3 to a 3 × 5 BbNN results in a more efficient reconfiguration time, especially in the case of the input data that only increases 0.5 µs. The increment in the BbNN configuration time is higher because more parameters have to be changed in the first phase of the reconfiguration process.

Table 6 also shows the comparison between using fine-grain reconfiguration (working at 175 MHz) and using an AXI lite interface in a non-reconfigurable BbNN operating at 100 MHz. Using fine-grain reconfiguration to configure the BbNN parameters is more efficient than using an AXI lite interface for all the three different sizes. In contrast, the best option to transfer the input data to the BbNN depends on the number of inputs. Fine-grain reconfiguration is convenient when there are five inputs, while the AXI lite interface is the preferred option when the BbNN only contains three inputs.


**Table 6.** Performance comparison for different BbNN sizes.

#### *7.2. Case Studies*

This section provides three different case studies showing how the neuroevolvable hardware system can be adapted to different problems. All the results provided are the average of 100 training processes. The EA finishes if a candidate configuration achieves the goal fitness or 1000 generations are exceeded. At each generation, 150 candidate configurations (i.e., chromosomes) are evaluated.

We expose here one classification problem and two control problems. In the classification problem, each chromosome is evaluated with all samples in the dictionary. In control problems, a new set of initial states is evaluated at each generation to avoid inconsistent solutions.

#### 7.2.1. Classification Domain: the Xor Problem

The XOR problem involves two inputs and one output, all 1-bit width. The goal is to evolve the BbNN, so it behaves like a logic XOR gate. If two inputs have the same value, the output is zero. Otherwise, the output must be one.

This problem is solved by using the truth table of the XOR gate as the reference. The selected output of the BbNN is compared with the reference result for each input data pair. The fitness function used to evaluate each configuration is based on the mean squared error of the BbNN output and the reference (see Equation (5)). Both values are float data type. The fitness function expresses the accuracy of the chromosome to approximate the output values to the binary values of the XOR truth table. A fitness over 0.9 corresponds to a mean squared error below 10% for the four cases in the XOR truth table. This problem is considered solved if the achieved fitness is over 0.9.

$$XORfitness = 1 - \left(\frac{1}{4}\sum\_{i=1}^{4}(y\_i - y\_{real})^2\right) \tag{5}$$

Fitness computation for this problem takes 21 µs, and it is executed once per sample in the batch. The batch contains four samples, so the fitness is computed four times per chromosome. Figure 16 shows the fitness progression along the evolution process and Figure 17 the selected configuration for a 2 × 2 network that solves the problem. It should be noted that the links connecting the edges of the network are used by the solution (i.e., the structure is closed as a cylinder). Figure 18 and Table 7 show the influence of the BbNN size in training. Experimental results showed that the minimum BbNN that can solve this task is a 2 × 2 BbNN. BbNNs with more than 2 × 2 elements facilitate the evolution towards a solution evaluating fewer configurations.

The four graphs in Figure 18 show the influence of the BbNN size on the convergence of the XOR problem. For each size, the generations needed by the EA to solve the problem are registered. Each graph covers up to 1000 generations. Beyond this value, the execution of the EA is considered as non-convergent. From these measurements, it can be concluded that the more convenient BbNN size for XOR proves to be 4 × 2 BbNN since it has the lowest rate of non-convergent executions and the highest rates of executions below 100 generations. This BbNN size has the lowest number of evaluated chromosomes and generations on average, as exposed in Table 7.

**Figure 17.** Solution for the XOR problem.

**Table 7.** Influence of the BbNN size on the training process for the XOR problem. Average stats from 100 convergent training processes.


If the complexity of the classification problem is unknown at run-time, the dynamic scalability of the BbNN may take an essential role in the search for solutions. This situation is shown for the

XOR problem in Figure 19. The system starts by searching for a solution with a 1 × 2 BbNN structure. After a period with the fitness stalled completely, the system dynamically adds a new row of PEs to the BbNN (at generation 211, in Figure 19). After a few iterations with the new size, the neuroevolutionary system is able to find a solution. The EA does not support population where chromosomes encode BbNN of different sizes, but it can recompose a new BbNN architecture and reset the evolution process if the fitness does not show any improvement.

**Figure 18.** Influence of the BbNN size in the convergence of the algorithm for XOR problem. Each graph exposes the data from 100 executions of the EA. The generations needed to achieve a solution are segmented in intervals, from 0 to 1000 generations. Executions over 1000 generations are stopped. Convergence of different BbNN sizes is analyzed: 2 × 2 BbNN (**a**), 3 × 2 BbNN (**b**), 4 × 2 BbNN (**c**) and 5 × 2 BbNN (**d**).

**Figure 19.** Resolution of the XOR problem using the dynamic scalability feature. At generation 211 the Evolutionary Algorithm (EA) increases the BbNN size and resets evolution. At generation 249 the EA converges towards a solution.

## 7.2.2. Control Domain: Mountain Car

The Mountain Car is a standard control problem. It involves a car whose starting position is at the bottom of a valley. The car must reach the top of the hill at the right. The engine of the car is not powerful enough to reach the goal position by accelerating up the slope. Therefore, the car needs to gain momentum to reach it by oscillating from left to right.

A simulation environment for the Mountain Car problem is included in the Gym OpenAI [14] toolkit (Figure 20). The observation space of the environment has two variables: the position of the car and the speed. Both variables are float type that are transformed to the fixed-point representation before being processed by the network. Three possible actions can be performed on the car: push left, push right or do not push. The force of the engine is constant.

**Figure 20.** OpenAI Mountain Car environment and coordinate system used to determine the position of the car. The hills are generated with the sin(3x) function.

The BbNN is evolved to find a controller for this problem directly by interacting with the environment. The Zynq-7020 SoC FPGA device in which the neuroevolvable hardware system runs has been integrated as a hardware-in-the-loop platform with the OpenAI simulator running on a PC. The evaluation of each candidate circuit is called an episode. Each episode finishes when the car reaches the goal position, or after 200 control actions (steps). This value has been obtained experimentally after observing that beyond this number of actions, the likelihood that an unsuccessful candidate circuit reaches the final position decreases.

A specific fitness function has been developed for this environment, which is shown in Equation (6). This fitness function rewards circuits able to drive the car close to the desired position with the fewest possible number of steps. Position of the car is its X coordinate according to Figure 20. It can vary in the range (−1.2, 0.5), the fitness expression presents three possible scenarios:


Fitness computation for this problem takes 41 µs. Fitness is computed once at the end of each episode and hence once per chromosome. An example of the evolution of fitness in an episode is shown in Figure 21.

Table 8 exposes the influence of the BbNN size on the training process. We only consider topologies with two columns since this is the number of observable variables in the environment. The number of rows varies from 2 to 4. First, we can see that all the considered BbNNs can solve the problem, even with a single row. However, the size has a direct effect on the performance of the EA. Small network architectures need fewer generations to solve the problem since chromosomes have fewer parameters to be optimized. A 1 × 2 BbNN needs 323 generations on average to converge to a solution; meanwhile, 2 × 2 BbNN increases the number of generations needed. Although networks over 1 × 2 size need more generations to be optimized, they enhance the quality of the solution. A 2 × 2 BbNN achieves a good compromise between the best fitness and evaluated circuits.

Figure 22 provides the convergence of the EA for different BbNN sizes similar to the previous problem. In this case, the smallest BbNN architecture ensures the convergence of the 81% executions below 100 generations and has a low rate of non-convergent executions. Moreover, this BbNN size presents the lowest average generations in Table 8. Therefore, 1 × 2 BbNN is the most suitable size in this case.

$$fitness = \left(1 - \frac{steps}{200}\right) + \frac{finalPosition}{10} \tag{6}$$

**Figure 21.** Example of progression of the fitness value during the Mountain Car training for 2 × 2 BbNN.

**Table 8.** Influence of the BbNN size on the training process for Mountain Car problem. Average stats from 100 training processes.


**Figure 22.** Influence of the BbNN size in the convergence of the algorithm for Mountain Car problem. Each graph exposes the data from 100 executions of the EA. The generations needed to achieve a solution are segmented in intervals, from 0 to 1000 generations. Executions over 1000 generations are stopped. Convergence of different BbNN sizes is analyzed: 1 × 2 BbNN (**a**), 2 × 2 BbNN (**b**), 3 × 2 BbNN (**c**) and 4 × 2 BbNN (**d**).

#### 7.2.3. Control Domain: Cart Pole

Cart Pole or Inverted pendulum is a staple problem in the control domain. The center of mass of the pendulum is above its pivot point. Therefore, the pendulum is unstable if no control actions are performed on it. OpenAI also provides a python-based simulation environment for this problem, whose graphical representation is shown in Figure 23.

**Figure 23.** Cart Pole environment.

Four variables are observed in the environment: the position and speed of the cart, the angle of the pole and the speed at the end of the pole. All of them are float type variables with different ranges. The system must keep the pole in a balanced position. Two actions can be performed on the simulation environment: push the cart left or right.

In this case, an episode finishes if the angle of the pole is over 12◦ or the position of the cart exceeds the scenario boundaries. In those cases, the pole is considered to be unbalanced. After each control action on the pole, a partial error value is calculated. This value represents the instability of the pole after the control action. The partial error involves the four parameters of the observation space and their maximum value, as shown in Equation (7). If the pole falls, global fitness is calculated as the addition of all partial error. Not performed steps are scored with the highest partial error value, as shown in Equation (8):

$$
bar{r}\_{\vec{j}} = \frac{1}{4} \frac{1}{200} \left( \sum\_{i=1}^{4} \frac{\mathbf{x}\_{i}}{\max\_{i}}^{2} \right) \tag{7}$$

$$fitness = 1 - \left(\sum\_{j=1}^{200} partialerror\_j + \frac{200 - steps}{200}\right) \tag{8}$$

where:


Fitness function showed in Equation (8) is designed to assign high scores to those chromosomes that complete 200 control actions in balance and low partial error. Fitness computation for this problem takes 46 µs. Fitness is computed after each control action since it involves an accumulation of the partial error. The ultimate fitness would be 1 in case the pole last for 200 control actions in a balanced and static position. However, two chromosomes able to balance the pole during 200 control actions can have different fitness scores since every *partial error* value depends on the value of the four variables of the observation space. The problem is solved if a fitness value over 0.95 is achieved. An example of the progression of fitness during a Cart Pole training experiment for 3 × 4 BbNN is shown in Figure 24.

**Figure 24.** Example of progression of the fitness value during the Cart Pole training for 3 × 4 BbNN.

The minimum BbNN width for this problem is four: one column for each variable in the observation space. The minimum network size compatible with this control problem is, therefore, 1 × 4 BbNN. Networks with an additional row to 2 × 4 BbNN broaden the design space exploration, and the EA evaluates more configurations to encounter a solution. Additional rows on the architecture have the same effect. Table 9 and graphs in Figure 25 exposes the influence of the BbNN size on the training process. These data have been gathered similarly to former case studies. In this case, increasing the size of the network leads to higher rates of non-convergent executions. Therefore, the most suitable size for this problem is 1 × 4 BbNN, which has the lowest rate of non-convergent executions (Figure 25) and the lowest average generations needed to solve the problem (Table 9).



**Figure 25.** Influence of the BbNN size in the convergence of the algorithm for Cart Pole problem. Each graph exposes the data from 100 executions of the EA. The generations needed to achieve a solution are segmented in intervals, from 0 to 1000 generations. Executions over 1000 generations are stopped. Convergence of different BbNN sizes is analyzed: 1 × 4 BbNN (**a**), 2 × 4 BbNN (**b**), 3 × 4 BbNN (**c**) and 4 × 4 BbNN (**d**).

Figure 24 contains the fitness progression of 3 × 4 BbNN during the training process of the Cart Pole problem. The initial unbalanced condition of the pole is different from each generation. Therefore, the same BbNN configuration varies its fitness value depending on the initial state. Figure 26 presents a solution to this problem in which the evolutionary algorithm has created two feedback loops.

**Figure 26.** BbNN solution for Cart Pole problem.

#### *7.3. Online Adaptation for Control in Dynamic Environments*

As proof of the online adaptation capability of the proposed system, two examples based on the previous control problems are provided. Both online training examples are tackled following the same approach. First, the system is trained under normal conditions, and once it is capable of controlling the problem for at least 10 generations, physical parameters are changed. This change hampers the capability of the trained BbNN to solve the initial problem. Table 10 exhibits the initial conditions for each problem and their modified values.


Mountain Car Engine power 0.001 0.0008

**Table 10.** Initial and modified conditions for online training.

Some of these modifications to the problem conditions emulate changing the environment. For instance, the modified value of the engine power of the car in the Mountain Car problem emulates a loss of power in the engine. Other changes emulate conditions that harden the problem resolution. For instance, an increment of the gravity over the pole seems unrealistic but creates a handicap for the problem resolution.

Both control problems exhibit similar behavior. After the change in the conditions, a drop in the fitness can be observed. The re-training stage has better average fitness than the first training stage. This means that the system has prior knowledge about the problem, creating a nice basis to solve it when harder conditions appear. Figures 27 and 28 show the evolution of the fitness when conditions change for both control problems.

**Figure 28.** Mountain Car training and re-train for 2 × 2 BbNN.

#### **8. Conclusions and Future Work**

In this paper, we propose a dynamically scalable hardware implementation of the Block-based Neural Network model, which, under the control of an evolutionary algorithm, enables continuous system adaptation. The proposed neuroevolvable hardware system integrates advanced reconfiguration features that allow to (1) compose the BbNN at run-time by stitching together individual PEs and (2) providing the inputs and changing each PE configuration with reduced reconfiguration times. The result is a scalable BbNN whose size can be adapted to the computational demands required by a given application. Experimental results show how scalability allows changing the number of logic resources occupied by the network depending on the complexity of the problem or the expected quality of the results. The proposed system has been implemented in an SoC FPGA and integrated using a hardware-in-the-loop scheme with the OpenAI toolkit to show its efficiency in reinforcement learning problems, such as the cart pole and the mountain car problem.

Regarding resource utilization, each PE uses 473 LUTs, 163 FFs and 1 DSP. Compared to other state-of-the-art solutions, our proposal uses more LUTs but reduces the number of FFs or DSPs. Fine-grain reconfiguration has been proven to be a valid solution to train the BbNN online as all the parameters of a 3 × 3 BbNN can be reconfigured in just 41.7 µs. In real applications, it is not necessary to change all the parameters at the same time, which further reduces the total time needed to configure the network. The inputs of the network are also provided using fine-grain reconfiguration in 6.1 µs

while the outputs are transferred using an AXI full interface in 4.3 µs. The time needed to compute the fitness is application-dependent and ranges from 21 µs to 46 µs for the use cases provided in this paper.

Further research will be carried out to extend the reinforcement learning capabilities of the proposed solution in more complex scenarios and other applications. Different variants of the evolutionary algorithm will also be explored to increase the capacity of the system to deal with more complex problems. Moreover, the evolutionary algorithm will be modified to use the size of the BbNN as an additional parameter subject to evolution, which will allow selecting the most appropriate BbNN size for a given application without user intervention. The fixed-point data encoding of the network can be an obstacle when solving complex problems. Therefore, more precise encoding schemes, like dynamic fixed-point or wider bit width of registers, will be studied. This improvement in data representation may cause an increment in FPGA resource consumption. Other optimization algorithms, such as gradient descend based or multi-threaded EAs will be analyzed, as an alternative to EAs.

**Author Contributions:** A.G. has contributed to the conceptualization, methodology, investigation, validation, writing—original draft preparation and writing—review and editing. R.Z. has contributed to the conceptualization, methodology, investigation, validation and writing—original draft preparation. A.O. has contributed to the conceptualization, methodology, investigation and writing—original draft preparation. E.d.l.T. has contributed to the investigation and writing—original draft preparation. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was partially funded by the European Union's Horizon 2020 research and innovation programme under grant agreement No 732105 (CERBERO Project).

**Conflicts of Interest:** The authors declare no conflict of interest. The funding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.

### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
