**Embedded Bio-Mimetic System for Functional ElectricalStimulation Controlled by Event-Driven sEMG †**

#### **Fabio Rossi, Paolo Motto Ros, Ricardo Maximiliano Rosales and Danilo Demarchi \***

Dipartimento di Elettronica e Telecomunicazioni (DET), Politecnico di Torino, 10129 Torino, Italy; fabio.rossi@polito.it (F.R.); paolo.mottoros@polito.it (P.M.R.); s251237@studenti.polito.it (R.M.R.)

**\*** Correspondence: danilo.demarchi@polito.it

† This paper is an extended version of our paper published in: Rossi, F., Rosales, M. R., Motto Ros, P., Danilo, D. Real-Time Embedded System for Event-Driven sEMG Acquisition and Functional Electrical Stimulation Control. In Proceedings of the 2019 International Conference on Applications in Electronics Pervading Industry, Environment and Society (ApplePies), Pisa, Italy, 11–13 September 2019.

Received: 7 February 2020; Accepted: 7 March 2020; Published: 10 March 2020

**Abstract:** The analysis of the surface ElectroMyoGraphic (sEMG) signal for controlling the Functional Electrical Stimulation (FES) therapy is being widely accepted as an active rehabilitation technique for the restoration of neuro-muscular disorders. Portability and real-time functionalities are major concerns, and, among others, two correlated challenges are the development of an embedded system and the implementation of lightweight signal processing approaches. In this respect, the event-driven nature of the Average Threshold Crossing (ATC) technique, considering its high correlation with the muscle force and the sparsity of its representation, could be an optimal solution. In this paper we present an embedded ATC-FES control system equipped with a multi-platform software featuring an easy-to-use Graphical User Interface (GUI). The system has been first characterized and validated by analyzing CPU and memory usage in different operating conditions, as well as measuring the system latency (fulfilling the real-time requirements with a 140 ms FES definition process). We also confirmed system effectiveness, testing it on 11 healthy subjects: The similarity between the voluntary movement and the stimulate one has been evaluated, computing the cross-correlation coefficient between the angular signals acquired during the limbs motion. We obtained high correlation values of 0.87 ± 0.07 and 0.93 ± 0.02 for the elbow flexion and knee extension exercises, respectively, proving good stimulation application in real therapy-scenarios.

**Keywords:** surface electromyography; event-driven; functional electrical stimulation; embedded system

#### **1. Introduction**

Nowadays, an increasing number of active rehabilitation techniques are moving to the bio-mimetic approach, which relies on the analysis of the surface ElectroMyoGraphy (sEMG) signal for, e.g., the application of Functional Electrical Stimulation (FES) [1], with the aim of physiologically controlling the muscle functional restoration as much as possible [2]. In particular, FES employs low energy current pulses to modulate the muscle contraction [3] where a complex stimulation pattern, useful to activate the group of muscles involved in a movement, is regulated by sEMG envelope evaluation or by muscle force indicators (e.g., Root Mean Square (RMS), Absolute Rectified Value (ARV)) [4].

In a practical application, the sEMG processing and FES control is a fundamental task to be carried out in real-time [5]. Since the run-time performance bottleneck could be easily related to the use of a general purpose computer for the FES control (often concurrently running, or loaded with, many other unrelated applications or functionalities, leading to unpredictable performances), here the idea is to replace it with a dedicated embedded system. In this regard, major concerns will be the effectiveness and safety of the stimulation and the resulting performance, i.e., a latency short enough to fulfill the real-time constraints and the quality of the stimulated movement.

We propose an embedded bio-mimetic FES system based on the Average Threshold Crossing (ATC) event-driven technique applied to the sEMG signal. The ATC essentially compares the sEMG signal with a threshold [6]: the Threshold Crossing (TC) events generate the quasi-digital TC signal, which is characterized by a digital waveform carrying analog (time-based) information. The ATC parameter is then computed by counting the number of TC events during a time window. In [7], we have demonstrated the correlation among ATC, ARV and the muscle force: in particular, having 0.95 ± 0.02 ATC-force w.r.t. 0.97 ± 0.02 ARV-force correlation, the ATC parameter can be used as indicator of muscle activity [8]. In this way, the event-driven approach enables the implementation of a low-complexity on-board feature extraction process, divided into two steps (TC generation and ATC computing), which can be directly performed in hardware [9,10], supporting, e.g., the recognition of different gestures [11–14]. While the theoretical background of ATC is quite similar to others common sEMG features, e.g., Zero-Crossing (ZC) or Wilson Amplitude (WAMP) [15], our event-based approach could overcome signal processing limitations for embedded feature extraction [16]. In particular, ZC and WAMP calculations are achieved by analyzing an already digitized signal, leading to high time, processing and power consumption, while by implementing in hardware our proposed ATC approach we are able to relax these issues.

Therefore, the minimal data size of the ATC information [10] and its sparsity (due to its event-driven nature) perfectly matches the low computational capabilities of an embedded system. Evolving from the architecture presented in previous works [10,17,18], with the aim of making the system portable and improving the run-time performance, we replaced the personal laptop, and the software based on the MATLAB® & SIMULINK® environment, with a Raspberry Pi 3 B+ as the processing and control core of the system, running a multi-platform software. Its main tasks are the management of the sEMG multi-channel wireless acquisition, the computation and update of the FES parameters from the ATC data, and the safe control of the stimulator. The software features a Graphical User Interface (GUI) as well, to monitor and control every aspect of the system, eventually guiding the user into setup different and personalized stimulation sessions.

From the application point of view, typical scenario consists in the reproduction of functional movements between two subjects in the therapist-patient rehabilitation context: The muscular activity monitored from an healthy subject (therapist), e.g., doctor, physiotherapist, during the execution of a movement, is processed in order to define the FES pattern to be applied to a second subject (patient) in order to induce the replication of the same movement.

This manuscript extends what already presented in [19] by discussing the design choices and details about the system architecture, focusing both on hardware and software aspects, and the related development approaches; finally, further system characterization and validation results are presented, as well as in-vivo experiments.

The paper is organized as follows: Section 2 presents the overall system architecture and details the design and development of both the hardware and software parts; Section 3 presents the results of the validation and characterization of the developed embedded system, while Section 4 the results of in-vivo experimental tests are reported; results, with particular emphasis on the feature of the embedded system, are then discussed and compared with related works in Section 5; in the end conclusion and future perspectives are outlined in Section 6.

#### **2. System Architecture: Design and Development**

#### *2.1. Overview*

A description of the proposed system can be conceptually schematized into inputs, control and output logical macro areas, as represented in Figure 1, according to the actions flow from signal acquisition to stimulation application. Data acquired by input devices (i.e., muscular activation and limbs motion) are processed by the control unit in order to drive the FES application through the output device.

**Figure 1.** System hardware and user interface architecture: The Raspberry Pi acts as control logic linking the input devices (i.e., surface ElectroMyoGraphic (sEMG) acquisition board and electro-goniometer) with the output (Functional Electrical Stimulation (FES) stimulator).

We designed a flexible enough framework by developing a multi-platform software core, compatible with widespread Operating Systems (OSs) (such as Microsoft® Windows®, GNU/Linux and Android), able to run on commonly available devices, i.e., PC, laptop, tablet, smartphone as well as Raspberry Pi.

Among all the possibilities, we defined our optimized embedded version of the system as Reference Hardware Setup (RHS), which comprises individual acquisition channels for sEMG and electro-goniometers as inputs, Raspberry Pi as control logic and the RehaStim 2 FES stimulator as output. With respect to RHS, other configurations are characterized by changes in the inputs and control devices (i.e., a Microsoft® Windows® or GNU/Linux PC), which lead to slight variations in the wireless connectivity management and software structure.

#### *2.2. Hardware Platform*

The input devices are the sensors useful to record the signals of interest, i.e., the electrical signals produced by the muscles contraction (sEMG signal) during the execution of a movement and the angular signals representing the limbs motion of the human body.

In the first case, the employed device has to amplify and filter the muscular signal in order to allow its interpretation, since the raw signal amplitude varies between hundreds of μV and tens of mV [20]. Therefore, referring to the guidelines reported in [21], we developed an analog conditioning circuit for the bio-signal [9], which, using the three-electrodes differential approach (two as sources, one for reference), provides 1000 gain factor in the 30 Hz to 400 Hz bandwidth (obtained as a cascade of a differential first-order high-pass filter [22] and a second-order Sallen-Key low-pass filter [23]) in order to filter out electrode-skin movement artifact and high-frequency noise. Moreover, since the

sEMG module has to be coupled with FES stimulator, we added overvoltage protection diodes on the channels input. As introduced, we carried out the first step of our event-driven signal processing by extracting the TC signal using an hysteresis voltage comparator (30 mV) so to avoid spurious glitches. The average counts of events (ATC) is then computed at the digital interface with the microcontroller (MCU): in [10], we demonstrated how to accomplish this task minimizing the MCU resources to a GPIO interrupt, which detects the TC digital events, and a timer, which defines the observation window. The length of the window is set to 130 ms as reported in the tests presented in [9], where this value has been proved to be an optimal trade-off between the time resolution of the muscle activation and the discrimination of different levels of generated muscular force.

We propose two solutions based on this acquisition and processing architecture, shown in Figure 2, depending on the user needs: The first option (a) is a complete four-channels board suitable for multiple-muscle monitoring on the same limb, e.g., extensor and flexor muscles of human forearm, while the second one (b) is a stand-alone single-channel module to be used independently when an individual detection is advantageous, e.g., biceps- and triceps- brachii muscles during the elbow flexion and extension. We equipped both solutions with wireless connectivity in order to improve a freedom movement executions, avoiding wiring hindrance, and to make the systems fully wearable: among the wide list of wireless (standard) option, we chose the Bluetooth Low Energy (BLE) protocol (stack 4.1 [24]) because of its low-energy features, which perfectly match with battery-device requirements. In particular, we equipped (a) with the Microchips RN4020 [25] module (with its own antenna), while in (b) the same MCU used for computing the ATC runs the BLE stack and directly feeds a PCB antenna, designed referencing to [26].

**Figure 2.** Custom sEMG acquisition device: (**a**) Four-channels board for multiple muscles monitoring, (**b**) independent single-channel module.

As second input typology, we developed custom electro-goniometers in order to record the limbs motion in form of electrical signals. Figure 3 shows their structure (very similar to standard goniometer's one), which basically consists of two parts fixed by a pivot at one extremity. Employing an absolute capacitive modular encoder, i.e., the AMT20 [27], and placing its center in correspondence of the pivot, we were able to detect the goniometer's angle decoding the encoder shaft position related to its inner capacitance changes. Angle values are represented on 12 bit, with a 0.2° accuracy and, since the AMT20 presents an SPI output line, we interfaced it with an Arduino micro MCU [28] in order to sample the signal at 80 Hz (appropriate w.r.t. human movement velocity [29]) and to transmit it to an external device (via USB cable) for graphical representation. The goniometer case has been manufactured by a 3D printing process, employing the Form 2 printer [30] with a bio-compatible photo-reactive resin, which allowed us to design an anatomical comfortable and lightweight structure. Four elastic strips secure the electro-goniometer in the proper location on the limb, ensuring its pivot to be in position with the rotation center of the articulation.

**Figure 3.** Wearable electro-goniometer developed to test the system performance and to provide an angular feedback on the ongoing stimulation.

The control of the induced FES pulses depends on how they are electrically generated and which pulse parameters can be modified during the stimulation. We decided for the medical-certified RehaStim 2 [31] because it allowed us to have an advanced control on the pulse definition per channel and the possibility to be easy interfaced with an external device by means of the ScienceMode2 bidirectional communication protocol [32].

The generated current pulses are characterized by a biphasic rectangular shape, shown in Figure 4, whose configurable parameters are the pulse amplitude, the stimulation frequency and the phase width, while the inter-phase interval is fixed to 150 μs guaranteeing a proper stimuli excitability [33].

**Figure 4.** Rectangular biphasic current pulse generated by RehaStim 2 stimulator and its parameter.

Therefore, considering the ATC dependency on the muscle force (e.g., correlation between ATC and sEMG amplitude/energy indicators), our idea has been to modulate the FES pulses intensity on the basis of such parameter, while for the other settings we referred to the physiotherapy manual provided with the stimulator [34]. In this way, the modulation approach allows us to excite the muscle fibers with the proper amount of current during all the phases of a movement session (warm up, increasing force, relaxation as well as resting state) and for a wide list of exercises.

Last part of the system is represented by the Raspberry Pi, model 3 B+ [35], working as control logic which manages the entire system. Indeed, it runs the main software controlling the data acquisition, its processing, the stimulation definition and application. Moreover, since this Raspberry Pi is equipped with four USB ports and a full size HDMI, we improved the system usability developing a complete GUI and employing some peripherals, as keyboard, monitor and mouse.

As discussed in Section 2.1, different devices can act as control unit appropriately configuring the hardware: As an example, if a Microsoft® Windows® OS PC is used as control logic, the CC2540-Dongle [36] module is needed to communicate with the acquisition devices (limiting the maximum number of simultaneously connections to three) since Windows® machines do not allow an easy access to the Bluetooth interface.

#### *2.3. Software Overview*

As previously introduced in Section 2.1, although the project is finalized to the development of an embedded system, we want to provide a modular and flexible software core able to fulfill the compatibility requirements of different OSs. As previously introduced in Section 2.1, we provided a modular and flexible software core able to fulfill the compatibility requirements of different OSs. Consequently, from the development standpoint, we based the software on the Python language, because of its cross-platform nature, its widespread adoption, and the large availability of third-party multi-platform libraries (such as standard library for multi-threading features or Kivy library [37] for the GUI). Moreover, the embedded software has been based on an object-oriented (OO) framework in order to promote flexibility, modularity and robustness [38] (e.g., leveraging encapsulation, inheritance, and composition features), allowing a seamless integration and management of several devices (e.g., different input modules) along with the possibility of future integration of new processing algorithms. We also implemented a multi-threaded architecture in order to map the functional tasks onto different running threads [39], so to optimize the use of computational resources and to avoid complex (run-time) code interdependencies.

#### 2.3.1. Classes Diagram Overview

As shown in the Unified Modeling Language (UML) diagram in Figure 5, the main System object is composed by four sub-objects: The FES class representing the stimulator, two Goniometer classes for the developed electro-goniometers and a Bluetooth class, which can have different implementations depending on the hardware configuration.

**Figure 5.** Classes diagram (UML) of our OO software organization. Bluetooth class implementation depends on system configurations.

Since both the goniometers and the stimulator are wired connected to the control unit, their classes inherit from an abstract custom Serial Device class, which provides a standard interface for every serial device (i.e., serial port, baudrate, stopbits, etc. attributes or connect(), settings(), transmission() methods and so on). Specific methods of different serial devices have been overwritten in order to provide the proper interfacing with the control unit.

Regarding the BLE software, it depends by system configurations: if the RHS is used, we combined the BlueZ [40] Linux Bluetooth stack with the bluepy [41] Python library (specific for low energy features). In particular, the HW Bluetooth class is composed by a variable number (zero to four) of BLE connections, which in turns consists of by a Delegate (notification data handling) and a Peripheral (bluepy instance for encapsulating BLE BlueZ connection) objects, and a Scanner, which seeks for advertising devices. On the other hand, if a common PC is employed, the CC2540-Dongle module is needed and, since it communicates through a serial port with the workstation, the Bluetooth class inherits from the Serial Device one.

#### 2.3.2. Multi-Threading

Figure 6 shows the multi-threading structure of the system and the running state of the involved threads during a typical stimulation session.

**Figure 6.** Multi-threading structure during a typical stimulation scenario.

The Main Thread starts after the user login and runs all along the session waiting for the user inputs, at which correspond the creation of child threads, handling the user interface. As primary sub-thread, the Output Control Thread manages the communication with the stimulator, e.g., watchdog timer, packet creation etc., during the calibration and stimulation phases. Moreover, the Main Thread runs all the calibration-step threads (i.e., ATCth, ATCmax, AROMmax and Imax, details in Section 2.4) during the settings and the Idefinition threads when the stimulation is applied, globally defined as Processing Threads. Each of them is also supported by a Plot thread, represented by white rectangle, which graphically represents the useful signals. Finally, we developed the Acquisition Threads, divided into ATCacq and Angularacq for the ATC and angular values acquisition, and Service Thread for BLE notifications managing. Data exchange among threads is organized with queue objects; therefore, each thread implements a specific method in order to continuously check the queue status.

#### 2.3.3. Graphical User Interface

The GUI has been developed choosing the Kivy Python library [37], due to OS inter-compatibility, modern layout, open-access feature and optimized performance [42], in order to have an easy, intuitive, and practical high-level control of the application.

Figure 7 shows the main four screens of the GUI. In the Initialization one, the user inserts the personal information of therapist and patient, and chooses the system configuration (acquisition and stimulation channels) along with the movement that will be executed. The Calibration screen is properly designed to perform the calibration process, whereby the acquisition and stimulation parameters are optimized for the user-case. Subjects data are used to build up a database, useful to fast-configure application settings avoiding the calibration steps. In the Main Stimulation screen, the stimulation can be started and stopped, and the useful signals are graphically represented (i.e., pulse amplitude and angular signals) in order to provide a visual feedback for the therapist. Lastly, the Parameters screen allows the user to modify the parameters or save them if multi-session scenarios are expected. Transitions among the screens, represented by black arrows in Figure 7, have been arranged using the Screen Manager object, facilitating user navigation among sections.

From an OO prospective, all the screens directly inherit from the Kivy Screen class, with the exceptions of the ones containing graphs (i.e., Main Stimulation and Calibration) which are also defined by the MyPlotScreen class since it possesses Kivy plotting objects. Thus, the System is aggregated in every main screen where the system actions run through screens widgets.

Lastly, on the Raspberry Pi, we changed the RAM memory assigned to the Graphics Processing Unit (GPU) from 64 MB to 256 MB in order to execute the GUI without impacting on the graphical resources.

**Figure 7.** Kivy main screens: The Initialization one allows the user to store subjects information and set up the system; in the Calibration one the calibration process is achieved, and the optimized parameters can be visualized, saved and modified into the Parameter screen; finally, the Main Stimulation screen runs the stimulation and graphically represents the signals of interest. Arrows highlight transitions among screens.

#### *2.4. ATC Dataflow: Processing and Calibration*

The definition of the FES pulses amplitude dependent on the ATC values is the core of the FES control mechanism, linking the data acquisition with the stimulation one. Since embedded device has extremely low-computational power, we needed to implement this process trying to maintain the complexity lower as much as possible, also considering fast computing approach to respect real-time requirements. In this scenario, taking advantage of the sEMG-ATC (pre-)processing, our idea is to mimic the simplicity of a look-up table structure: Basically, we organized it as two matrices architecture, one for the ATC values and one for the FES current ones, with one-to-one cell correspondence between them.

Typical application scenario, considering *n* active channels, is represented in Figure 8: Every time a new BLE packet arrives, containing the ATC data of n channels, the received data are appended to an n×4 matrix (ATC matrix), which also includes the three past ATC-window data. Then, the row-median operation is computed in order to obtain a robust ATC value without any noise corruptions. Since the ATC matrix is continuously updated (every ATC window), this operation basically represents a moving median. In this way, we obtain a n×1 array, whose values are interpreted as indexes pointing to the FES current values stored into the FES Current Matrix. Once the new stimulation data are defined through this algorithm, a FES data packet is built up and the command is transmitted to the stimulator.

**Figure 8.** Average Threshold Crossing (ATC)-FES definition process: Green cells represents the links between inputs ATC data and outputs pulses current; orange labels identify the FES current Matrix indexes defined by the Maximal ATC calibration step; blue values correspond to the maximal FES current calibrated with the Current limitation process.

However, since different subjects could produce different ATC values or be stimulated by a diverse amount of current, a calibration process for the optimization of the acquisition (therapist) and stimulation (patient) parameters is fundamental, permitting us to develop a flexible system, able to suit different users, while maintaining the benefits of a proper and safe per-subject stimulation. Hence, we defined a four-steps calibration process as follows:


Following this approach, we are able to set up our structure with a perfect matching between the muscular activation of the therapist and the pulses amplitude to adequately stimulate the patient limb. Looking at the example represented in Figure 8, the FES Current Matrix has a different column-dimension for each channel defined by the Maximum ATC values. In this way, setting the Maximum current values, we are able to define step and range of pulses amplitude. Concluding, simply controlling the lower values of the stimulation matrix (FES Current Matrix[(:, 1:2)], grey cells), combined with the moving median gate operation, we are able to implement a very low complex but efficient noise-gateway control.

#### **3. System Validation and Characterization**

The performances of the control unit have been studied by analyzing the real-time FES control processing, fundamental to achieve the proper online modulation of pulses amplitude, and examining how the developed software impacts the workstation resources, so evaluating memory, graphic and computational cost. Due to the multi-platform nature of our software, we carried out these tests comparing its behavior running on two different control units: in particular, we employed the Raspberry Pi 3 B+, equipped with a Cortex-A53 (ARMv8) 64 bit, running at 1 GHz, 1 GB RAM and Raspbian OS, to evaluate the performance of the embedded version; conversely, a Toshiba Satellite L830-14J PC, equipped with an Intel Core i3-3227U with 1.9 GHz clock frequency, 4 GB RAM and Microsoft® Windows® 10 OS, has been used to simulate personal laptop application scenario.

#### *3.1. Latency Measurement*

As mentioned in Section 2.1, the system can adopt different architectures depending on which sEMG acquisition device is used and by the employed processing unit. As a consequence, we defined five hardware configurations (**Cx**) to be tested, listed in Table 1. The latency has been evaluated for two crucial sections of the application: The FES current definition, which concerns the definition of the new pulses amplitude on the basis of the latest ATC values, and the Plotting, which regards the representation of both the angular signals and the FES currents over time. The duration of the test has been set to 3 min in order to obtain sufficient values (180 s/*ATCwindow* = 180 s/0.13 s 1385 measures) able to represent the system performance from the stimulation initialization to the stable working condition.


**Table 1.** Tested hardware configurations. Main differences concern the acquisition device (single channel or four-channel board) and the control unit (GNU/Linux Raspberry or Microsoft® Windows® PC).

\* up to three concomitant connections.

#### 3.1.1. FES Current Definition

This method represents the logical core that links the ATC values, describing the muscular activity, to the FES current values, which specify the amplitude of the incoming pulses. As described in Section 2.4, we implemented this functionality using a lookup table structure in order to minimize the complexity as well as the processing time. Indeed, the real-time FES definition is a fundamental task for a proper stimulation, avoiding any delay caused by data-queueing; in particular, our time constraint is

directly related to the ATC window, which defines the time interval between two ATC values, and so the FES current definition processing time has to be lower than 130 ms. We tested this process studying how different methods split the workload among them and evaluating the total processing time. As details of the first test, the FES current definition is divided into five consequential methods: queue continuously checks the incoming of the new ATC values; when they are available, we append them to the ATC matrix and the median operation is performed. Then the FES\_start method builds the FES packet to be transmitted to the stimulator, and in the meanwhile the plot thread is called for the graphical representation of the signals.

Table 2 reports the time profiling of the workload breakdown: As it can be observed, the majority of the time (around 90% in **C1**, **C2** and **C3**, and 97% for **C4** and **C5**) is spent inside the queue method waiting for the arrival of new ATC values, while the other sub-functions runs for a very short time. Therefore, from a methods breakdown point of view, this behavior confirms that the application works as expected, avoiding any queue formations caused by low computational processing.


**Table 2.** Time profiling results for the evaluation of the methods breakdown during the FES current definition process.

On the other hand, the real-time FES definition has been proved by looking at the delay data represented in the box plots in Figure 9. All the cases largely fulfill our time constraint, also considering the outliers, since none of the values is greater than 100 ms. In particular, in the Raspberry cases and PC ones the median values are below of 10 and 5 ms respectively, which avoid any possible delay between acquisition and stimulation caused by our FES definition processing. However, comparing the two OSs, laptop performances are considerably superior with respect the Raspberry ones since both hardware and software resources differ between the two architectures.

**Figure 9.** Box plots representing the time delays related to the FES current definition method for the different system configurations. Real-time constraints are respected, considering our time limitation of 130 ms (dashed black line), both in GNU/Linux Raspberry (**C1**, **C2** and **C3**) and Microsoft® Windows® PC (**C4** and **C5**) cases.

In conclusion, these results prove the low complexity implementation of our event-driven ATC-FES definition approach: considering the computational lightness of the ATC processing, based on simple mathematical relations (such as matrix and median operations) applied to a minimal data size, we were able to reach very fast pulses updates (lower than 10 ms) along with online modulation of FES parameters.

#### 3.1.2. Plotting

The above measurements are repeated for the Plotting process in order to study if the graphical representation of the signals of interest can affect the run-time performances. The plotting is based on a clock object, whose methods are the get\_value and the sleep: The former gets the new ATC and angular data, and represents them on the graphs; the latter puts the object into an idle state until new data are available. As in the previous test, this analysis allows us to detect the queues formation during the plotting process, but, looking at the Table 3, we can assume this critical condition has not been reached since the 99% of the time is spent in the clock sleep state. Moreover, we measured the time spent by the Plotting thread in order to verify the correctness of the data representation, discarding the possibility to misinterpret the stimulation visual feedback. This time, we report in Figure 10 our results only for the embedded configurations as worst case scenario: Again, since the 95% of the values (IQR) are lower than 2 ms for all the cases, we can assume the real-time behavior of our plotting method.

**Table 3.** Time profiling results for the Plotting process into its two sub-threads: get\_value and sleep methods represent the active and inactive action, respectively, of the clock object which manages the plotting of the data.


#### *Plotting* **processing time**

**Figure 10.** Plotting delay for the configurations employing the Raspberry Pi (**C1**, **C2** and **C3**). The very low plotting-time proves the real-time graphical representation by our application.

#### *3.2. Computational Performance*

The information about the usage of resources at the processing side, such as the Central Processing Unit (CPU) and the Random Access Memory (RAM), is of crucial importance for the evaluation of application fluency, performance and usability. From this perspective, we carried out this test using the RHS (**C2**), considering it as the worst case scenario in terms of processing power. Hence, we monitored the system processes through the htop GNU/Linux tool, running the system without the Raspbian GUI activated in order to take in consideration only the application and all its dependencies (i.e., BLE and Kivy).

Table 4 reports the results when four channels stimulation mode has been selected: As it can be observed, the most challenging CPU performance is reached in the main stimulation procedure, where the highest amount of threads are active. Therefore, focusing on the Stimulation stage, we repeated the measures studying two different but dependent cases: first, we varied the channels number from one to four looking at resources usage changes; second, with the same purpose, we analyzed our ATC-FES current definition process both in the standard situation (i.e., append and median methods), and when a direct equivalence between ATC and current values is performed (no data processing). From the results listed in Table 5, we can see that employing one single channel we are able to reduce the CPU usage of approximately the 20% w.r.t. the complete channels configuration, and it entirely depends on the fewer or higher number of co-running threads in the two situations. Instead, as it can be observed by the two main columns, the FES current definition implementation does not affect the CPU usage, which further proves the lightweight computational cost of our approach.


**Table 4.** CPU and RAM measurements of the application main stages, which have been evaluated testing the Reference Hardware Setup (RHS) with four working channels.

\* four steps of the Calibration process.

**Table 5.** RHS resources performance during the Stimulation stage depending on the number of working channels and the FES current definition implementation. The ATC processing has been enabled (standard flow) or disabled (direct ATC-IFES equivalence) in order to study if whether implementation affects the run-time system performance.


On the other hand, the dynamic memory suffers just low variations among the application stages and between one or four channel cases. This behavior is mainly due by the different amount and types of widgets the GUI owns, which are directly related to the number of active channels. Hence, since there is not any differences between single or four channels GUIs, the RAM consumption is almost constant (see Table 5).

#### **4. In Vivo Experimental Tests and Results**

As the system is intended for rehabilitative sessions, some tests have been carried out in order to prove the correctness and appropriateness of our approach in the control of the FES application. As introduced, typical scenario considers a first subject which performs an useful movement and whose muscle activity is monitored (therapist), and a second subject which replicates the movement as consequence of the stimulation application (patient). We studied the similarity between the two movements by analyzing the limb motion signals, acquired using the developed electro-goniometers which are worn from both the therapist and patient. We compared the angular signals calculating the maximum of the normalized cross-correlation coefficient (*σ*) as reported in the following formula:

$$\sigma = \max(\sigma\_{th\\_pt,coeff}(m)) = \frac{1}{\sigma\_{th\\_th}(0) \* \sigma\_{pt\\_pt}(0)} \* \sigma\_{th\\_pt}(m) \tag{1}$$

where *m* is the lag between the signals (th, pt), and the autocorrelation product normalization limits *σ* values to 1 (perfect match between signals) and –1 (complete opposite signals).

We enrolled 11 healthy subjects, whose have been submitted their informed consent for our testing protocol (approved by the Bio-ethical Committee of the Università degli Studi di Torino, Italy), and we divided them into therapist-patient couples. In the next sections, we introduce the adopted methodologies for choosing electrode type and for preparing the subjects skin, and we report our complete results for upper limb exercise and the preliminary one for the lower limb exercise.

#### *4.1. Electrodes and Skin Preparation*

A proper treatment of the electrode-skin interface is essential in order to enhance the signal acquisition quality. Therefore, cleaning the skin surface with medical alcohol allows a removal of fat, dust and dead cell, and an increasing of the conductivity through the electrode [43]. We chose the Kendall™ Covidien H124G -24 mm [44] for the sEMG signal acquisition due to the Ag/AgCl sensor, pre-gelled surface and long-term stability. Instead, for the stimulation, we employed the 5 cm × 9 cm RehaTrode [45] rectangular self-adhesive electrodes produced by the Hasomed®, which are perfectly designed to be coupled with the RehaStim 2 stimulator. The main difference between these two types regards the working area, having the acquisition electrodes a higher spatial resolution while the stimulation ones cover a bigger surface to properly induce the stimulation.

#### *4.2. Upper Limb: Elbow Flexion*

As upper limb benchmark exercise, we chose the Elbow Flexion (EF) movement, which consists in the forearm motion toward the upper arm rotating around the elbow join center. The active muscles of the arm are the brachialis, which attaches the humerus to the ulna, the brachioradialis, that connects humerus and radio, and the biceps brachii, which links the shoulder blade to the radius [46]. Since our idea was to perform this first tests with minimal complexity, we decided to monitor and to stimulate only the biceps brachii also due to its accessibility by surface electrodes. Therefore, we placed the couple of acquisition electrodes at 1/3 of the line between the fossa cubit and the medial acromion, with 20 mm inter-electrode distance, and the reference one on the back of the hand, as electric-neutral area [43]. In contrast, the FES electrodes position slightly differs from the previous ones, being located one on muscle belly and the other one closer to the crease of the elbow [47], in order to have a correct muscle fibers contraction. The experimental setup is shown in Figure 11a,b.

**(a) (b) Figure 11.** Acquisition (**a**) and stimulation (**b**) electrodes location, on the biceps femoris muscle, for the EF exercise.

Ten healthy subjects (five males and five females, 24–27 years old) took part to the testing phase: We divided them into five therapist-patient couples and, after the calibration of the acquisition and FES parameters, we asked them to repeat the EF exercise twelve times for each couple. A single repetition has to follow this flow: The starting position for both the therapist and patient is upright sitting, with their forearms and hands completely lean against the table, forming a 90° angle with the upper arm; then, the therapist performs the movement reaching her/him AROM, and finally returns to the starting position; once also the patient has finished the exercise, a short pause of at least 10 s prevents any muscle fatigue effects.

ATC, FES current, therapist and patient angular values have been collected during the entire session. They are successively processed in the MATLAB® environment in order to extract the useful information for the comparison between the voluntary and stimulated movement. The angular signal processing consists of the following steps:


The box plot on the left of Figure 12 represents the entire dataset of *σ* values (60 measures: Five couples per 12 repetitions each one) extracted during our test campaign. As it can be observed, the distribution is Q3-skewered to the unity, which indicates a good reproduction of the movement, further confirmed by a median value above 0.8. Indeed, looking at the angular signals on the right graph of Figure 12, representing a single repetition, we can see how much the limb motion is similar between therapist and patient. Moreover, this graph also shows the on-line modulation of the stimulation current when the ATC values, directly proportional to the therapist limb angle, trigger the increasing, decreasing or plateau current phases. It is possible to notice that the total delay between the two movements is due to a first short processing phase, visible as distance between non-zero therapist angle and non-zero FES current, and a physiological longer one, distance between non-zero FES current and non-zero patient angle, which depends on the muscle mass and fibers contractions.

**Figure 12.** (**left**) Example of the stimulation application of a repetition of the Elbow Flexion (EF) movement: Blue and red are the angular signals of the therapist and patient, respectively; the dashed black line represents the FES current injected through the electrodes. (**right**) Similarity analysis using the maximum of the cross-correlation coefficient to compare the limb motion angular signals of the therapist and patient for the EF movement.

#### *4.3. Lower Limb: Knee Extension*

We also tried to replicate the Knee Extension (KE) movement, due to its largely employment as physiotherapy exercise. From a sitting initial position, the contraction of the quadriceps femoris muscle allows the extension of the leg with respect to the knee joint. This muscle is composed by four separate muscles: The rectus femoris, in the middle of the thigh; the vastus lateralis located on the lateral side of the femur; the vastus medialis on the medial side; and the vastus intermedius under the rectus femoris [48].

In our test, we decided to monitor the vastus lateralis and vastus medialis, setting up a two-channel stimulation layout: in the first case, the sEMG electrodes were placed at the 80% of the line from the anterior superior iliac spine and the medial side of the platella, while in the second case, they were put at 2/3 of the line connecting the anterior superior iliac spine with the superior lateral side of the patella [43], as shown in Figure 13a. Both reference electrodes were located on the patella. On the other hand, as reported in Figure 13b, the stimulation electrodes were placed along the muscle bellies in order to cover a surface including both the muscles and the rectus femoris.

Our preliminary results consist in 13 repetitions of the movement, performed by a single female subject (24 years old). As in the EF case, both the therapist and patient need to start from an initial position, which we defined as 135° between thigh and calf. Then, the therapist extends its leg until her/him AROM and, once the stimulation is completed, at least 10 s have to be waited before next repetition.

The cross-correlation results, calculated with the same method of the EF case, are reported on the box plot in Figure 14 (top left). However, these values are also represented by a time-graph (bottom left) in order to avoid any misinterpretations due to the low number of measures. Looking at the graphs, some considerations can be made: first of all, also for the KE movement, we obtained satisfactory results in term of similarity between the two signals, proved by the *σ* values completely equal or greater to 0.9. Indeed, considering the repetition example on the left graph, we can observe the similar morphology among the therapist motion and the patient one. Anyway, the maximal angular value reached by the patient is lower than the therapist one. This behavior is possibly related to the muscle physiology activation and different fibers recruitment between voluntary and stimulated contraction. One cause could be associated to the stimulation of a healthy subject, with a normal muscles condition, that, by applying large values of stimulating current, could lead to sense of pain. Hence, we limited the current to the values represented by the dashed line in the graph avoiding this situation. A second

possibility could be related to which muscles have been stimulated: A complete leg extension involves the total contraction of the quadriceps, while superficial electrodes could not result in the proper shortening of the deeper fibers, consequently producing a limited movement. Anyway, further studies will allow us to set up more complex stimulation scenarios, which will induce a better reproduction of the movement.

**(a) (b)**

**Figure 13.** Knee extension exercise. (**a**) sEMG acquisition electrodes on the vastus lateralis and vastus medialis muscles. (**b**) the electrodes are directly placed on the muscle bellies of the vastus lateralis and vastus medialis. This locations and the electrodes dimension also contract the rectus femoris muscle improving the stimulation effectiveness.

**Figure 14.** (**left**) Single repetition example showing the similarity between the two angular signals with respect to the vastus lateralis (VL) and vastus medialis (VM) stimulation currents. (**right**) Maximum of the cross-correlation coefficient for the 13 repetitions of the knee extension movement.

#### **5. Discussion: sEMG-FES Systems Comparison**

Table 6 reports a summary of literature works in the field of FES application triggered by sEMG signal analysis. The classification includes which control feature (e.g., RMS, envelope, ATC) has been used to online modulate one or more FES parameters, as well as the employed hardware, which summarizes the processing capability and the possibility to transfer the FES algorithm into an embedded device. Moreover, these systems could be also analyzed by considering additional features such as wireless connectivity, modularity and number of active channels which foster system wearability, future sensors and algorithms integration, and application typology. Lastly, since real-time behavior remains a major constraint, rightmost column shows the latency (FES pulses update period) measurements calculated as FES processing delay or (whenever available) the therapist-patient delay.


**Table 6.** sEMG-trigger-FES systems table comparison.

<sup>1</sup> measured as therapist-patient delay.

A complete comparison between our system and those reported here is not straightforward due to the large variety of analyzed features; anyway, some considerations about different methods and performance can be carried out. In [51] authors used the threshold crossing feature extraction to modulate the stimulation frequency of the FES pulses, achieving a very promising FES definition latency of 142 ms, directly comparable to our outcomes. However, since the sEMG processing has been embedded in the MCU, a standard sampling approach is needed, which includes peripherals management and relative expensive processing power; on the other hand, with our event-based approach, MCU resources could be drastically reduced by implementing ATC in hardware. At the same time, linking the TC events with the stimulation frequency results particularly interesting in order to reproduce motor fibers firing rate; in our system we preferred, as a first step, to modulate the intensity, but a frequency-control approach could be easily implemented thanks to the flexible and modular architecture of our system. Another interesting frequency modulation has been presented in [54], which evaluates the sEMG entropy on an MCU architecture, but limiting the number of controlled channels to one.

With reference to the pulse amplitude control, it could be performed by extracting different features from the muscle signal (e.g., RMS [49], envelope [50] and force [52]) or using data-fusion techniques with different type of sensor (as Inertial Measurement Unit, IMU [53]): While the final latency among these works and our proposed system is quite similar, and respects the real-time constraints for such type of application, the processing methodology does not always allow to use an MCU [53] and, where it is possible, raw data acquisition seems to be the common adopted solution.

As another comparison point, to the best of our knowledge for this application, our architecture is the only one which presents a modular system structure, thanks to the combination of chosen programming language and strategies.

Lastly, looking at the ATC-FES system evolution from previous versions [10,18], we enhanced the real-time performance obtaining a total latency of about 140 ms, defined as the sum of the processing time (10 ms, RHS configuration) and the ATC widow (130 ms). Again, since the relevant upgrade in our architecture was to move towards an embedded device (i.e., Raspberry Pi), we confirm how the lightness and low-complexity of the ATC technique perfectly match with the low-processing capability of an embedded system, while maintaining adequate FES control, usability and power performance.

#### **6. Conclusion and Future Perspectives**

In this paper we presented our last prototype of the sEMG(ATC)-controlled-FES system, in which we have been replaced the previous MATLAB® & SIMULINK® software architecture [10] with the novel embedded version running on a Raspberry Pi, in order to overcome the performances limitation due to the use of a general purpose computer. Taking the advantages of the object-oriented and multi-threaded approach, along with the versatility of Python programming language, we developed a multi-platform software core able to work onto several devices and with different operating systems, also enhancing system usability by featuring a graphical user interface.

Since the main tasks of the system application concern the modulation of the FES pattern and its real-time application, we designed a processing structure able to match with the low computational power of an embedded device. We implemented a calibrated ATC-FES lookup table structure, combined with noise-gateway controls, which allows us to obtain a total processing time, defined as the delay between ATC data and the new FES parameters, below 30 ms (corner cases), obtained without substantally impact on the CPU and RAM usage, therefore demonstrating the lightness and responsiveness of the event-driven technique in the control of the stimulation.

We proved system efficiency by studying the similarity between voluntary and stimulated movements in therapist-patient real FES scenarios (healthy subjects): using the maximum of the normalized cross correlation coefficient, as comparison measurement between the signals of the involved limbs, we obtained a mean value of 0.87 ± 0.07 as result of 60 repetitions (5 therapist-patient couples per 12 repetition each one) during the reproduction of the elbow flexion movement. These promising outcomes allowed us also to preliminarily evaluate the FES performance for the knee extension movement: Adopting the same methodologies, we analyzed 13 exercise repetitions achieving a correlation value of 0.93 ± 0.02.

Future improvements, i.e., full FES parameters modulation, pre-FES movement recognition, will permit us to further optimize the ATC-based stimulation in order to extend our testing phase to a wide list of rehabilitation exercises, also performing some clinical trials with the support of medical staff.

**Author Contributions:** Investigation, R.M.R.; Project administration, D.D.; Supervision, P.M.R.; Writing—original draft, F.R. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** The authors would like to thanks the volunteers which kindly offered for the experimental tests useful for this work.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Fast Approximations of Activation Functions in Deep Neural Networks when using Posit Arithmetic**

#### **Marco Cococcioni 1, Federico Rossi 1, Emanuele Ruffaldi <sup>2</sup> and Sergio Saponara 1,\***


Received: 25 January 2020; Accepted: 2 March 2020; Published: 10 March 2020

**Abstract:** With increasing real-time constraints being put on the use of Deep Neural Networks (DNNs) by real-time scenarios, there is the need to review information representation. A very challenging path is to employ an encoding that allows a fast processing and hardware-friendly representation of information. Among the proposed alternatives to the IEEE 754 standard regarding floating point representation of real numbers, the recently introduced Posit format has been theoretically proven to be really promising in satisfying the mentioned requirements. However, with the absence of proper hardware support for this novel type, this evaluation can be conducted only through a software emulation. While waiting for the widespread availability of the Posit Processing Units (the equivalent of the Floating Point Unit (FPU)), we can already exploit the Posit representation and the currently available Arithmetic-Logic Unit (ALU) to speed up DNNs by manipulating the low-level bit string representations of Posits. As a first step, in this paper, we present new arithmetic properties of the Posit number system with a focus on the configuration with 0 exponent bits. In particular, we propose a new class of Posit operators called L1 operators, which consists of fast and approximated versions of existing arithmetic operations or functions (e.g., hyperbolic tangent (TANH) and extended linear unit (ELU)) only using integer arithmetic. These operators introduce very interesting properties and results: (i) faster evaluation than the exact counterpart with a negligible accuracy degradation; (ii) an efficient ALU emulation of a number of Posits operations; and (iii) the possibility to vectorize operations in Posits, using existing ALU vectorized operations (such as the scalable vector extension of ARM CPUs or advanced vector extensions on Intel CPUs). As a second step, we test the proposed activation function on Posit-based DNNs, showing how 16-bit down to 10-bit Posits represent an exact replacement for 32-bit floats while 8-bit Posits could be an interesting alternative to 32-bit floats since their performances are a bit lower but their high speed and low storage properties are very appealing (leading to a lower bandwidth demand and more cache-friendly code). Finally, we point out how small Posits (i.e., up to 14 bits long) are very interesting while PPUs become widespread, since Posit operations can be tabulated in a very efficient way (see details in the text).

**Keywords:** alternative representations to float numbers; posit arithmetic; Deep Neural Networks (DNNs); neural network activation functions

#### **1. Introduction**

Due to the pervasivenss of real-time and critical systems like Internet of Things (IoT) platforms, automotives, and robotics, new types of requirements are being addressed in the use of Deep Neural Networks (DNNs).

The main challenges when dealing with DNNs are both the ubiquitous multiply-and-accumulate operations and the massive use of activation functions across the neural network layers. A big speed-up to these challenges is surely offered by parallelization of the workloads (e.g., Graphics Processing Units (GPUs) or Single-Instruction Multiple-Data (SIMD)) processors). However, these solutions are considerable demanding in terms of resources. Moreover, adding parallelization in critical systems may reduce the predictability of the said system (see References [1,2]). Furthermore, even the use of floating point SIMD engines is not always possible in embedded systems (e.g., ARM Cortex-M4 [3]). This means that we cannot always rely on high-performance processing units in critical and real time scenarios, thus needing to address new challenges.

Therefore, the challenging topic is to satisfy the real-time requirements while guaranteeing computational efficiency and lowering the power and the cost of such applications. One of the main paths to reduce the computational complexity when evaluating DNNs is stepping away from cumbersome arithmetic such as double-precision floats (represented on 64 bit). The basic idea is to use compressed formats that may save resources in terms of power consumption and computational efficiency. Great examples of compact formats are Brain Floats (BFLOAT) and Flexpoint [4,5] that consist in an optimized version of the 16-bit standard floating point number IEEE 754) used by Google for their TPU (tensor processing unit) engines. Other formats also come from the concept of transprecision computing [6,7] (NVIDIA Turing architectures allow computation with 4-, 8-, and 32-bit integers and with 16- and 32-bit floats). The up-and-coming Posit format has been theoretically [8–10] and practically [11] proven to be a perfect replacement for IEEE float numbers when applied to DNNs in terms of efficiency and accuracy.

Due to its novelty, this format lacks proper or standardized hardware support (e.g., a Posit Processing Unit (PPU)) to accelerate its computation, forcing the use of software implementations. However, in order to speed up the software emulation of Posits in DNNs, we present two different techniques. In this paper, we extend the work on deriving a fast and approximate version of the hyperbolic tangent (TANH) presented in Reference [12]. We introduce novel arithmetic properties of the Posit number system with a deep focus on the Posits with 0 exponent bits. This special case allows us to build common functions and arithmetic operators as simple bit manipulations on the bit-string representing a Posit number. This new class of functions (called L1 functions) has some interesting properties:


In particular, in this extension, we also propose a new fast and approximated version of the Extended Linear Unit (ELU) activation function.

Moreover, if we consider really low-power devices that do not embed a floating point unit but only an arithmetic logic unit, the approach proposed can become very interesting to enable DNN processing even in this class of devices (although for inference only, not for training).

Furthermore, we investigate operator tabulation as a different approach to speed up Posit emulation without constraints on the exponent configuration. This allows us to accelerate basic arithmetic operators like sum and multiplication that are not suitable for being implemented as L1 functions. Although very powerful, this approach has clear limitations to its scalability, having a considerable spatial complexity.

#### *Paper Structure*

The paper is organized as follow: Section 2 introduces the Posit format, proposing novel approaches to approximation and speed-up of Posit arithmetic, exploiting the 0-bit exponent Posit configuration. Section 3 describes the cppPosit library implemented in Pisa for the computation of the new numerical format. Section 4 introduces the hyperbolic tangent and ELU activation functions along with their approximations. Section 5 shows the results of our approach with DNN and common benchmarking datasets. Finally, Section 6 provides the conclusions.

#### **2. Posit Arithmetic**

The Posit format has been introduced by John L. Gustafson in Reference [8] and was further investigated in Reference [9,10,12]. The format is a fixed-length one with up to 4 fields as also reported in Figure 1:



**Figure 1.** Illustration of the of 32-bit Posit data type.

Given a Posit on *nbits*;*esbits*, represented by the integer *X*, and *e* and *f* respectively as the exponent and fraction values, the real number *r* represented by that encoding is as follows:

$$r = \begin{cases} 0, \text{if } \mathbb{X} = 0 \\ \text{NaN}, \text{if } \mathbb{X} = -2^{(nbits - 1)} \\ \text{sign}(\mathbb{X}) \times \text{use} d^k \cdot 2^c \cdot (1 + f), \text{otherwise} \end{cases}$$

An example of Posit decoding operation is shown in Figure 2.


**Figure 2.** An example of a 16-bit Posit with 3 bits for the exponent (*esbits* = 3): Given the sequence on top of the figure, after detecting that it starts with one 1, we have to compute the 2's complement of all the remaining bits (passing from 001-110-111011001 to 110-001-000100111). Then, we can proceed to decode the Posit. The associated real value is therefore <sup>−</sup>256<sup>1</sup> · 21 · (<sup>1</sup> <sup>+</sup> 39/512). The final value is therefore −512 · (1 + 39/512) = −551 (exact value, i.e., no rounding, for this case).

The design of a hardware Posit Processing Unit (PPU) as a replacement for the FPU has already started on several universities worldwide, but it will take time for their availability on real platforms. Fortunately, we can still do many things related to DNNs even in the absence of a hardware PPU. Furthermore, when DNN weights can be represented with less than 14-bit Posits, we can tabulate some core operations like sum and multiplication (see Section 3.1) and can use the ALU for other operations that will be shown hereafter in order to reduce the number of tables.

As reported above, the process of decoding a Posit involves the following steps: obtaining regime value by reconstructing the bit-string, building exponent, and extracting fraction. We can make use of C low-level building blocks to speed up the decoding:

• Count leading zeros: using the embedded \_\_builtin\_clz C function that several CPU families provide in hardware [13].

• Next power of two: used to extract the fraction. An efficient way to obtain the next power of two, given a representation *X* on 32 bit, is the following:

next\_p2(X) -> Y Y=X-1 Y = Y | X >> 1 Y = Y | X >> 2 Y = Y | X >> 4 Y = y | X >> 8 Y = Y | X >> 16 Y=Y+1

This approach copies the highest set bit to all the lower bits. Adding one to such a string will result in a sequence of carries that will set all the bits from the highest set to the least significant one to 0 and the next (in order of significancy) bit of the highest set to 1, thus producing the next power of two. Let us use an example. Suppose *X* = (5)<sup>10</sup> = (0101)2. At the first step, *Y* = (0100)2. At the second step, *Y* = (0100)2|(0010)<sup>2</sup> = (0110)2. At the next step, *Y* = (0110)2|(0001)<sup>2</sup> = (0111)2. From now on, *Y* will remain set to *Y* = (0111)2. At the last step, *Y* = (0111)<sup>2</sup> + (0001)<sup>2</sup> = (1000)<sup>2</sup> = (8)10, that is the next power of two starting from 5.

#### *2.1. The Case of No Exponent Bits (esbits = 0)*

When using a Posit configuration with zero exponent bits (*esbits* = 0), some interesting properties arise. In this case, we can express the real number represented by the Posit as follows:

$$\mathbf{x} = \mathbf{2}^k \cdot (\mathbf{1} + \boldsymbol{\phi} \cdot \mathbf{2}^{-F}) \tag{1}$$

where *φ* is the fraction field and *F* the fraction length. The value of *k* depends on the regime length R. In particular, *<sup>k</sup>* <sup>=</sup> <sup>−</sup>*<sup>R</sup>* for *<sup>x</sup>* <sup>&</sup>lt; 1 (from now on *<sup>x</sup>*−) and *<sup>k</sup>* <sup>=</sup> *<sup>R</sup>* <sup>−</sup> 1 for *<sup>x</sup>* <sup>&</sup>gt;<sup>=</sup> 1 (from now on *<sup>x</sup>*+). If we denote the bit immediately following the regime bit string (stop-bit) as *σ*, we can express the value of *R* as *R* = *N* − 2 − *σp*, where *σ<sup>p</sup>* is the position of the stop-bit in the Posit bit-string. For *x*−, we can note that, substituting the expression for *F* = *N* − 2 − *R* in (1), we get the following expression:

$$\mathbf{x}^- = \mathbf{2}^{-R} + \boldsymbol{\phi} \cdot \mathbf{2}^{-(N-2)} = \mathbf{2}^{N-2} \cdot (\mathbf{2}^{-F} + \boldsymbol{\phi}) \tag{2}$$

Moreover, we can link *x*− with its representation *X* using Equation (3), obtaining:

$$\mathbf{x}^- = \mathbf{X} \cdot \mathbf{2}^{-(N-2)} \tag{3}$$

A particular property emerges with 0-bit exponent Posits when considering the [0, 1] range. In fact, if we plot the resolution (that is the decimal difference between the real numbers represented by two consecutive bit-strings) of a Posit *X*, 0 in [0, 1] we obtain the resolution of a fixed-point format. This property is visualized in Figure 3. This is a very important property that will be exploited below.

As we will see below, the novel equations introduced above for the first time play an important role for deriving fast approximation of activation functions in DNNs. Equation (3) says also that a Posit with zero exponent bits can be interpreted as a fixed point number with a shift of (*N* − <sup>2</sup>) bits. This has implications on the accuracy and further operations.

An example of how to exploit the expressions discovered in the previous section is building a fast approximated inversion operator. Given *x*, we want to find a fast and efficient way to compute *y* such that *x* · *y* ≈ 1. In the following, we will consider only positive values of x. The simplest case is when *f* = 0. Let us consider *x* > 1; we simply need to apply a reduction of the regime length by 1 as in Equation (4).

$$\mathbf{x} \cdot \mathbf{y} = 2^{K\_x - 1 - K\_y} = 1 \to R\_y = R\_x - 1 \tag{4}$$

A trickier case is when *f* > 0. Here, we can easily see that *kx* + 1 = *ky*, that implies *Rx* = *Ry*. Therefore, we get Equation (5).

$$\mathbf{x} \cdot \mathbf{y} = 2^{-1} \cdot \left( 1 + f\_{\mathbf{x}} \cdot f\_{\mathbf{y}} \cdot 2^{-2F\_{\mathbf{x}}} + (f\_{\mathbf{x}} + f\_{\mathbf{y}}) \cdot 2^{-F\_{\mathbf{x}}} \right) = 1 \tag{5}$$

Then, discarding the term *fx* · *fy* · <sup>2</sup>−2*Fx* , we obtain Equation (6):

$$1 + (f\_{\mathbf{x}} + f\_{\mathbf{y}}) \cdot 2^{-F\_{\mathbf{x}}} = 2 \to f\_{\mathbf{y}} = 2^{F\_{\mathbf{x}}} - f\_{\mathbf{x}} \tag{6}$$

The latter can be obtained by simply bitwising-not *fx* and by adding 1, thus obtaining Equation (7):

$$Y = X \oplus (\neg \text{sign} \, \text{mask}) \tag{7}$$

where ⊕ is the exclusive or (XOR) operator, ¬ is the bitwise negation operator, and *signmask* is the a mask obtained as shown in the following pseudo-code. For example, given a 5-bit Posit, the signmask is simply (10000)2. The pseudocode also takes into account the holder size; in fact, a 5-bit Posit may be held by an 8-bit integer. This means that, for this holder type, the signmask produced by the pseudocode is (11110000)2.

**Figure 3.** Resolution of a 12-bit Posit when varying the exponent size. With a 0-bit exponent, the Posit resolution in the [0, 1] range is the one of a 12-bit fixed point format.

A pseudo-code implementation for *f* > 0 (otherwise, we simply invert the sign) is as follows:

inv(x) -> y X = x.v // 'v' field: bit-string representing the Posit msb = 1 << (N-1) signmask = ~((msb | msb -1) >> 1) Y=X^ (~signmask) // negation operator followed by XOR operator (C-style) y(Y)

Another useful function to implement as bit manipulation is the one's complement operator (8)

$$y = 1 - \mathbf{x} \tag{8}$$

This is of interest when *x* ∈ [0, 1]. In this case, *y* ∈ [0, 1], of course. From Equations (1) and (2), we can rewrite the operator as in Equation (9).

$$y = 1 - 2^k \cdot (1 + \phi \cdot 2^{-F}) = 1 - 2^{N-2} \cdot (2^{-F} + \phi) \tag{9}$$

Since we can link *x* to its representation *X*, we obtain (10).

$$y = 1 - X \cdot 2^{-(N-2)}\tag{10}$$

Then, we can also link *y* to *Y*, obtaining (11).

$$\mathcal{Y} = 2^{N-2} \cdot \left[1 - X \cdot 2^{-(N-2)}\right] = 2^{N-2} - X \tag{11}$$

The latter can be obtained easily with an integer subtraction only using the ALU. A pseudo-code implementation is the following:

```
comp_one(x) -> y
    X = x.v // 'v' field: bit-string representing the Posit
    invert_bit = 1 << (N-2)
    Y = invert_bit - X
    y(Y)
```
when *esbits* = 0, we know that *<sup>x</sup>* = <sup>2</sup>*<sup>k</sup>* · (<sup>1</sup> + *<sup>φ</sup>* · <sup>2</sup>−*F*). when doubling/halving *<sup>x</sup>*, we simply increment/decrement the exponent *k* by 1. For 0-bit exponent Posits, this operation corresponds to one left shift for doubling and one right shift for halving the number. For instance, let us take a Posit 5, 0 with the value 3/4. The correspondent bit-string will be (00110)2. If we shift it by one position right, we will get (00011)2, that is the bit-string corresponding to a Posit with value 3/8.

#### *2.2. FastSigmoid*

As pointed out in Reference [8], if we plot the 2's complement value of the signed integer representing the Posit against the real number obtained from Equation (2), we obtain an S-shaped function very similar to the sigmoid curve. What we need to do is to rescale it to have the co-domain ∈ [0, 1] and to shift it in order to center it in 0. To bring the Posit in [0, 1], we must notice that the quadrant is characterized by having the two most significant bits set at 00 (see Figure 4).

Moreover, we can notice that adding the *invert bit* seen in previous sections to the Posit representation means moving it a quarter of the quadrant. In fact with *esbits* = 0, when adding the invert bit, we are adding 2*N*−2, that is equal to *<sup>L</sup>* <sup>=</sup> <sup>1</sup> *minpos*, which is the number of Posits that fit in a single quarter of a ring. This means moving *L* times along the Posit ring, thus skipping a quarter of it. A pseudo-code implementation of this transformation is the following:

fastSigmoid(x) -> y X = x.v // 'v' field: bit-string representing the Posit Y = (invert\_bit + (X >> 1)) >> 1 y(Y)

In order to understand how this code works, we need to separate the analysis for *x*<sup>−</sup> and *x*+, considering only positive values, since the reasoning is symmetric for negative ones. Figure 5 shows the behaviour of the two sigmoid versions.

**Figure 5.** Accuracy comparison between the exact and approximated versions of the Sigmoid function.

We know that, for values of *x* ∈ [0, 1], the behaviour of *x* is like the one of a fixed point representation, so the first right shift is simply a division by two. When we add the *invert bit*, we move the Posit in the northeast ring quarter ([1, +*NaR*)). After this addition, the last shift can be considered as a division by two as well, thus obtaining the following:

$$y = \frac{\mathbf{x}}{4} + \frac{1}{2} \tag{12}$$

Equation (12) is also the first-order Taylor expansion of the Sigmoid function in *x*<sup>0</sup> = 0.

With *x* represented as the bit-string *X* = (0, 1[*Rx*], 0, *φx*), the right shift will produce *X* = (0, 0, 1[*Rx* − 1], 0, *φ <sup>x</sup>*). Now, with some computation, we can express *x* as a function of *x* and *Rx*, obtaining (13).

$$\frac{\infty}{2^{2\cdot R\_x}} + \frac{2^{2\cdot R\_x} - 3 \cdot 2^{R\_x - 1}}{2^{2\cdot R\_x}} \tag{13}$$

when adding the *invert bit*, we obtain *X* = (0, 1[*Rx* + 1], 0, *φx*). Finally, with the last right shift, we obtain (14).

$$\frac{\chi}{2^{2\cdot R\_x + 1}} + 3 \cdot \frac{2^{2\cdot R\_x} - 2^{R\_x}}{4 \cdot 2^{2\cdot R\_x}} \tag{14}$$

We know that we can approximate *Rx* ∼ log2(*x*) −→ *<sup>x</sup>* ∼ <sup>2</sup>*Rx* . If we substitute it back in Equation (14), we obtain Equation (15), close to *sigmoid*(*Rx*):

$$\frac{3 \cdot 2^{R\_x} - 1}{4 \cdot 2^{R\_x}} \tag{15}$$

#### **3. CppPosit Library**

For this paper, we employ our software implementation of Posit numbers developed at the University of Pisa, called cppPosit. As already described in References [9,12], the library classifies Posit operations into four different classes (from L1 to L4), with increasing computational complexity.

Among the others, L1 operations are the ones we want to focus on, since they can be fully emulated with an ALU. For this reason, they provide means to produce very efficient operators, as reported in Table 1.

This level supports Posit reciprocation and sign-negation as well as one's complement. Furthermore, when dealing with 0 exponent-bit configuration, they provide the fast and approximated sigmoid function (FastSigmoid) as described in Reference [8] and the fast approximation of the hyperbolic tangent (FastTanh) investigated in Reference [12]. Other interesting operators that require 0 exponent bits are the double and half functions. It is clear that, given these requirements, it is not always easy to derive a simple expression for a particular function that can be implemented in an L1 way. However, the effort put in this step is completely rewarded since it brings both faster execution both in a emulated and hardware Posit Processing Unit (PPU) and reduction of transistor occupation when dealing with hardware implementation of the unit.

**Table 1.** Most interesting L1 operators implemented in cppPosit and their requirements to be applied on the argument *x*.


#### *3.1. Tabulated Posits*

In the absence of proper hardware support of a Posit Processing Unit (PPU), there still is the need for speeding up the computation. An interesting mean to cope with this problem is the pre-computation of some useful Posit operators in look-up tables. These lookup tables (LUTs) become useful when the number of bits is low (e.g., *nbits* < 12). The core idea is to generate tables for the most important arithmetic operations (addition/subtraction and multiplication/division) for all combinations of a given Posit configuration *nbits*,*esbits*. Moreover, some interesting functions can be tabulated in order to speed up their computation, like *logarithm* or *exponentiation*. Given an *nbits* bit Posit with a naive approach, a table will be *<sup>T</sup>* ∈ *<sup>P</sup>R*×*<sup>C</sup>* where *<sup>R</sup>* = *<sup>C</sup>* = <sup>2</sup>*nbits* − 1.

Depending on the underlying storage type T, each table entry will occupy b=sizeof(T) bits. Typically, there will be between *N* = 8 and *N* = 10 tables for a Posit configuration. This means that the overall space occupation will be *S* = *N* · (*R* · *C*) · *b*.

Table 2 shows different per-table occupations of different Posit configurations. As reported, only Posits with 8 and 10 bits have reasonable occupation, considering current generation of CPUs. In fact, we can obtain a considerable speed-up when one or more tables can be entirely contained inside the cache.


**Table 2.** Table occupation for various configurations.

In order to reduce both LUT size and their number, we can exploit some arithmetic properties:



**Table 3.** All the possible combinations for multiplying and dividing two Posit numbers.

**Table 4.** All the possible combinations for multiplying and dividing two Posit numbers: all the cells in italics correspond to the same LUT entry, and all the remaining ones correspond to another LUT entry.


#### *3.2. Type Proxying*

When dealing with Posit configuration with *esbits* = 0 it is not possible to exploit fast approximation of operators that relies on this property. A possible solution is to switch to a different Posit configuration with 0 exponent bits and higher total number of bits to exploit a fast approximation and to then switch back to the original one.

Increasing the number of bits is also useful when the starting Posit configuration has already 0 exponent bits. In fact, increasing *nbits* for the operator computation increases the accuracy of the computation, avoiding type overflows.

Given a Posit configuration *P*1 *X*,*Y* , the basic idea is to proxy through a configuration *P*2 *Z*, 0 with *Z X*. The core step in the approach is the Posit conversion between different configurations. The base case is converting *P*1 *X*, 0, *T*1 −→ *P*2 *Z*, 0, *T*2 , with *Z X* and sizeof(T2)sizeof(T1). In this case, the conversion operation is the following:

```
convert0(p1) -> p2
   v1 = p1.v // 'v' field: bit-string representing the Posit
   v2 = cast<T2>(v1) << (Z - X)
   p2.v2 = v2
```
#### *3.3. Brain Posits*

The idea behind Brain Floats is to define a Float16 with the same number of bits for the exponents of an IEEE 754 Float32. BFloat16 is thus different from IEEE 754 Float16, and the rationale of its introduction is that, when we have a DNN already trained with IEEE Float32, we can perform the inference with a BFloat16 and we can expect a reduced impact on the accuracy due to the fact that the dynamic range of a BFloat16 is the same as that of IEEE Float32. Following the very same approach, we can define *Brain Posits* to be associated to the Posit16 and Posit32 that will be standardized soon. In particular, BPosit16 can be designed in such a way that it has the same dynamic range of a standard Posit32, which will be the one with 2 bits of exponent. Since we are using the Posit format, we can define the BPosit16 as the 16-bit Posit having a number of bits for the exponent such that its dynamic range is similar to the one of Posit<32,2>. Using the same approach, we will define BPosit8, where the number of bits for the exponent, in this case, must be the one that allows the BPosit8 to cover most of the dynamic range of the standard 16-bit Posit, which is the Posit<16,1>. In the following, we will perform some computations to derive the two number of exponents. Indeed, another interesting aspect of type proxying is that we can also reduce the total number of bits while increasing the exponent ones and still being able to accommodate the entire dynamic range. In doing so, we need to know the minimum number of exponent bits of the destination type. Suppose we are converting from Posit *P*1 *X*1,*Y*<sup>1</sup> to Posit *P*2 *X*2,*Y*<sup>2</sup> , with *X*<sup>1</sup> > *X*2. We know that the maximum value for *P*<sup>1</sup> (similarly, it holds for *P*<sup>2</sup> as well) is *max*<sup>1</sup> = 22*Y*<sup>1</sup> *X*1−<sup>2</sup> If we set the inequality *max*<sup>2</sup> ≥ *max*<sup>1</sup> and we apply logarithms to both sides, we get (*X*<sup>2</sup> − <sup>2</sup>) · <sup>2</sup>*Y*<sup>2</sup> ≥ (*X*<sup>1</sup> − <sup>2</sup>) · <sup>2</sup>*Y*<sup>1</sup> From this, we obtain the rule for determining the exponent bits of the destination type:

$$\chi\_2 \ge \log\_2(\frac{X\_1 - 2}{X\_2 - 2}) + \chi\_1 \tag{16}$$

From Equation (16), we can derive some interesting cases. A Posit *P*1 16, 1 can be transformed into a Posit *P*2 8, 2 without a significant loss in the dynamic range. Furthermore, the same holds for a Posit *P*1 32, 2 , which can be approximated using Posit *P*1 16, 3 .

For all this reasons, the Brain Posits proposed in Table 5 might deserve a hardware implementation too.


**Table 5.** Brain Posits.

#### **4. Hyperbolic Tangent, Extended Linear Unit, and their Approximations**

The hyperbolic tangent (*tanh* from now on) is a commonly used activation function. Its use over the sigmoid function is interesting since it extends the sigmoid codomain to the interval [−1, 1]. This allows both the dynamic range of the sigmoid in the output to be exploited twice and the negative values in classification layers during training to be given meaning. The first advantage is particularly important when applied to Posit, especially to small-sized ones. In fact, when considering the sigmoid function, if we apply it to a Posit *X*, *Z* , we practically obtain in the output the dynamic range of a Posit *X*/2, *Z* , that is, for instance, quite limiting for Posits with 8 to 14 number of bits. Figure 6 stresses this point, highlighting how the tanh function insists on the two most dense quarters of the Posit circle (the interval [−1, 1] occupies half of the Posit circle).

However, the sigmoid function has an important property, as shown in Table 1 and in Reference [8]: it can be implemented as L1 function, thus having a fast and efficient approximation only using integer arithmetics. The idea is to use the sigmoid function as a building block for other activation functions, only using a combination of L1 operators. We know that the sigmoid function is:

$$s \text{sigmoid}(\mathbf{x}) = \frac{1}{e^{-\mathbf{x}} + 1} \tag{17}$$

Now, we can scale and translate (17) to cover the desired range [−1, 1] on the output obtaining the scaled sigmoid:

$$s\text{Sign}oid\_k(\mathbf{x}) = k \cdot \text{sign}id(k \cdot \mathbf{x}) - k/2 \tag{18}$$

Equation (18) is useful when setting *k* = 2, thus obtaining the tanh expression in (19):

$$\text{s}\\
\text{Sigmoid}\_2(\mathbf{x}) = (\mathbf{e}^{2\mathbf{x}} - \mathbf{1})/(\mathbf{e}^{2\mathbf{x}} + \mathbf{1}) = \tanh(\mathbf{x}) = \mathbf{2} \cdot \text{sigmoid}(\mathbf{2} \cdot \mathbf{x}) - \mathbf{1} \tag{19}$$

From this formulation, we want to build an equivalent one that only uses L1 operators to build the approximated hyperbolic tangent, switching from sigmoid to the fast approximated version called FastSigmoid. Since we are dealing with 0 exponent bit Posits, the operations of doubling the Posit argument, computing the FastSigmoid, and doubling again is just a matter of bit manipulations, thus efficiently computed. However, the last step of subtracting 1 to the previous result is not an L1 operator out-of-the-box; thus, we reformulate the initial expression obtaining (20):

$$
tanh(\mathbf{x}) = -\left(\mathbf{1} - \mathbf{2} \cdot \operatorname{sigmoid}(\mathbf{2} \cdot \mathbf{x})\right) \tag{20}
$$

If we consider only negative arguments *x*, we know that the result of the expression 2 · *sigmoid*(<sup>2</sup> · *x*)) is always in the unitary region. This, combining with the 0 exponent bit hypothesis allows us to implement the inner expression with the 1's complement L1 operator seen in Table 1. The last negation is obviously an L1 operator; thus, we have the L1 fast approximation of the hyperbolic tangent in (21):

$$FastTanh(\mathbf{x}) = -(1 - 2 \cdot FASSign\_{\mathbb{S}} monoid(\mathbf{2} \cdot \mathbf{x})) \tag{21}$$

Finally thanks to the antisymmetry of the tanh function, we can extend what we have done before to positive values. The following is a pseudo-code implementation:

$$\begin{array}{rcl} \mathtt{FastTank}(\mathtt{x}) & \xrightarrow{\mathtt{y}} & \\ \mathtt{x}\_{n} & = & \mathtt{x} > \mathtt{0} \ ? & \mathtt{-x} : \mathtt{x} \\ \mathtt{g} & = & \mathtt{x} > \mathtt{0} \\ \mathtt{y}\_{n} & = & \mathtt{neg}\{\mathtt{comp11}\{\mathtt{twice}\{\mathtt{FastSigmoid}\{\mathtt{twice}\{\mathtt{x}\_{n}\}\}\}\}\} \\ \mathtt{y} & = & \mathtt{g} > \mathtt{0} \ ? & \mathtt{-y}\_{n} : \mathtt{y}\_{n} \end{array}$$

As already described, tanh and sigmoid functions can be implemented in their fast approximated version. However, the use of such kinds of shapes presents the well-known behaviour of vanishing gradients [15]; for this reason, ReLU -like functions (e.g., ELU, Leaky-ReLU, and others) are preferable when dealing with a large number of layers in neural networks. As in Reference [15], the ReLU activation function is defined as in (22):

$$\text{ReLU}(\mathbf{x}) = \begin{cases} 0, \text{if } \mathbf{x} \le 0 \\ \mathbf{x} \text{ otherwise} \end{cases} \tag{22}$$

Its use is important in solving the vanishing gradient problem, having a non-flat shape towards positive infinity. However, when used with Posit numbers, this function can only cover [0, inf), ignoring the very dense region [−1, 0].

In order to provide a more covering function with similar properties, we switch to the Extended Linear Unit (ELU) (23):

$$\text{ELU}(\mathbf{x}) = \begin{cases} \boldsymbol{\alpha} \cdot (\boldsymbol{\varepsilon}^{\mathbf{x}} - 1) \text{, if } \mathbf{x} \le \mathbf{0} \\ \boldsymbol{x} \text{ otherwise} \end{cases} \tag{23}$$

*Sensors* **2020**, *20*, 1515

This function is particularly interesting when *α* = 1 (24), covering the missing dense region from the ReLU one:

$$\text{ELU}(\mathbf{x}) = \begin{cases} e^{\mathbf{x}} - 1, \text{ if } \mathbf{x} \le \mathbf{0} \\ \mathbf{x} \text{ otherwise} \end{cases} \tag{24}$$

Figure 7 shows the difference in Posit ring region usage of ELU and ReLU functions. It is remarkable how the ELU function manages to cover all the high density regions of the Posit ring. Moreover, the ELU function brings interesting normalization properties across the neural network layers as proven in Reference [16]. This helps in keeping stable the range of variation of the weights of the DNN.

**Figure 7.** The Posit circle when the total number of bits is 5: The extended linear unit uses all the numbers in [−1, inf), while the ReLU function uses only the ones in [0, inf).

From Equation (24), we can build a L1 approximation exploiting operators in Table 1. The *ELU*(*x*) behaviour for *x* > 0 is the identity function, that is L1 for sure. The first step for negative *x* values is seeing that the ELU expression is similar to the reciprocate of Sigmoid function (17). We can manipulate (17) as follows:

$$\text{Gigmoid}(-x) = \frac{1}{1 + e^x} \tag{25}$$

$$1/\text{Sign} \text{point}(-\infty) = 1 + e^{\mathbf{x}} \tag{26}$$

$$1/(2 \cdot \text{Sigmoid}(-x)) = \frac{1 + c^x}{2} \tag{27}$$

$$1/(2 \cdot \text{Sigmoid}(-\infty)) - 1 = \frac{1 + e^{\mathbf{x}}}{2} - 1 = \frac{e^{\mathbf{x}} - 1}{2} \tag{28}$$

$$2 \cdot \left[ 1 / (2 \cdot \text{Sigmoid}(-x)) - 1 \right] = \varepsilon^x - 1 \tag{29}$$

We need to prove that the steps involved are L1 operations. The step in Equation (25) is always L1 for *esbits* = 0 thanks to fast Sigmoid approximation. The result of this step is always on [1, 2]. The step

in Equation (26) is always L1, and the output is on [1/2, 1] ∈ [0, 1]. The step in Equation (27) is always L1 for *esbits* = 0, and the output is on [0, 1/2]. The step in Equation (28) is L1 since the previous step output is in the unitary range [0, 1]. The output of this step is in [0, 1] as well. Finally, the last step is L1 for *esbits* = 0. Expression (29) is exactly the ELU expression for negative values of the argument.

A pseudo-code implementation of the FastELU using only L1 operations is shown below:

```
FastELU(x) -> y
   y_n = neg(twice(compl1(half(reciprocate(FastSigmoid(neg(x)))))))
   y=x>0? x:y_n
```
Figure 8 shows the behaviour of the two functions when approximated with our approach.

**Figure 8.** Comparison between the exact and approximated versions of hyperbolic tangent (TANH) and extended linear unit (ELU).

#### **5. Implementation Results**

In this section, the different proposed activation function performances are analyzed in both the exact and approximated fashions when used as activation function in the LeNet-5 neural network model [17]. As shown in Figure 9, the neural network is trained with the MNIST digit recognition benchmark (GTRSB) [17] and the German Traffic Road Sign Benchmark [18] datasets using the Float32 type. The performance metrics involved are the testing accuracy on said datasets and the mean sample inference time. Testing phase is executed converting the model to *Posit X*,*Y* type and to SoftFloat32 (a software implementation of floats). We used SoftFloats in order to ensure a fair comparison between the two software implementations due to the absence of proper hardware support for Posit type.

Network training using high number of bit formats (e.g. Float32 or Positร16,0ว)

Conversion of the model to lower number of bit formats (e.g. Positร10,0ว or Positร8,0ว)

Evaluation of accuracy and timing performance changes

**Figure 9.** Flowchart for the proposed method: models are trained using formats with high bit count like Float32 or, in the future, Posit 16, 0 . The models obtained this way are then converted to formats with lower bit count (e.g., Posit 8, 0 ) to increase space efficiency and bandwidth.

Benchmarks are executed on a 7th generation Intel i7-7560U processor, running Ubuntu Linux 18.04, equipped with GCC 8.3. Benchmark data is publicly available in References [17]. The C++ source code can be downloaded from Reference [19].

As reported in Tables 6 and 7, the approximated hyperbolic tangent can replace the exact one, with a small degradation in accuracy but improving the inference time of about 2 ms in each Posit configuration. Moreover, the performance of FastTanh also overcome FastSigmoid in terms of accuracy. Furthermore, as reported in Tables 8 and 9, the approximated ELU function can replace the exact one, with little-to-no accuracy degradation, improving the inference time of about 1 ms in each Posit configuration. Moreover, performance of FastELU also overcomes the ReLU in terms of accuracy, showing the benefits of covering the additional region in [−1, 0]. At the same time, the FastELU is not much slower than ReLU, thus being an interesting replacement to increase accuracy of Posits with few bits (e.g., Posit 8, 0 ) without losing too much in time complexity.

**Table 6.** Comparison using Posits for the MNIST dataset for three different activation functions: fast approximated version of Tanh (FastTanh), exact Tanh, and FastSigmoid. Accuracy of the neural network and mean sample inference time are reported.


**Table 7.** Comparison using Posits for the GTRSB dataset (see Table 6).


**Table 8.** Comparison using Posits for the MNIST dataset for three different activation functions: fast approximated version of ELU (FastELU), exact ELU, and ReLU. Accuracy of the neural network and mean sample inference time are reported.



**Table 9.** Comparison using Posits for the GTRSB dataset (see Table 8).

If we compare FastELU and FastTanh, their performance are quite similar in the benchmarks provided. However as already said in Section 4, increasing the number of layers in the neural network model can lead to the so called "vanishing gradient" problem; s-shaped functions like sigmoid and hyperbolic tangent are prone to this phenomenon. This has been proven not to hold for ReLU-like functions.

The results highlight how Posits from Posit 16, 0 to Posit 10, 0 are a perfect replacement for float numbers; Posit 10, 0 is a particularly interesting format since it offers the best data compression without any drop in accuracy. This reasonably makes Posit 10, 0 the configuration of choice for low-precision inference when using Posits.

#### **6. Conclusions and Future Work**

In this work, we have introduced some interesting properties of Posit format for the specific configuration having zero exponent bits (*esbit* = 0), that allows building fast arithmetic operators that only requires ALU support. In particular, we have derived two novel fast approximated versions of two important activation functions in neural networks: the hyperbolic tangent and the extended linear unit. These approximations are fast since they involve only bit manipulations (at the so-called "L1 level"). This means that such functions do not need to be implemented in hardware within the so-called Posit processing unit. Instead, they can be efficiently computed using the ALUs of most of the current CPUs. We have used this approximation to speed up the inference phase of deep neural networks. The proposed approximations have been tested on common deep neural network benchmarks. The use of this approximations resulted in a slightly less accurate neural network with respect to the use of the (slower) exact version but with better performance in terms of mean sample inference time of the network. In our experiment, the FastTanh and FastELU functions also outperform both the ReLu and the FastSigmoid (a well-known approximation of the sigmoid function), a de facto standard activation function in neural networks. Future developments of the work will include porting the Posit format inside the Apollo Autonomous Driving Framework to test it on the assisted/autonomous driving scenario; this will allow us to test our approach in object detection and semantic segmentation tasks. We plan to implement a Field Programmable Gate Array (FPGA) based Posit Processing Unit (PPU) in order to evaluate real-world hardware performance of our library. Furthermore, we are actively working to port the cppPosit library for the new RISC-V processor architecture; we plan to develop both a software-accelerated version using the vector extension of the RISC-V Instruction Set Architecture (ISA) and an intellectual property (IP) core for the RISC-V hardware architecture.

**Author Contributions:** The four authors have equally contributed to all the phases of this work. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work is partially funded by H2020 European Processor Initiative (grant agreement No. 826647) and partially by the Italian Ministry of Education and Research (MIUR) in the framework of the CrossLab project (Departments of Excellence).

**Conflicts of Interest:** The authors declare no conflict of interest. The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; and in the decision to publish the results.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
