1. Introduction
1.1. Background
Electronic devices are susceptible to damage due to external ionizing radiation. These effects are traditionally classified between two groups, according to whether the effects emerge from the gradual degradation of the semiconductor properties due to the accumulated effects of multiple particles, or a single ionizing particle impacting a particularly sensitive volume inside the material. This latter category is commonly referred to as Single Event Effects (SEE), and there exist some design hardening strategies designers can follow in order to harden electronic circuits against some of these effects.
SEEs cause anomalies in the behavior of electronic systems [
1], which may lead to catastrophic consequences, especially in applications where a high level of reliability and security is required. These effects can be classified as destructive, whether they cause permanent damage to the device, such as Single Event Latchup (SEL) or Single Event Burnout (SEB), or non-destructive, when they only affect the expected behavior of the device without physically destroying it [
2], like Single Event Transient (SET) or Single Event Upset (SEU).
1.2. Problem of Interest
In the space sector, which is the most affected by these radiation effects, the use of SRAM (Static Random Access Memory) COTS (Commercial Off-The-Shelf) FPGAs (Field Programmable Gate Arrays) are becoming increasingly important [
3,
4]. These FPGAs are not necessarily hardened against SEUs so they must be hardened with a mitigation strategy to mitigate these logic effects when deploying such high-performing FPGAs in missions that require high reliability. There exist many techniques that can be used to mitigate these effects such as [
5,
6,
7]. A compilation of techniques that can be applied in order to mitigate radiation effects, organized by different design stages and abstraction levels, can be found in [
8].
The most common mitigation techniques use spatial redundancy, which is also known as hardware redundancy. These techniques involve adding more hardware, thus consuming more resources, in order to detect discrepancies between replicated elements.
Dual Modular Redundancy (DMR) consists of duplicating hardware elements to allow for the detection, but not correction, of faults. Triple Modular Redundancy (TMR) triplicates the hardware elements and inserts a majority voter circuit in order to choose the correct output in case of having a discrepancy between the replicated elements. While these techniques can be applied either to small elements [
6] or complete modules, DMR is usually applied to full modules.
Error Detection and Correction (EDAC) algorithms allow one to detect and correct errors in data words without requiring a triplication of the information, and are typically used in memories. For this purpose, Hamming codes are commonly-used error correcting codes.
In order to apply these techniques properly, a key recommendation is to carry out a study in the early phases of a design so as to determine the behavior of the electronic devices against these radiation effects. This can tell the designer which elements of their design are most susceptible of propagating erroneous values to the circuit outputs, producing failures, when being corrupted by these effects. The most sensitive design elements can then be hardened to optimally achieve the design reliability required for a specific mission. In this context, fault injectors are a useful tool to study the behavior of electronic systems against SEE.
1.3. Literature Survey
There are many types of fault injectors that can be found in the literature. They can be classified into five main categories according to the type of injection technique used [
9]. These are: Hardware-based fault injection, software-based fault injection, FPGA-based fault injection, simulation-based fault injection, and hybrid fault injection.
1.3.1. Hardware-Based Fault Injection
In the first category, the device under test is physically attacked by external sources, such as a laser [
10] or by injecting the current through the pins of the device with active probes, as in [
11]. These types of injectors may be destructive to the device causing an over-increase of the initial budget for the project. On the other hand, some of the advantages of these techniques are the wide range of possible locations that can be injected in comparison with other techniques and the accuracy of the results obtained, since real hardware and software are being used.
1.3.2. Software-Based Fault Injection
Software-based fault injection techniques use software to insert the faults in the DUT (Design Under Test) [
12,
13]. This technique has the benefit of being a portable tool, allowing its use in many platforms without damaging the DUT, but it has two main drawbacks: It only can be used in microprocessor designs, and it cannot access the entire device to insert the faults, only the registers that are available through the microprocessors’ ISA (Instruction Set Architecture). Furthermore, the technique is invasive because the software code has to be instrumented in order to inject the faults.
1.3.3. FPGA-Based Fault Injection
Instrumentation techniques can also be applied to HDL code leading to instrumented FPGA-based fault injection. The main criticism that this technique receives is that the circuit that is being tested is not the same as the one that is intended to be deployed on the final mission application, since the VHDL or Verilog code has to be modified in order to perform the fault injection. In critical applications, invasive techniques have the risk of masking functional failures due to the changes added to the DUT. However, these techniques do not require specific hardware nor hidden knowledge of the internal mechanisms of the chosen FPGA, and thus can be applied to many commercial development kits. An example of these techniques can be found in [
14].
Some SRAM-based FPGA families include internal circuitry that can be used to read and write internal circuit values, which allows injecting faults in an FPGA design without instrumenting the HDL code. This technique is called non-instrumented FPGA-based fault injection. Traditionally, researchers have developed their own techniques based on limited documentation and reverse engineering in order to inject faults using the internal FPGA circuitry, when the observing and controlling capabilities are implemented in the silicon [
15]. The least invasive way of performing this is to have a dedicated chip to perform fault injection, input/output vector control, and campaign execution, leaving the full target FPGA to host the user design [
16].
Due to the increase in popularity of the fault injection techniques, nowadays some FPGA vendors are providing IP (Intellectual Property) cores to perform the SEU injection [
17,
18]. The use of these SEU injection IP cores is less invasive than instrumenting the complete HDL design, but nevertheless requires some changes to the DUT, at least to instantiate the required IP cores and add some kind of control logic to manage the tests. We could call this technique minimally-instrumented fault injection. An example of the application of this technique can be found in [
19] where some debugging facilities from Altera FPGAs are used to inject faults in the device under test.
1.3.4. Simulation-Based Fault Injection
Simulation-based injectors have the benefit of being a flexible and inexpensive tool. They use a simulation model of the DUT, which can be described in hardware description language such as VHDL. This technique allows full control of the injection mechanisms as can be seen in [
20].
A good review of the different techniques that can be used to perform simulation-based fault injection in VHDL can be found in [
21]. According to this reference, there are three possible techniques that can be used:
This is the simulation equivalent to the non-instrumented fault injection technique. When simulator commands can be used to inject the faults, there is no need to instrument the VHDL design, which avoids the aforementioned issues related to design instrumentation. Depending on the fault model used, the required simulator command sequence may vary. The main drawback of this technique is that not all simulators support these commands. A second drawback of this technique is that, depending on how the faults are injected, the technique could be fairly demanding to implement. For example, using interactive commands is fairly easy, but implementing a complete solution that uses the Verilog Procedural Interface of a simulator can be very complex because different simulation objects (such as signals, ports, or variables) may be accessed in different ways [
22].
This technique consists of adding VHDL components that modify the characteristics of signals that interconnect VHDL modules of the design under test. This way, values and timing characteristics of these interconnection signals can be altered during the simulation. The main drawback of this technique is that the circuit has to be instrumented, but on the other hand, it can be applied using any VHDL simulator. Since the saboteurs must not interfere with the normal operation of the circuit, a number of control and selection signals must be added to the design and also managed, either through the simulator commands or through extra design inputs.
The mutants technique is similar to the saboteurs technique in the sense that the VHDL design is instrumented, but in the case of the mutants, design components are replaced by mutant components. These mutants operate like the original component in the absence of faults, but one or more parts of its functionality are altered when activated. The VHDL configuration keyword allows one to select, for each component, either its original architecture or one of a set of mutant architectures. In order to change the configuration of a component, the architecture to component binding (meaning which architecture a specific component will have) and the new configuration must be recompiled, but since this is a partial compilation, there is no need for recompiling the complete design. Since this technique does not add new components and instead just changes the architecture of the already existing design components, in the absence of mutations the obtained design is equal to the original design.
While every researcher or engineer might have their own preference, it must be noted that these techniques are in no way exclusive as more than one of them could be applied at the same time, for example including both mutants and saboteurs in the same instrumented design.
1.3.5. Hybrid Fault Injection
The last category of injectors use a combination of the aforementioned techniques to improve the injection capabilities in conjunction [
23].
1.4. Scope and Contribution of This Paper
The paper presents a virtual device to perform simulation-based fault injection using open source tools. The virtual device is fully compatible with the software used for an existing FPGA-based fault injection platform and it also allows one to verify both software and firmware parts of this fault injection platform. As mentioned in the previous section, simulation-based injectors have the advantage of being inexpensive and portable tools unlike the other techniques exposed above.
One of the differences between simulation-based techniques mentioned like [
20] and the proposed approach is the use of open source tools, which allows more flexibility in the use and applications of the proposed approach. The proposed approach uses the GHDL and cocotb tools to simulate the design and perform the fault injection, thus allowing the user free and complete use of the capabilities of the tools, for example running multiple instances of the device without any licensing limitations to perform multiple injections in parallel.
Another contribution of the proposed approach is that, although the approach can be slower than other alternatives based on simulation, since the test shell that goes in the service FPGA is also simulated and in return it is guaranteed that it is fully compatible with the software that manages the real hardware. This allows the tool to be used as a debugger for the firmware that goes in the actual hardware of the FPGA-based fault injection platform. Furthermore, this compatibility allows one to combine any number of physical and virtual devices in order to accelerate the execution of the fault injection campaigns.
1.5. Organization of the Paper
The paper is structured as follows:
Section 2 describes the architecture of the proposed approach. In
Section 3, experimental results of the fault injection campaigns are obtained using the virtual device, and the advantages and disadvantages of the approach are discussed. Finally, the conclusions and future work are presented in
Section 4.
2. Virtual Device Architecture
The virtual device (also shortened as vdev) is a VHDL model of the firmware architecture for an FPGA-based fault injection platform known as FTU-VEGAS. This platform is developed by Universidad de Sevilla through the European H2020 project VEGAS, (Validation of high capacity rad-hard FPGA and software tools). The system has two FPGAs, one to perform the injections, manage the command set, and compare faulty outputs with the golden outputs (outputs without injections), and another FPGA which hosts the Design Under Test (DUT). The software part of the system is named tntsh (Test aNalysis Tools shell). This software operates the hardware by sending the necessary commands and data to perform the injection campaigns, and also receives and stores the results. The architecture for the vdev proposed is the one shown in
Figure 1. The vdev communicates with the same software (tntsh) used for the physical hardware through a pair of pipes. The virtual device follows a modular architecture where each module communicates with another through a pair of streams. The functionality and architecture of each module are presented in the following subsections, starting from the top level.
2.1. Top Level Module
This is the top level of the virtual device. This module receives the commands from tntsh and returns the requested data to it through a pair of input/output pipes.
2.2. SRAM Simulation Model
This is a model for the R1WV6416R SRAM device from RENESAS, used to store the input/output vectors of a fault injection campaign in an internal format called wave. The wave contains the bit array that contains the concatenated inputs for the DUT each clock cycle, and the corresponding bit array with the concatenated outputs, and is stored in the SRAM using a simple compression schema.
2.3. Test Design Simulation Model
Instances the design under test.
2.4. Core Module
This module is responsible for accepting instructions from the tntsh software and returning the appropriate values. It instances the command interpreter, which manages the commands received from the software and also interfaces with the vectors and configuration modules. The data interchange between the command interpreter and the rest of the modules is managed by stream modules.
2.5. Stream Module
The stream is the interface used for interchanging data between modules. It is composed of an encoder, a FIFO (First In First Out) memory, and a decoder. Every main module inside the virtual device uses an input stream to request input data and an output stream to provide output data. The stream has been designed to simplify the exchange of multiple data of different widths between modules. Internal data inside the virtual device may have different widths, for example, an injection address may be a 32-bit value but a time value in cycles may be a 64-bit value, while a command always has an 8-bit value. In addition, in order to save SRAM memory space, the width of input and output vectors depends on the characteristics of each design under test. This module needs thus to manage a stream of data of different sizes, with a size between 1 and 8 bytes for each data.
Figure 2 shows the architecture of the stream. The process for module A to send a single multi-byte data to module B is as follows:
A writes in the stream:
A waits for the write side of the stream to be ready (wr_ready active). If the write side is not ready, wr_op must be set to zero;
A sets wr_op to the number of bytes that need to be written (N), while at the same time sets wr_data(N*8-1 downto 0) to the value of the data word to write.
B reads from the stream:
B waits for the read side of the stream to be ready (rd_ready active). If the read side is not ready, rd_op must be set to zero;
B sets rd_op to the number of bytes that need to be read (N);
Starting from the next clock cycle, when rd_ready is asserted again, rd_data is valid, from which B reads the least significant N bytes.
It must be noted that B can ask for data before A writes anything into the stream, and this will not cause any issue, since the stream will not assert rd_ready until it has enough data, which effectively waits for A to write the requested data.
When bidirectional communications are needed, for example in case of a module sending commands to a submodule and reading the responses to these commands, two streams can be instanced, one for each data direction.
2.6. Config Module
The configuration module is responsible for reading and writing in the simulated target FPGA. It can configure a valid bitstream, and perform I/O operations on individual configuration bits, such as bit flips. It interfaces with a model of the target FPGA (configuration interface simulation model) which interchanges data with the logic of the configuration module using the APB (Advanced Peripheral Bus) protocol.
2.7. Vectors Module
The module is responsible for handling the input and output vectors, both the golden data and the experiment results, as well as handling the emulation clock. It includes the SRAM controller to store and manage the input/output vectors in the SRAM and event queue.
The Event Queue
Every time a special condition is detected during the fault injection (such as discrepancy with golden outputs or end of vectors), an event is raised. When an event is raised, a new item is added to the event queue with each item containing two data: A 1-byte mask of all the events raised and an 8-byte value containing the cycle in which these events were raised.
The event queue allows one to configure flags for the fault injection campaign that can alter the course of the test depending on what happens during the experiment, without continuously communicating with the software. For example, by stopping a run (a complete execution of the test vectors with zero, one or more injections) after detecting damage, without simulating the rest of the clock cycles. The complete event queue can be read by the software after a single run, reducing communication overhead.
2.8. Flow Process and Required Files
The process of injecting faults is made using cocotb [
24]. Cocotb is a cosimulation testbench environment written in python. It is an open-source tool which can be used in multiple operating systems. The GHDL open-source simulator is also used to compile the source VHDL code of the vdev to obtain the executable file. Cocotb accesses the values inside the simulation using the simulator’s VPI (Verilog Procedural Interface). A testshell wrapper for the GHDL executable has been written in python so cocotb can be used to access the internal simulation values. With respect to the classification described in
Section 1.3.4, we can consider this technique inside the ‘simulator commands’ category.
The tntsh software can be used interactively since it provides a TCL (Tool Command Language) shell, but it can also be used in batch mode by providing .tcl scripts with the commands to execute. Makefiles can be used then to automate the execution of multiple fault injection campaigns, using one or multiple instances of the vdev.
The following files are needed to perform a fault injection campaign:
pin file:
Contains the inputs, outputs, and clock pin signal of the DUT. This file must be written by the user, using a very simple format to indicate signal names, directions, and widths;
nxb file:
The configuration bitstream of the DUT. Generated by the NXmap FPGA vendor tool. While this file is obviously required when using the real hardware, a dummy nxb can be used when injecting faults with the vdev;
vcd file:
A value change dump with the recorded input/output vectors obtained by simulating the design. The user must generate this file using their own testbench with any simulator that supports the generation of VCD files;
ctxt file:
A logic location file that shows the position of the user logic inside the bitstream. Generated by the NXmap tool. This file can also be substituted for a dummy file when using the vdev, removing the need for the proprietary NXmap tool;
ctt file:
This file relates the register names inside the ctxt file with the hierarchical signal names inside the vdev. This file is generated semi-automatically by processing the ctxt file;
test.tcl:
A tcl script with the commands to be executed to perform a fault injection. It loads the configuration, the vectors, and the location files and selects the options desired to perform the campaign. These are the same commands that are supported by the real hardware. The tcl file is not strictly necessary, since the commands can be entered interactively in the tntsh shell.
It must be noted that the nxb and ctxt files are equivalent to bitstream and logic location files generated by software from other vendors, such as Xilinx. The specific nxb and ctxt files are used here so both the software and virtual device remain compatible with the hardware of the fault injection platform, but other file formats could be supported.
The tntsh and virtual device support multiple injection campaign options that can be selected to customize the fault injection experiments. The injector function and injection mask can be also selected. These options are listed below:
2.9. Technical Requirements for the Virtual Device
While there are no specific hardware requirements for the use of the virtual device, it needs the following software:
The GHDL simulator, which is available both for Linux and Microsoft Windows;
A mechanism to create unix pipes or pipes that behave as such;
The tntsh software, which in turn requires a C compiler that supports the C++17 revision of the standard for the C++ programming language (such as gcc or clang), and some libraries readily available in most modern GNU/Linux systems;
The cocotb coroutine simulation framework, which in turn requires python 3.
The experimental results for this paper have been obtained in an Acer EX2540 series computer, model NX.EFHEB.2002, with 8 GB of RAM and an Intel Core i5 7200U processor, running the Debian GNU/Linux Operating System, version 10.
3. Experimental Results
A number of DUTs of increasing complexity have been chosen to perform fault injection with the vdev. The selected designs are described below:
counter:
Implements an 8-bit counter;
adder acum:
This design accumulates the value of an 8-bit input vector in a 20-bit vector;
shiftreg:
Implements an 8-bit shift register;
b13:
An interface to meteo sensors [
25];
FIFO:
A simple 32-bit FIFO memory [
26];
pcm:
An Integrated Interchip Sound (IIS) interface for the PCM3168 codec [
27].
For each DUT, a campaign using a gaussian injector and no injection masks was performed. The gaussian injector was configured with
and
, so a single SEU was injected each run. Each campaign performed a total of 1000 injections among the list of candidates, which may or may not have propagated to the primary outputs, causing output damage. For each design, the Architectural Vulnerability Factor (AVF), which is the percentage of damages obtained over the total of injections performed [
28], was calculated and is presented in
Table 1,
Table 2,
Table 3,
Table 4,
Table 5 and
Table 6. Note that registers that do not produce output errors when injected are not shown in the tables, but are included in the global AVF calculations of the design. Sometimes designs can break when adapting them to a fault injection platform if the process is not made with special care. To check the correct functionality of the designs before performing a campaign, an emuvssim test was performed for each design. Emuvssim tests check the emulation in the virtual device and simulation with a test bench given by the user match, by comparing both waveforms. This allows one to prove that the virtual device does not break functionality and also helps to verify the firmware/software of the fault injection platform. All designs passed the emuvvsim tests.
3.1. Campaign Execution Times
To save execution time, multiple instances of the virtual device can be launched in parallel. The maximum number of instances depends on the processor capacity of the user thus allowing a faster performance than when using proprietary tools, if enough processing capability is available.
Table 7 shows the percentage of execution speed improvement when using two, four, and eight devices:
3.2. Discussion
The main disadvantage of the proposed approach is that it is slower than FPGA-based fault injection. Conversely, one of its advantages is that it does not require any hardware devices.
An advantage that can mitigate the previous disadvantage of this approach is that the fault injection campaigns can be parallelized by instantiating multiple virtual devices, up to the available computing capacity. By using open source tools, this approach does not have any arbitrary license-based limitations to the number of devices that can be executed at the same time.
Another advantage of the approach is the compatibility with the same software that is used with the real hardware. The cost of having this compatibility is the increase of complexity in the virtual device.
The current version of the virtual device requires recompiling the VHDL source when new designs are added. In order to make the device available to more users and simplify the design preparation, it would be a good idea to separate the two elements. For example, an object file for the virtual device could be provided that the user could link to the object files of their design under test.
Another issue to consider is that special care must be taken in the timing of the forcing and release actions of the signal where the fault is being injected. If the signal force is released before the active clock cycle of the design under test, it is possible that the fault is erased and not captured by the design. However, if the signal force is released after the active clock cycle of the design under test, there is the risk that the fault does not correctly propagate (for example, in case of the fault crossing some logic cones and propagating to the input of the same flip-flop) or that it remains for more time than is needed, which would result in an incorrect SEU model. This was solved by using the trigger methods provided by cocotb to synchronize with respect to the clock signal for the design under test.
A current limitation of this approach is that the latest version of cocotb (1.5.dev0) cannot access signals inside a record with the latest version of GHDL (1.0-dev/v0.37.0), which may impact the fault coverage of complex designs that use these datatypes. This is a known limitation of the simulator and is currently an open issue in the GHDL issue tracker, so it can be expected to be fixed in the future. A possible workaround until this limitation is removed, is to inject the faults in a post-synthesis version of the design under test. To achieve this, the synthesis of the design under test must be performed, and afterwards a netlist of the synthesized design must be generated in VHDL format. For example, the netgen tool from Xilinx allows to do this.
4. Conclusions and Future Work
A virtual device to perform fault injection by simulation was designed, developed, and demonstrated. This virtual device is also a model of the FTU-VEGAS fault injector by emulation firmware, extended with fault injection capabilities, and is fully compatible with the software that communicates with real hardware. A set of injection campaigns with increasing complexity was performed and the results of the Architectural Vulnerability Factor for these campaigns were exposed. The virtual device demonstrated to not disturb the correct functionality of the test designs in the absence of injected faults, with the emulation versus simulation tests. The virtual device can be fully compiled and used with only free and open source software, avoiding the use of expensive proprietary simulators, and campaign speed can be parallelized by running multiple instances of the virtual device without any restrictions, which helps to bridge the speed gap with respect to FPGA-based solutions, when running the tests in powerful servers with multiple processor cores.
Future work will include comparing the results obtained with the virtual device with the results obtained using real hardware when it becomes available and decoupling the HDL compilation of the virtual device and the DUTs, so binaries of the virtual device can be distributed and linked to or co-simulated with a third party’s confidential designs.