1. Introduction
Electronic computing is evolving at a pace where chip technology reaches high densities whilst incorporating more functionalities and new technologies into their software, such as kernels, device drivers and large data processing applications. Consequently, the higher densities increase chip susceptibility to ionizing radiation [
1], increasing soft error rates. Propagation of such errors into complex software stacks may cause system instability, compromising safety, security, reliability and performance.
Concerns for these metrics are most important on safety or mission-critical applications, for instance the aerospace, automotive, and defence industries, leading them to adhere to strict safety and reliability requirements, defined by specific standards such as the ISO 26262 Road vehicles Functional Safety for the automotive sector [
2]. Since theses systems directly relate to safety, it is vital to prove that they implement the correct functionality in-time and with sufficient level of reliability even in the presence of soft errors and/or faults.
Fault mitigation comes in the form of redundancy, either hardware- or software-implemented [
3], by replicating components that mask faults and prevent them from propagating. From a safety standpoint, either technique can be employed, but from a cost point-of-view, software-implemented techniques are considered preferable, since high safety oriented hardware components are typically more expensive than software development costs [
4]. On high volume production areas with Size, Weight, Power and Cost (SWaP-C) constraints, such as automotive, Software-Implemented Hardware Fault Tolerance (SIHFT) techniques may be preferred over replicating hardware components of the Electronic Control Units (ECUs).
However, assessing the correctness of these techniques may become a difficult task, as hardware-based tests become destructive and lack repeatability, and software-only based tests lack observability and traceability. Hence, researchers and market leaders have adopted alternatives in the form of virtual simulation platforms to avoid the consequences of destroying hardware at the cost of modelling the system in a virtual environment.
In the automotive industry, the ISO 26262 standard stipulates that the role of simulation is critical in validating system behaviour, and recommends simulation at all development phases. Furthermore, it advises the use of Fault Injection (FI) testing to not only evaluate the hardware architectural metrics, but also fault metrics, such as diagnostic coverage (DC) of the safety mechanisms (SM) [
2]. To aid with this assessment, the safety standard released part 11, which provides failure modes that support the assessment of the safety mechanisms.
With that in mind, we propose an open-source tool, namely QEFIRA, that helps developers to assess fault mitigation techniques with failure modes supported by the ISO 26262 standard, part 11. It is based on the open-source QEMU emulator with modifications for executing runtime Fault Injection campaigns. The test bench performs architectural emulation of platform, Fault Injection during runtime, result logging and classification of fault runs.
The paper is organised as follows.
Section 2 presents the state-of-the-art and related work regarding virtual platforms within the scope of Fault Injection and safety standards.
Section 3 describes the main features and benchmarks of the proposed QEFIRA tool.
Section 4 exemplifies how QEFIRA can be used to apply ISO 26262-compliant Fault Injection to digital components. Lastly,
Section 5 presents the final remarks and future work.
2. State of the Art
This section presents basic concepts and terminologies related to virtual platforms and Fault Injection. Furthermore, it briefly introduces the ISO 26262 standard and contextualizes Fault Injection within the standard.
2.1. Basic Concepts of Fault Injection
Following the nomenclature provided by Dubrova et al. [
3], a
fault is physical defect, imperfection, or flaw that occurs in some hardware or software component. Resulting from a fault, an
error is a deviation from the expected computational value. A single error or multiple ones can lead to a system
failure, which translates into severe system degradation. Fault injection is a technique that aims to apply faults directly on hardware, software, or architectural models, to test and assess the effectiveness fault tolerance or safety mechanisms. Contrarily to the analytical methods, this aims to experimentally observe the system behaviour when deliberately injecting faults. In safety-critical applications, such as the automotive industry, FI has become a de facto practice to improve safe design and avoid the costs associated with untested safety-critical software [
2].
Depending on the FI method, injection strategies can be mainly classified into
hardware-based,
software-based and
simulation-based [
5]. A hardware-based strategy is performed at physical level, disturbing the physical components with parameters of the environment (heavy ion radiation, electromagnetic interference, etc.), voltage glitching on power rails, or modifying the value of the pins of the circuit. This type of technique requires specialised hardware setups, and only FI via test access ports can be achieved with COTS hardware while retaining the repeatability and controllability necessary for detailed post-injection analysis. Works based on this approach include injection frameworks such as RIFLE [
6], FIST [
7], and MESSALINE [
8].
Software-implemented FI (SWIFI) uses the actual target running the application software with additional injection procedures that modify the contents of registers and memory elements to emulate the effect of real-world hardware faults. It offers less destructiveness than hardware FI, but high intrusiveness and low reachability, as it can only reach internal processor states. Notable SWIFI tools include FERRARI [
9], XCEPTION [
10], and DOCTOR [
11].
Lastly, Simulation-based Fault Injection (SFI) applies faults into a system or hardware model. The injection of faults can be performed by modifying either the state of the hardware components (e.g., flip-flops), the state of the architectural resources (e.g., register file), or the state of the software structures (e.g., variables). This technique offers virtually maximum reachability and traceability, but the model details should be accurate enough for meaningful simulations. Some SFI approaches base the Fault Injection strategy on cycle-accurate models implemented by means of Hardware Description Languages (HDLs) at the Register Transfer Level (RTL), while others use instruction-accurate models of processors and software modules to emulate hardware behaviour.
FI techniques based on HDLs, such as the MEFISTO [
12] and VERIFY [
13] tools, use faulty signals connected to VHDL models to provoke system failures, while the authors of [
14,
15] use Verilog for the same purpose. This approach provide a high degree of controllability and reachability, but the main drawbacks in this case are represented by the large development effort required to model the simulator, and by poor simulation performance which drops proportionally by increasing the accuracy of the target model. Also, traditional event-driven (gate-level) or cycle accurate (RTL) simulation is typically orders of magnitude slower than real hardware [
16].
Therefore, higher levels of abstraction are preferred if the underlying simulators are fast and do not compromise the accuracy of the simulation. The fastest solutions are represented by purely functional simulators that can almost reach the speed of the simulated hardware. However, simulating low-level faults could be very misleading when the simulation is only functional. Following these considerations, approaches based on instruction-accurate simulators, which rely on fast virtual platform systems that can perform simulation at an higher level of abstraction, such as at micro-architectural level, seem to be preferable to RTL-based simulators. On this level of abstraction, several studies were made based on simulators such as GEM5 [
17,
18] and QEMU [
16,
19,
20,
21,
22,
23]. Since this type of simulator is a significant part of this manuscript, the next section provides a more in-depth research regarding the current state of the art of this type of simulation.
2.2. Fault Injection in Virtual Platforms
Micro-architectural FI tools aim to allow designers to emulate faults at processor state level and verify the efficiency of fault tolerance solutions with low overhead, high repeatability and reachability, and low intrusiveness. Although these are desirable characteristics, these are heavily dependent on the simulator used and on how much the tool internals need to be modified to achieve meaningful injections. The following section provides an overview of tools that were extended with these characteristics in mind, focusing on the GEM5 and QEMU simulators for Fault Injection campaigns.
The authors of [
17] propose a framework, supported by GEM5 and M*DEV, that allows the assessment, flaw identification, hardening, and profiling of software architectures resilience against soft errors. The Fault Injection module supports single-bit upset (SBU) and multiple-bit upset (MBU) faults injected in registers and memory addresses during runtime. The fault occurrence, location and injection time are assigned by a random uniform function. The framework also provides a analysis module, that retrieves fault campaign results and provides analytics about code execution and classifies fault run results.
The authors of [
18] propose a simulator-agnostic framework, validated using GEM5 and QEMU, to assess the efficacy of SIHFT mechanisms. The fault models are user-defined by a runtime abstraction and need to be implemented for each simulator. Reachability and fault models depend on the verboseness of the controlled back-end system. Injection is made directly during the simulation execution loop, but can also be performed by test-port, such as JTAG probes. Furthermore, the framework performs post-injection analysis in the form of mapping failure to code.
Using QEMU, the authors of [
19] presented QEFI, a framework which aims to assess system behaviour, focusing on simulating hardware faults and testing software reactions to them. It was designed for supporting the ARM architecture for both system-wide and kernel-based FI. Fault models include permanent faults in CPU, RAM, system peripherals and
bit-flips in memory. Injection triggers are made by Program Counter (PC) value, injection by user-defined probability or by external application triggering. The internal source code was modified, mainly the Tiny Code Generator (TCG), to inject faults during runtime. In addition to runtime injection, faults can also be injected while debugging with GDB protocol. The protocol has been enhanced to support not only breakpoints and watch points, but also injection points. Deeper into QEMU, the authors of [
20] proposed FIES, a framework focusing on the implementation of the IEC 61508 standard [
24] fault models. These include register faults in cells and address decoding, faults in CPU instructions and condition flags, and faults in DRAM memory cells and address decoding. Fault models include transient
bit-flips to simulate SEU’s and permanent
stuck-at faults. Similarly to [
19], QEMU version 2.1 TCG was modified to allow injection within the simulation execution loop. Architecture support is limited to the ARM architecture, and faults are user-defined through an XML file, passed as argument upon simulation start. Extending the fault triggering capabilities of previous works, the authors of [
16] used QEMU to implement FI at register level. The focus was made for both the register operands and the register file status bits, e.g., CPSR in ARM architecture. The fault models include permanent
stuck-at, permanent transition, and both transient and intermittent
bit-flips. Comparing with other works, it provides the most complete regarding fault models, at the cost of multiple fault locations. The work focused on the ARM and the x86 architectures.
Contrarily to both works previously mentioned, the authors of [
21], did not change the internal QEMU source code. Instead of modifying the TCG, the authors used the TCG plugin interface for code instrumentation released with QEMU 4.2 to monitor execution and perform injection. This solution method allows the injection to be more architecture-agnostic than solutions that required TCG modifications. Support fault models include transient and permanent faults in CPU instructions, RAM and registers. Injection trigger is made on user-defined CPU instructions. Execution results including the register contents throughout simulation, fault specification, memory dump and target-to-host translation info, are logged in raw format for post-analysis, if necessary. To validate the work, the authors executed a physical experiment using laser Fault Injection to cross-check the results from the experiment with the simulation. They have concluded that the modifications made for FI can be used to predict the timing and location of the fault candidates for a fault attack. Similar work was performed in FIG-QEMU [
22], where the authors adopted a GDB-based approach to evaluate the robustness of application software. It focused on emulating single-event upsets on CPU registers and is heavily dependent on the program debug symbols.
More recently, the authors of [
23] proposed a test bench, based on QEMU, to assess the efficacy of SIHFT methods for fault models proposed on the ISO 26262 standard. The test bench primary modules are a GDB-based fault injector and a classifier entity for result analysis. Injection is made by starting and stopping the simulation using the debugger stub, meaning that no changes in the simulator source code are needed. This approach grants high versatility, since it is not bound to any architecture. By monitoring execution, faults can be injected in into the program counter (PC), the register file, and system variables and memory. Furthermore, the authors also propose a classification according to the standard, made automatically by the classifier, which receives log data runtimes watches, that monitor variables and memory locations. One of the main takeways from this work is that the fault models provided in the standard can be correctly emulated using QEMU. The authors proposed that fault models supported by the test bench can be mapped into the failure modes reported in Table 30 of the standard, part 11, concerning the central processing unit (CPU), the interrupt handling (INTH), and the interrupt control unit (ICU). Regarding performance, the tests shown an average execution time of about 31 s per injection (comprising the time needed for logging and classification).
To summarise the mentioned works,
Table 1 presents a simplified description of each tool. The table lists the works by year and provides the simulator used and the injection methodology, either by changing the internal simulator software or by using a debug probe attached to the running program. Furthermore, it shows the fault locations and models supported by the tools.
2.3. ISO 26262
The ISO 26262 standard, released in 2011, focuses on the functional safety of electrical and electronic systems in road vehicles. It defines development guidelines to minimise the risk of accidents and ensure that automotive components perform correct functionality and a correct time. It is divided into twelve parts, but this work focuses only applies to a subset of them composed of: (i) part 5, product development at hardware level, which focuses how to prepare the hardware to prevent errors and how to retrieve architectural metrics; (ii) part 6, product development at software level, which focuses on the design and implementation of processing software modules; and part 11, guidelines on applying ISO 26262 to semiconductors, which focuses on guaranteeing safety levels on digital components.
Throughout the development phases, Fault Injection testing is recommended, either physical or through simulation, to validate the fulfilment of safety goals, i.e., requirements that are assigned to a system with the purpose of reducing the risk of one or more hazardous events. Part 5 of the standard classifies faults that may affect safety goals as single point, residual, detected multi-point, perceived multi-point, and latent multi-point. Safe faults (SF) do not impact safety critical logic either because they lack physical connection, or they are masked by a mitigating mechanism, such as a redundant system. Single point faults (SPF) are faults that can reach a safety critical logic, and there are no safety mechanisms, such as CRC, to detect or correct them. Residual faults (RF) happen in an area monitored by a safety mechanism, but might not be detected by it. Multi-point faults (MPF) refer to faults that the safety mechanism detects, with the implication that, for these faults to cause harm, there would need to be an additional fault. Detected multi-point faults are detected, within a prescribed time, by a safety mechanism, which prevents it from being latent, latent multi-point faults are faults whose presence is not detected by a safety mechanism nor perceived by the driver and, finally, perceived multi-point faults are faults that are not fully detected, but have some noticeable impact on the driving experience.
The fault classification can be used to retrieve metrics relative to the component under FI experiments. Resulting metrics are expressed in terms of the
failure rate of a item,
, when exposed to faults [
2], as shown in Equation (
1). This equation reflects the sum of different failure rates in respect to different fault types.
From Fault Injection results, one can retrieve the Single-Point Fault Metric (SPFM), calculated using Equation (
2), which reflects the robustness of the component to single-point and residual faults either by coverage from safety mechanisms or by design (primarily safe faults). A high single-point fault metric implies that the proportion of single-point faults and residual faults in the component is low.
Another relevant metric is Latent-Fault Metric (LFM) (Equation (
3)) that reflects the component robustness to latent faults either by coverage through safety mechanisms or by the driver recognising that the fault exists before the violation of the safety goal, or by design (primarily safe faults). A high latent-fault metric implies that the proportion of latent faults in the hardware is low.
Diagnostic coverage provided by safety mechanisms can also be retrieved by Fault Injection testing [
2], and can be seen as the ratio, given as a percentage, between the failure rates of detected faults with respect to the failure rates of all faults, as shown in Equation (
4).
The diagnostic coverage expresses the effectiveness a safety mechanism. Although the standard does not provide an explicit expression, it can be inferred, theoretically, as all possible faults that lead to unsafe states that are capable of being detected by a safety mechanism. Since metrics are mathematically supported, one can claim that methods that map resulting faults into standard-compliant characterisation can automate the process of retrieving relevant metrics. This process was previously validated by the authors of [
23].
3. QEFIRA: QEMU-Based Fault Injection Framework
The proposed Fault Injection framework, QEmu Fault Injection for Reliability Assessment (QEFIRA), is based on QEMU and extends its run-time environment to provide the ability to modify the target state during simulation. It monitors the execution of an emulated ARM target machine and injects faults according to a user specified fault experiment, as shown in
Figure 1.
The framework receives a target application and user specified fault experiment files. These files contain the descriptions of the faults to be injected during the simulation, alongside variables for simulation control. These files are parsed by the Fault Controller, which creates a virtual representation of each defined fault and enqueues them for activation according to their simulation times and their respective triggers. During run-time, the Fault Injector module is aware of the enqueued faults and checks for fault triggers in the form of accesses to memory and registers, interrupt calls, and changes to the Program Counter (PC) by continuously monitoring these operations. Each one of these triggers dispatches it to verify if any fault within the list is pending activation or deactivation. Alongside this process, the internal virtual clock monitoring provides the current simulation time so that timed faults can be inserted and removed from the list according to their respective duration or start time. When an injection point activates a fault, the Logger saves the resulting system state changes. Logged data includes fault affected memory containing prior and post injection values, the PC execution flow, and user-defined memory monitor variables, all paired with the current simulation time.
A valid fault campaign is composed of a
golden run and, at least, a
fault run. The fault run occurs as previously described, with faults being injected throughout the simulation. The
golden run follows the same execution, but no faults are injected. While the
golden run executes, the system state is logged according to its non-faulty behaviour, allowing for a post-execution comparison between executions. At the end of the campaign, the resulting logs are sent to the Classifier, which compares the golden run with fault runs. It makes a suggestion about the campaign result and provides it to the Data Visualization tool which, in turn, provides a visual representation of the system state throughout the simulation. These two entities are further explained in
Section 3.4 and
Section 3.3, respectively.
During runtime, the framework provides support to inject faults in:
Instruction Execution (CPU_INSN): Changing the currently fetched instruction from the target application. Valid for both Arm and Arm Thumb instruction sets.
Registers (CPU_REG): Modifying register file values, or modifying the register address decoding by altering register operands. The current implementation supports the Arm main register bank from R0 to R12.
Memory: Modifying memory values or blocking read/write operations on memory. Valid for both program and code memory, and for Memory-Mapped IO (MMIO).
Interrupt Handling and Control (ICU): changing the processor state by either modifying the asserted interrupt index, forcing the interrupt controller to ignore specific interrupt requests or by causing spurious interrupts.
All faults can be defined either as permanent
stuck-at, which reproduces a permanent defect on the underlying emulated hardware, or as transient single event upsets (SEUs), modelling short-lived faults such as the ones caused by cosmic radiation [
25]. As previously mentioned, faults can be triggered either when a specific instruction is ready to be executed, pointed to by the PC, or by read/write operations in a specific resource, e.g., accessing a memory location or when a specific interrupt request is asserted. By monitoring operations, the injector avoids activating latent faults that do not cause change in system state, reducing logging overhead.
The faults used on fault experiments are described by an XML file containing the properties of faults to be injected and system variables that should be monitored throughout the campaign. The XML file schema is presented in
Table 2. Multiple specifications can be included in a single fault campaign, inserting multiple faults for injection. Alongside the fault specification, there the additional parameters: (i)
simulation duration, which specifies how much time the simulation should run, and (ii)
monitor, which specifies a memory address to continuously monitor for changes, e.g., an application level state variable. Various
monitor items can be defined for more information about the system running state.
3.1. QEMU Internal Changes
Integration of the framework was made on QEMU 8.2 focusing on the ARM architecture. The native code translation and IO loops were modified to allow monitoring of target memory and instructions during the course of the simulation. This monitoring is needed, since QEMU caches blocks of executable host code, named Translation Blocks (TLB), to prevent continuously repeating translation of target code, thus accelerating the simulation. The QEMU’s execution loop can be seen in
Figure 2 along with a high level-representation of where the monitoring and injection points were inserted within the loop.
As shown in the figure, injection points are inserted either before target code is translated or within the IO loop. The instruction faults are inserted when the target code is fetched from the application binary file. This process runs at least once, with the TCG using target instructions, translating them into host code in form of TLB’s and caching them for performance. Code fetching is the main entry point for PC triggered faults. After fetch, instructions are enqueued and checked for register accesses. At this point, register faults come into action, for both register cell value and register address decoding, before all code is translated into host. During the execution loop, all types of memory access through the software-emulated Memory Management Unit (MMU) are monitored for injection point triggers. Prior to entering the IO loop, the system is able to handle exceptions in the form of interrupt requests or debug handling. At this point, interrupt requests can be contaminated or ignored. After every successful injection, all translation caches are emptied to force instruction re-translation. This is important for timed faults, since cached instructions may avoid subsequent triggers during simulation. All modifications are contained within the QEMU internal translation loops, outside TCG action. Logging points are synchronized with injection points, providing a valid execution flow between the execution and IO loops.
During the simulation, the internal virtual clock provides near deterministic execution of instructions. This is enforced by the usage of the QEMU’s icount parameter, which provides a simulation time that is proportional to the emulated instructions and is not impacted by the host wall-clock time. By defining this parameter, one target instruction counter tick equals nanoseconds, with N being the user specified icount value. With a deterministic simulation time, transient faults and delayed injections can be triggered and deactivated with increased time granularity.
3.2. Benchmarks
One important metric to gather when developing simulation extensions is how the tool performance is affected. With that in mind, two benchmarks were performed to have an overview of how the QEMU runtime is affected. The graph in
Figure 3 shows the wall-clock time comparison, in seconds, between the runtime injection test with QEFIRA and a GDB-based implementation, such as the one proposed in [
23]. The approaches are compared against a baseline execution of two ARMv8 target applications running ThreadX real-time operating-system that: (i) perform a Triple Modular Redundant (TMR) calculation of a block of RAM memory; and (ii) perform bubble sort of an array of 300 four byte words. Both applications run for a total simulation time of ten seconds. The injection tests performed alongside the baseline are composed of the execution of the previously mentioned application with a fault experiment with the following description:
Three SEUs on a block of RAM which applies a bitmask of 11223344h, from 100 ms to 200 ms simulation time.
A spurious UART interrupt every time the processor reaches a data processing function.
Replacement of register cell R3 contents with a random 32-bit value, triggered by PC.
The testbench was repeated for the QEFIRA runtime injection with and without the Fault Logger. The GDB-based injection was performed by developing a script that sets breakpoints on the needed instructions, performs changes to the variables and logs the results into a text file. Tests were made on an AMD Ryzen 7 six-core PC with 24 GB of RAM, running Ubuntu 22.04.
The tests included a well-defined icount value, and were repeated with and without the inclusion of the singlestep parameter. The latter guarantees maximum granularity on TCG translated blocks, which avoids caching of big blocks of translated code. For all applications, it had no effect on the simulation results, only on the performance. Comparing with the baseline run, the changes made for QEFIRA introduced a slowdown of for singlestep execution, while the GDB-based approach introduced a slowdown up to . By discarding the aforementioned parameter, GDB-based simulations do not have meaningful changes in performance. This is due to the fact that, whilst using the GDB stub, QEMU runs automatically in singlestep mode. On the other hand, QEFIRA had a slowdown of comparing with the baseline. However, its overall speed, compared with singlestep execution, increased 4.8× in the bubble sort benchmark and 3.34× in TMR benchmark. On either application, logging had close to no influence on performance. Comparing both implementations, QEFIRA’s runtime based Fault Injection is, at least, faster than a GDB-based implementation. This value increases when the icount parameter is well-defined, and does not provoke changes in simulation results.
3.3. Logging and Data Visualization
The Data Visualization entity retrieves all data from the Logger and provides a visual summary of the simulation in a
web-based application, as shown in
Figure 4. This extension provides an overview of the program execution flow and corresponding software instructions, fault affected memory, and the resulting fault run classification, which is better explained in
Section 3.4. Execution flow is shown by charting the PC value sequence throughout the simulation of both the
golden run and
fault run, shown as blue and red dot graphs, respectively. Within this graph, the injection points are highlighted according to their affected PC value and simulation time, whilst, on the right side of the graph, the fault-affected instructions are also automatically highlighted. Below the execution flow graph, the log shows the changes in the memory values pre-injection and post-injection.
The example used in the figure refers to a TMR software similar to the one used in the benchmarks of the previous section, regarding both a golden run and fault run. The software loops between three states: (i) the Input state, which waits for three different 256-byte values to be written into a block of RAM starting in 0x200020A0 by an acquisition thread; (ii) the Computation state, which performs bit-wise triple modular calculation of acquired values; and (iii) the Output state, which posts the correct value and logs the result into a UART for debug purposes. Throughout the simulation, two faults were injected: (i) a permanent stuck-at fault at PC value 0x66C, avoiding half the computation loop, and (ii) an SEU starting at 22 millisecond simulation time at the start of the RAM block. As shown in the figure, the SEU injected in the RAM is shown in the affected memory log, and is highlighted both in the binary code and in the fault run graph. The two charts present an offset in execution, highlighted in the figure by a red ∗, resulting from a permanent fault that exits the computation loop ahead its termination, offsetting the executed instructions.
Alongside the visual representation, the tool provides the user with a result according to how the system behaved with the executed fault experiment under the Fault Classification tab. The Classifier provides the user the proposed fault classification, by showing, within the aforementioned tab, the possible classifications and asserting the correct classification label. The automatic classification procedure is further explained in the next section.
The monitored memory addresses are used to automatically generate Finite State Automatas (FSM) from a user-selected address, such as the one shown in
Figure 5. This feature is particularly helpful in low complexity software implementations, such as simple redundancy check algorithms, since they employ a well-defined finite state machine that expresses their functionality. By monitoring its internal state variables, the automatas visually demonstrate how the system behaves in each experiment, and the state changes can be used to infer about execution flow. The automata creation algorithm verifies all the different values and assigns a state to each different value. Since the FSM in
Figure 5 is created from a previous example, the resulting state
q0 shows the initial state where all is initialized, and the looping states
q1,
q2, and
q3, are, respectively, the
Input,
Computation, and
Output states. At the current state of development, the tool expresses only the automatas in terms of their output states, not their state triggers. For this reason, state changes are always shown as ‘1’ and ‘0’.
All data used for the fault experiments and resulting log files are stored into a database, allowing for posterior analysis and more in-depth examination. For the presented example, the log files size of both fault run and golden run for a simulation time of 10 s wall-clock time, was approximately 9 megabytes (MB), excluding the application binary file.
3.4. Automatic Fault Classification
With the log files from the Fault Logger, the Classifier provides an automatic classification regarding the
fault runs. Each run is classified using the proposal by the authors of [
26], which categorises Fault Injection experiment outcomes into five groups: (i) Vanished, where no fault traces are left; (ii) Application Output Not Affected (ONA), where the resulting instruction flow is not modified, but one or more remaining bits of the architectural state is incorrect; (iii) Application Output Mismatch (OMM), where the application terminates without any error indication, and the resulting memory is affected; (iv) Unexpected Termination (UT), where the application terminates abnormally with an error indication; and (v) Hang, where the application does not finish requiring a preemptive removal.
The Classifier entity, previously depicted in
Figure 1, uses all Logger information and compares them with the baseline
golden run system state. The fault classification is based on the following criteria:
Vanished: When the golden run PC execution flow, and both fault-affected and monitored memory match the fault run, whilst the fault-run contains at least one active fault during simulation. Valid of for latent faults.
ONA: the resulting monitored variables values and PC execution flow is equal, discarding fault-affected memory differences.
OMM: the monitored state variables are different, resulting in different instruction flows.
Unexpected termination: when the simulation ends before the expected simulation duration.
Hang: when internal simulation watchdog triggers at the expected simulation duration.
With all QEFIRA’s functionalities properly presented, the next section provides an overview on how the tool can be used to reach ISO 26262 compliant classifications.
4. ISO 26262 Compliant Fault Models
The latest addition to the ISO 26262 standard, part 11, presents a proposal of failure modes for digital memory components, such as Flash and RAM, and non-memory components such as Central Processing Units (CPU), Direct Memory Access (DMA) modules, and interrupt controllers [
2]. Furthermore, it also specifies how faults are characterised according to their duration. The standard specifies that a physical fault, represented by its fault model abstraction, can either be Permanent
stuck-at or Transient:
Permanent: (i) stuck-at fault, (ii) open-circuit fault, (iii) bridging fault, and (iv) single-event hard error.
Transient: (i) single-event transient, (ii) single event upset, (iii) single bit upset, (iv) multiple cell upset, and (v) multiple bit upset.
These fault modes are supported by QEFIRA, as both timed permanent and transient faults can be specified for injection. Furthermore, the conjunction of the parameter mask and set_bit allows for single or multiple bit upsets. Further analysis of part 11 of the standard reveals that QEFIRA is able to emulate the faults that provoke the proposed failure modes for digital components. Targets for these failure modes include the CPU instructions, the CPU Interrupt Handler circuit (CPU_INTH), the Interrupt Controller Unit (ICU), the DMA controllers, data memory coherency (DATA), and communication peripherals (COM).
The proposed failure modes and respective fault models supported by QEFIRA were compiled into
Table 3.
The first column provides the identification of the targeted failure mode according to its part/subpart, i.e., the digital component affected, followed by its description as per the standard part 11 specification. The third column proposes the fault model behaviour that achieves the correct failure mode. The standard does not provide the explicit behaviour to achieve the failure modes, meaning that the tool designer’s responsibility to correctly model faults and the user responsibility to correctly specify them in fault run inputs. This means that supporting a larger fault space can tackle different ways to achieve the same failure mode, improving coverage at the cost of an higher effort on fault experiment specification. Lastly, the final column provides the target component that should be specified in the fault experiment schema to achieve the failure mode.
ISO Compliant Fault Classification
Fault experiment results can help to reach an ISO 26262 compliant classification by testing critical-paths monitored by SM’s against well-defined single faults. A compliant fault classification is supported by the proposed confusion matrix on
Figure 6, adapted from the classification of analog circuitry in [
27]. The classification is made according to: (i) the ability by the SM to detect the fault, and (ii) the SM efficacy to mitigate it. This characterisation designates the faults either Safe or Dangerous (matrix rows), whilst being Detected or Undetected (matrix columns) by the SM. The simulation results provided by QEFIRA’s Logger can help determine if the SM detected the fault either by analysing the execution flow for a safe state trigger or by monitoring an output variable, such as a detection flag. The efficacy of the SM is indicated by the fault run result provided by the Classifier. Experiments classified as Vanished and ONA imply that the SM has yield favourable results, whilst other classifications imply failure.
Most of the ISO 26262-defined fault classifications presented in
Section 2.3 can be mapped in the matrix. The exception is single point faults which, by definition, assume that no safety mechanism exists to detect and correct them. Notably, latent faults belong on the left side of the diagram, because they are not detected by the SM. Since a latent fault does not cause any function failure by itself, it belongs in the upper quadrants, i.e., Safe. Although, this could also imply a fault affecting the SM itself, rendering it Dangerous with a subsequent fault. The rationale for perceived faults to be cross-plane in the quadrants is similar to latent faults, but pertains to a failure in the SM that asserts a safe state. The fault itself is Dangerous, but tolerable by the safe state. Residual faults belong in the lower left quadrant, as they violate safety goals. On the other hand, Safe faults do not affect the system, as they are always mitigated by the safety mechanism. Lastly, detected faults belong in the lower right quadrant, since the SM will be able to detect but not mitigate them.
As mentioned in
Section 2.3, fault metrics can be retrieved from fault classifications using the matrix results. From the matrix, the diagnostic coverage metric of an SM can be calculated by the weighted sum of all Dangerous Undetected (DU) faults as a percentage of the sum of all potential faults, as shown in Equation (
5).
A more refined version of the previous equation tackles only faults that will, from earlier analysis, guarantee to jeopardise the system. This metric, namely DC-Residual, is calculated as the likelihood-weighted percentage of the Dangerous faults (DD and DU) that are Detected (DD), as presented in Equation (
6).
Furthermore, the Single Point Fault Metric can be calculated according to Equation (
7).
SPFM is calculated as the likelihood-weighted sum of the multi-point and safe faults, i.e., SU, SD and DD, as a percentage of the likelihood-weighted sum of all potential faults. This metric covers all faults that are out of the SM scope.
5. Conclusions and Future Work
In this paper, we introduce QEFIRA, a Fault Injection framework built on QEMU runtime, designed to aid the assessment of SIHFT mechanisms on embedded software. Furthermore, we proposed its usage for modelling faults compliant with the ISO 26262 automotive functional safety standard, enabling developers to better evaluate the efficacy and cost of software-implemented safety mechanisms. The framework generates a comprehensive log detailing execution flow and memory dumps, complemented by automatic classification of fault experiments. This, coupled with the proposed confusion matrix, allows us to gather compliant metrics to characterize and evaluate different designs in the early stages of development, avoiding hardware destructiveness and avoiding the need for physical hardware. We also provide an extensive list of proposed fault models compliant with the standard, addressing part location and fault duration. The framework presents high reachability, performance and injection granularity at the cost of portability and source code intrusiveness. We view this as a trade-off favouring more accurate fault behaviour and higher campaign throughput. The Fault Injection overhead introduced into the simulator resulted in a slowdown of comparing to the native implementation. The post-campaign visual aid provides a quick overview of system behaviour, avoiding analysing verbose log files, thereby reducing generation overhead. Regarding the proposed fault matrix, three architectural metrics can be retrieved to evaluate the efficacy of the SMs.
Our next steps include extending the framework logging capabilities and injection points into Linux-based applications, and migrate monitoring features and injection hooks into plugins to reduce intrusiveness. These changes aim to improve framework portability without significantly compromising speed. Furthermore, register-level fault models can be extended, allowing injection on all register file. Regarding fault classification, we plan to refine the confusion matrix to address cases where safety mechanisms function correctly but safety functions still fail. This should avoid any mislead information regarding latent faults or faults that can occur in the safety mechanism itself.