Next Article in Journal
A Fault Identification Method of Hybrid HVDC System Based on Wavelet Packet Energy Spectrum and CNN
Previous Article in Journal
A 2.0–3.0 GHz GaN HEMT-Based High-Efficiency Rectifier Using Class-EFJ Operating Mode
 
 
Article
Peer-Review Record

Virtualized Fault Injection Framework for ISO 26262-Compliant Digital Component Hardware Faults

Electronics 2024, 13(14), 2787; https://doi.org/10.3390/electronics13142787 (registering DOI)
by Rui Almeida *, Vitor Silva and Jorge Cabral
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Electronics 2024, 13(14), 2787; https://doi.org/10.3390/electronics13142787 (registering DOI)
Submission received: 3 June 2024 / Revised: 11 July 2024 / Accepted: 14 July 2024 / Published: 16 July 2024
(This article belongs to the Special Issue Safety of Real-Time and Cyber-Physical Systems)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The presented paper introduces a fault injection framework named QEFIRA, which can achieve the failure modes proposed by Part 11 of the ISO 26262 automotive standard. The proposed tool uses QEMU to inject faults during runtime and provides an automatic framework for post-execution analysis.

 

Although the article has potential, some fundamental aspects must be reviewed to determine its acceptance.

 

  • The state-of-the-art should also include RTL simulation techniques directly implemented with behavioural simulations, UVM or SystemC approaches. A few words should be stated about the differences between these and other simulation approaches, such as GEM5 or QEMU.
  • Lines 223-225. don't understand how the fault injector decides to inject faults or what it bases its decision on.
  • Lines 227-228. What do you mean by "golden run"? Is it just memory comparisons? Does it also include Registerfile comparisons with the processor's state? Please clarify this section.
  • Line 232. Considering the faults are injected in the micro-op, which registers can be accessed? Just operands or all the registers from Registerfile? Clarify this section.
  • Table 1, <duration> of transient faults. Speaking about SEU, there are no duration constraints. An SEU fault changes the state of a flip-flop until the next rewrite of the flop, so the duration parameter does not make sense. It only counts for stuck-at or SET faults (not considered in the environment according to lines 239-241).   
  • Line 302. What is "icount"? A reader does not necessarily know this.
  • Comparisons are only made with GDB. Seriously consider including further comparisons with the state of the art of fault injection simulators.
  •  Figure 4. need clarification on the contribution of the Analyzer. Does it show only the PC progression, or can it show other signals/registers? The red-circled instruction 6ac is faulted, but why does the memory show just the state of 690? How can the results produced by the Analyzer be used to ensure compliance with ISO 26262?
  • Figure 5. Better explain this figure and its connection to the previous one (Figure 4). Moreover, where are the states explained? q0,q1,q2,q3. 
  • Lines 252-363, already explained in Lines 333-339.
  • Lines 364-365. Better explain this statement.
  • Lines 379-384. "QEFIRA should be able to emulate the faults that provoke the proposed failure modes for digital component". According to the article, it should already comply with ISO26262. Please clarify this statement. 
  • Table 2. Never cited or explained in the text. Please remove it.
  • Section 4.1. It is unclear how this section is connected with the result provided by the tool. Please highlight the connection, especially with the use of the Confusion Matrix. 

 

Overall, the article lacks objective validation of the work. More targeted comparisons with other works in the literature and a more detailed description of the implementation would be needed. A lot of effort goes into the ISO 26262 standard and little into the QEFIRA tool itself.

 

Some minor changes.

 

 

  • Line 17 "electromagnetic radiation". Electromagnetic radiation, as it is, is too generic. Consider changing it to something particular like "ionizing particles".  
  • Line 34, "ECU" was never defined before.
  • Line 143, "TLB" was never defined before.
  • Eq. (1) should be specified that this equation is the sum of different failure rates due to the different fault types. 
  • Line 196, Single-Point Fault Metric, consider adding the Achronim SPFM, which was never defined before. 
  • Line 204, Latent-Fault Metric, consider adding the Achronim LFM, which was never defined before. 
  • Line 282, "performed to to"
  • Figure 3. Consider adding the [s] seconds, on the vertical axis. 
  • Line. 328 "the the pre-injection"
  • Line. 380 "the the proposed"

Author Response

Thank you very much for taking the time to review this manuscript. Please find the detailed responses below and the corresponding revisions/corrections in the revised manuscript.

Comment 1: The state-of-the-art should also include RTL simulation techniques directly implemented with behavioral simulations, UVM or SystemC approaches. A few words should be stated about the differences between these and other simulation approaches, such as GEM5 or QEMU.
Response 1: Provided an overview regarding RTL based methods in section 2.1. Mentioned the advantages of using micro-architectural simulation approaches in contrast to lower-level, RTL approaches.

Comment 2: Lines 223-225. I don't understand how the fault injector decides to inject faults or what it bases its decision on.
Response 2: The fault injector monitors both the instruction translation loop and the memory loop, and verifies the fault list for faults that should be triggered. The fault list is created by the Fault Controller, and is composed by the virtual representation of the faults specified in the XML file, containing all properties. The Fault Injector module continuously monitors accesses to memory and registers, interrupt assertions and changes to the Program Counter (PC). Each one of these triggers dispatches the Fault Injector to verify if any fault within the fault list is pending activation or deactivation. When an injection point activates a fault, the Logger saves the resulting system state changes. During runtime, the simulation time is also monitored to control timed faults, activating them when their activation period is equal to the simulation time. Further description about this process was added in Section 3.

Comment 3: Lines 227-228. What do you mean by "golden run"? Is it just memory comparisons? Does it also include Registerfile comparisons with the processor's state? Please clarify this section.
Response 3: A valid fault campaign is composed by a golden run and, at least, a fault run. The fault run contains faults being injected throughout the simulation. The golden run follows the same execution but no faults are injected. While the golden run executes, the system state is logged according to its non-faulty behavior, allowing for a post-execution comparison between executions. Log data includes: fault affected memory containing prior and post injection values, the PC execution flow, and user-defined memory monitor variables, all paired with the current simulation time. Comparisons with the golden run and fault run are made by the Classifier, which uses all the log data to provide a classification, as shown in Section 3.4. More details about this procedure was added in Section 3, 3.3 and 3.4.

Comment 4: Line 232. Considering the faults are injected in the micro-op, which registers can be accessed? Just operands or all the registers from Registerfile? Clarify this section.
Response 4: The current version of the tool supports the Arm registers from R0 to R12. The R15 register, PC, is used for fault triggers. Future work aims to provide support for faults tackling all of the register file, including CPSR status registers. Clarification was added in Section 3.

Comment 5: Table 1, <duration> of transient faults. Speaking about SEU, there are no duration constraints. An SEU fault changes the state of a flip-flop until the next rewrite of the flop, so the duration parameter does not make sense. It only counts for stuck-at or SET faults (not considered in the environment according to lines 239-241).   
Response 5: The transient faults, once triggered, only stay in the system for a single operation. For example, a transient fault on a memory address, will apply the specified bit mask for a single read or write operation. The specified duration relates to the time the fault is latent in the system. Further analysis shows that the table is missing important information: <time> pertains to permanent stuck-at and transient faults while <duration> only pertains to transient faults. The table was modified to provide correct information.

Comment 6: Line 302. What is "icount"? A reader does not necessarily know this.
Response 6: Added definition of “icount” in section 3.1 after line 279: “This is enforced by the usage of the QEMU’s icount parameter, which provides a simulation time that is proportional to the emulated instructions and is not impacted by the host wall-clock time. By defining this parameter, one target instruction counter tick equals 2N nanoseconds, being N the user specified icount value. With a deterministic simulation time, transient faults and delayed injections can be triggered and deactivated with increased time granularity.”

Comment 7: Comparisons are only made with GDB. Seriously consider including further comparisons with the state of the art of fault injection simulators.
Response 7: The aim for the QEFIRA tool was to use QEMU as it presents itself as an open–source solution with continuous updates and upgrades regarding platform compatibility. The focus lays on the usage of QEMU and paving a simulation ecosystem that can be extended to more different architectures and with different FI mechanisms. With that in mind, we have decided to only compare with GDB as it is a different FI technique implemented in the same tool and in the same software version.

Comment 8:  Figure 4. I need clarification on the contribution of the Analyzer. Does it show only the PC progression, or can it show other signals/registers? The red-circled instruction 6ac is faulted, but why does the memory show just the state of 690? How can the results produced by the Analyzer be used to ensure compliance with ISO 26262?
Response 8: The Data Visualization tool provides a visual representation of the system state throughout the simulation run. Regarding the red circled instruction, the example shows a fault run with a prior injection on the mentioned PC value. The screenshot does not provide a clear enough example so it was changed for an explicit representation of a clearer fault experiment. 
The Data Visualization tool does not produce results. It provides the user a friendly manner to verify the simulation state through its duration, by showing: (i) PC value progression in a chart with both the golden run and the fault run; (ii) fault-affected memory; (iii) the application assembly instructions; (iv) the resulting fault classification provided by the Classifier; and (v) the monitored variables and the optional FSM compiled from them. 
The compliance with ISO 26262 is reached as the fault models supported by QEFIRA are aligned with the standard and by complementing the results from the Classifier with the Confusion Matrix proposed in Section 4.1. As per-standard specification, the architectural metrics can be reached from the fault classification (SF, MPF, …). We have achieved a correct mapping between the fault run classification and the standard-specified fault classification, by using the fault models supported by QEFIRA and adopting the failure modes that fit the ones proposed in Table 30 of the standard Part 11.

Comment 9: Figure 5. Better explain this figure and its connection to the previous one (Figure 4). Moreover, where are the states explained? q0,q1,q2,q3.
Response 9: The FSM presented in Figure 5 results from the progression of a monitored state variable of the example shown in Figure 4. In the previous example, an application variable containing the encoding of the application state machine was monitored, providing information about the internal application state. The resulting state q0 shows the initial state where the platform memory is initialized, and the looping states q1, q2 and q3, are, respectively, the Input, Computation and Output states. Further information about this figure was provided in Section 3.3.

Comment 10: Lines 252-363, already explained in Lines 333-339.
Response 10: The text on lines 252-363 do not correlate with text on lines 333-339. 

Comment 11: Lines 364-365. Better explain this statement.
Response 11: The statement is as follows: “Since the Classifier only provide with a suggestion, the user, on the Data Visualization tool, can, upon further examination, classify the campaign differently.” This statement regards a specific functionality of the Data Visualization tool that allows the user to change the fault classification prior to saving the log data into the database. The Classifier provides the latter with a proposed fault classification and it is shown by asserting one of the classification labels: Hang, UT, OMM, ONA and Vanished. The user can select another label to provide a different classification, prompting the database to store the record from the fault experiment. This statement is further removed from this revision as it is not significantly relevant for the manuscript.


Comment 12: Lines 379-384. "QEFIRA should be able to emulate the faults that provoke the proposed failure modes for digital component". According to the article, it should already comply with ISO26262. Please clarify this statement. 
Response 12: The QEFIRA tool was developed with safety-critical systems in mind not exclusively for the ISO 26262. In fact, QEFIRA is able to emulate the faults that provoke the proposed failure modes for digital components. The statement aimed to show that QEFIRA is not exclusive to ISO 26262 but is broad enough to be extended to other standards, such as IEC 61508, that may use the supported fault models and benefit from the proposed fault classification. The statement was updated to “Further analysis of the Part 11 of the standard reveals that QEFIRA is able to emulate the faults that provoke the proposed failure modes for digital components.”
 
Comment 13: Table 2. Never cited or explained in the text. Please remove it.
Response 13: Table description and reference were added to the manuscript. The text regarding the table was lost during template formatting. The following paragraph was added:
“The proposed failure modes and respective fault models supported by QEFIRA were compiled into Table 3. The first column provides the identification of the targeted failure mode according to its part/subpart i.e., the digital component affected, followed by its description as per standard part 11 specification. The third column proposes the fault model behavior that achieves the correct failure mode. The standard does not provide the explicit behavior to achieve the failure modes, meaning that it is the tool designer's responsibility to correctly model faults and the user responsibility to correctly specify them in fault run inputs. This means that a higher fault space can tackle different ways to achieve the same failure mode, improving coverage at the cost of a higher effort on fault experiment specification. Lastly, the final column provides the target component that should be specified in the fault experiment schema to achieve the failure mode.”

Comment 14: Section 4.1. It is unclear how this section is connected with the result provided by the tool. Please highlight the connection, especially with the use of the Confusion Matrix. 
Response 14: Section 4.1 provides the bridge between the ISO 26262 standard and the presented QEFIRA tool. It proposes how to use QEFIRA’s abilities to reach both fault classification and architectural metrics compliant with the standard Part 11 and Part 5, respectively. The QEFIRA tool is not necessarily bound to ISO 26262, but was designed with safety-critical applications in mind. This section shows how QEFIRA functionalities align with the standard, proposing a way to reach standard compliant metrics.
The section proposes both the (i) usage of fault models supported by QEFIRA to reach the failure modes proposed in the standard and a (ii) confusion matrix that maps simulation results in standard-compliant fault classifications. These are the two conditions that reach architectural metrics, since a standard-compliant fault classification is needed as shown in the equations presented in Section 2.3. This classification is the culmination of QEFIRA’s fault models, fault experiment classification and application-level monitoring supported by the confusion matrix. 
A correct usage of the fault matrix implies a fault experiment that evaluates a safety mechanism against a well-defined fault space. Furthermore, the internal state of the mechanism should be monitored, by analyzing either a logged application-level variable such as a detection flag or the execution flow. In the matrix, columns are dependent on the value of the detection flag, while rows depend on the fault classification given by the Classifier entity.

Regarding the minor changes, we have made the following changes.

Comment 1: Line 17 "electromagnetic radiation". Electromagnetic radiation, as it is, is too generic. Consider changing it to something particular like "ionizing particles".  
Response 1: Change to “ionizing radiation”.

Comment 2: Line 34, "ECU" was never defined before.
Response 2: Changed to Electronic Control Unit (ECU).

Comment 3: Line 143, "TLB" was never defined before.
Response 3: Rephrased into “target-to-host translation”.

Comment 4: Eq. (1) should be specified that this equation is the sum of different failure rates due to the different fault types. 
Response 4: Added “This equation reflects the sum of different failure rates in respect to different fault types.”.

Comment 5: Line 196, Single-Point Fault Metric, consider adding the Achronim SPFM, which was never defined before.
Response 5: Added acronym.

Comment 6: Line 204, Latent-Fault Metric, consider adding the Achronim LFM, which was never defined before.
Response 6: Added acronym.

Comment 7:  Line 282, "performed to to"
Response 7: Removed “to”.

Comment 8:  Figure 3. Consider adding the [s] seconds, on the vertical axis. 
Response 8: Added “Time elapsed [s]” as vertical axis title.

Comment 9:  Line. 328 "the the pre-injection"
Response 9: Removed “the”.

Comment 10: Line. 380 "the the proposed"
Response 10: Removed “the”.

Thank you once again for your revision and comments. We hope to have answered all your questions.
Best regards,
The authors.

Reviewer 2 Report

Comments and Suggestions for Authors

Hi Authors. Kindly refer to attached pdf for comments. Thanks

Comments for author File: Comments.pdf

Comments on the Quality of English Language

ok

Author Response

Thank you very much for taking the time to review this manuscript. Please find the detailed responses below and the corresponding revisions/corrections to your comments.

Comment 1: How does QEFIRA contribute to validating system behavior in safety-critical applications, particularly in the automotive industry? 
Response 1: The developed tool, QEFIRA, is based on QEMU, which is an instruction-accurate simulator that allows system behavior to be emulated with the help of software-implemented models. Naturely, the usage of simulation allows to validate system behavior in a non-destructive testing environment, assuming that the system model is well established and capable of providing significantly accurate results. The state-of-the-art shows that simulation is a well-rounded technique for validation for several domains within and outside the safety spectrum.
The QEFIRA tool extends the mentioned simulator to focus on validating system behavior by providing faulty stimuli that is destructive in hardware and impossible in software-based testing. This type of testing is particularly important in safety-critical applications since fault-tolerance mechanisms need to be tested for their correctness and evaluated against RHF’s. This way, QEFIRA aids developers to evaluate safety mechanisms by providing user-friend input schema for running fault injection campaigns and retrieve the resulting system output in a graphical manner.
Considering the automotive industry, QEFIRA provides the ability to provoke the failure modes suggested in the ISO 26262 standard. Furthermore, the metrics provided by each fault campaign can be used to retrieve ISO 262626 compliant metrics for safety mechanisms.

Comment 2: In what ways does QEFIRA align with the failure modes proposed by Part 11 of the ISO 26262 standard? 
Response 2: The tool is able to emulate fault behavior that fits the ISO 26262 standard. Regarding fault models, they can be permanent stuck-at or transient. The Part 11 of the standard dictates that digital fault models include both permanent (described as stuck-at, open-circuit and bridging) and transient (described as SET, SEU, SBU, MCU, and MBU). QEFIRA is able to emulate the SEU transient fault model and the stuck-at fault model for permanent faults. Regarding fault location, it tackles the ability to inject faults in common IP blocks, such as CPU, CPU_INTH, ICU, DMA and memory. These common IP blocks are present on Table 30 of Part 11 of the standard. 
Under this same table, there is a guideline on what to consider as a failure mode for the IP. By using QEFIRA’s ability to inject faults into runtime, the user can create a fault experiment that attends to the failure modes presented in the table for each IP block. We have assisted on using QEFIRA to reach the failure modes by providing a  table that contains guidelines on how to reach the failure modes using the fault models supported by QEFIRA.

Comment 3: How does QEFIRA utilize QEMU to inject faults, and what types of faults can it inject during runtime? 
Response 3: QEFIRA tool uses QEMU version 8.2 and focuses on extending its runtime to monitor execution and decide when and where to inject faults. The tool receives a user-defined XML schema file that configures a fault campaign, containing the fault specification, such as trigger and location, and runtime variables, such as simulation duration and memory addresses to monitor. The Fault Controller module creates virtual representations of the specified faults, specified as a fault list, providing the Fault Injector module the needed information to make a decision when and where to inject the fault. As simulation progresses, the Fault Controller module is responsible to also add and remove timed faults to the list when simulation time matches or exceeds fault trigger time. 
During runtime, the Fault Injector module monitors both the instruction translation loop (in micro-ops) and the IO operations. Each instruction fetched is verified for a matching fault in the list, and is replaced by its faulty counterpart. For the duration of the fault, the faulty instruction is cached in the Translation Block (TLB) , speeding up on further usages without the need for repeating the process. During execution, the program counter is monitored to inject PC-triggered faults on the CPU, memory or ICU. The same rationale occurs for the IO operations, which occur after execution. Each memory access is monitored and checked for injection. The overall execution loop is capable of maintaining its performance since the instruction cache is maintained while the fault is active, refreshing its current state when faults are removed from the system.
Faults can be injected in components such as CPU, RAM, FLASH, MMIO, registers and ICU (e.g., NVIC for the arm architecture). The faults can be either permanent stuck-at or transient, with the ability for both to be triggered by simulation time or by the program counter. On each injection point, the system state is saved into a log file, containing the current PC execution, current simulation time and monitor variables.

Comment 4: What benefits does QEFIRA's automatic post-execution analysis offer for system validation? 
Response 4: QEFIRA generate log files for each fault experiment containing: (i) the list of injected faults; (ii) system progression regarding the executed instructions paired with current simulation timestamp; (iii) values of fault-affected memory also paired with the current timestamp; (iv) monitored application-level variables, and (v) the total simulation time. A fault experiment is composed of a golden run, where no faults are injected, and one or more fault runs, where faults are injected.
From this data, the Classifier entity classifies the fault experiment as follows: (i) Vanished, where the golden run instruction progression, memory and monitor variables match the fault run; (ii) ONA, where the monitored memory presents differents, whilst the instruction progression remains equal; (iii) OMM, instruction flow is different between runs; (iv) Unexpected termination, where the simulation aborts before the user-defined simulation duration, and (v) Hang, where the simulation enters an infinite loop or a unexpected instruction flow, resulting in a simulation time of at least 2x the expected simulation time.
By providing an automatic classification, the developer is exempt from performing a repetitive task of examining each log individually to obtain information about the resulting system behavior. This automation increases the throughput of tests performed by fault experiment. Furthermore, it can enlighten about cases where the experiment result is different from the expected one, highlighting occurrences that may have escaped the developer.
All the data is made available to the user as a web page interface made available by the Data Visualization entity. The user can, on this interface, view all data in a simpler manner and verify if the automatic analysis matches system behavior. 

Comment 5: How does the integration of a confusion matrix in QEFIRA enhance the evaluation of safety mechanisms? 
Response 5: The proposed confusion matrix aims to characterize faults according to the ISO 26262 standard for fault classification. The matrix uses the result of the fault experiment provided by the Classifier entity and an internal safety mechanism variable, such as a trigger flag, to classify faults as Safe, Multi-Point, Residual or Latent. This classification is presented in Part 5 of the ISO 26262 standard.
From the fault classification, the standard specifies that architectural metrics regarding the safety-related mechanisms can be retrieved from the amount of faults that affect the system while assuming all failures are independent. The available metrics are specified in Part 5 of the standard. Using the confusion matrix results, one can retrieve the following metrics: (i) Diagnostic Coverage, which is the weighted sum of all Dangerous Undetected (DU) faults as a percentage the sum of all potential faults; (ii) Diagnostic coverage of residual faults, which tackles only faults that will, from earlier analysis, guarantee to jeopardize the system; and (iii) Single Point Fault Metric, which covers all faults that are out of the SM scope. 
Different safety mechanisms can be evaluated by comparing these metrics between them, and finding the right trade-off between the coverage needed for the system and its application cost.

Comment 6: I suggest having a comparison table to compare FOM metrics between this proposed work and previously published work in the same domain. 
Response 6: We have added a table comparing our work to other works in the same domain in section 2.2.

Comment 7: What metrics can QEFIRA estimate to ensure compliance with ISO 26262 standards? 
Response 7: From the fault classification, the standard specifies that architectural metrics regarding the safety-related mechanisms can be retrieved from the amount of faults that affect the system while assuming all failures are independent. The available metrics are specified in Part 5 of the standard. Using the confusion matrix results, one can retrieve the following metrics: (i) Diagnostic Coverage, which is the weighted sum of all Dangerous Undetected (DU) faults as a percentage the sum of all potential faults; (ii) Diagnostic coverage of residual faults, which tackles only faults that will, from earlier analysis, guarantee to jeopardize the system; and (iii) Single Point Fault Metric, which covers all faults that are out of the SM scope. 
Different safety mechanisms can be evaluated by comparing these metrics between them, and finding the right coverage needed for the system.

Comment 8: How does QEFIRA's performance compare to the native QEMU implementation in terms of speed and efficiency? 
Response 8: To test QEFIRA performance toll comparatively to the native QEMU implementation, two benchmark programs were used, running in an ARMv8 processor supported by ThreadX: (i) a TMR calculation of a block of RAM and (ii) bubble sort of an array of 300 four byte words. Both these programs were executed for a total of ten seconds of simulation time. 
The tests were performed with a fault experiment with the following description:
Three SEUs on a block of RAM which applies a bitmask of 11223344h, from 100 ms to 200 ms simulation time.
A spurious UART interrupt every time the processor reaches a data processing function;
Replacement of register cell R3 contents with a random 32-bit value, triggered by PC.
For the QEFIRA benchmarks, the faults were injected, during the native tests, no faults were injected. Compared with the baseline run, the changes made for QEFIRA introduced a slowdown of 1.4× for singlestep execution, in its worst case. The singlestep execution guarantees maximum granularity of injection, since there is no translation block caching. 
By discarding the aforementioned parameter and caching translated instructions, QEFIRA had a slowdown of 3.8× compared with the baseline. Although, its overall speed, compared with singlestep execution, increased 4.8× in the bubble sort benchmark and 3.34× in TMR benchmark.
We consider the slowdown in singlestep execution acceptable, as the tradeoff between injection granularity and performance favors simulation granularity for improved results. The usage of instruction caching presents itself as an improvement point, to further increase performance where injection granularity may not be a concern.

Comment 9: In what ways does QEFIRA support early-stage design evaluation and characterization for safety-critical systems? 
Response 9: The usage of simulators in the early stage of development allows developers to experiment with different system architectures or processor ISA’s, before committing to a specific hardware platform. As safety-critical platforms may employ significant processing power and several built-in properties for harsh environments, they become costly and poor platform decisions may have monetary and time consequences. 
Furthermore, the usage of simulation allows developers to employ software and test its mechanisms without the need for a physical prototype. This is valid also for fault-tolerance mechanisms implemented in software, as safety requirements may enforce the software to react correctly to the presence of random hardware failures. In this context, QEFIRA provides the developers the needed tools to stimulate the software mechanisms with fault models that may provoke failures on the underlying hardware. The fault models QEFIRA provides can be used to emulate the failure modes presented in the ISO 26262 automotive standard, allowing to retrieve significant architectural metrics regarding the employed safety mechanisms. Since this can be done without physical hardware, it can be performed in the early stages of development, not compromising decisions on the platform or processor architecture. 

Comment 10: What advantages does QEFIRA offer in assessing software-implemented mechanisms against random hardware failures (RHF)? 
Response 10: QEFIRA offers several advantages when it comes to validate and evaluate SIHFT mechanism against RHF’s. The first one is the non-destructiveness of hardware. One way to provoke hardware failures is through faults caused by ionizing particles, which may render the hardware unusable and is costly in terms of setup. Since QEFIRA can emulate failure modes that approximate the failure modes caused by electromagnetic interference, tests can be made safely without the need for and cost  of physical hardware. The second one lays on the granularity of injection, which, comparatively with software- and hardware-only fault injection, offers a larger spectrum of fault locations and fault models, allowing tor more system reachability. This greatly improves the fault space, resulting in more testing experiments. Regarding fault injection capabilities, the supported fault models are aligned with the Part 11 of the ISO 26262 standard. Thus, the tool can be used to validate and evaluate systems that are within the automotive standard scope.
Comparatively with other works, QEFIRA provides a novel evaluation method that allows retrieval of significant architectural metrics compliant with the ISO 26262 standard. Considering that the application of the QEFIRA tool occurs early in the development phase, the assessment of various SHIFT mechanisms can be performed in order to find the optimal solution.

Comment 11: References should be increased to >30 to make this work more convincing to readers.
Response 11: Number of references was increased.


Thank you once again for your revision and comments. We hope to have answered all your questions.
Best regards,
The authors.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Small final changes, review the text in the added parts.

  • Line 99-100: " of of "
  • Table 2: in the comment you said <time> but you wrote <timer>

Author Response

Thank you very much for taking the time to review this manuscript. Please find the detailed responses below and the corresponding revisions/corrections to your comments.

Comment 1: Line 99-100: " of of ".
Response 1: We have removed the grammatical mistake. Line 99 was replaced with "while the authors of [ 14 ] and [ 15 ] use". 

Comment 2: Table 2: in the comment you said <time> but you wrote <timer>.
Response 2: The first column of Table 2 presents the label used on the XML file to specify the fault properties. The <timer> label specifies the simulation time at which the fault should become latent in the system. An example usage of this property value is "20MS", which means that this fault is inserted in the system (latent) after 20 milliseconds of simulation time. In the comment, we have missed an "r", the correct label in the comment should be <timer>. We apologize for the confusion.

Thank you once again for your revision and comments. We hope to have answered all your questions.
Best regards,
The authors.

 

 

Back to TopTop