Next Article in Journal
A Flexible Input Mapping System for Next-Generation Virtual Reality Controllers
Next Article in Special Issue
DuckCore: A Fault-Tolerant Processor Core Architecture Based on the RISC-V ISA
Previous Article in Journal
A Study of Cutaneous Perception Parameters for Designing Haptic Symbols towards Information Transfer
Previous Article in Special Issue
Multiple Sensor Fail-Operational Architecture for Electric Vehicle Powertrain Control System
 
 
Article
Peer-Review Record

Fault-Tolerant FPGA-Based Nanosatellite Balancing High-Performance and Safety for Cryptography Application

Electronics 2021, 10(17), 2148; https://doi.org/10.3390/electronics10172148
by Laurent Gantel 1,*, Quentin Berthet 1, Emna Amri 2,*, Alexandre Karlov 2 and Andres Upegui 1,*
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Electronics 2021, 10(17), 2148; https://doi.org/10.3390/electronics10172148
Submission received: 13 July 2021 / Revised: 4 August 2021 / Accepted: 27 August 2021 / Published: 3 September 2021

Round 1

Reviewer 1 Report

The proposal is really interesting, with a high novelty in both platform, CubeSats, and applications, post-quantum signature. 
However, some important details are missing, most of them related to the reliability of the system and the experimental setup.

1) The authors trust the SEM to estimate the radiation level and move from one configuration to another. The questions are:

1.1) As SEM is in the configuration memory of the FPGA, what happens when radiation effects affect SEM or the related routing.

1.2) What is the flow followed by SEM to move from one configuration to another? Does it change with the orbit? A description of the full routine would be interesting indicating the parameters that are measured and the flow diagram followed.

2) In addition to Fig.9 a deeper explanation about how the injection campaign has been done would be required. 
2.1) The frame addresses are randomly chosen but, how are the errors injected? One by one, multiple in the same clock cycle? 

2.2) Are the errors injected at the beginning of the experiment or at a different time? If this is the case, how is the time chosen?

2.3) Once the frame addresses are randomly chosen, are they the same for all the experiments? The same for the time in which the errors are injected, is the same for all the experiments?

3) Finally, it is not clear if the number of experiments is representative for this large design. Could you provide the statistical analysis to show if the sample (number of experiments run) is an unbiased (representative) sample?

Author Response

Thank you for your review. Please find below our answers to your questions:

Q1: The authors trust the SEM to estimate the radiation level and move from one configuration to another. The questions are: As SEM is in the configuration memory of the FPGA, what happens when radiation effects affect SEM or the related routing.

As stated in the introduction, the fact that the SEM is in the configuration memory is indeed an issue if radiation affects this IP. In this case, soft errors will accumulate in the design and trigger a computation or system failure. Thus, an upper layer is necessary to mitigate these errors. In our platform, the monitoring processor will reset the system and a fallback configuration will be loaded from the Flash memory in order to restore the design. This precision has been added in the Discussion section.

Q2: What is the flow followed by SEM to move from one configuration to another? Does it change with the orbit? A description of the full routine would be interesting indicating the parameters that are measured and the flow diagram followed.

The SEM Controller provides an access to a counter indicating how many soft errors has been detected in the design. If this counter value goes above a given threshold, the monitoring processor stops the application and reconfigure the TMR partitions from High-Performance mode to Safe mode, then it reset the counter. Idem if the counter value goes above the threshold, the configuration is switched from High-Performance mode to Safe mode. For the moment, the refresh period for the counter and the threshold are set arbitrarly. In future work, we intend to integrate a precise model of the radiation at Low Earth Orbit and to use these values to fix the threshold and the period. The actions performed when an alarm is triggered by the soft error counter has been added to the monitoring diagram (Figure 6).

Q3: In addition to Fig.9 a deeper explanation about how the injection campaign has been done would be required. The frame addresses are randomly chosen but, how are the errors injected? One by one, multiple in the same clock cycle?

The errors are injected at a period of 100 ms, ie. one error every 100 ms. In future work, we will use a random distribution to choose when a fault is injected but in this case our goal was to keep the same injection scheme for every test run. The explanation has been added in sub-section 3.2 (Platform Reliability) and in the Discussion section.

Q4: Are the errors injected at the beginning of the experiment or at a different time? If this is the case, how is the time chosen?

The errors are injected at the beginning of the experiment, just after the application processor startup, once the reconfigurable partitions has been correctly configured regarding the execution mode (ie. configuring the TMR version of the IP or a single instance of each processing elements). The precision has been added in sub-section 3.2 (Platform Reliability).

Q5: Once the frame addresses are randomly chosen, are they the same for all the experiments? The same for the time in which the errors are injected, is the same for all the experiments?

The application running on the platform being deterministic, in order to measure the reliability it was necessary to have a different set of addresses for each run. To do so, a random seed is provided by the PC connected to the UART link at the beginning of each run. This seed is used by the platform to generate the random addresses. The injection period and the startup time are the same for all the experiments. These points have been clarified in sub-section 2.5 (Fault Injection Mechanism) and in sub-section 3.2 (Platform Reliability).

Q6: Finally, it is not clear if the number of experiments is representative for this large design. Could you provide the statistical analysis to show if the sample (number of experiments run) is an unbiased (representative) sample?

This is an interesting question. We chose the parameters for the experiments (fault injection period, targeted areas and application) to be constant between each mode in order not to favor one mode over the other. Also, the number of runs allowed us to obtain a number of runs close to zero passed a given number of injected faults, and so to have a complete reliability curve, which was the major criteria to validate the experiment. We will provide a statistical analysis of the data as soon as possible. Additionally, the samples of the measured data will be uploaded to a public storage repository (https://yareta.unige.ch/) and available very soon.

Reviewer 2 Report

Minor recommendations for best reading:

-Name coherence: "ARM-A53" is used in one place and "Cortex-A53" is used in the rest of the article

-Abbreviation RP (Reconfigurable Partitions) is not defined in first place of ocurrence neither in the final list of abbreviations.

-Introducing mode names "Basic mode - (B_Mode)", "Safe mode - (S_Mode)", "High Performance mode - (P_Mode)", the "TMR mode" is not mentioned. It is inferred.

-Name coherence: "Safe mode", "safe-mode" keep the same name.

Author Response

Thank you for your review. Please find below the answers to your comments:

Q1: Name coherence: "ARM-A53" is used in one place and "Cortex-A53" is used in the rest of the article

"ARM-A53" has been replaced with "Cortex-A53"

Q2: Abbreviation RP (Reconfigurable Partitions) is not defined in first place of occurrence neither in the final list of abbreviations.

Abbreviation has been defined in first place of occurrence (Introduction) and added to the final list of abbreviations.

Q3: Introducing mode names "Basic mode - (B_Mode)", "Safe mode - (S_Mode)", "High Performance mode - (P_Mode)", the "TMR mode" is not mentioned. It is inferred.

TMR mode has been mentioned before its name in sub-section 3.2 (Platform Reliability) to avoid the need of inference.

Q4: Name coherence: "Safe mode", "safe-mode" keep the same name.

"safe-mode" has been replaced with "Safe mode" in several sections of the document.

Reviewer 3 Report

The topic of the paper is both importnat and interesting as both nanosatellites and application of programmable hardware, specifically FPGA, are sure "hot topics" and important for application of space technologies. The technical content of the paper is fine and the proposed design seems functional and interesting. However, I believe that the paper needs some improvements. It would be nice to address the following:

  • I believe that despite the technical description and reported results, the paper needs to contain some comparison of the results to the state of the art. Please, include some form of comparison to the state of the art (e.g. in figures) or at least some description of "what is beyond the state of the art".
  • I do not see much point in comparison of the power consumptions in safe, high performance, and low power modes baceuse the figures are only very marginally different. Does it have any reasonable practical meaning? Or is this to demonstrate that the power consumption increase is small? If so, please, tell this explicitly.
  • It is not very clear how/whether reconfiguration can really exploited in the presented board - the paper, in fact, says that the reconfuguration would be hazardeous. If it can be used, please, state how and under what conditions, if not, please, tell this explicitly.
  • The reliability of the system is evaluated by injecting faults into the configuration memory in the periods of 100ms. Please, discuss in the paper i) whether this really resembles the conditions in real usage of the board, ii) whether this is really the only or the major threat to the system (how about flipflops, LUTs, RAM)?
  • I believe that the references are really sketchy and must be improved. (I am not sure whether just a web page link is a good reference).
  • Finally, I believe that some means of verification of the results should be provided, e.g. more technical details of the design or, better, some downloadable file with the configurations for FPGAs.

I believe that the changes will improve the information for the readers and impression from the paper.

Author Response

Thank you for your review. Please find below the answers to your questions:

Q1: I believe that despite the technical description and reported results, the paper needs to contain some comparison of the results to the state of the art. Please, include some form of comparison to the state of the art (e.g. in figures) or at least some description of "what is beyond the state of the art".

Information has been added in the Introduction section to explain what is beyond the state of the art.

Q2: I do not see much point in comparison of the power consumption in safe, high performance, and low power modes because the figures are only very marginally different. Does it have any reasonable practical meaning? Or is this to demonstrate that the power consumption increase is small? If so, please, tell this explicitly.

The point was to know if switching to the Safe mode will drive to a higher power consumption because of the usage resources of the triplicated IPs, compared with the High-Performance mode. Both modes are effectively comparable in term of power consumption, so there is not so much penalty to be in one mode or the other. Additional information to be more explicit has been added to the Discussion section.

Q3: It is not very clear how/whether reconfiguration can really exploited in the presented board - the paper, in fact, says that the reconfiguration would be hazardous. If it can be used, please, state how and under what conditions, if not, please, tell this explicitly.

The reconfiguration can be hazardous if the count of soft errors is high but if it is low and increasing, reconfiguration can occurs to switch to the Safe-mode. When reconfiguring, the monitoring processor performs a readback of the configuration memory to ensure that the partition has been correctly reconfigured. In future work, we expect to couple the information provided by the SEM Controller with a model of the fault rate in Low Earth Orbit. It would allow to know if we are entering into an area highly exposed to radiation. In this case, for prevention, it will be possible to directly switch to the Safe mode. A paragraph has been added to the Discussion section to clarify this point.

Q4: The reliability of the system is evaluated by injecting faults into the configuration memory in the periods of 100ms. Please, discuss in the paper i) whether this really resembles the conditions in real usage of the board, ii) whether this is really the only or the major threat to the system (how about flipflops, LUTs, RAM)?

As stated in the Discussion section, the period of 100 ms is very pessimistic compared to the reality, thus the lifetime of the system before a failure will be better than during our tests. Regarding the configuration memory, in Xilinx FPGA, it means the active configuration memory of the SRAM-based FPGA, including the flip-flop, the LUT and the routing resources. Currently, the internal RAM (Block RAM) is not taken into account in the mitigation mechanisms but it is planned for future work. This last precision has been added to the Discussion section.

Q5: I believe that the references are really sketchy and must be improved. (I am not sure whether just a web page link is a good reference).

Web page links are used to reference space actor companies which have been interviewed in by the industrial partner of the project to understand the state of the art in the industry. The related document is confidential and so cannot be shared publicly. They can be removed if preferred. For other web page references, it points to technical documents that can be useful to understand the different tools and modules used to build the validation platform.

Q6: Finally, I believe that some means of verification of the results should be provided, e.g. more technical details of the design or, better, some downloadable file with the configurations for FPGAs.

The design is the output of a collaboration with an industrial partner so we cannot share it, but we can share the output logs of the experiments used to generate the curves, alongside with the detailed protocol used to obtain these data. The Result section has been completed with this protocol. Data will be uploaded to a public storage repository (https://yareta.unige.ch/) very soon.

Round 2

Reviewer 1 Report

The authors have addressed all my questions.

Back to TopTop