Fault-Tolerant Hardware Acceleration for High-Performance Edge-Computing Nodes
Abstract
:1. Introduction
- Filling the gap in the literature about fault-tolerant hardware acceleration in edge-computing devices.
- Demonstrating how a full-TMR hardware accelerator works, describing its architecture, and proving its functionalities with extensive fault-injection (FI) tests.
- Demonstrating that thanks to the inherent behavior of an Interleaved-Multi-Threading structure, it is possible to convert a full-TMR hardware accelerator into a single-buffered accelerator without degrading system reliability, decreasing hardware overhead, cost, and power consumption.
2. Related Works
- An in-depth analysis of the idea of the replicated VCU, with an expanded and detailed description;
- The introduction of the hybrid-VCU mode, assuming and analyzing the impact of memory-scrubbing techniques within the accelerator scratchpad memory;
- The novel detailed analysis of fault-resilience performance on algorithms of practical interest in hardware acceleration, such as FFT and Matmul.
3. Proposed Approach
3.1. Methodology
3.2. Fault-Tolerant Scalar-Core Microarchitecture
3.3. Fault-Tolerant Vector Co-Processor
4. Impact on Hardware Resources
5. Validation
5.1. Failure Probability Estimation with Time Frame Spanning FI
5.2. Monte Carlo FI Simulation to Analyze Memory-Scrubbing Impact
6. Comparison with Existing Fault-Tolerance Techniques
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Cardarilli, G.C.; Nunzio, L.D.; Fazzolari, R.; Panella, M.; Re, M.; Rosato, A.; Span, S. A Parallel Hardware Implementation for 2-D Hierarchical Clustering Based on Fuzzy Logic. IEEE Trans. Circuits Syst. II Express Briefs 2021, 68, 1428–1432. [Google Scholar] [CrossRef]
- Cardarilli, G.C.; Di Nunzio, L.; Fazzolari, R.; Giardino, D.; Re, M.; Ricci, A.; Spanò, S. An FPGA-based multi-agent Reinforcement Learning timing synchronizer. Comput. Electr. Eng. 2022, 99, 107749. [Google Scholar] [CrossRef]
- Cardarilli, G.C.; Di Nunzio, L.; Fazzolari, R.; Giardino, D.; Nannarelli, A.; Re, M.; Spanò, S. A pseudo-softmax function for hardware-based high speed image classification. Sci. Rep. 2021, 11, 15307. [Google Scholar] [CrossRef] [PubMed]
- Barbirotta, M.; Cheikh, A.; Mastrandrea, A.; Menichelli, F.; Olivieri, M. Analysis of a Fault Tolerant Edge-Computing Microarchitecture Exploiting Vector Acceleration. In Proceedings of the 2022 17th Conference on Ph.D Research in Microelectronics and Electronics (PRIME), Villasimius, Italy, 12–15 June 2022; pp. 237–240. [Google Scholar] [CrossRef]
- Barbirotta, M.; Mastrandrea, A.; Cheikh, A.; Menichelli, F.; Olivieri, M. Improving SET Fault Resilience by Exploiting Buffered DMR Microarchitecture. In Proceedings of the SIE 2022: 53rd Annual Meeting of the Italian Electronics Society, Pizzo, Italy, 7–9 September 2022; Springer: Berlin/Heidelberg, Germany, 2023; pp. 233–238. [Google Scholar]
- Khalid, U.; Mastrandrea, A.; Olivieri, M. Novel approaches to quantify failure probability due to process variations in nano-scale CMOS logic. In Proceedings of the 2014 29th International Conference on Microelectronics Proceedings-MIEL 2014, Belgrade, Serbia, 12–14 May 2014; pp. 371–374. [Google Scholar] [CrossRef]
- Barbirotta, M.; Cheikh, A.; Mastrandrea, A.; Menichelli, F.; Vigli, F.; Olivieri, M. A Fault Tolerant soft-core obtained from an Interleaved-Multi- Threading RISC- V microprocessor design. In Proceedings of the 2021 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT), Athens, Greece, 6–8 October 2021; pp. 1–4. [Google Scholar] [CrossRef]
- Barbirotta, M.; Cheikh, A.; Mastrandrea, A.; Menichelli, F.; Olivieri, M. Design and Evaluation of Buffered Triple Modular Redundancy in Interleaved-Multi-Threading Processors. IEEE Access 2022, 10, 126074–126088. [Google Scholar] [CrossRef]
- Barbirotta, M.; Cheikh, A.; Mastrandrea, A.; Menichelli, F.; Ottavi, M.; Olivieri, M. Evaluation of Dynamic Triple Modular Redundancy in an Interleaved-Multi-Threading RISC-V Core. J. Low Power Electron. Appl. 2022, 13, 2. [Google Scholar] [CrossRef]
- Cheikh, A.; Sordillo, S.; Mastrandrea, A.; Menichelli, F.; Olivieri, M. Efficient mathematical accelerator design coupled with an interleaved multi-threading RISC-V microprocessor. In Proceedings of the Applications in Electronics Pervading Industry, Environment and Society: APPLEPIES 2019, Pisa, Italy, 11–13 September 2019; Springer: Cham, Switzerland, 2020; pp. 529–539. [Google Scholar] [CrossRef]
- Cheikh, A.; Sordillo, S.; Mastrandrea, A.; Menichelli, F.; Scotti, G.; Olivieri, M. Klessydra-T: Designing Vector Coprocessors for Multithreaded Edge-Computing Cores. IEEE Micro 2021, 41, 64–71. [Google Scholar] [CrossRef]
- Moghaddam, M.T.; Muccini, H. Fault-tolerant IoT. In Proceedings of the International Workshop on Software Engineering for Resilient Systems, Naples, Italy, 17 September 2019; Springer: Cham, Switzerland, 2019; pp. 67–84. [Google Scholar] [CrossRef]
- Power, A.; Kotonya, G. A Microservices Architecture for Reactive and Proactive Fault Tolerance in IoT Systems. In Proceedings of the 2018 IEEE 19th International Symposium on “A World of Wireless, Mobile and Multimedia Networks” (WoWMoM), Chania, Greece, 12–15 June 2018; pp. 588–599. [Google Scholar] [CrossRef]
- Ibrahim, M.; Baloch, N.K.; Anjum, S.; Zikria, Y.B.; Kim, S.W. An energy efficient and low overhead fault mitigation technique for internet of thing edge devices reliable on-chip communication. Softw. Pract. Exp. 2021, 51, 2393–2410. [Google Scholar] [CrossRef]
- Zielinski, Z.; Wrona, K.; Furtak, J.; Chudzikiewicz, J. Reliability and Fault Tolerance Solutions for MIoT. IEEE Commun. Mag. 2021, 59, 36–42. [Google Scholar] [CrossRef]
- Bertoa, T.G.; Gambardella, G.; Fraser, N.J.; Blott, M.; McAllister, J. Fault Tolerant Neural Network Accelerators with Selective TMR. IEEE Des. Test 2022, 40, 67–74. [Google Scholar] [CrossRef]
- Tuli, S.; Casale, G.; Jennings, N.R. PreGAN: Preemptive Migration Prediction Network for Proactive Fault-Tolerant Edge Computing. In Proceedings of the IEEE INFOCOM, Online, 2–5 May 2022; pp. 670–679. [Google Scholar] [CrossRef]
- Dong, B.; Wang, Z.; Chen, W.; Chen, C.; Yang, Y.; Yu, Z. OR-ML: Enhancing Reliability for Machine Learning Accelerator with Opportunistic Redundancy. In Proceedings of the 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), Grenoble, France, 1–5 February 2021; pp. 739–742. [Google Scholar] [CrossRef]
- Zhang, J.J.; Basu, K.; Garg, S. Fault-Tolerant Systolic Array Based Accelerators for Deep Neural Network Execution. IEEE Des. Test 2019, 36, 44–53. [Google Scholar] [CrossRef]
- Zheng, Z.; Zhou, T.C.; Lyu, M.R.; King, I. Component Ranking for Fault-Tolerant Cloud Applications. IEEE Trans. Serv. Comput. 2012, 5, 540–550. [Google Scholar] [CrossRef]
- Javed, A.; Heljanko, K.; Buda, A.; Framling, K. CEFIoT: A fault-tolerant IoT architecture for edge and cloud. In Proceedings of the 2018 IEEE 4th World Forum on Internet of Things (WF-IoT), Singapore, 5–8 February 2018; pp. 813–818. [Google Scholar] [CrossRef]
- Khan, W.Z.; Ahmed, E.; Hakak, S.; Yaqoob, I.; Ahmed, A. Edge computing: A survey. Future Gener. Comput. Syst. 2019, 97, 219–235. [Google Scholar] [CrossRef]
- Rossi, D.; Conti, F.; Marongiu, A.; Pullini, A.; Loi, I.; Gautschi, M.; Tagliavini, G.; Capotondi, A.; Flatresse, P.; Benini, L. PULP: A parallel ultra low power platform for next generation IoT applications. In Proceedings of the 2015 IEEE Hot Chips 27 Symposium (HCS), Cupertino, CA, USA, 22–25 August 2015; pp. 1–39. [Google Scholar] [CrossRef]
- Barbirotta, M.; Mastrandrea, A.; Menichelli, F.; Vigli, F.; Blasi, L.; Cheikh, A.; Sordillo, S.; Gennaro, F.D.; Olivieri, M. Fault resilience analysis of a RISC-V microprocessor design through a dedicated UVM environment. In Proceedings of the 33rd IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, DFT 2020, Frascati, Italy, 19–21 October 2020. [Google Scholar] [CrossRef]
- George, N.; Elks, C.R.; Johnson, B.W.; Lach, J. Transient fault models and AVF estimation revisited. In Proceedings of the 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN), Chicago, IL, USA, 28 June–1 July 2010; pp. 477–486. [Google Scholar] [CrossRef]
- Waterman, A.; Lee, Y.; Patterson, D.A.; Asanovi, K. The RISC-V Instruction Set Manual. Volume 1: User-Level ISA, Version 2.0; Technical Report; Department of Electrical Engineering and Computer Sciences, California University Berkeley: Berkeley, CA, USA, 2014. [Google Scholar]
- Aranda, L.A.; Wessman, N.J.; Santos, L.; Sánchez-Macián, A.; Andersson, J.; Weigand, R.; Maestro, J.A. Analysis of the critical bits of a RISC-V processor implemented in an SRAM-based FPGA for space applications. Electronics 2020, 9, 175. [Google Scholar] [CrossRef]
- Wilson, A.E.; Wirthlin, M. Neutron radiation testing of fault tolerant RISC-V soft processor on Xilinx SRAM-based FPGAs. In Proceedings of the 2019 IEEE Space Computing Conference (SCC), Pasadena, CA, USA, 30 July–1 August 2019; pp. 25–32. [Google Scholar]
- Ramos, A.; Toral, R.G.; Reviriego, P.; Maestro, J.A. An ALU protection methodology for soft processors on SRAM-based FPGAs. IEEE Trans. Comput. 2019, 68, 1404–1410. [Google Scholar] [CrossRef]
- Santos, D.A.; Luza, L.M.; Dilillo, L.; Zeferino, C.A.; Melo, D.R. Reliability analysis of a fault-tolerant RISC-V system-on-chip. Microelectron. Reliab. 2021, 125, 114346. [Google Scholar] [CrossRef]
Architecture | FPGA Synthesis Results | Perf | ||||||
---|---|---|---|---|---|---|---|---|
Core | HW | FF | LUT | B-RAM | DSP | LUT-RAM | Freq. (Mhz) | DLP |
T13 | MIMD | 4712 | 15,943 | 18 | 19 | 264 | 2 | |
+ | 6753 | 25,089 | 18 | 31 | 264 | 120 | 4 | |
SIMD | 10,854 | 43,419 | 36 | 55 | 264 | 8 | ||
fT13 | 9017 | 20,174 | 48 | 31 | 0 | 108 | 2 | |
repl. | SIMD | 11,671 | 33,250 | 48 | 55 | 0 | 102 | 4 |
17,006 | 53,198 | 48 | 103 | 0 | 90 | 8 | ||
fT13 | 7795 | 20,658 | 48 | 15 | 0 | 105 | 2 | |
hyb. | SIMD | 9177 | 27,230 | 48 | 23 | 0 | 97 | 4 |
12,117 | 50,913 | 48 | 39 | 0 | 88 | 8 | ||
T03 | - | 1418 | 4281 | 0 | 7 | 176 | - | |
fT03 | - | 4910 | 6670 | 0 | 0 | 0 | 200 | - |
fft | Matmul | ||||
---|---|---|---|---|---|
Core | fT13-hyb | ||||
Total clock cycles | 91182 | 95925 | 144639 | 220490 | 227344 |
# frames | 10 | 10 | |||
Faults/frame | 225 | 237 | 357 | 545 | 561 |
Deterministic fault rate | 1 every 40 cycles | 1 every 40 cycles |
Work | Core | FT Techniques | FT Units | Reported FT Results |
---|---|---|---|---|
[28] | Taiga | Distributed TMR | Configuration memory | 4.1 % NFP |
[27] | FT Rocket | Distributed TMR | Entire microarchitecture | 2.3% NFP |
[29] | FT lowRISC | Application-tailored TMR | ALU | from 40% to 5.3% |
[30] | Ad hoc core | Partial TMR and Hamming | ALU, PC, Reg.File, control logic | 7.7% NFP |
[8] | Klessydra-fT03 | Buffered TMR | Reg.File, Write-Back Unit, PC, Load–Store Unit | 2.9% NFP |
This work | Klessydra-fT13 | Buffered TMR + scrubbing | Reg.File, Write-Back Unit, PC, Load–Store Unit + VCU | 1.53% NFP |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Barbirotta, M.; Cheikh, A.; Mastrandrea, A.; Menichelli, F.; Angioli, M.; Jamili, S.; Olivieri, M. Fault-Tolerant Hardware Acceleration for High-Performance Edge-Computing Nodes. Electronics 2023, 12, 3574. https://doi.org/10.3390/electronics12173574
Barbirotta M, Cheikh A, Mastrandrea A, Menichelli F, Angioli M, Jamili S, Olivieri M. Fault-Tolerant Hardware Acceleration for High-Performance Edge-Computing Nodes. Electronics. 2023; 12(17):3574. https://doi.org/10.3390/electronics12173574
Chicago/Turabian StyleBarbirotta, Marcello, Abdallah Cheikh, Antonio Mastrandrea, Francesco Menichelli, Marco Angioli, Saeid Jamili, and Mauro Olivieri. 2023. "Fault-Tolerant Hardware Acceleration for High-Performance Edge-Computing Nodes" Electronics 12, no. 17: 3574. https://doi.org/10.3390/electronics12173574
APA StyleBarbirotta, M., Cheikh, A., Mastrandrea, A., Menichelli, F., Angioli, M., Jamili, S., & Olivieri, M. (2023). Fault-Tolerant Hardware Acceleration for High-Performance Edge-Computing Nodes. Electronics, 12(17), 3574. https://doi.org/10.3390/electronics12173574