ALPRI-FI: A Framework for Early Assessment of Hardware Fault Resiliency of DNN Accelerators

Mahmoud, Karim; Nicolici, Nicola

doi:10.3390/electronics13163243

Open AccessArticle

ALPRI-FI: A Framework for Early Assessment of Hardware Fault Resiliency of DNN Accelerators

by

Karim Mahmoud

^*

and

Nicola Nicolici

Department of Electrical and Computer Engineering, McMaster University, Hamilton, ON L8S 4L8, Canada

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(16), 3243; https://doi.org/10.3390/electronics13163243

Submission received: 10 July 2024 / Revised: 2 August 2024 / Accepted: 11 August 2024 / Published: 15 August 2024

(This article belongs to the Topic New Developments for Circuit Design: Synthesis, Modeling, Simulation, and Applications)

Download

Browse Figures

Versions Notes

Abstract

Understanding how faulty hardware affects machine learning models is important to both safety-critical systems and the cloud infrastructure. Since most machine learning models, like Deep Neural Networks (DNNs), are highly computationally intensive, specialized hardware accelerators are developed to improve performance and energy efficiency. Evaluating the fault resilience of these DNN accelerators during early design and implementation stages provides timely feedback, making it less costly to revise designs and address potential reliability concerns. To this end, we introduce Architecture-Level Pre-Register-Transfer-Level Implementation Fault Injection (ALPRI-FI), which is a comprehensive framework for assessing the fault resilience of DNN models deployed on hardware accelerators.

Keywords:

hardware fault assessment; machine learning hardware accelerators

1. Introduction

Over the past decade, the field of machine learning has experienced a resurgence, which is primarily driven by the availability of large amounts of training data and computational power. These advancements have enabled the validation and practical application of machine learning models in real-world scenarios that were previously impractical to explore. Deep Neural Networks (DNNs), considered the most significant subfield of machine learning in recent years, have played a crucial role in advancing areas such as computer vision, speech recognition, and natural language processing.

As the adoption of DNNs continues to grow, so does the development of custom hardware accelerators [1,2,3,4]. DNN models often involve billions of Multiply–Accumulate (MAC) operations per inference, making their computations computationally expensive. Additionally, DNNs are now being adopted in form factor-constrained environments, requiring a combination of high throughput, low latency, and energy efficiency. Therefore, relying solely on conventional processors for DNN execution is inadequate, highlighting the need for dedicated hardware accelerators, which are specialized to enhance the performance of the computational patterns inherent to the DNN workloads.

Despite the recent successes in the consumer domain of machine learning, there is still more work to enhance the reliability of DNN-enabled systems, particularly when deployed in safety-critical scenarios, such as health and autonomous driving. An often overlooked aspect is that even robustly validated models may fail to perform correctly on defective hardware. Consequently, understanding the relationship between a new hardware accelerator and the DNN models it is expected to support in the presence of hardware faults is increasingly important alongside other assessment criteria for hardware accelerators, such as performance and energy efficiency. Importantly, developing this understanding during the design space exploration phase can have a significant impact by allowing early design revisions before a full RTL implementation is carried out.

Reliability is particularly important for safety-critical applications and industrial domains where hardware accelerators are designed to run for extended periods, such as aerospace, for example. Recent studies have demonstrated a high sensitivity of DNN inference accuracy to applications running on faulty hardware. The study by [5] has revealed that the fault resiliency of the hardware is influenced by both the hardware architecture and the data reuse patterns that impact its fault resilience. Transient bit flips can occur in flops (FFs) and other memory structures resulting in temporary but detrimental effects [5,6,7,8,9,10,11]. Concurrently, the growing use of machine learning technologies in fields that require hardware with long lifespans has motivated an interest in exploring how permanent hardware faults affect system-level behavior [11,12,13,14]. Although these studies have emphasized the importance of assessing the reliability of DNN hardware, their scope is limited to errors in memory and/or storage elements within the datapath. Due to the inherent trade-off between the types and number of faults that can be evaluated and the length of fault injection experiments, there is an underexplored area in understanding how permanent hardware faults in large-scale arrays of processing engines can impact the accuracy of DNN inference. This challenge is particularly pronounced in designs that feature large-scale arrays of arithmetic units, such as those described in [2], where even register-transfer level (RTL) simulations are expensive [15]. Additionally, it is important to conduct this type of fault assessment experiments at an early stage of design space exploration to ensure that design revisions are made in a timely manner.

Arithmetic units can require 27–40% of the DNN accelerator hardware area [2,3,12,16]; however, fewer works are presented in the literature to evaluate the resiliency toward faults that are internal to arithmetic units at the DNN application level. The work from [17] has established how, under simple DNN workloads, faults in integer multipliers will cause deviations in the output of the arithmetic units. However, further investigations are needed to understand how these deviations will impact the accuracy of DNN inference at the application level. As shown in Figure 1, there are different stages of the chip development process during which fault resiliency assessment can be performed (a more detailed description of this figure is provided in Section 4). It is important to clarify that the term Early used in the rest of this work refers to the system-level exploration design stages before the hardware architecture is fully developed and implemented.

In addition to the motivation provided above, recent works have reported increases in Silent Data Corruption (SDC) [18,19]. The assumption that hardware reliability is safeguarded with post-fabrication testing to detect and isolate fabrication defects from causing impactful data corruption is now more challenging than before. SDC can manifest long after the initial installation of cloud infrastructure; for example, defects in mercurial cores [18] can impact the computed result long after the error has first occurred, making it even harder to localize the error in a fleet of compute nodes or long workloads. Several reasons are contributing to this: the widespread demands from the DNN community to push more processing enginess on the same chip, staying on track with Moore’sl law with a lower transistor feature size, the increasing interest in cloud decentralized computing, and the movement toward adopting machine learning technologies in safety-critical applications. For these reasons, there is a higher chance that latent defects reach consumers [20]. Mitigation of the impact of faults in functional blocks, such as arithmetic units, is much less developed when compared to dealing with memory faults because of a lack of error correction mechanisms; this is reinforced by the study from [19], which found that permanent faults in functional blocks occur at a higher rate than soft errors. To this end, the cloud providers are in need to better understand the underlying problems and find solutions to deal with SDC [18,19]. A major direction in research to address these issues is to use more accurate models during the premanufacturing verification stages. While several works address transient faults from this perspective, in terms of scale, negligible initiatives focus on permanent faults in processing engines. One possible reason for limited contributions in this niche is due to the challenges faced when scaling frameworks for fault resiliency assessment and mitigation to deal with larger hardware systems and workloads. Therefore, in this work, we are proposing a solution to overcome these challenges by carrying out fault resiliency assessment early in the design process when it is less expensive to carry out design revisions.

One key point worth articulating is that accurate models of the arithmetic units are only available during the later stages of the development process. For example, circuit models of arithmetic units are not available at RTL [21]. While gate-level models can replace selected RTL modules [22], the RTL simulation speeds [15] can still pose challenges for carrying out fault injection experiments for very large accelerators. One way to address the above concerns is to use accurate models of the hardware components at very high-levels of abstraction, as detailed in this paper, before the RTL description has been developed. This strategy not only accelerates fault injection experiments in arithmetic units but also provides early feedback to system designers about the fault resiliency of their architecture before RTL implementation.

In this work, accurate circuit models of targeted hardware components, i.e., arithmetic units, are utilized to facilitate fault assessment integration within High-Performance Computing (HPC) DNN frameworks, e.g., [23]. This setup provides an abstracted interface for running DNN inference tasks while enabling fault injection configurations across a diverse range of fault models and hardware architectures, as elaborated in Section 5. By increasing the abstraction level for fault injection experiments, there is a limited spectrum of the types of faulty behaviors that can be assessed. However, given that embedded memories and arithmetic units are predominant in DNN accelerators, this focus is justified. While system-level fault injection experiments in memory have been previously investigated, the effects of faults within large arrays of processing engines on DNN inference at the application level remain largely underexplored.

By using precise circuit models for arithmetic units, hardware designers are able to make decisions and revisions before the full implementation is complete. A key challenge that needs to be addressed is how to seamlessly incorporate fault injection and modeling custom hardware accelerator architectures within HPC frameworks, such as PyTorch [23], which have been optimized for peak application performance. The framework must also provide a broad spectrum of fault injection capabilities and remain adaptable to various hardware configurations. To summarize, the main contributions of this study are outlined below:

We propose a DNN simulation framework designed to assess the fault resilience of hardware accelerators. This framework utilizes data streaming patterns known prior to RTL implementation and integrates gate-level accurate arithmetic circuit models into our high-level model. The methodology enables designers to evaluate the resilience of new DNN accelerators very early during the design space exploration phase, providing insights that can guide adjustments before more detailed and time-consuming implementations are finalized.
To validate the generality of our framework, we evaluate the fault resilience of different hardware accelerators from the literature for different DNN models. We provide a comparative analysis in terms of fault resilience that, to our knowledge, has not been reported in the literature for large-scale designs and models.

The rest of the paper is organized as follows. Section 2 outlines the key differences between our work and recent studies from the literature. Section 3 presents the terminology and core concepts from the literature needed to elaborate our methodology. Section 5 describes the proposed fault assessment framework’s design stages and structure. The experimental results, including using the framework with different state-of-the-art DNN models and hardware accelerators from the literature, are discussed in Section 6. Section 7 concludes this paper.

2. Related Works

DNN fault simulation frameworks need to support various features, such as modeling the target hardware, mapping DNN operations to hardware, and fault models. Developing a comprehensive simulation framework with such capabilities poses a challenge due to the processing demands of state-of-the-art DNN models and the implementation of accurate hardware models. Such complexities depend on the modeled hardware architecture and the fault scenarios that can be supported by the framework ([10,11,12,24,25,26,27,28,29,30,31]). For example, enabling transient faults in memory during simulation may require the usage of accurate models of the hardware only during the simulation periods when faults are activated ([10,25,31]).

Another type of faults to be considered for hardware resiliency assessment is permanent faults in the datapath ([11,12,24]). Unlike transient faults, modeling faults in the datapath requires more detailed information about the hardware architecture that needs to be captured in the hardware models. For instance, the mapping of operations to the processing engines, data types used, memory footprints, and data-streaming patterns should be considered when modeling the datapath for fault resiliency assessment. A permanent fault in one of the inputs to an arithmetic unit will affect different operations and variables when running the DNN application. This impact is unique to the particular hardware architecture under study and how it is being used to process the different calculations during DNN inference.

The main challenge in modeling faults at this level in a comprehensive simulation framework is providing a flexible and configurable hardware modeling tool to allow hardware designers to assess different architectures while, at the same time, maintaining adequate simulation speeds. One option is using a compiled programming language, such as C++, to build a complete and efficient DNN fault resiliency assessment framework ([24]). However, supporting and extending such frameworks with emerging DNN architectures makes these frameworks less favorable due to the extra effort needed to rebuild them. Another approach is to utilize one of the existing HPC DNN frameworks and incorporate the hardware-specific and fault injection features into it ([24,31]).

In order to support faults in arithmetic units, a more detailed level of hardware modeling with various added complexities is required. Faults in the operands of operations of the datapath can be modeled by modifying the data at the output of tensor operations ([24,29]). In contrast, permanent stuck-at faults in arithmetic circuits cannot be generally represented at the boundaries of tensor operations as they alter the output based on both operand values and the fault site within the arithmetic circuit. Therefore, to support a broader range of fault models and hardware accelerator architectures, the hardware designers should have the ability to introduce faults at the level of logic gates and capture the mapping of operations from the DNN model to the arithmetic units within the accelerator.

To model fault injection in arithmetic units using HPC DNN frameworks, it is essential to consider that the hardware models for arithmetic units need to be integrated within the tensor or kernel routines. However, it should be noted, as further elaborated in Section 5, that modifying highly optimized tensor operations in an HPC DNN framework can lead to a substantial decrease in computational performance. In [27], a gate-level, stuck-at fault-injectable operator for a single neuron increases the simulation time by more than 25%. Replacing all the neurons in the DNN with fault-injectable neuron modules, based on the criteria in [27], is expected to significantly increase the simulation time by several orders of magnitude. In [26], to enable acceleration using parallel-computing Graphics Processing Units (GPUs), DNN models of the gates’ logic are used instead of logic operations to model processing engines’ circuitry. Scaling this type of method to larger hardware clusters, DNN models or a higher number of classes may necessitate several iterations of feature-map extraction, training, and updating of the DNN models.

Some previous studies have aimed to balance simulation time and fault injection accuracy by integrating RTL-accurate models in performance DNN frameworks. Verilator [32] can convert RTL descriptions to C++ modules [30]. However, RTL models lack accurate gate-level representations of arithmetic units [25]. Additionally, creating an RTL model of the target architecture is necessary before conducting fault injection experiments, leading to increased development cycle time when design revisions are required. In contrast, this work attempts to bridge the gap between system-level design and gate-level simulation accuracy by using gate-level accurate arithmetic hardware models alongside configurable accelerator architecture models while maintaining manageable simulation speeds.

Figure 2 outlines the key differences between our work and recent studies from the literature, which are categorized by fault injection support, redesign cost, and simulation speed [11,12,25,26,27]. The work in [25] provides fault assessment for transient faults using RTL models of the hardware with more affordable simulation demands compared to studies supporting permanent faults in the datapath [11,12,24]. Although supporting faults within arithmetic units significantly increases the simulation time [26,27], studying this type of fault is important because they occupy up to 40% of the silicon area [2,3,16] and they can lead to Silent Data Corruption [18,19] due to the lack of error correction mechanisms that are common for protecting systems from memory faults.

Taking the above context into account, we introduce our framework, referred to as Architecture-Level Pre-Register-Transfer-Level Implementation Fault Injection (ALPRI-FI), which operates at a new level of abstraction for carrying out fault injection experiments. This framework allows for the integration of data streaming patterns, known before the RTL implementation, into existing DNN frameworks, such as PyTorch [23], while also providing the capability to inject faults in gate-level accurate models of the arithmetic circuitry. Although the scope of the fault injection experiments is restricted to arithmetic units for different hardware accelerator architectures, the scale of the experiments, as shown in Section 6, is more comprehensive than what has been reported in the literature. This framework also offers additional flexibility, enabling early-stage assessment of how certain architectural choices will impact reliability and allowing for revisions at an early design stage when they are less costly.

3. Background

This section introduces the terminology and core concepts from the literature needed to introduce and elaborate our methodology.

3.1. Notation

The purpose of this section is to introduce the terminology used in the presented work. While the findings and approaches designed in this work are devoted to the fundamental operations for convolution Deep Neural Network (CONV DNN) models, the same methods can be applied to other types of DNN models such as transformers.

\begin{array}{r} O [i m] [d_{h}] [d_{w}] [c_{o}] = A C T (B [c_{o}] + \sum_{i_{h} = 0}^{K_{h} - 1} \sum_{i_{w} = 0}^{K_{w} - 1} \sum_{d_{c} = 0}^{D_{i n} - 1} I [i m] [S_{T} \times d_{h} + i_{h}] [S_{T} \times d_{w} + i_{w}] [d_{c}] \\ \times W [c_{o}] [i_{h}] [i_{w}] [d_{c}]), H_{o} = \frac{H_{i} - K_{h} + S_{T}}{S_{T}}, W_{o} = \frac{W_{i} - K_{w} + S_{T}}{S_{T}}, \\ 0 \leq i m < N_{I m}, 0 \leq d_{c} < D_{i n}, 0 \leq c_{o} < D_{o u t}, 0 \leq d_{h} < H_{o}, 0 \leq d_{w} < W_{o} \end{array}

(1)

Equation (1) evaluates the Output Feature Map (OFmap) O using bias B, Input Feature Map (IFmap) I, and weights W. OFmaps O is a four-dimensional (4D) structure, where the outer two dimensions of O scan different images in the patch pool and different input channels for each image, respectively. The inner two dimensions scan the OFmap channel height and width, respectively. Zero padding is assumed for simplicity. Table 1 describes the structural parameters, and Figure 3 illustrates the operation. It is important to note that although the discussion in this work focuses on convolutional layers (CONV), the examples capture the intuition behind the operations performed in other DNN model types, such as transformers and MLP models.

3.2. DNN Acceleration

Various types of hardware accelerators have been introduced for DNN acceleration [2,3,4,16,33]. In the inference mode, data streaming through DNN layers occurs sequentially with the output of one layer calculated before activating the next layer. However, the calculations within each layer can often be performed in parallel using multiple data streams. Furthermore, the majority of DNN processing relies on Multiplication-and-Accumulation (MAC) operations. These two characteristics of DNN processing serve as the primary motivation for designing hardware accelerators. Diverse hardware architectures are precisely optimized for specific tasks like convolution acceleration to enhance the computational efficiency for targeted applications. Conversely, some hardware designers advocate for versatile, general-purpose DNN accelerators, which are engineered to support a wide array of DNN models.

3.3. Accelerating DNN Convolution

Most of the computational time in convolutional DNNs is devoted to processing CONV layers [34]. Consequently, custom accelerators have undergone extensive optimization for CONV layers, particularly in terms of data reuse, which can be classified into three main types:

Weight Stationary: Weight-stationary accelerators exploit weight reuse by storing the kernel weights in faster and smaller memories allocated for each PE. Examples of weight-stationary accelerators include refs. [2,35]. Equation (1) shows that processing each kernel necessitates scanning the IFmap along two or more dimensions (in this example, two dimensions). By reusing kernel weights across different scanning positions, the number of weight reloads from the main memory is reduced, resulting in savings in both energy and time.
Output Stationary: In accelerators of this type, PEs are allocated to perform operations related to a specific number of outputs. The intermediate partial sums are stored in the PE memory and are updated with additional partial sums as more IFmaps and weights are streamed through the hardware. Output partial sum stationary accelerators leverage input and partial sum data reuse [3,36,37].
Weight and Input Reuse: In addition to leveraging weight reuse, input reuse is further employed to minimize memory reads, enhancing the performance of these accelerators. Convolution Primitives, as implemented by [4], is enabled by processing IFmap and kernel weights utilizing distinct sets of PE units, resulting in the aggregation of partial sums to generate the desired OFmap with less inter-core memory transactions.

As elaborated in the next section, a key novelty of our methodology is the integration of the data streaming pattern specification into our high-level fault assessment framework. This allows for the evaluation of the fault resiliency of the accelerator at an early stage in the design space exploration phase before the completion of the full implementation of the accelerator.

3.4. General DNN Frameworks

General DNN frameworks are designed to efficiently support various types of DNN layers. One of the reasons is that the processing of each layer can be simplified to a series of tensor operations. For instance, the processing of linear layers can be transformed into a single matrix multiplication task, where the first dimension of the activation matrix corresponds to the batch size and the number of output nodes equals the number of columns in the weights matrix. Similarly, the processing of a convolutional layer can be converted to matrix multiplication by applying convolution unfolding techniques.

An example illustrating the unfolding of CONV layer processing to matrix multiplication is presented in Figure 4. Matrix A represents the input activation for a batch of images with a size of

N_{i m}

images. Each group of

H_{o} \times W_{o}

rows in matrix A corresponds to one unfolded input image, where each row contains the IFmap pixels corresponding to one OFmap sample. It is assumed that the deepest dimension is the number of input channels, which is denoted as

D_{i n}

. The weight matrices are represented by matrix B, where each column corresponds to one kernel. The output partial product before activation (assuming activation occurs separately in the subsequent stage) is represented by matrix C.

When presenting our main contribution in Section 5, we will utilize the same terminology and design abstraction outlined in Figure 4. This choice is motivated by the widespread use of this level of abstraction in software frameworks for DNN implementation, such as PyTorch [23]. More importantly, this level of abstraction is not dependent on any specific type of hardware accelerator employed for deploying DNN models. Therefore, by developing a fault injection framework at this level of abstraction, it is possible to evaluate the potential impact of hardware faults on new hardware acceleration proposals early during design space exploration, i.e., during the pre-implementation phase when the design revisions are more easily manageable.

4. Motivation

In Figure 1, we show four different fault resiliency assessment options in four different environments during the design and implementation process.

Using silicon prototypes to emulate hardware faults (silicon environment) can help assess the hardware fault resiliency in a fast operating environment. However, this requires the insertion of hardware structures that facilitate the fault injection, and the results from such experiments can be used for reporting fault resiliency rather than making important decisions for design revisions. A gate-level environment offers an accurate circuit model of the hardware for fault injection experiments. Nonetheless, this model is only available very late in the design and implementation stages, and the simulation speed is known to be significantly slow, thus making the assessment of DNN application performance at this stage practically infeasible.

While RTL gives more accurate description of the hardware, the scope of the model is limited to data transfers, and therefore, it does not offer natively accurate circuit models to be used for fault injection. For example, the impact of faults that are internal to the arithmetic units cannot be evaluated unless more accurate circuit models are introduced.

Driven by the computational speed constraints of current fault injection methodologies in large arrays of arithmetic units within DNN accelerators, we propose adapting existing DNN frameworks, such as PyTorch [23], to facilitate fault assessment experiments at the application level.

To emphasize the reasons behind our choice, in Figure 5, we show an example of an MAC operation (hardware unit) in three different simulation environments: system level (System-Env), RTL level (RTL-Env) and gate level (Gate-Env). The figure shows an example of multiplying activation X by weight Y. The result partial product Z is added to the DNN node accumulator

a c c

. It is assumed that the hardware is faulty with fault

Z [0] / 0

(Stuck-at 0). In the Gate-Env, a netlist serves as the design model, enabling direct fault injections. However, the simulation times in Gate-Env tend to be excessively long because it involves processing the entire netlist. This level of detail is unnecessary when the goal is specifically to carry out fault injections in the arithmetic units, suggesting a need for more targeted fault injection strategies that balance detail with efficiency.

The System-Env and RTL-Env environments provide faster simulation speeds compared to Gate-Env with System-Env achieving superior performance. This advantage in System-Env is largely due to its reliance on higher levels of abstraction, which are facilitated by environments like SystemC [38]. Nevertheless, both environments lack accurate circuit models needed for performing fault injection in arithmetic units, necessitating certain design updates to accommodate this requirement. In particular, RTL-Env can incur higher costs, as modifications to the arithmetic units often necessitate updates across other components within the RTL model. For instance, changing the number of bits of a hardware multiplier in RTL-Env requires updates to all associated memory and connections. Additionally, fault assessment using RTL-Env is only viable once the RTL description of the accelerator is fully developed, which delays the possibility of early design feedback.

In contrast to RTL-Env, System-Env leverages High-Performance Computing frameworks designed to accelerate algorithm design and model development. Utilizing widely adopted DNN frameworks such as PyTorch [23] and TensorFlow [39], running inference jobs in System-Env offers several advantages over RTL-Env, which can be summarized as follows:

The system environment is readily accessible during the initial phases of development, allowing for early development and validation.
It can be used to provide early feedback on fault resiliency, which can reduce the cost and time associated with redesigns.
Offers greater reconfigurability, utilizing accurate models only when necessary for the types of components that are assessed, e.g., arithmetic units.
Enables experimentation at a wider scale, supporting the assessment of larger DNN models and hardware configurations.
Facilitates keeping pace with emerging DNN models by integrating with widely used, open-source DNN frameworks like PyTorch [23] and TensorFlow [39].

5. Fault Assessment Framework

This section describes the proposed fault assessment framework’s design stages and structure, including a novel method used to model the operation-PE mapping, which is consistent with the data streaming patterns from the evaluated HW accelerator that leverages the organization of data structures in state-of-the-art HPC-based frameworks for DNNs, e.g., PyTorch [23]. Note, it is worth recalling from Figure 1 that a key benefit of developing a high-level fault assessment framework for DNN accelerators is to employ it not only after an accelerator has been implemented but also during the early design space exploration phase when the cost of design revisions is low.

5.1. Framework Requirements

The main objective is to build a framework capable of performing logic fault assessment of DNN hardware accelerators. This can be translated into a series of three key requirements, R1, R2, and R3, as shown in Table 2.

The design of the framework is aimed at meeting the above requirements. To support requirements R1 and R3, a high-performance DNN framework (PyTorch) is chosen as the base framework. Note, however, that there is conceptually no reason the methods presented in this work cannot be applied to other base DNN frameworks. The main contributions of this work are providing key support to conduct fault injection in a hardware multiplier model (R2-a) and modeling operation-PE mapping (R2-b), as per the data streaming patterns defined at the architectural level, within an HPC DNN framework. These two modeling features are anticipated to be key characteristics of the hardware utilized for DNN acceleration, and both introduce significant challenges upon their integration within high-performance DNN frameworks, especially when scaled up.

Before discussing the ALPRI-FI structure, it is essential to highlight the challenges of modeling fault injection for a large DNN model on an HW accelerator and quantify the performance overhead the work from this paper addresses. To achieve this, Table 3 summarizes how adding new modeling features gradually introduces a performance penalty, while at the same time, it shows how, by optimizing the implementation of these added features, some of the shortcomings can be mitigated. These optimizations, which are detailed in the following sections, are the main contributing factors that facilitate fault assessment for low-level netlist models of arithmetic units for HW accelerators, executing large DNN models on a scale not reported previously in the literature.

The framework development can be divided into stages S1, S2 and S3 as follows.

5.1.1. (S1) Selection of Base DNN Framework for Injecting Faults to Arithmetic Logic

Simulating logic faults in arithmetic circuits requires developing circuit models of the arithmetic units and the injection of faults to the circuit nodes of the model. For example, to model logic faults in HW multipliers, it is required to replace multiplication operations in DNN inference jobs with the circuit model of the multiplier. One option is to build a custom DNN framework using the HW multiplier model. However, this approach does not meet the requirements R1 and R3. Another option is integrating the hardware models into a high-performance DNN framework like PyTorch. For example, benchmarking PyTorch on large matrix multiplication operations, it is found to be 56x faster compared to a native implementation using loops (stage S1 in Table 3), which justifies the integration of HW models into PyTorch’s high-performance DNN framework.

5.1.2. (S2) Using Fault Injectable Arithmetic Unit Model

In order to carry out fault resiliency assessment experiments, it is required to replace the multiplication operations in DNN inference with a netlist multiplier model that supports fault injection, e.g., forcing the target node to 1 or 0. Exploratory experiments have established that the initial implementation of requirement R2-a—specifically, substituting the multiplication operator with a fault-injectable hardware multiplier netlist model—resulted in a substantial increase in simulation time, exceeding 2700x-equivalent compared to the base framework implementation. For example, this increases the inference job simulation time of the ImageNet [41] validation test set using ResNet50 [42] from approximately 220 min (on a reference 6-core desktop machine) to an estimate of 412 days, which is very large. To mitigate the excessive slowdown in runtime, and acknowledging that integrating feature R2-b would further exacerbate this issue, we decided to investigate basic optimization techniques (specific technical details will be provided in the following subsections) targeting general simulation platforms. The simulation speed is further boosted by using parallel computing clusters. One reason for this decision is to keep the framework portable, i.e., no machine-specific code is introduced; another reason is to offer a plug-and-play experience for the accelerator design team to try different arithmetic structures. The basic optimizations reduced the simulation time from 2700x-equivalent to 900x-equivalent compared to baseline, as shown in stage S2 of Table 3.

5.1.3. (S3) Operation-PE Mapping Feature

Moving to requirement (R2-b), the first exploration used a list that maps the target operation to the assigned PE, as clarified later in this section. This implementation increased the simulation time by 50% (1400x equivalent in stage S3-a compared to 900x equivalent in stage S2-b of Table 3) for small activation and weight matrices. More importantly, the memory requirements for the initial operation-PE design are substantial such that assessing large DNN network layers was not feasible on one machine. By introducing a novel method, discussed in detail in this section, the operation-PE mapping information is stored and processed in a more compact representation. This approach to supporting different data streaming patterns by modeling different operation-PE mappings offered lower memory requirements proportional to the data size rather than the number of operations and a simulation time overhead of only 20% over not supporting this critical feature needed for high-level fault assessment (1100x equivalent compared to the baseline as in stage S3-b of Table 3).

Having motivated the need for and summarized the requirements for an HPC-based framework for fault assessment during the early design phase of DNN accelerators, we introduce the general framework structure in the following subsection.

5.2. General Framework Structure

Figure 6 depicts the general diagram of the framework. It is divided into three main parts: (1) top-level application, (2) DNN framework, and (3) General Basic Linear Algebra Subroutine (BLAS) back-end. The processing of one DL layer for a batch of inputs is converted to one matrix multiplication job and assigned to a BLAS General Matrix Multiplication (GEMM) routine. Without loss of generality, in this framework, we use PyTorch [23] as the base framework. Currently, PyTorch uses Facebook General Matrix Multiplication (FBGEMM) [40] as the main GEMM for processing inference jobs on 8-bit quantized DNN models.

In Figure 6, A, B, and C are the activation, weight, and partial output matrices, respectively. The conversion process includes convolution unfolding of the convolution layers and converting the processing of convolution (and linear) layers to matrix multiplication. It also features tiling the input matrices to several smaller matrices with fixed dimensions. The GEMM job is broken down into calls to several machine-coded matrix multiplication kernel routines. The data are re-ordered in the memory to increase data cache hits, and the efficiency of data streaming to the simulation hardware CPU cores.

The base framework can be upgraded to support modeling of the various blocks within the DNN accelerator being assessed. One key feature is modeling the behavior of arithmetic units. While the results from this study focus on logic faults in integer multipliers, the framework is expandable to investigate adders, bit shifters, etc., that may exist in the processing units of a DNN accelerator. As detailed in the next subsection, the multiplier model is based on Baugh and Wooley [43] signed multiplier with added fault injection support. While one signed multiplier circuit model is used in this study, the generic framework can be configured to other multiplier circuit architectures. Another key feature in assessing the impact of logic faults on the DNN accelerator is modeling the operation-PE mapping. In this work, as discussed in detail in the following subsections, we introduce an effective way to manage fault assessments by using two new matrix-like data structures, which are referred to as

I A

and

I B

. As shown in Figure 6,

I A / I B

are defined at the top level (application) and are used at the low level (kernel). Supporting these two key fault injection features, the built-in kernel is replaced with a custom kernel that uses the operation-PE mapping information taken from

I A / I B

and the fault-injectable multiplier model.

After introducing the multiplier model (for the sake of completeness when presenting results) in the following subsections, we will discuss how the introduction of the

I A / I B

mapping works efficiently with different data streaming patterns and how the implementation integrates smoothly with the DNN/GEMM performance layers.

5.3. Multiplier Model

While the main contribution of our work is concerned with how to assess the impact of faults in arithmetic units based on an injection mechanism at the level of vector operations (as detailed in the following subsections), for the sake of technical completeness, we first elaborate on the details needed to be accounted for to model the arithmetic circuit’s faulty behavior at a higher level of abstraction. A multiplier model is illustrated in Figure 7, where a 9-bit signed multiplier is based on Ripple Carry Adders (RCAs) [43].

As highlighted in Table 3, the initial benchmarking of the multiplier model showed a drastic increase in runtime. The performance of the model was enhanced by applying several techniques summarized below. One method to accelerate the evaluation of the faulty output for a pair of inputs and a specific fault is by using look-up tables (LUTs) for the top adder rows (in Figure 7) preceding the row where the fault is being injected. Another technique is implementing multi-bit fault masks to remove the conditional statements for each fault injected into the model’s nodes. For each logic operation in the model, one pair of pre-evaluated masks (AND and OR masks) is used to inject SA-0 or SA-1 faults. Finally, a header-only implementation is also used (in contrast to source file implementation, header-only implementation implement software logic in header files to allow better compiler optimization at later application build stages), which overall leads to a runtime that is three times faster compared to the non-optimized faulty model (900x equivalent vs. 2700x equivalent compared to baseline, as shown in stage S2 of Table 3).

5.4. (Baseline) Operation-PE Mapping Using Mapping Lists

The data streaming patterns vary from one DNN accelerator to another; these patterns define how operations map to PE units, which is called operation-PE mapping. To motivate a new technique for managing operation-PE mapping, we will refer to the example of multiplying matrix A of sizes (2, 4) by matrix B of size (4, 3) shown in Figure 8.

A basic approach to describe operation-PE mapping is to use mapping lists. Figure 9 depicts how operation-PE mapping information using a mapping list can be used to process the matrix multiplication example in Figure 8, using 4 × 3 systolic arrays with weights-stationary streaming. Referring to Figure 4, the example can be a result of unfolding a convolution operation with

K_{w}

=

K_{h}

= 2,

D_{i n}

= 1, and

D_{o u t}

= 3. As described in Section 3, the PE grid must be pre-configured with the weights in weights-stationary systolic arrays. It is also required that the streaming of activation into the grid is synchronized with the accumulation and the streaming of partial products from the grid. Figure 9 shows the weights-stationary configuration using a 4 × 3 systolic-arrays grid. Assuming a fully pipelined implementation with a PE latency of 1 clock cycle per PE,

C_{00}

and

C_{12}

are ready at the output of M[3][0] and M[3][2] PE units at the 4th and 7th cycle, respectively. The mapping list can be accessed using the iterators k, n, and m used to access

A / B

. The multiplication operation corresponding to

A [m] [k]

×

B [k] [n]

is mapped to PE (

P E_I D [m] [k] [n]

) as described in Equation (2), where

c_{t}

is the number of grid columns (

c_{t} = 3

in this example).

\begin{matrix} P E_I D [m] [k] [n] = M P_l i s t I B [k] [n + m \times c_{t}] \end{matrix}

(2)

It is important to note that the mapping list size is larger than the input matrices added together. This imposes a challenge for larger DNN layers requiring more memory storage and bandwidth. In addition, the memory footprint of the list is different from either A or B, which increases the likelihood of cache misses. Consequently, a more effective operation-PE mapping method is needed to assess the impact of logic faults when a data streaming pattern is reassessed during the early stages of designing a new DNN accelerator.

5.5. A New Representation for Operation-PE Mapping

ALPRI-FI stores the operation-PE mapping information in two matrix-like data structures. Introducing

I A

and

I B

matrices, the designer can define the number of PEs in the cluster and the operation-PE mapping with minimal impact on the system performance. The contents of

I A

and

I B

define the mapping of operations to the cluster PE units. The sizes of

I A

and

I B

match the sizes of the activation matrix A and the weights matrix B, respectively.

I A

and

I B

are packed and tiled, resulting in a memory footprint similar to A and B, respectively. Each pair of elements taken from

I A

and

I B

is processed by an opertion-PE mapper function to calculate the index of the PE unit. Figure 6 highlights how a pair of (

I A_{i k}

,

I B_{k j}

), taken from

I A

and

I B

, is translated to the ID of the cluster PE

P E_I d

. By defining the multiplier model and the

I A / I B

pair, the designers of the hardware accelerator can define the number of PE units and the operation-PE mapping of the cluster under study. In the following subsections, we discuss how the proposed method can be used to describe in detail the operation mapping for a variety of DNN accelerators.

5.6. Supporting Data Streams for Generalized Matrix Multiplication Acceleration

In this section, we introduce a more effective and efficient method to describe the operation-PE mapping for a general matrix multiplication accelerator to be able to assess the impact of faults on logic circuits, such as multipliers. As discussed in Section 3, the processing of different DNN layers can be converted to matrix multiplications.

Using the proposed method to describe the operation mapping of weights-stationary data streaming in systolic arrays, Figure 10 shows general operation-PE mapping configuration for a TPU-like accelerator [2] of size

r t

×

c t

. Each PE is associated with one weight.

I A

is redundant, and

I B

only is needed to describe the operation-PE mapping. As shown in the figure,

I B

can be defined as tiles of two-dimensional (2-D) blocks of size

r t

×

c t

. The contents of

I B

are simply the ID of the PE units to be used with the operations associated with the corresponding weights.

Figure 11 shows how

I A

and

I B

can be configured to describe the operation-PE mapping for the example from Figure 8, which is processed by weights-stationary systolic-arrays. Since only

I B

is needed to define the operations mapping, the multiplication operation corresponding to iterators m, k, and n is processed by a PE identified as outlined below:

\begin{matrix} P E_I D [m] [k] [n] = I B [k] [n] \end{matrix}

(3)

A natural question is whether the

I A / I B

-type matrices can be used for different data streaming patterns in the same cluster. Figure 12 depicts partial results-stationary data streaming into the arrays and how

I A / I B

can describe the mapping. In partial results-stationary arrays, rows of activation matrix A perform a dot-product with columns of weights matrix B. The same PE performs the accumulation of partial results for the same output pixel. After the final results are ready, the output is streamed out of the processing grid. One way to describe the mapping is to store the row index and column index of the target PE in

I A

and

I B

, respectively, as shown in the figure. In this case, the operation-PE mapping depends on both the activation and the weights. Consequently, both

I A

and

I B

are needed to describe this mapping such that the multiplication operation corresponding to m, k, and n is mapped to PE

P E_I D [m] [k] [n]

, which is defined as

\begin{matrix} P E_I D [m] [k] [n] = I A [m] [k] \times c t + I B [k] [n] \end{matrix}

(4)

It is worth mentioning that changing the mapping from weights-stationary to partial product-stationary reduces the utilization of PE units. For the latter, each PE presents a higher contribution toward the result by calculating four multiplications per PE compared to only two for the former. This utilization will affect the susceptibility to logic faults for different DNN workloads, leading to different PE utilizations for the respective accelerator. Investigating this relationship between PE utilization and hardware fault resilience within a DNN environment constitutes a key feature of our framework, as elaborated in the results section.

5.7. Supporting Data Streams for Specialized DNN Accelerators

This subsection shows how our method can be adapted to capture custom mappings for specialized DNN accelerators, such as convolution accelerators whose mapping depends on convolution parameters. As an illustrative example, we will discuss the mapping of convolution operations to the Eyeriss [4] convolution accelerator.

Eyeriss [4] consists of a grid of

r_{e}

×

c_{e}

PE units, where

r_{e}

and

c_{e}

are the numbers of grid rows and columns, respectively (

r_{e} = 12

and

c_{e} = 14

in [4]). Figure 13 shows the kernel, IFmap, and OFmap data streaming patterns. The rows of the kernels, IFmap, and OFmap are streamed horizontally, diagonally, and vertically, respectively. While the grid size and data streaming can be customized, we will show how

I A

/

I B

can be configured to describe the operation-PE mapping according to the specifications from [4].

Before drawing a generalized format for

I A

/

I B

for Eyeriss, it is important to illustrate how structuring the operation-PE mapping in 2-D matrix format can be used to describe a multi-dimensional tensor operation, e.g., convolution between activation in A and kernels in B. In a similar manner as convolution unfolding described in Section 3, the operation-PE mapping of convolution jobs can be described by 2-D matrices

I A

/

I B

. To help with the illustration, we show

I A

/

I B

for a 2-D convolution job with an IFmap with (

W_{i}

,

H_{i}

,

D_{i n}

) = (3, 2, 1), 3 kernels with (

k_{w}

,

k_{h}

,

S_{T}

,

N_{i m}

,

D_{o u t}

) = (2, 2, 1, 1, 3) and OFmap with (

H_{o}

,

W_{o}

) = (1, 2). If the same convolution unfolding rules described in Section 3 are followed, the convolution job unfolds to a matrix multiplication job with a format similar to the example in Figure 8. Hence, the same annotation is used.

Figure 14 is sectioned into two rows, illustrating the segments of the data used to produce one value of an OFmap on different views, as shown from the left: (i) the convolution kernel scanning diagram, (ii) activation matrix A and kernel matrix B, (iii) Eyeriss data streaming pattern to the cluster, and (iv) the corresponding operation-PE mapping in the

I A

/

I B

matrix format. The processing of the first kernel is shown. There are two horizontal kernel scanning points (

W_{o} = 2

) and one vertical kernel scanning point (

H_{o} = 1

). Due to the nature of the convolution unfolding, some entries on A map to the same IFmap value (e.g.,

a_{10} = a_{01}

).

On the first row from Figure 14, selecting the first row of each of A and

I A

corresponds to the kernel’s first horizontal scanning position. The next row of A and

I A

corresponds to the second kernel horizontal kernel scanning position. Only two PE units are used from the cluster since there are two kernel rows and one vertical scanning point (

H_{o} = 1

). Configuring

I A

/

I B

such that

I A

holds the cluster column index and

I B

holds the cluster row index is sufficient to describe operations mapping to the cluster for each pair of values from A and B. Equation (5) can be used to calculate the PE index

P E_I D

for each pair taken from

I A

and

I B

.

\begin{matrix} P E_I D [m] [k] [n] = I A [m] [k] + c_{e} \times I B [k] [n] \end{matrix}

(5)

Figure 15 shows a configuration of

I A

and

I B

describing the operation-PE mapping of a general convolution job streamed to Eyeriss. In this format, it is assumed that the Eyeriss grid is evaluating the OFmap for only one input channel and one output kernel at a time. As noted from the figure, this mapping results in the columns of each of

I A

and

I B

being identical. This means further memory savings if one column is used to define each matrix.

5.8. Contrasting the New Methodology to the Baseline

To assess the impact of logic faults on DNN accelerators early during the implementation phase, one has to capture the relation between arithmetic operations in the DNN model executed on the accelerator and accelerator’s processing engines (PEs). Describing the operation-PE mapping for the example in Figure 8 highlights a few key insights of the proposed methodology compared to mapping lists.

One key feature is providing a compact representation of operation-PE mapping. The size of the mapping information in the proposed method matches the size of the inputs, unlike mapping lists (described in Section 5.4), in which the mapping information matches the number of operations. In the example from Figure 8, it is required to perform 24 operations using 24 different operands from A of size (2, 4) and B of size (4, 3) to evaluate C. For larger A and B values, the number of operations is generally higher than the combined number of samples and weights. The reduction in memory facilitated by

I A

and

I B

is defined by

M_{R}

, where

(M, K)

,

(K, N)

are the dimensions of A and B, respectively, as follows:

\begin{matrix} M_{R} = 1 - \frac{M K + K N}{M K N} \end{matrix}

(6)

For example, for Resnet50 [42], the size of the CONV layer 20 corresponds to a kernel of size

K_{h}

=

K_{w}

= 3, output channels of size

D_{o u t}

= 128, output channel dimensions of size

H_{o}

=

W_{o}

= 28 and input channels

D_{i n}

= 128. For a patch of

N_{i m}

= 100 images, the memory consumption needed for the reduction in memory consumption using the proposed method is

M_{R}

=

99.2 %

compared to mapping lists (describing operation-PE mapping using

I A

and

I B

requires less than 1.76 GB of memory, whereas utilizing a mapping list for operation-PE mapping will need 225 GB of memory for this layer only). This reduction is achieved by breaking down the operation mapping information into two parts: activation-dependent, captured in

I A

, and weights-dependent, captured in

I B

. These savings in memory storage and bandwidth made it feasible to fulfill the requirements for completing an inference job supporting configurable operation-PE mapping on one simulation machine.

Another key feature of using this methodology is related to runtime. Pre-processing

I A

and

I B

is fused with the pre-processing of A and B, respectively. This results in leveraging the same platform-specific optimization routines. It has been observed that it is faster to pre-process A and

I A

together in the same pre-processing routine rather than processing each separately. One reason for this is using the same memory access iterators and similar memory access patterns. This efficient design and implementation of the

I A / I B

PE-operation mapping technique consumes as little as an extra 20% more processing compared to the fault-injectable kernel without this feature, which is critical when exploring the impact of hardware faults on new accelerators.

To summarize, in addition to operation-PE mapping support, the framework features different fault injection capabilities, allowing to run versatile types of experiments. Table 4 summarizes the fault injection support by the framework that will be assessed in Section 6.

6. Experiments

This section shows the benefits of the framework described in this paper to assess the resiliency of a hardware accelerator running DNN models against logic faults in arithmetic units. As highlighted in Figure 1, the main motivation for the framework is to enable early fault assessment during the design phase of large-scale DNN accelerators.

In each of the following experiments, the changes in inference accuracy are analyzed for a different set of parameters. Each experiment studies the impact of a subset of parameters on the application performance. This is performed by randomizing all the other remaining parameters and plotting the average. By calculating the average, the effect of the unwanted parameters is removed (additional averaging jobs were performed to enable clearer data visualization). Different pre-trained DNN networks based on refs. [44,45] are used in the analysis. The framework is implemented in C++ and is integrated into a custom-build of PyTorch C++ open-source library in a Linux environment. The verification was completed in three stages: (i) multiplier model verification in Hardware Description Language (HDL) and C++ environment; (ii) matrix multiplication kernel verification and benchmarking using FBGEMM verification and benchmarking library tests on a local machine (6-core Intel Xeon CPU E5-2620 v3 with 48GB RAM); (iii) DNN inference performance verification using custom-built PyTorch 1.11 on Niagara compute cluster (operated by SciNet [46]). Niagara enables large parallel jobs on 1040 cores or more clustered in nodes of 40 Intel “Skylake” cores and 202 GB of RAM. While collecting the results in the third stage, each simulation point (e.g., for a specific fault injection rate) is repeated multiple times (30–500), and the average is plotted.

The results from Section 6.1, Section 6.2 and Section 6.3 assess the impact of different types of faults, e.g., stuck-at-0 (SA-0) or stuck-at-1 (SA-1) or critical nets in multipliers on the prediction accuracy of the DNN models whose arithmetic runs on such faulty multipliers. The first set of experiments does not account for operation-PE mapping. The results from Section 6.4 demonstrate how data streaming patterns used by a specific HW accelerator (as captured by operation-PE mapping) impact the prediction accuracy of the DNN models executed on them. In Section 6.5, the PE utilization distribution for each accelerator under study is presented. This feature helps system designers optimize for the resource utilization and hardware resiliency at the system level. This section concludes with a qualitative comparison against related studies in Section 6.6.

6.1. Fault Resilience of Different DNN Models

This experiment injects random logic faults before running inference jobs for an accelerator with a specific number of pre-defined PE units (e.g., 10,000). To assess the impact of random faults, the mapping of operations to the PE units is ignored for this experiment by randomization of operation to PE mapping and calculation of the average. A case study sweeping different fault injection rates for nine different convolution DNN models is shown in Figure 16. The inference accuracy for CIFAR10 and ImageNet DNN nets are evaluated based on top-k = 1 and top-k = 5, respectively. For high injection rates, the inference accuracy for CIFAR10 DNN nets settled at 10%, whereas for ImageNet, DNN nets settled at 0.1%, reflecting the number of classes in each. The simulation points are manually selected to reduce the simulation time/resources; more simulated points are chosen within the region of high decline in accuracy.

The simulation results show a significant decline in the inference accuracy for fault injection rates as low as 250–2000 faults per million PE. This implies that a DNN model running on an accelerator with the same dimensions as a 256 × 256 TPU [2] may mispredict an object in the input image if 16–130 PE units are faulty. Projecting this range to Eyeriss with 168 PE units [4], only one fault can cause a similar accuracy decline. The differences between the nine DNN nets shown in the figure do not reflect a significant change in the trend of how prediction accuracy degrades.

6.2. Susceptibility to Different Fault Types

In this experiment, the sensitivity of a DNN model to different fault types is investigated. The objective is to understand the effect of SA-0 and SA-1 logic faults in arithmetic units on different DNN models, e.g., VGG16 and Resnet18. For the results from Figure 17, the simulation experiments were performed with the exact selection of a PE and fault site. The results show more decline in inference accuracy when injecting SA-1 faults than SA-0. This is because the ReLU activation function passes positive numbers. SA-1 pushes the kernel output (or node in linear layers) toward a positive increase. This argument is valid for positive and negative convolution layer outputs before activation. On the other hand, SA-0 works in the opposite direction. Flipping a bit from 1 to 0 reduces the output kernel value. The ReLU activation function shunts changes in negative kernel outputs, whereas faults causing discrepancies in the positive range will likely affect the output. A similar observation was reported when injecting fault in the datapath [11].

6.3. Critical Faults

This test evaluates the DNN network sensitivity toward faults at different fault sites of the multiplier circuit. The findings of the experiment can be used to guide the design of fault-tolerant techniques to protect against the most critical faults, since mitigating against all faults is not feasible. For the multiplier model in Figure 7, there are 876 different faults for 438 nodes. A study of the general trends in inference accuracy as a result of injecting random faults (SA-0, SA-1) in two node groups at different areas of the multiplier circuit is performed as a guiding example.

Figure 18 shows the inference accuracy at different fault injection rates for M and S node groups, at the input to full adder blocks

F A (i, j)

, such that

i = 2

and

3 < = j < = 8

(

M_{23}

:

M_{28}

and

S_{23}

:

S_{28}

). The same notation as in Figure 7 is used. Faults injected into nodes closer to the most significant bits (MSBs) are more likely to propagate discrepancies at the multiplier output than those injected into nodes closer to the least significant bits (LSBs). Injecting faults at a rate higher than

10^{- 3}

faults/PE to node

M_{28}

is sufficient to reduce the inference accuracy significantly. On the other hand, a fault injected to

M_{23}

can be mitigated up to a rate of

10^{- 1}

faults/PE. A similar trend can be observed for the node group S. Comparing the M and S node groups in the same row, both node groups show a comparable decline in inference accuracy for the nodes closer to LSB. However, as we move closer to the MSB, injecting faults into M nodes shows a more significant decline in prediction accuracy. One reason is that M MSB nodes have a higher chance to flip the sign bit than S MSB nodes.

6.4. Different Hardware Accelerator Architectures

In this experiment, a study of the impact of logic faults for different hardware accelerator architectures, TPU ([2]), Eyeriss ([4]) and Origami ([3]), is presented. Its primary purpose is to assess the impact of data streaming patterns represented by different operation-PE mappings in our model. As detailed in Section 5, these mappings are captured by the

I A / I B

configuration matrices. Figure 19 shows the trend in inference accuracy decline when random faults are injected into a set of PE units randomly selected from all available PE units in each accelerator regardless of the utilization of each chosen PE. It is worth noting that the PE units’ utilization and workload distribution are not necessarily equivalent for all PE units in an accelerator due to the unique architecture and data streaming patterns. For example, in Eyeriss, for kernels with a smaller number of rows (

K_{h}

), some cluster rows may not be utilized. To highlight the impact of each accelerator’s unique operation-PE mapping scheme, the experiment is repeated such that faults are injected only to the utilized PE units, and the chances of selecting a PE for injection are based on how much this PE is used compared to other PE units. The results are depicted in Figure 20.

Injecting faults to Eyeriss below

1 / N_{P E_{E Y E R I S S}}

Faults/Million PE does not impact the output, where

N_{P E_{E Y E R I S S}}

is the number of PEs of Eyeriss (e.g., 168 in [4]). The same observation can be made for Origami. Comparing Figure 19 and Figure 20 indicates that injecting faults to the higher utilized PE units leads to higher fault resilience. The reason is not that the more utilized PE units contribute less to the output, but rather that another factor has a greater influence. This is because the inference workloads are not evenly distributed between the accelerator PE units for DNN layers closer to a model’s inputs, and faults in these layers have a higher chance of being mitigated. This uneven distribution of workload is due to the specific structure of the DNN model and the particular operation-PE mapping scheme for an accelerator. For example, the first two layers of VGG16 have 64 kernels vs. 512 kernels in layers 8–13. This results in using only 64 columns of the TPU PE grid out of 256 in the first two layers. The more utilized PE units are more likely to be in the first 64 columns of the grid, and injecting faults to the respective PE units will mainly impact layers 1–2. For Origami and Eyeriss, the uneven distribution of workloads between layers depends more on the kernel’s dimensions (for example, VGG16 uses only kernels of size 3 × 3). For this reason, there is a smaller difference in the PE workload distribution between the DNN layers closer to the model’s input and the layers closer to the output when compared to the TPU. Overall, comparing the resilience of the three accelerators in Figure 20, it is evident that for higher injection rates, when the PE utilization is taken into account, TPU performs better than Eyeriss and Origami because of the better workload distribution between the PE units. It should be highlighted, from this experiment, that understanding the dominating factors (from the DNN model and/or hardware) in system performance is possible by the assessment of hardware fault resilience at this level of abstraction offered by our framework.

6.5. Pre-RTL Accelerator PE Utilization

A key property of our framework ALPRI-FI is that it operates at the pre-RTL level. By functioning before RTL development, ALPRI-FI can provide feedback early in the design process, allowing for revisions before a more refined model is developed. Through abstraction layers from the ALPRI-FI framework, one can identify (for various types of models) the distribution of the workload for PE units. In Figure 21, the PE workload distributions for TPU [2] (a), Eyeriss [4] (b), and Origami [3] (c) are compared. This analysis helps in identifying which PEs are under heavier loads and can help system designers identify where faults are most likely to have significant impacts at the application level. This type of insight can be used in decisions on how to allocate fault resources at locations where they are most needed.

It is important to highlight that the PE workload distributions plotted on Figure 21 are based on the basic operation-PE mapping features that are unique to each accelerator as described in Section 5. For instance, the mapping for Eyeriss shown in the figure initially assumes processing only one kernel at a time. However, a modification in the operation-PE mapping would alter the workload distribution across the processing grid. Therefore, to enhance resource utilization, whenever deemed necessary and assessed to be feasible, multiple kernels can be processed simultaneously. The value of the ALPRI-FI framework is that it allows for updates to the operation-PE mapping and provides estimates of the impact of different workload distributions on hardware fault resilience; thus, designers can make early decisions on the development of fault-tolerant strategies.

6.6. Comparison to Related Studies

Table 5 summarizes the main differences between ALPRI-FI and studies from [12,24,25,26,27]. In ref. [12], faults in control and management units are emulated by inserting specific instructions into the software kernels of accelerators. Although this method is very fast, as it is applicable to assess the faults using silicon prototypes, it is complementary to the focus of our study, which is focused on the effects of permanent faults in arithmetic units on the accuracy of DNN inference at the application level. The impact of such faults depends on the values of the input operands, making it challenging to adapt the approach from [12] for large-scale arrays of arithmetic units common in DNN accelerators. Additionally, our research aims to enable designers to identify potential reliability concerns due to permanent faults in arithmetic units early in the design process even before the RTL code has been developed.

The study from [25] utilizes RTL models to assess resiliency against transient faults. This approach achieves fast simulation times (0.178 min per image) by converting RTL models to C/C++ and activating accurate models only when transient faults occur during simulation. However, extending this method to assess permanent faults in arithmetic units presents several challenges. Unlike transient faults, permanent faults require that accurate circuit models of arithmetic units be active throughout the entire simulation. Adding accurate circuit models for arithmetic units, which are not natively available in RTL models, and the activation of these models during all simulation times are expected to increase the simulation time significantly. Furthermore, unlike our approach that can operate at the pre-RTL stage, the RTL models become available during the later design stages.

Using accurate gate-level models for all the accelerators is another option to evaluate permanent faults. In [27], a complete gates-level hardware model is produced by synthesizing DNN model operations. Evaluating fault resiliency using such a complete gate-level model of the accelerator is expected to result in an accurate fault assessment. However, the simulation time is very long, and it is unfeasible to conduct fault injection campaigns efficiently at the DNN application level. According to [27], replacing each multiplication operation with a fault-injectable hardware model increases the simulation time by 7 s. Scaling this to the number of multiplications for a state-of-the-art DNN model is expected to require an excessive amount of runtime. Therefore, this approach is limited to a final assessment before the physical implementation rather than during the early stages of design space exploration, as it is one of the critical objectives of our work, as highlighted in Figure 1.

One attempt to increase simulation performance using parallel processing is the study in [26], which models the logic gates of the arithmetic circuits using neural network-based models. Using neural network-based models facilitates the usage of GPU engines. However, aside from the impact of such models on the assessment accuracy, this type of modeling of the circuit logic increases the simulation demands because of the large number of models that need to be executed to evaluate MAC operations. For example, a relatively small LeNet-5 network consumes 1670 processor minutes for each of the 28 × 28 MNIST images.

In addition to the simulation time increases from using accurate circuit models of arithmetic units, which can be in the order of 1000x from Section 5, it is important to highlight another key challenge when considering a DNN framework to test the resiliency of different hardware architectures. This challenge is the significant amounts of memory needed during simulation to support configurable operations mapping to arithmetic units. In [24], the purpose is to assess memory faults, focusing only on faults affecting the operands of arithmetic operations. Hence, the simulation CPU requirements are sufficient to simulate larger DNN models such as ResNet50 performing inference for the ImageNet dataset (34 min per image in Table 5). However, it should be noted that describing the mapping of operations to the processing elements takes more considerable amounts of memory than running inference without operation mapping support. According to [24], it is required to free the memory of the simulation machine belonging to one DNN layer before processing the next layer. This can limit the performance when more fault assessment features are added, such as using an accurate model of the arithmetic units. Through the proposed method, memory requirements to describe the mapping of operations are reduced to be proportionate to the model size and not to the number of operations. This helped to keep manageable simulation performance (29 min per image) even with the inclusion of gate-level models of the arithmetic units, as shown in the last row of Table 5.

To summarize, the primary goal of our framework is to assist system and hardware designers in evaluating the resilience of new hardware architectures against permanent faults in arithmetic units, which can take up to 40% of the silicon area. This evaluation can be made at the DNN application level and is intended to be carried at the very early stages of the development process even before the RTL model is developed. For these reasons, we are considering ALPRI-FI as a suitable candidate for early hardware fault resiliency assessment in DNN accelerators.

7. Conclusions

Hardware accelerators are essential for machine learning workloads due to their performance and energy efficiency in handling the large computational demands of DNN models. Assessing the impact of hardware faults on these accelerators is critical to maintaining reliable operation in safety-critical applications and cloud infrastructure. However, fault assessment is time consuming because it requires extensive analysis of various fault scenarios. Additionally, accurately simulating and evaluating the impact of these faults on DNN models with large datasets is extremely computationally demanding. To address this concern, we introduced ALPRI-FI in this paper, which is a comprehensive framework for evaluating the fault resilience of DNN accelerators during the early design and implementation stages.

While prior work has investigated the impact of transient faults or permanent faults in memory, there is a lack of understanding of how faults in arithmetic units impact the functionality of DNN models, particularly when fault injection experiments need to be carried out at scale. To address this, ALPRI-FI relies on information from data streaming patterns in hardware accelerators, determined prior to the RTL implementation, and gate-level accurate arithmetic circuit models, which are used to perform precise fault injection experiments. It was demonstrated how accurate fault injection experiments in arithmetic units could be incorporated into hardware accelerator models within state-of-the-art frameworks, such as PyTorch, thus reducing both development time and computational runtimes. More importantly, these experiments are conducted during the early design space exploration stage, providing insights that can guide adjustments before more detailed implementations are finalized. To validate the generality of our framework, we evaluated the fault resilience of various hardware accelerators from the literature for different DNN models. We provided a comparative analysis in terms of fault resilience that, to our knowledge, had not been reported in the public domain for large-scale designs and models.

Author Contributions

Conceptualization, K.M. and N.N.; methodology, K.M.; software, K.M.; validation, K.M.; formal analysis, K.M.; investigation, K.M.; resources, K.M.; data curation, K.M.; writing—original draft preparation, K.M.; writing—review and editing, K.M. and N.N.; visualization, K.M.; supervision, N.N.; project administration, K.M.; funding acquisition, K.M. and N.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Natural Sciences and Engineering Research Council (NSERC) of Canada grant number RGPIN-2020-06884. Computations were performed on the Niagara supercomputer at the SciNet HPC Consortium and were enabled in part by support provided by Compute Ontario (computeontario.ca accessed on 1 Jauary 2021) and the Digital Research Alliance of Canada (alliancecan.ca accessed on 1 April 2022). SciNet is funded by Innovation, Science and Economic Development Canada; the Digital Research Alliance of Canada; the Ontario Research Fund: Research Excellence; and the University of Toronto.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author, K.M., upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ALPRI-FI	Architecture-Level Pre RTL Implementation Fault Injection
SDC	Silent Data Corruption
DNN	Deep Neural Network
IFmap/OFmap	Input/Output Feature Map
MAC	Multiply-Accumulate (operation)
FF	Flip Flops
HPC	High-Performance Computing
GAN	Generative Adversarial Networks
GPU	Graphics Processing Units
RISC	Reduced Instruction Set Computer (Architecture)
CONV	Convolution (DNN layer and Tensor operation)
RTL-Env	RTL Environment (Hardware Design Phase)
Gates-Env	Gate/Circuit level Environment (Hardware Design Phase)
FBGEMM	Facebook General Matrix Multiplication
BLAS	General Basic Linear Algebra Subroutine
RCA	Ripple Carry Adders
PE	Processing Engine
TPU	Tensor Processing Unit
MM	matrix multiplication

References

Capra, M.; Bussolino, B.; Marchisio, A.; Shafique, M.; Masera, G.; Martina, M. An Updated Survey of Efficient Hardware Architectures for Accelerating Deep Convolutional Neural Networks. Future Internet 2020, 12, 113. [Google Scholar] [CrossRef]
Jouppi, N.P.; Young, C.; Patil, N.; Patterson, D.; Agrawal, G.; Bajwa, R.; Bates, S.; Bhatia, S.; Boden, N.; Borchers, A.; et al. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, Toronto, ON, Canada, 24–28 June 2017; pp. 1–12. [Google Scholar] [CrossRef]
Cavigelli, L.; Benini, L. Origami: A 803-GOp/s/W Convolutional Network Accelerator. IEEE Trans. Circuits Syst. Video Technol. (TCSVT) 2017, 27, 2461–2475. [Google Scholar] [CrossRef]
Chen, Y.H.; Krishna, T.; Emer, J.S.; Sze, V. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. IEEE J. Solid-State Circuits (JSSC) 2017, 52, 127–138. [Google Scholar] [CrossRef]
Li, G.; Hari, S.K.S.; Sullivan, M.; Tsai, T.; Pattabiraman, K.; Emer, J.; Keckler, S.W. Understanding error propagation in Deep Learning Neural Network (DNN) accelerators and applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Denver, CO, USA, 12–17 November 2017; pp. 1–12. [Google Scholar] [CrossRef]
Santos, F.F.D.; Pimenta, P.F.; Lunardi, C.; Draghetti, L.; Carro, L.; Kaeli, D.; Rech, P. Analyzing and increasing the reliability of convolutional neural networks on GPUs. IEEE Trans. Reliab. 2019, 68, 663–677. [Google Scholar] [CrossRef]
De Oliveira, D.A.G.; Pilla, L.L.; Santini, T.; Rech, P. Evaluation and mitigation of radiation-induced soft errors in graphics processing units. IEEE Trans. Comput. 2016, 65, 791–804. [Google Scholar] [CrossRef]
Tyagi, A.; Gan, Y.; Perceptin, S.L.; Perceptin, B.Y.; Whatmough, P.; Zhu, Y. Thales: Formulating and Estimating Architectural Vulnerability Factors for DNN Accelerators. IEEE Int. Symp. High Perform. Comput. Archit. 2023. Available online: https://par.nsf.gov/biblio/10411383-thales-formulating-estimating-architectural-vulnerability-factors-dnn-accelerators (accessed on 1 May 2024).
Schorn, C.; Guntoro, A.; Ascheid, G. An Efficient Bit-Flip Resilience Optimization Method for Deep Neural Networks. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE), Florence, Italy, 25–29 March 2019; pp. 1507–1512. [Google Scholar] [CrossRef]
He, Y.; Balaprakash, P.; Li, Y. Fidelity: Efficient resilience analysis framework for deep learning accelerators. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO), Athens, Greece, 17–21 October 2020; pp. 270–281. [Google Scholar] [CrossRef]
Kundu, S.; Banerjee, S.; Raha, A.; Natarajan, S.; Basu, K. Toward Functional Safety of Systolic Array-Based Deep Learning Hardware Accelerators. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2021, 29, 485–498. [Google Scholar] [CrossRef]
Guerrero-Balaguera, J.D.; Condia, J.E.; Dos Santos, F.F.; Reorda, M.S.; Rech, P. Understanding the Effects of Permanent Faults in GPU’s Parallelism Management and Control Units. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC, Denver, CO, USA, 12–17 November 2023. [Google Scholar] [CrossRef]
Reagen, B.; Gupta, U.; Pentecost, L.; Whatmough, P.; Lee, S.K.; Mulholland, N.; Brooks, D.; Wei, G.Y. Ares: A framework for quantifying the resilience of deep neural networks. In Proceedings of the ACM/ESDA/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 24–29 June 2018; pp. 1–6. [Google Scholar] [CrossRef]
Kundu, S.; Basu, K.; Sadi, M.; Titirsha, T.; Song, S.; Das, A.; Guin, U. Special session: Reliability analysis for AI/ML hardware. In Proceedings of the IEEE VLSI Test Symposium (VTS), San Diego, CA, USA, 25–28 April 2021; pp. 1–10. [Google Scholar] [CrossRef]
Mao, F.; Guo, Y.; Liao, X.; Jin, H.; Zhang, W.; Liu, H.; Zheng, L.; Liu, X.; Jiang, Z.; Zheng, X. Accelerating Loop-Oriented RTL Simulation With Code Instrumentation. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2023, 42, 4985–4998. [Google Scholar] [CrossRef]
Chen, T.; Du, Z.; Sun, N.; Wang, J.; Wu, C.; Chen, Y.; Temam, O. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. ACM SIGARCH Comput. Archit. News 2014, 42, 269–284. [Google Scholar] [CrossRef]
Deligiannis, N.I.; Cantoro, R.; Reorda, M.S.; Habib, S.E. Evaluating the Reliability of Integer Multipliers With Respect to Permanent Faults. In Proceedings of the 2024 27th International Symposium on Design and Diagnostics of Electronic Circuits and Systems, DDECS 2024, Kielce, Poland, 3–5 April 2024; pp. 124–129. [Google Scholar] [CrossRef]
Hochschild, P.H.; Turner, P.; Mogul, J.C.; Parthasarathy, R.G.; Google, R.; Culler, D.E.; Vahdat, A.; Govin-Daraju, R.; Ranganathan, P.; Vah, A. Cores that don’t count. In Proceedings of the Workshop on Hot Topics in Operating Systems, Ann Arbor, MI, USA, 1–3 June 2021; pp. 9–16. [Google Scholar] [CrossRef]
Dattatraya Dixit, H.; Pendharkar, S.; Beadon, M.; Mason, C.; Chakravarthy, T.; Muthiah, B.; Sankar, S. Silent Data Corruptions at Scale. arXiv 2021, arXiv:2102.11245v1. [Google Scholar]
Hollister, S. There Is No Fix for Intel’s Crashing 13th and 14th Gen CPUs—Any Damage Is Permanent; The Verge: New York, NY, USA, 2024. [Google Scholar]
Kaja, E.; Gerlin, N.; Bora, M.; Rutsch, G.; Devarajegowda, K.; Stoffel, D.; Kunz, W.; Ecker, W. Fast and Accurate Model-Driven FPGA-based System-Level Fault Emulation. In Proceedings of the IEEE/IFIP International Conference on VLSI and System-on-Chip, VLSI-SoC, Patras, Greece, 3–5 October 2022. [Google Scholar] [CrossRef]
Kaja, E.; Gerlin, N.; Vaddeboina, M.; Rivas, L.; Prebeck, S.; Han, Z.; Devarajegowda, K.; Ecker, W. Towards Fault Simulation at Mixed Register-Transfer/Gate-Level Models. In Proceedings of the IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, DFT, Athens, Greece, 6–8 October 2021. [Google Scholar] [CrossRef]
PyTorch. Available online: https://pytorch.org/ (accessed on 3 February 2022).
Hendrik Bahnsen, F.; Klebe, V.; Fey, G. Effect Analysis of Low-Level Hardware Faults on Neural Networks using Emulated Inference. In Proceedings of the IEEE International Conference on Modern Circuits and Systems Technologies (MOCAST), Thessaloniki, Greece, 5–7 July 2021. [Google Scholar] [CrossRef]
Hoefer, J.; Kempf, F.; Hotfilter, T.; Kreß, F.; Harbaum, T.; Becker, J. SiFI-AI: A Fast and Flexible RTL Fault Simulation Framework Tailored for AI Models and Accelerators. In Proceedings of the ACM Great Lakes Symposium on VLSI, GLSVLSI, Knoxville, TN, USA, 5–7 June 2023; pp. 287–292. [Google Scholar] [CrossRef]
Chaudhuri, A.; Chen, C.Y.; Talukdar, J.; Madala, S.; Dubey, A.K.; Chakrabarty, K. Efficient Fault-Criticality Analysis for AI Accelerators using a Neural Twin. In Proceedings of the IEEE International Test Conference (ITC), Anaheim, CA, USA, 10–15 October 2021; pp. 73–82. [Google Scholar] [CrossRef]
Karami, M.; Haghbayan, M.H.; Ebrahimi, M.; Miele, A.; Tenhunen, H.; Plosila, J. Hierarchical Fault Simulation of Deep Neural Networks on Multi-Core Systems. In Proceedings of the IEEE European Test Symposium (ETS), Bruges, Belgium, 24–28 May 2021; pp. 1–2. [Google Scholar] [CrossRef]
Chaudhuri, A.; Talukdar, J.; Su, F.; Chakrabarty, K. Functional Criticality Classification of Structural Faults in AI Accelerators. In Proceedings of the IEEE International Test Conference (ITC), Washington, DC, USA, 1–6 November 2020; pp. 1–5. [Google Scholar] [CrossRef]
Chen, C.Y.; Chakrabarty, K. Pruning of Deep Neural Networks for Fault-Tolerant Memristor-based Accelerators. In Proceedings of the ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 5–9 December 2021; pp. 889–894. [Google Scholar] [CrossRef]
Kaja, E.; Leon, N.O.; Werner, M.; Andrei-Tabacaru, B.; Devarajegowda, K.; Ecker, W. Extending Verilator to Enable Fault Simulation. In Proceedings of the MBMV 2021—24th Workshop, Online, 18–19 March 2021; pp. 1–6. [Google Scholar]
Chen, Z.; Narayanan, N.; Fang, B.; Li, G.; Pattabiraman, K.; DeBardeleben, N. Tensorfi: A flexible fault injection framework for tensorflow applications. In Proceedings of the IEEE International Symposium on Software Reliability Engineering (ISSRE), Coimbra, Portugal, 12–15 October 2020; pp. 426–435. [Google Scholar] [CrossRef]
Verilator. Available online: https://veripool.org/ (accessed on 7 July 2024).
Han, S.; Liu, X.; Mao, H.; Pu, J.; Pedram, A.; Horowitz, M.A.; Dally, W.J. EIE: Efficient Inference Engine on Compressed Deep Neural Network. Proc. ACM/IEEE Annu. Int. Symp. Comput. Archit. (ISCA) 2016, 16, 243–254. [Google Scholar] [CrossRef]
Cavigelli, L. Accelerating Real-Time Embedded Scene Labeling with Convolutional Networks. In Proceedings of the ACM/EDAC/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 7–11 June 2015; pp. 1–6. [Google Scholar] [CrossRef]
Gokhale, V.; Jin, J.; Dundar, A.; Martini, B.; Culurciello, E. A 240 G-ops/s mobile coprocessor for deep neural networks. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Columbus, OH, USA, 23–28 June 2014; pp. 696–701. [Google Scholar] [CrossRef]
Du, Z.; Fasthuber, R.; Chen, T.; Ienne, P.; Li, L.; Luo, T.; Feng, X.; Chen, Y.; Temam, O. ShiDianNao: Shifting vision processing closer to the sensor. In Proceedings of the ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), Portland, OR, USA, 13–17 June 2015; pp. 92–104. [Google Scholar] [CrossRef]
Peemen, M.; Setio, A.A.; Mesman, B.; Corporaal, H. Memory-centric accelerator design for convolutional neural networks. In Proceedings of the IEEE International Conference on Computer Design (ICCD), Asheville, NC, USA, 6–9 October 2013; pp. 13–19. [Google Scholar] [CrossRef]
systemC. Available online: https://systemc.org/ (accessed on 7 July 2024).
Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv 2016, arXiv:1603.04467. Available online: https://www.tensorflow.org/ (accessed on 9 July 2024).
Khudia, D.; Huang, J.; Basu, P.; Deng, S.; Liu, H.; Park, J.; Smelyanskiy, M. FBGEMM: Enabling High-Performance Low-Precision Deep Learning Inference. arXiv 2021, arXiv:2101.05615. [Google Scholar] [CrossRef]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. (IJCV) 2015, 115, 211–252. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Baugh, C.R. A Two’s Complement Parallel Array Multiplication Algorithm. IEEE Trans. Comput. C 1973, 22, 1045–1047. [Google Scholar] [CrossRef]
TorchVision Maintainers and Contributors. TorchVision, PyTorch’s Computer Vision Library. Available online: https://github.com/pytorch/vision (accessed on 1 January 2021).
Phan, H. huyvnphan/PyTorch_CIFAR10. Available online: https://github.com/huyvnphan/PyTorch_CIFAR10 (accessed on 1 February 2022).
Ponce, M.; Van Zon, R.; Northrup, S.; Gruner, D.; Chen, J.; Ertinaz, F.; Fedoseev, A.; Groer, L.; Mao, F.; Mundim, B.C.; et al. Deploying a top-100 supercomputer for large parallel workloads: The Niagara supercomputer. In Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (Learning), Chicago, IL, USA, 28 July–1 August 2019. [Google Scholar] [CrossRef]
Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images. Technical Report; Toronto, ON, Canada. Available online: https://api.semanticscholar.org/CorpusID:18268744 (accessed on 1 January 2019).
Genc, H.; Kim, S.; Amid, A.; Haj-Ali, A.; Iyer, V.; Prakash, P.; Zhao, J.; Grubb, D.; Liew, H.; Mao, H.; et al. Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Full-Stack Integration. Proc.—Des. Autom. Conf. 2021, 2021, 769–774. [Google Scholar] [CrossRef]

Figure 1. The main motivation for the proposed framework is to enable early fault assessment during the design phase of new DNN accelerators. The top of the figure highlights four different hardware development stages during which fault assessment can be performed. The rest of the figure compares them according to usage, accuracy, hardware redesign cost, and simulation speed.

Figure 2. Positioning of our framework, called ALPRI-FI, relative to recent studies [11,12,25,26,27]. ALPRI-FI offers gate-level accurate fault assessment in arithmetic circuitry during early design stages using an extendable framework that operates at the application level.

Figure 3. Paper notation: convolution operation for IFmap

I [H_{i}] [W_{i}] [D_{i n}]

using

D_{o u t}

kernels

W [D_{o u t}] [k_{h}] [k_{w}] [D_{i n}]

.

Figure 3. Paper notation: convolution operation for IFmap

I [H_{i}] [W_{i}] [D_{i n}]

using

D_{o u t}

kernels

W [D_{o u t}] [k_{h}] [k_{w}] [D_{i n}]

.

Figure 4. Converting CONV layer processing to matrix multiplication, known as convolution unfolding. One kernel scan position (

d_{w}

,

d_{h}

) corresponds to one row of A. The lower part of the figure shows rows from matrix A related to one input image, which, when multiplied to weights matrix B, the result corresponds to the

O [i m] [d_{h}] [d_{w}] [c_{o}]

, before activation, such that

i m

is the image index as per Equation (1).

Figure 4. Converting CONV layer processing to matrix multiplication, known as convolution unfolding. One kernel scan position (

d_{w}

,

d_{h}

) corresponds to one row of A. The lower part of the figure shows rows from matrix A related to one input image, which, when multiplied to weights matrix B, the result corresponds to the

O [i m] [d_{h}] [d_{w}] [c_{o}]

, before activation, such that

i m

is the image index as per Equation (1).

Figure 5. Comparing fault injection mechanisms using Multiply–Accumulate (MAC) hardware models in System-Env, RTL-Env and Gates-Env simulation environments. RTL functional description of the target hardware does not offer detailed circuit models, limiting the support of fault resiliency assessment. The colours of the accelerator model (square) and the multiplier model (circle) indicate the simulation speed, with red presenting the longest simulation time.

Figure 6. ALPRI-FI framework structure: The left side describes the breakdown of processing activation matrix A and weights matrix B to produce pre-activation layer output C through software layers consisting of DNN application, DNN framework, GEMM subroutine, and low-level MM kernel. Operation-PE mapping description in

I A

and

I B

matrices is efficiently propagated from the top-level application to the low-level kernel (Conf Block(im, ik, in)) as shown on the right side. The MM kernel uses the mapping information to model the operation-PE mapping of the target hardware accelerator. The thickened black arrows indicate the transformation of data structures from an upper to a lower software layer.

Figure 6. ALPRI-FI framework structure: The left side describes the breakdown of processing activation matrix A and weights matrix B to produce pre-activation layer output C through software layers consisting of DNN application, DNN framework, GEMM subroutine, and low-level MM kernel. Operation-PE mapping description in

I A

and

I B

matrices is efficiently propagated from the top-level application to the low-level kernel (Conf Block(im, ik, in)) as shown on the right side. The MM kernel uses the mapping information to model the operation-PE mapping of the target hardware accelerator. The thickened black arrows indicate the transformation of data structures from an upper to a lower software layer.

Figure 7. A 9-bit signed multiplier model based on Baugh–Wooly model [43].

Figure 8. Operation-PE mapping example for matrix multiplication of A of size (2, 4) and B of size (4, 3) resulting in matrix C or size (2, 3). Referring to Figure 4, these dimensions can be the result of convolution parameters

K_{w}

=

K_{h}

= 2 and

D_{i n}

,

D_{o u t}

=

1, 3

.

Figure 8. Operation-PE mapping example for matrix multiplication of A of size (2, 4) and B of size (4, 3) resulting in matrix C or size (2, 3). Referring to Figure 4, these dimensions can be the result of convolution parameters

K_{w}

=

K_{h}

= 2 and

D_{i n}

,

D_{o u t}

=

1, 3

.

Figure 9. Streaming pattern for matrix multiplication from Figure 8 in a 4 × 3 systolic array and the operation-PE mapping description using mapping list MP_list[M][K × N], where M, K and N are 2, 4 and 3, respectively.

Figure 10. Operation-PE mapping for a TPU cluster (weight-stationary systolic arrays) of size

r t

×

c t

using IB configuration matrix. Each block of

r t

×

c t

provides mapping information of one cluster configuration.

Figure 10. Operation-PE mapping for a TPU cluster (weight-stationary systolic arrays) of size

r t

×

c t

using IB configuration matrix. Each block of

r t

×

c t

provides mapping information of one cluster configuration.

Figure 11. Operation-PE mapping description for weights-stationary streaming patterns through a grid of 4 × 3 systolic arrays using

I A

and

I B

matrices for the example in Figure 8.

Figure 11. Operation-PE mapping description for weights-stationary streaming patterns through a grid of 4 × 3 systolic arrays using

I A

and

I B

matrices for the example in Figure 8.

Figure 12. Operation-PE mapping description for partial results-stationary streaming patterns through a grid of 4 × 3 systolic arrays using

I A

and

I B

operation-PE mapping matrices for the example in Figure 8.

I A

and

I B

hold the row and column ID of the corresponding PE, respectively.

Figure 12. Operation-PE mapping description for partial results-stationary streaming patterns through a grid of 4 × 3 systolic arrays using

I A

and

I B

operation-PE mapping matrices for the example in Figure 8.

I A

and

I B

hold the row and column ID of the corresponding PE, respectively.

Figure 13. Eyeriss ([4]) specialized convolution accelerator is a PE grid of size

r_{e}

×

c_{e}

. In Eyeriss, kernel rows are streamed horizontally while IFmap rows are fed diagonally. The output partial-sums are captured vertically.

Figure 13. Eyeriss ([4]) specialized convolution accelerator is a PE grid of size

r_{e}

×

c_{e}

. In Eyeriss, kernel rows are streamed horizontally while IFmap rows are fed diagonally. The output partial-sums are captured vertically.

Figure 14. The operation-PE mapping using the

I A

/

I B

matrix format for a convolution processed by Eyeriss ([4]) specialized DNN accelerator. At the top row of the figure, for the first kernel,

C_{00}

is calculated by streaming rows of IFmap and the kernel to PE units M[0][0] and M[1][0]. Holding the cluster column and row indices,

I A

and

I B

can be configured to describe the operation-PE mapping.

Figure 14. The operation-PE mapping using the

I A

/

I B

matrix format for a convolution processed by Eyeriss ([4]) specialized DNN accelerator. At the top row of the figure, for the first kernel,

C_{00}

is calculated by streaming rows of IFmap and the kernel to PE units M[0][0] and M[1][0]. Holding the cluster column and row indices,

I A

and

I B

can be configured to describe the operation-PE mapping.

Figure 15. Operation-PE mapping description for Eyeriss ([4]). Configuring

I A

/

I B

such that

I A

holds the cluster column index and

I B

holds the cluster row index is sufficient to describe operations mapping to the cluster.

Figure 15. Operation-PE mapping description for Eyeriss ([4]). Configuring

I A

/

I B

such that

I A

holds the cluster column index and

I B

holds the cluster row index is sufficient to describe operations mapping to the cluster.

Figure 16. The performance of DNN inference for nine different DNN models and two different datasets, CIFAR10 and ImageNet. The performance of the models declines for fault rates as low as 100–1000 faulty PE units per million.

Figure 17. Comparing injection of SA-0 and SA-1 for CIFAR10 [47] on VGG16 and ImageNet [41] on ResNet18.

Figure 18. Inference accuracy for

M 23

:

M 28

and

S 23

:

S 28

node groups for the example multiplier from Figure 7. VGG16 DNN model for CIFAR10 dataset is used in this experiment.

Figure 18. Inference accuracy for

M 23

:

M 28

and

S 23

:

S 28

node groups for the example multiplier from Figure 7. VGG16 DNN model for CIFAR10 dataset is used in this experiment.

Figure 19. Inference accuracy for random, grid-wise, logic fault injection with simplified operation-PE mapping for TPU [2], Eyeriss [4] and Origami [3]. VGG16 DNN model for CIFAR10 dataset is used in this experiment.

Figure 20. Inference accuracy for random logic fault injection with simplified operation-PE mapping. Selecting a PE unit to inject a fault is according to the utilization of that specific PE unit compared to the other used PE units. Showing the results for TPU [2], Eyeriss [4] and Origami [3]. VGG16 DNN model for CIFAR10 dataset is used in this experiment.

Figure 21. PE utilization distribution for TPU [2] (a), Eyeriss [4] (b) and Origami [3] (c). The normalized distribution statistics are collected from running the same inference jobs on VGG16. Evaluating the impact of operation-PE mapping on workload distribution and hardware fault resilience can be performed efficiently using our method, thus gaining early feedback for design revisions before developing an RTL model.

Table 1. Convolution parameters of Equation (1).

$D_{o u t}$	Number of OFmap channels
$D_{i n}$	Number of IFmap channels
$K_{h} / K_{w}$	Convolution kernel height and weight
$H_{i} / W_{i}$	IFmap height and weight
$H_{o} / W_{o}$	OFmap height and weight
$S_{T}$	Scanning stride (default = 1)
$N_{i m}$	Number of images/data samples

Table 2. Framework requirements.

Requirement	Description
R1	Flexible to simulate different DNN networks, able to perform both post and pre-implementation fault assessment, and upgradable to support future DNN network types and HW features.
R2	Support different hardware modeling features: a Modeling the arithmetic circuitry of the HW design b Modeling the mapping of operations to processing engines (PEs) based on data streaming patterns.
R3	Consume manageable computational resources.

Table 3. Framework performance at three different development stages. At each stage, a framework’s requirement is designed and verified (a), and then the implementation is further optimized (b). The results are calculated using FBGEMM ([40]) library benchmarking tests of matrix multiplication.

Stage	Feature	Fault-Injectable Multiplier Model (R2-a)	Operation-PE Mapping (R2-b)	Run Time Compared to Baseline
S1	(a) Native Matrix Multiplication (MM) routine	No	No	56x
S1	(b) FBGEMM Machine-Specific MM	No	No	1 (Baseline)
S2	(a) Add Initial Fault-Injectable Multiplier Model	Yes	No	2700x
S2	(b) Optimized Fault-Injectable Multiplier Model	Yes	No	900x
S3	(a) Add Basic Operation-PE mapping	Yes	Yes	1400x
S3	(b) Operation-PE mapping with the new method	Yes	Yes	1100x

Table 4. Fault injection features supported by the framework.

Feature	Description
DNN Models	Support configurable DNN model
HW Accelerator	Support modeling of custom HW accelerator
Multiplier Arch	Support customized multiplier circuit design
Fault Type	Stuck at 0/1 (SA 0/1), fixed or randomized.
Fault Site	Node selection in multiplier netlist
Injection Rate	Number of faulty PEs with respect to the total number of PE units
DNN Model Fault Region	Selection of specific DNN model layer
Cluster Fault Location	Selection of randomized or specific PE set

Table 5. Comparing support and performance of ALPRI-FI against literature studies [12,24,25,26,27].

Study	Simulation Environment	Fault Assessment (Support Arithmetic Permanent Faults?)	HW Architecture Support	Evaluated DNNs/Datasets	Inference Speed (Minutes/Image/Core)
[12]	RTL-Env	Permanent faults to control/management units. Faults emulated on silicon (NO)	GPU with microarchitecture ready Tested on NVIDIA G80	General GPU load/applications	No DNN Inference Data
[25]	RTL-Env	Transient faults to activation, weights and control (NO)	Configurable: Convert Complete HW RTL to C++ Tested with Systolic Arrays by Gemmini [48]	ResNet, GoogLeNet/ImageNet	0.178
[27]	Gates-Env	After-synthesis fault simulation (YES)	HW = DNN Structure	LeNet-5/MNIST	No Data (slow)
[26]	Gates-Env	Fault-criticality analysis (YES)	128 * 128 Systolic arrays	LeNet-5/MNIST	1670
[24]	System-Env	Memory fault assessment. Operands only (NO)	General 12 PE Grid	VGG19, InceptionV3, ResNet50/ImageNet	34
ALPRI-FI	System-Env	Wide support summarized in Table 4 (YES)	Configurable: describe operation mapping using IA/IB Tested with TPU, Eyeriss, Origami	Configurable: Tested with 9 models (Figure 16)/CIFAR10, ImageNet	29

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mahmoud, K.; Nicolici, N. ALPRI-FI: A Framework for Early Assessment of Hardware Fault Resiliency of DNN Accelerators. Electronics 2024, 13, 3243. https://doi.org/10.3390/electronics13163243

AMA Style

Mahmoud K, Nicolici N. ALPRI-FI: A Framework for Early Assessment of Hardware Fault Resiliency of DNN Accelerators. Electronics. 2024; 13(16):3243. https://doi.org/10.3390/electronics13163243

Chicago/Turabian Style

Mahmoud, Karim, and Nicola Nicolici. 2024. "ALPRI-FI: A Framework for Early Assessment of Hardware Fault Resiliency of DNN Accelerators" Electronics 13, no. 16: 3243. https://doi.org/10.3390/electronics13163243

APA Style

Mahmoud, K., & Nicolici, N. (2024). ALPRI-FI: A Framework for Early Assessment of Hardware Fault Resiliency of DNN Accelerators. Electronics, 13(16), 3243. https://doi.org/10.3390/electronics13163243

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ALPRI-FI: A Framework for Early Assessment of Hardware Fault Resiliency of DNN Accelerators

Abstract

1. Introduction

2. Related Works

3. Background

3.1. Notation

3.2. DNN Acceleration

3.3. Accelerating DNN Convolution

3.4. General DNN Frameworks

4. Motivation

5. Fault Assessment Framework

5.1. Framework Requirements

5.1.1. (S1) Selection of Base DNN Framework for Injecting Faults to Arithmetic Logic

5.1.2. (S2) Using Fault Injectable Arithmetic Unit Model

5.1.3. (S3) Operation-PE Mapping Feature

5.2. General Framework Structure

5.3. Multiplier Model

5.4. (Baseline) Operation-PE Mapping Using Mapping Lists

5.5. A New Representation for Operation-PE Mapping

5.6. Supporting Data Streams for Generalized Matrix Multiplication Acceleration

5.7. Supporting Data Streams for Specialized DNN Accelerators

5.8. Contrasting the New Methodology to the Baseline

6. Experiments

6.1. Fault Resilience of Different DNN Models

6.2. Susceptibility to Different Fault Types

6.3. Critical Faults

6.4. Different Hardware Accelerator Architectures

6.5. Pre-RTL Accelerator PE Utilization

6.6. Comparison to Related Studies

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI