Two-Dimensional Protection Code for Virtual Page Information in Translation Lookaside Buffers

Gao, Xin; Cui, Naiyuan; Nian, Jiawei; Liu, Hongjin; Yang, Mengfei

doi:10.3390/electronics13071320

Open AccessArticle

Two-Dimensional Protection Code for Virtual Page Information in Translation Lookaside Buffers

by

Xin Gao

^1,2,*,

Naiyuan Cui

³,

Jiawei Nian

^2,4,

Hongjin Liu

² and

Mengfei Yang

⁵

¹

Beijing Institute of Control Engineering, Beijing 100190, China

²

Beijing SunWise Space Technology Ltd., Beijing 100190, China

³

Beijing Institute of Spacecraft Environment Engineering, Beijing 100029, China

⁴

School of Computer Science and Technology, Xidian University, Xi’an 710071, China

⁵

China Academy of Space Technology, Beijing 100094, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(7), 1320; https://doi.org/10.3390/electronics13071320

Submission received: 18 February 2024 / Revised: 20 March 2024 / Accepted: 29 March 2024 / Published: 1 April 2024

(This article belongs to the Section Computer Science & Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Severe conditions such as high-energy particle strikes may induce soft errors in on-chip memory, like cache and translation lookaside buffers (TLBs). As the key component of virtual-to-physical address translation, TLB directly affects processor performance. To protect the virtual page information stored in TLB, several studies have introduced error detection or correction codes. However, most schemes proposed for data TLB cannot support the detection of multi-bit upsets, which is reported as a serious issue in modern fault-tolerant processors due to the downscaling of CMOS process technology. In this paper, we propose a new two-dimensional protection technique, called the matrix protection tag (MaP-Tag), to provide stronger error detection and correction capability to TLB virtual page information. Our proposal reorganizes virtual page information into a matrix and employs adjacent check bits and interleaved check bits for error detection. The two-dimensional design offers one-bit error detection, burst multi-bit error detection, and even multi-bit error correction in some cases. The simulation results show that our proposal can detect almost all error patterns when injected with up to eight bit-flips. Furthermore, the technique provides a better error correction rate than conventional single error correction (SEC) codes. The reliability calculation shows that our proposal is powerful in both error detection and correction with affordable storage overhead.

Keywords:

microarchitecture; reliability; TLB; error detection; error correction; fault tolerance

1. Introduction

Fault-tolerant architecture design is a major concern in modern, highly reliable processors. Harsh environment conditions, such as high-energy particle spikes, can inject soft errors into on-chip memory and impact the execution of programs. With the downscaling of MOSFET size due to the emerging process technologies, multi-bit upset (MBU), especially burst multi-cell bit flips caused by single high-energy particle strikes, has become an essential issue in today’s processors. The burst MBU induced by a single particle spike will flip adjacent storage cells and violate data integrity. Several architectural protection schemes [1,2] have been widely studied recently. Error detection and correction code (EDAC) schemes like parity and Hamming code are commonly used solutions for memory protection.

The translation lookaside buffer (TLB), as the critical hardware component of address translation acceleration, caches the virtual page information and associated physical page information that has been recently used. Soft errors in the TLB may alter the cached virtual page number (VPN) or physical page number (PPN), affecting the validity of translation through false positives or false negatives [3]. In extreme cases, soft errors in VPN or PPN can result in serious consequences and even violate system security. Considering the intrinsic nature of virtual page information, a robust error detection scheme is required to protect the VPN array in TLBs.

However, the fault tolerance issue of TLBs has not drawn sufficient attention, and current protection schemes for data TLBs are inadequate against MBU errors [3,4,5]. Conventional check coding schemes can only detect or correct non-adjacent errors, lacking protection capability for burst multi-bit errors.

To enhance fault-tolerance capability, especially multi-bit error detection for virtual page information, to distinguish false positive and false negative cases, we propose a new two-dimensional protection technique called the matrix protection tag (MaP-Tag). MaP-Tag reorganizes data bits into a matrix architecturally. The appended check bits are separated into three parts: adjacent check bits, interleaved check bits, and total check bits. Adjacent check bits provide the detection capability of multi-bit errors across different data segments, while interleaved check bits enable rapid burst multi-bit error detection. Furthermore, benefiting from the protection in different dimensions, MaP-Tag can further locate erroneous single bit-flips and provide the function of error correction like conventional single correction codes (SEC).

In this paper, we take the RISC-V Sv39 virtual memory system used in modern processors [6] as the scenario and demonstrate a practical configuration of the MaP-Tag scheme. To evaluate the reliability, we conduct a fault injection simulation and calculate the detection and correction rates. To understand the protection behaviors under extreme pressure, we perform an exhaustive set of simulations testing the protection capability in all one-to-eight bit-flip combination cases in a valid VPN entry. The simulation results show that MaP-Tag provides a near-perfect error detection rate across all tested cases while holding a surprisingly high correction rate approaching the SEC scheme. Further reliability analysis through the error distribution model used in previous studies [7,8] confirms the strong and stable protection capability of MaP-Tag.

The simulation results and discussions reveal the powerful error detection and suitable error correction capabilities of MaP-Tag with affordable storage and performance overhead. Our MaP-Tag proposal is a promising candidate for future fault-tolerance designs in highly reliable processors.

The rest of this paper is organized as follows. In Section 2, we summarize the background on TLB soft errors and introduce related protection schemes. Section 3 expounds on the architectural design of MaP-Tag, including check information organization and coding schemes. The evaluation methodology is presented in Section 4. The simulation results and discussions on reliability are shown in Section 5. In Section 6, we conclude the paper.

2. Background and Related Work

This section summarizes the background and related work. We first briefly introduce basic concepts and the soft error issue in the TLB. The false negatives and false positives caused by soft errors are discussed. Then, the proposed schemes for TLB protection are analyzed through classification.

2.1. Soft Errors in TLBs

Harsh environment conditions such as high-energy particle strikes may induce single event upset (SEU) and cause bit-flip in single bit (single bit upset, SBU) or multiple bits (multiple bit upset, MBU) [9]. In the worst cases, bit corruption may violate the integrity of the data stored in on-chip memory, like cache and TLBs.

In modern processors, the translation lookaside buffer acts key-value cache storage for virtual-to-physical translation. The virtual page number (VPN) is the key for lookup and the value obtained is the physical page number (PPN), as well as the corresponding page information, like access permissions and status flags. To enhance lookup efficiency, high-performance processors usually organize the TLB as a content-addressable memory (CAM), which retains the VPN and associated random access memory (RAM) that stores PPN information [3].

For CAM storage, one bit flip in a valid entry can result in a false positive or false negative problem:

False positive. The requested VPN is supposed to be missed in the TLB since it is not cached de facto. However, a bit flip could alter the VPN in one valid entry, resulting in an unexpected false hit. In this case, the wrong PPN associated with this VPN would be obtained, potentially causing data corruption or even a system crash. Figure 1b presents an example of a false positive. Worse still, the incorrect translation may incur data leakage and threaten information security if the obtained physical address belongs to another process. False positives are harmful to processor functionality and dependability.

False negative. Access to the valid entry holding the requested VPN ought to generate a TLB hit. However, a bit flip could change the stored VPN, causing a false negative. The false negative, in this case, only impacts system performance, increasing the TLB miss ratio and requiring more time-consuming page walks. Figure 1c illustrates the false negative case.

To quantify the performance loss due to soft errors in TLBs, we conduct fault injection experiments in a cycle-accurate simulator [10], as detailed in Section 4. We test the performance of an out-of-order superscalar processor model with incremental error rates in TLB storage. Six representative workloads from the SPEC CPU 2006 benchmark suite [11] are selected for evaluation. The overall instruction per cycle (IPC) results are presented in Figure 2. All IPCs have been normalized to an ideal TLB with no errors (Base). The results indicate that soft errors in TLBs with a 100% error rate incur approximately 20% performance loss on average. Workloads with good access locality, like perlbench, can see IPC declines exceeding 30%, while memory-intensive workloads with irregular access patterns like mcf may be less affected by false negatives.

Furthermore, soft errors in the PPN RAM region can violate the integrity of address translation. PPN information is typically read-only in TLBs and modifiable exclusively by the kernel. An unexpected bit flip could result in erroneous page translation on a TLB hit, akin to false positive cases in VPNs.

In this work, we focus on the protection of VPN information since it has a direct impact on TLB lookup. PPN protection can be accomplished with strong correction codes like SEC-DED and is not discussed here.

2.2. Related Work

To address data corruption induced by soft errors in TLBs, particularly the VPN array, several protection schemes have been proposed. Two major solutions exist: one is to employ extra error detection and correction code (EDAC), including a parity check and error correction codes (ECC) based on information redundancy, and the other is to exploit the locality of virtual page information for cross-protection. Here, we summarize some related proposals.

The simplest and most direct protection for a VPN is appending a single parity check code [12] for each valid entry to detect one-bit errors. The encoding and decoding processes of parity checks are realized by XOR operations and require limited area and performance overheads. However, parity checks do not support error correction. If one-bit error is detected, a TLB refill request will be immediately sent to the page walker, and the erroneous entry will be marked invalid.

To enhance the capability of error correction, ECC is utilized in TLBs for fast in situ data recovery. The proposal in paper [4] combines the ECC code with the parity code, only invoking ECC correction if errors are detected by the parity code. Additionally, backup mechanisms can be used for data protection. A copy of TLB entries is maintained in backup storage, encoded with ECC [13]. Every TLB miss results in a backup storage lookup to determine whether the TLB miss is normal or not. A modified ECC code is proposed by paper [14] for a longer Hamming distance. A specific match comparator is adapted to check whether one mismatch is a real miss. Mismatches induced by one-bit error can be detected. Another modified ECC with a minimum Hamming distance of 3 was used in the [5].

The coding scheme proposed by [3] aims to eliminate false positives in instruction TLBs. Encoded by values stored in upper VPN bits with increased Hamming distance, the most significant bit of the VPN entry is selected to check the false positives caused by one-bit errors. Nevertheless, this scheme is restricted by additional adjacency requirements and may not work well when VPNs stored in TLBs fall into different regions of the virtual space.

In addition, some researchers have explored the locality of virtual page numbers. The proposal in [15] divides the VPN into two parts and studies the similarity of the upper one. By appending extra flags to mark the identification of upper bits, the VPN can be cross-protected by adjacent entries. However, the similarity might not be guaranteed in full-associative CAM memory. Other schemes [16,17] analyze the locality of physical addresses and eliminate the redundant physical address information stored in tag entries of the same set. Although evaluated for data cache, these ideas can also inform TLB design. The concept of division is also utilized in [18].

Other studies exploit techniques for error detection and correction in separate memory chips like caches. Matrix code schemes [8,19] have been proposed to enhance the protection capability. Matrix codes proposed by [19] use check bits with a matrix format to protect 32-bit word length memory and achieve significant reliability gains. However, these methods, which are based on Galois field GF(2ⁿ), cannot be directly implemented into TLBs due to data length differences. The length of data bits in TLBs is usually irregular and cannot be aligned to 2ⁿ. The direct deployment of matrix codes with Hamming checks will induce unexpected problems like the waste of protection coverage due to check-bit truncation. The constraints of timing and storage resources are also required to be carefully considered when designing the fault-tolerant architecture in TLBs. The timing constraints to separate memory chips are much more relaxed than for TLBs. Directly transplanting these schemes would cause unbearable performance deterioration. Furthermore, TLBs are capacity-limited. Modern processors typically arrange tens of entries for L1 TLBs. The usage of storage resources in TLBs is severely restricted. Hence, a novel protection code scheme that considers the intrinsic features of TLBs is required. Other proposals utilize bloom filters to detect errors in CAM memory or tag arrays [20,21,22]. We do not discuss the design details here.

Previous studies either utilize a modified ECC coding scheme with increased Hamming distance for error detection or exploit the locality of virtual addresses stored in adjacent entries for cross-protection. However, few studies address the MBU issue, which has become a vital threat to modern, highly reliable processors. This paper aims to enhance the capability of burst multi-bit error protection for virtual page information.

3. Protection Scheme Design

In this section, we discuss the detailed MaP-Tag architecture design from several aspects. First, we describe the overall protection code organization. Then, we explore the fault-tolerance capability of our proposal in terms of error detection and correction.

3.1. Two-Dimensional Protection Design

In this paper, we take the RISC-V Sv39 virtual memory system in Xuantie C910 [6] as an example. C910 is a modern high-performance processor supporting translation from a three-level 39-bit virtual address to a 40-bit physical address. Excluding the 12-bit page offset, the VPN and PPN are 27-bit and 28-bit, respectively. The following discussion is based on the premise of a 27-bit VPN.

Traditional one-dimensional protection codes have been deeply studied to detect and correct soft errors in TLBs. Before introducing our MaP-Tag proposal, we first discuss the widely used check codes for memory storage.

Parity check codes are one of the most efficient error detection schemes, benefiting from concise code generation and low storage overhead. The encoding and decoding processes can be realized by fast XOR operations with negligible execution latency impact. For 27-bit VPN storage, a single parity check code on all 27 bits (Parity-1) detects any errors with odd bits. To enhance the capability of multi-bit error detection, three-bit parity codes (Parity-3) can be employed, each generated from consecutive 9-bit data segments. Parity-3 provides extra detection of non-adjacent multi-bit errors. However, the parity check code is unable to locate detected error bits, precluding fast data recovery. A time-consuming page walk will be requested if any errors are detected by the parity check code.

For fast error bit correction, Hamming code can be adapted. The SEC(38, 32) scheme can correct any single error bit in a 32-bit memory. In the VPN region, SEC(38, 32) can be truncated to SEC(33, 27) (SEC-1) for storage efficiency. SEC-1 provides fast single-bit error correction with a tolerable area overhead. To enhance multi-bit error resistance, a three-segment SEC code (SEC-3) can be employed, akin to the Parity-3 design. SEC-3 offers in situ correction of up to three error bits.

However, traditional one-dimensional protection coding schemes cannot address the problems induced by burst multi-bit upsets, which are regarded as a vital threat in modern space processors.

To tackle the problem of data corruption caused by burst MBU, we propose a two-dimensional matrix protection code scheme, called MaP-Tag, to provide resistance to adjacent multi-bit errors without sacrificing the benefits of traditional one-dimensional schemes. Figure 3 demonstrates the conceptual architecture design of MaP-Tag. The codeword of the MaP-Tag protection scheme is comprised of four parts: the original data, the adjacent check, the interleaved check, and the total check.

To explore the optimization space of data protection, the original data are rearranged as a matrix of data bits in an architectural perspective. Each row in this data matrix contains adjacent data bits of a fixed number, while a column holds data bits from the same position across all rows.

The adjacent check bits are generated in the same manner as conventional segment-based protection coding schemes. For instance, adjacent check bits can be obtained by Parity-3 calculation when the data is divided into three rows. Then, these adjacent check bits provide the capability of non-adjacent multi-bit fault tolerance.

To realize data protection against burst multi-bit errors, the interleaved check technique is introduced. The interleaved check bits are generated from a column perspective. Each check bit is calculated using all data bits in a data column. In this way, the adjacent data bits are protected by different check bits, and burst multi-bit errors can be detected by altering them.

The total check part serves as a check for check bits and can be calculated using adjacent check bits or interleaved check bits, reflecting the overall check of all data bits swiftly.

3.2. Error Detection and Correction

As mentioned previously, the TLB in C910 stores 27-bit VPN information as per the Sv39 virtual memory standard requirement. The VPN can then be organized as a 3 × 9 or 9 × 3 two-dimensional matrix. Here, we take the 3 × 9 matrix as an example and analyze its corresponding fault-tolerance capability.

A practical architectural design for Sv39 is displayed in Figure 4. Considering storage efficiency, parity check coding schemes are employed in both the adjacent check and the interleaved check parts. The one-bit total check bit is generated by the XOR operation of all adjacent check bits (and exactly all data bits in the correct cases).

In Figure 4, V_i, A_i, and I_i denote the i-th data bit of the VPN, the i-th check bit of the adjacent check part, and the i-th check bit of the interleaved check part, respectively. The total check bit is represented by the symbol T.

According to the generation method of parity coding schemes, all check bits can be calculated as follows.

For the adjacent check part:

A₀ = V₀ ⊕ V₁ ⊕ V₂ ⊕ V₃ ⊕ V₄ ⊕ V₅ ⊕ V₆ ⊕ V₇ ⊕ V₈,
A₁ = V₉ ⊕ V₁₀ ⊕ V₁₁ ⊕ V₁₂ ⊕ V₁₃ ⊕ V₁₄ ⊕ V₁₅ ⊕ V₁₆ ⊕ V₁₇,
A₂ = V₁₈ ⊕ V₁₉ ⊕ V₂₀ ⊕ V₂₁ ⊕ V₂₂ ⊕ V₂₃ ⊕ V₂₄ ⊕ V₂₅ ⊕ V₂₆.

(1)

For the interleaved check part:

I₀ = V₀ ⊕ V₉ ⊕ V₁₈,
I₁ = V₁ ⊕ V₁₀ ⊕ V₁₉,
I₂ = V₂ ⊕ V₁₁ ⊕ V₂₀,
I₃ = V₃ ⊕ V₁₂ ⊕ V₂₁,
I₄ = V₄ ⊕ V₁₃ ⊕ V₂₂,
I₅ = V₅ ⊕ V₁₄ ⊕ V₂₃,
I₆ = V₆ ⊕ V₁₅ ⊕ V₂₄,
I₇ = V₇ ⊕ V₁₆ ⊕ V₂₅,
I₈ = V₈ ⊕ V₁₇ ⊕ V₂₆.

(2)

For the total check part:

T = I₀ ⊕ I₁ ⊕ I₂ ⊕ I₃ ⊕ I₄ ⊕ I₅ ⊕ I₆ ⊕ I₇ ⊕ I₈.

(3)

Here, we analyze the error detection and correction capability of the MaP-Tag protection scheme according to the generation method described previously.

Each check bit A_i in the adjacent check part is calculated from nine consecutive data bits and can provide the detection for odd-bit errors. For simplicity, we discuss the detection of one-bit errors. Then, the adjacent check part can detect up to three errors if no two errors occur in the same row. Hence, the adjacent check part can detect non-adjacent three-biterrors.

Check bit I_i in the interleaved check part is obtained by the calculation of data bits in the i-th column. The single parity code can detect one-bit errors in this column. Through interleaved organization, adjacent data bits are placed into different groups and checked separately. In the worst case, all bits in the first row (V₀, V₁, V₂, …, V₈) experience bit-flips due to the harsh environment. The adjacent nine-bit errors can be detected by changes in all bits in the interleaved check part because these nine data bits are protected by different I_i and overcome the shortage of conventional parity checks on adjacent multi-bit errors. The interleaved check part can detect up to nine adjacent multi-bit errors.

Furthermore, the combination of the adjacent and interleaved check bits enables additional error correction. The syndrome of each parity check bit in these two parts performs as an error index. Take one-bit errors as an example. The syndrome S_i of the parity check bit can be calculated as follows:

S_i = P_i ⊕ P_i’,

(4)

where P_i and P_i’ denote the i-th parity check code encoded and decoded, respectively. The dependable data bits without errors will generate all-zero syndrome patterns after decoding. If we inject a single error into V₁₁, the adjacent check bit A₁ and the interleaved check bit I₂ will flip when decoding and generate the inverse values A₁’ and I₂’. Then, the syndromes are as follows:

SA₁ = A₁ ⊕ A₁’,
SI₂ = I₂ ⊕ I₂’,

(5)

and will become non-zero. Regarding all non-zero syndromes as coordinates, MaP-Tag can deduce the position of an erroneous bit and correct it through value flipping.

When extending to multi-bit scenarios, MaP-Tag can realize error correction in some specific cases as well. Multi-bit errors can be located only when indexing syndromes do not result in ambiguity, for example, when there are any odd-bit errors in the same row or column. Based on the above analysis, our MaP-Tag proposal can achieve the protection capability of one-bit error detection and correction, multi-bit error detection, and even multi-bit error correction in some cases.

Moreover, the usage of total check bit T can provide extra protection capability for check bits. As mentioned in Section 3.1, T acts as the check bit of one-dimensional check bits. T is calculated by the XOR operation of all bits in the interleaved check region. Imagine that a single-bit error occurs in the adjacent check or interleaved check region rather than the data region. Previous protection schemes will directly regard the TLB entry as erroneous. But, the stored data are correct and should be returned to the processor pipeline in time. Our proposal can recognize this case by comparing the value of T with the XOR calculation of bits in the adjacent check region. Then, the data will be returned to the pipeline in a timely manner, and all check bits will be recalculated for correctness.

4. Experimental Methodology

The performance experiment in Figure 2 is conducted using a cycle-accurate architecture simulator [10]. We utilize an out-of-order superscalar processor model to evaluate performance loss. The simulated processor is a 4-issue single-core processor with 3.2 GHz main frequency, 32 KB L1 I/D-Cache, 256 KB L2 cache and 2 MB LLC. To support address translation, a 128-entry L1 TLB is set on the critical path of memory access. All errors are injected into the L1 TLB with given error rates.

To evaluate the reliability of our proposal, we conduct a comprehensive test on the error resistance. Although [23] summarizes error patterns with the highest occurrence probability according to the neutron strike simulation, we choose to take into account the more complex space environment involving proton and electron irradiations. Then, we perform an exhaustive set of simulations to evaluate the protection capability in all possible combinations of one to eight bit-flips in one valid VPN entry. The numbers of detected and corrected error patterns are counted to represent the capability of data protection. We focus on the execution consequences in real processors and do not fuss over the source of errors. This premise is derived from the concept of architecturally correct execution (ACE) [24], which only focuses on execution results.

In all experiments, the ideal TLB with no protection and no error is marked as Base. Considering that no specific protection codes are specifically designed for VPN information, we select Parity-1, Parity-3, SEC-1, and SEC-3 schemes analyzed in Section 3.1 for comparison and discussion.

5. Results and Discussion

This section discusses the reliability of the MaP-Tag scheme. We first focus on the protection capability by error detection and correction analysis. Then, reliability is defined and quantified through a mathematical model. Finally, the overhead induced by MaP-Tag is discussed.

5.1. Protection Capability Analysis

As described in Section 4, we utilize the parameter detection rate and correction rate to represent the protection capability. The detection rate signifies the proportion of detected error patterns in all cases, while the correction rate is calculated by the fraction of corrected error patterns in all patterns.

Figure 5 exhibits the detection rate changes from one-bit error to eight-bit errors. We evaluate our MaP-Tag design with the comparison of the other four protection schemes. All these five schemes can achieve a detection rate of 100% for one-bit error.

When it comes to multi-bit errors, detection rates change rapidly. Considering the intrinsic limit caused by XOR operations, Parity-1 cannot detect any even-bit errors in that even-bit flips will not change the value of the parity code. Parity-3 performs a similar trend to Parity-1 but works better in even-bit cases. Parity-3 uses a segment-based design and can detect a few multi-bit errors when these errors are non-adjacent and distributed in different segments. For example, in the case of two-bit error patterns, Parity-3 can detect errors if these two bit-flips occur in different segments, while Parity-1 cannot detect any two-bit errors.

Designed specifically for fast error correction, SEC schemes react poorly in multi-bit cases. All check bits in SEC schemes are used to locate the erroneous bit, and the correct data bit might be modified when mistakenly directed by the syndrome of multi-bit errors. SEC-3 performs slightly better than SEC-1 due to its segment-based design.

MaP-Tag maintains stable error detection performance from single-bit to eight-bit errors. As discussed previously, the two-dimensional check design augments adjacent multi-bit error detection while preserving non-adjacent multi-bit error detection of Parity-3.

Figure 6 shows the correction rate results. Considering Parity-1 and Parity-3′s inability to correct errors, we only compare the MaP-Tag scheme with SEC-1 and SEC-3. All these three schemes can achieve an error correction rate of 100% for single-bit errors. SEC-1 cannot locate any errors beyond single-bit error cases and lacks the capability of multi-bit error correction. SEC-3 outperforms other schemes in error correction, detecting and correcting a few non-adjacent multi-bit errors with no more than three bit-flips.

MaP-Tag performs better than SEC-1 for multi-bit error patterns but cannot achieve the same correction level as SEC-3. Notably, MaP-Tag cannot correct any two-bit errors even if these two bit-flips fall into different rows and columns. Imagine that bits flip in V₀ and V₁₀ simultaneously. Then, syndromes SA₀, SA₁, SI₀, and SI₁ all become non-zero, which will lead to ambiguity, and the two-dimensional check codes cannot locate errors precisely. In fact, MaP-Tag recognizes odd-bit errors if they fall into the same row or column. However, any results for error patterns with more than three bit-flips are omitted here as the correction rate of MaP-Tag for five-bit errors is approximately zero and unnecessary for discussion.

Integrating the detection and correction rates, MaP-Tag is the most efficient protection scheme for VPN information. MaP-Tag achieves the correction level of SEC and holds a steady strong error detection capability. Considering that detection plays a more important role than correction in VPN storage for the sake of information security, MaP-Tag is an appropriate candidate for future TLB fault-tolerance design.

5.2. Reliability Analysis

To rigorously investigate the protection offered by MaP-Tag, we perform a reliability analysis based on assumptions utilized in prior mathematical models [8,25]:

(1) Transient faults occur with a Poisson distribution.

(2) Bit failures are statistically independent.

Assume that the maximum potential error in a codeword is N, which is equal to the length of the codeword. Let FD be the number of faults detectable, and t be the time period after initial usage of the storage, then the fault detection capability F_D(t) can be calculated by Equation (6):

F_{D} (t) = P (FD | MF) = \sum_{i = 1}^{N} P (FD | iF) \times P (iF | MF),

(6)

where MF denotes the faulty memory, and iF is the presence of i faults in the memory. The fault detection capability F_D(t) can be defined as the probability of fault detection given single-bit or multi-bit errors in the memory.

Considering the assumptions mentioned before, the probabilities used in F_D(t) calculation in Equation (6) can be obtained by the combination calculations, and each probability can be calculated by the following equations:

P (iF | MF) = \frac{P (iF)}{P (MF)},

(7)

P (iF) = (\binom{N}{i}) \times {(1 - e^{- λ t})}^{i} \times {(e^{- λ t})}^{N - i},

(8)

P (MF) = {1 - (e^{- λ t})}^{N},

(9)

where λ is the failure rate indicating the probability of a bit-flip per bit per unit time. In this paper, λ is set to 10⁻⁵, considering real space irradiation effects. P(FD|iF) denotes the probability of the case that the fault can be detected when i-bit errors exist in the storage. This can be obtained from simulation results of exhaustive error patterns displayed in Figure 5.

Furthermore, the reliability of fault detection R_D(t) in the memory can be calculated using Equation (10) by taking all TLB entries into consideration. In our simulation configuration, the number of TLB entries M is set to 128.

R_{D} (t) = {(1 - P (MF) + \sum_{i = 1}^{N} P (iF) \times P (FD | iF))}^{M} .

(10)

Figure 7 depicts the reliability of detection results. Note the reliability curves for Parity-1 and SEC-1 are very similar. Therefore, we highlight the reliability values at 800 days for better comparison. Our MaP-Tag proposal outperforms all other protection schemes in the capability of error detection. It can provide reliability of over 90% in the first 8000 days. As discussed in Section 3.2, MaP-Tag protects the VPN tag array against one-bit, non-adjacent multi-bit, and adjacent multi-bit errors, exhibiting a strong fault detection capability due to the two-dimensional coding design. With powerful detection, MaP-Tag enables more stable and persistent fault tolerance in harsh space environments.

In addition, the reliability of fault correction R_C(t) can be analyzed in a similar way when we use P(FC|iF) to represent the probability of the case that the fault can be corrected when i-bit errors exist in the storage. Then, the reliability of correction can be calculated by Equation (11):

R_{C} (t) = {(1 - P (MF) + \sum_{i = 1}^{N} P (iF) \times P (FC | iF))}^{M},

(11)

where FC denotes the number of faults that can be corrected by the protection scheme.

The reliability of our correction results is shown in Figure 8. Our MaP-Tag proposal has an approximate capability of error correction compared to the conventional SEC-1 protection scheme. The difference between SEC-1 and MaP-Tag is negligible, as seen in Figure 8. For simplicity of comparison, we extract the reliability values at 500 days. The reliability of MaP-Tag dwindles to 35.7%, while the reliability of SEC-1 drops to 35.6%. MaP-Tag outperforms SEC-1 in that MaP-Tag is capable of correcting a few multi-bit errors if they occur in the same row or column, which cannot be accomplished by SEC-1. Nevertheless, SEC-3 is the best protection scheme for error correction due to its ability to locate non-adjacent two-bit errors.

In general, MaP-Tag provides a strong capability of error detection and appropriate capability of error correction, which is deemed to be the best comprehensive protection scheme compared to all tested check coding schemes. We believe that MaP-Tag can be a crucial candidate for fault tolerance design for TLBs in future processors.

5.3. Overhead Analysis

Check coding schemes protect memory storage like TLBs by appending extra check information, inducing storage overhead and extra encoding/decoding latency. Here, we discuss the overhead caused by MaP-Tag.

As Figure 4 shows, 27-bit VPN information is protected by an extra 3-bit adjacent check code, 9-bit interleaved check code, and 1-bit total check code. The total count of check bits is 13. Compared to the area of TLB storage, the area of the encoder and decoder is negligible, and the approximate storage overhead becomes 48.15%. The comparison of storage overhead with conventional check schemes is summarized in Table 1. Although the storage overhead of MaP-Tag is greater than the conventional parity coding scheme, it provides complete multi-bit detection capability, including non-adjacent multi-bit detection and burst multi-bit detection, which is unsupported by either parity or SEC check schemes. In space applications, a fault-tolerant design that could balance the protection capability and storage overhead is required. Compared to widely used SEC check schemes, MaP-Tag can realize more comprehensive error detection and correction capabilities with less storage overhead. Hence, MaP-Tag can be a potential candidate for the next-generation fault-tolerant check scheme in data TLBs.

Moreover, the calculation of check bits in MaP-Tag is implemented by rapid XOR operations. It is acknowledged that the generation of parity codes will not induce an extra cycle penalty into the execution time of the processor [1]. Hence, the performance overhead of MaP-Tag encoding and decoding logic can be neglected.

6. Conclusions

Fault tolerance is a critical issue for modern, highly reliable processors. To protect on-chip memory against soft errors, several check coding schemes have been proposed. Considering the special nature of VPN information stored in data TLBs, conventional check schemes are insufficient to detect diversiform error patterns induced by SEU. In this paper, we propose a new check coding scheme, named MaP-Tag, to provide powerful multi-bit detection and correction capability. MaP-Tag is a two-dimensional coding design where the check bits can be separated into adjacent check parts, interleaved check parts, and total check parts. This proposal can detect both non-adjacent multi-bit errors and burst adjacent multi-bit errors while maintaining approximate correction capability to the SEC scheme. The simulation results show that MaP-Tag achieves a nearly 100% detection rate for error patterns from one-bit flips to eight-bit flips. The MaP-Tag design can provide strong and stable error detection on one-bit and multi-bit errors and correction on one-bit errors with affordable storage overhead. Its protection capability is proven in the reliability analysis. MaP-Tag is an important candidate for fault tolerance design for future on-chip TLB memory.

Author Contributions

Conceptualization, X.G. and N.C.; methodology, X.G.; validation, J.N. and H.L.; formal analysis, X.G., N.C. and J.N.; investigation, X.G.; resources, H.L. and M.Y.; data curation, X.G.; writing—original draft preparation, X.G.; writing—review and editing, N.C. and M.Y.; supervision, M.Y.; project administration, M.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Acknowledgments

The authors would like to thank all anonymous reviewers for their careful consideration and valuable feedback. This work was technically supported by the research team of the highly reliable RISC-V processor program at Beijing SunWise Space Technology Ltd.

Conflicts of Interest

Authors Xin Gao, Jiawei Nian, and Hongjin Liu were employed by the company Beijing SunWise Space Technology Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Farbeh, H.; Delshadtehrani, L.; Kim, H.; Kim, S. ECC-united cache: Maximizing efficiency of error detection/correction codes in associative cache memories. IEEE Trans. Comput. 2021, 70, 640–654. [Google Scholar] [CrossRef]
Das, A.; Touba, N.A. Low complexity burst error correcting codes to correct MBUs in SRAMs. In Proceedings of the 28th ACM Great Lakes Symposium on VLSI, Chicago, IL, USA, 23–25 May 2018; pp. 219–224. [Google Scholar]
Sanchez-Macian, A.; Aranda, L.A.; Reviriego, P.; Kiani, V.; Maestro, J.A. Enhancing instruction TLB resilience to soft errors. IEEE Trans. Comput. 2019, 68, 214–224. [Google Scholar] [CrossRef]
Sanchez-Macian, A.; Reviriego, P.; Maestro, J.A. Combined modular key and data error protection for content-addressable memories. IEEE Trans. Comput. 2017, 66, 1085–1090. [Google Scholar] [CrossRef]
Efthymiou, A. An error tolerant CAM with nand match-line organization. In Proceedings of the 23rd ACM Great Lakes Symposium on VLSI, Paris, France, 2–4 May 2013; pp. 257–262. [Google Scholar]
Chen, C.; Xiang, X.; Liu, C.; Shang, Y.; Guo, R.; Liu, D.; Lu, Y.; Hao, Z.; Luo, J.; Chen, Z.; et al. Xuantie-910: A commercial multi-core 12-stage pipeline out-of-order 64-bit high performance RISC-V processor with vector extension. In Proceedings of the 47th International Symposium on Computer Architecture, Valencia, Spain, 30 May–3 June 2020; pp. 52–64. [Google Scholar]
Freitas, D.; Mota, D.; Goerl, R.; Marcon, C.; Vargas, F.; Silveira, J.; Mota, J. PCoSA: A product error correction code for use in memory devices targeting space applications. Integration 2020, 74, 71–80. [Google Scholar] [CrossRef]
Argyrides, C.; Zarandi, H.; Pradhan, D. Matrix codes: Multiple bit upsets tolerant method for SRAM memories. In Proceedings of the 22nd IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, Rome, Italy, 26–28 September 2007; pp. 340–348. [Google Scholar]
Yang, M.; Hua, G.; Feng, Y.; Gong, J. Fault-Tolerance Techniques for Spacecraft Control Computers; John Wiley & Sons Inc.: Singapore, 2017; pp. 9–20. [Google Scholar]
Gober, N.; Chacon, G.; Wang, L.; Gratz, P.; Jiménez, D.; Teran, E.; Pugsley, S.; Kim, J. The championship simulator: Architectural simulation for education and competition. arXiv 2022, arXiv:2210.14324. [Google Scholar]
Henning, J.L. SPEC CPU2006 benchmark descriptions. ACM SIGARCH Comput. Archit. News 2006, 34, 1–17. [Google Scholar] [CrossRef]
Griffith, T.W., Jr.; Thatcher, L.E. TLB Parity Error Recovery. U.S. Patent No. 6901540, 31 May 2005. [Google Scholar]
Lang, S.M. Processor Fault Tolerance through Translation Lookaside Buffer Refresh. US Patent No. US 8429135 B1, 23 April 2013. [Google Scholar]
Pagiamtzis, K.; Azizi, N.; Najm, F.N. A soft-error tolerant content addressable memory (CAM) using an error-correcting-match scheme. In Proceedings of the IEEE Custom Integrated Circuits Conference, San Jose, CA, USA, 10–13 September 2006; pp. 301–304. [Google Scholar]
Yalcin, G.; Ergin, O.; Islek, E.; Unsal, O.; Cristal, A. Exploiting existing comparators for fine-grained low-cost error detection. ACM Trans. Archit. Code Optim. 2014, 11, 1–24. [Google Scholar] [CrossRef]
Ghaemi, S.G.; Ahmadpour, I.; Ardebili, M.; Farbeh, H. SMARTag: Error correction in cache tag array by exploiting address locality. In Proceedings of the 2018 Design, Automation & Test in Europe Conference & Exhibition, Dresden, Germany, 19–23 March 2018; pp. 1658–1663. [Google Scholar]
Farbeh, H.; Mozafari, F.; Zabihi, M.; Miremadi, S.G. Raw-tag: Replicating in altered cache ways for correcting multiple-bit errors in tag array. IEEE Trans. Dependable Secure Comput. 2019, 16, 651–664. [Google Scholar] [CrossRef]
Hung, L.D.; Goshima, M.; Sakai, S. Mitigating soft errors in highly associative cache with CAM-based tag. In Proceedings of the 23rd International Conference on Computer Design, San Jose, CA, USA, 2–5 October 2005; pp. 342–347. [Google Scholar]
Argyrides, C.; Pradhan, D.; Kocak, T. Matrix codes for reliable and cost efficient memory chips. IEEE Trans. Very Large Scale Integr. Syst. 2011, 19, 420–428. [Google Scholar] [CrossRef]
Pontarelli, S.; Ottavi, M. Error detection and correction in content addressable memories by using bloom filters. IEEE Trans. Comput. 2013, 62, 1111–1126. [Google Scholar] [CrossRef]
Reviriego, P.; Pontarelli, S.; Ottavi, M.; Maestro, J.A. FastTag: A technique to protect cache tags against soft errors. IEEE Trans. Device Mater. Rel. 2014, 14, 935–937. [Google Scholar] [CrossRef]
Hong, J.; Kim, J.; Kim, S. Exploiting same tag bits to improve the reliability of the cache memories. IEEE Trans. Very Large Scale Integr. Syst. 2015, 23, 254–265. [Google Scholar] [CrossRef]
Rao, P.; Ebrahimi, M.; Seyyedi, R.; Tahoori, M. Protecting SRAM-based FPGAs against multiple bit upset using erasure codes. In Proceedings of the 51st Design Automation Conference, San Francisco, CA, USA, 1–5 June 2014; pp. 1–6. [Google Scholar]
Mukherjee, S.S.; Weaver, C.T.; Emer, J.; Reinhardt, S.K.; Austin, T. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In Proceedings of the 36th Annual International Symposium on Microarchitecture, San Diego, CA, USA, 3–5 December 2003; pp. 29–42. [Google Scholar]
Silva, F.; Silveira, J.; Silveira, J.; Marcon, C.; Vargas, F.; Lima, O. An extensible code for correcting multiple cell upset in memory arrays. J. Electron. Test. 2018, 34, 417–433. [Google Scholar] [CrossRef]

Figure 1. Illustration of soft errors in VPN information stored in TLB. (a) Correct TLB access, (b) false positive case, and (c) false negative case. The bits and corresponding outputs marked by red color exemplify the problems caused by bit flips.

Figure 2. Normalized IPC results with incremental error rates in TLBs. Injected soft errors can incur significant performance loss and even system crashes.

Figure 3. The architectural design of the two-dimensional MaP-Tag scheme.

Figure 4. A practical design of the MaP-Tag scheme for a RISC-V Sv39 virtual memory system.

Figure 5. The detection rate results in single-/multi-bit error patterns.

Figure 6. The correction rate results in single-/multi-bit error patterns.

Figure 7. The reliability of detection results. MaP-Tag has the best capability of fault detection among all tested schemes. Values of Parity-1 and SEC-1 at 800 days are extracted for comparison.

Figure 8. The reliability of correction results. MaP-Tag has the approximate capability of error correction to conventional SEC protection schemes. Values of SEC-1 and MaP-Tag at 500 days are extracted for comparison.

Table 1. Comparison of different protection schemes for TLB VPN information.

Schemes	Number of Check Bits ¹	Storage Overhead	Protection Capability
Parity-1	1	3.70%	1-bit detection
Parity-3	3	11.12%	non-adjacent 3-bit detection
SEC-1	6	22.22%	1-bit correction
SEC-3	18	66.67%	non-adjacent 3-bit correction
MaP-Tag in this paper	13	48.15%	non-adjacent 3-bit detection, burst 9-bit detection, 1-bit correction

¹ Total check bits needed for a 27-bit VPN entry.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, X.; Cui, N.; Nian, J.; Liu, H.; Yang, M. Two-Dimensional Protection Code for Virtual Page Information in Translation Lookaside Buffers. Electronics 2024, 13, 1320. https://doi.org/10.3390/electronics13071320

AMA Style

Gao X, Cui N, Nian J, Liu H, Yang M. Two-Dimensional Protection Code for Virtual Page Information in Translation Lookaside Buffers. Electronics. 2024; 13(7):1320. https://doi.org/10.3390/electronics13071320

Chicago/Turabian Style

Gao, Xin, Naiyuan Cui, Jiawei Nian, Hongjin Liu, and Mengfei Yang. 2024. "Two-Dimensional Protection Code for Virtual Page Information in Translation Lookaside Buffers" Electronics 13, no. 7: 1320. https://doi.org/10.3390/electronics13071320

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Two-Dimensional Protection Code for Virtual Page Information in Translation Lookaside Buffers

Abstract

1. Introduction

2. Background and Related Work

2.1. Soft Errors in TLBs

2.2. Related Work

3. Protection Scheme Design

3.1. Two-Dimensional Protection Design

3.2. Error Detection and Correction

4. Experimental Methodology

5. Results and Discussion

5.1. Protection Capability Analysis

5.2. Reliability Analysis

5.3. Overhead Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI