MDCIM: MRAM-Based Digital Computing-in-Memory Macro for Floating-Point Computation with High Energy Efficiency and Low Area Overhead

Liu, Liang; Tan, Lehao; Gan, Jie; Pan, Biao; Zhou, Jiahui; Li, Zhengliang

doi:10.3390/app132111914

Open AccessCommunication

MDCIM: MRAM-Based Digital Computing-in-Memory Macro for Floating-Point Computation with High Energy Efficiency and Low Area Overhead

by

Liang Liu

¹,

Lehao Tan

²,

Jie Gan

¹,

Biao Pan

^2,*,

Jiahui Zhou

¹ and

Zhengliang Li

¹

Beijing Smart-Chip Microelectronics Technology Co., Ltd., Beijing 102200, China

²

School of Integrated Circuit Science and Engineering, Beihang University, Beijing 100191, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(21), 11914; https://doi.org/10.3390/app132111914

Submission received: 20 September 2023 / Revised: 26 October 2023 / Accepted: 26 October 2023 / Published: 31 October 2023

(This article belongs to the Section Electrical, Electronics and Communications Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Computing-in-Memory (CIM) is a novel computing architecture that enormously improves energy efficiency and reduces computing latency by avoiding frequent data movement between the computation and memory units. Currently, digital CIM is regarded as more suitable for high-precision operations represented in floating-point arithmetic, as it is not limited by the bit width of ADC/DAC in analog CIM. However, the development of DCIM still faces two problems: On the one hand, mainstream SRAM-based DCIM memory cells introduce large area overheads, which contain at least six transistors per cell. On the other hand, existing DCIM solutions can only support the computing precision up to FP32, failing to meet the demands of high-accuracy application scenarios. To overcome these problems, this work designs a novel SOT-MRAM-based digital CIM macro (MDCIM) with higher area/energy efficiency and achieves double-precision floating-point (FP64) computation with a modified fused multiply–accumulate (FMA) module. The proposed design is synthesized with a 55 nm CMOS technology node, achieving 0.62 mW power consumption, 26.9 GOPS/W, and 0.332 GOPS/mm² energy efficiency at 150 MHz with 1.08 V supply. Circuit level simulation results show that the MDCIM can achieve higher area utilization compared to previous SRAM-based CIM designs.

Keywords:

digital computing-in-memory; SOT-MRAM; double-precision floating-point format; fused multiply–accumulate

1. Introduction

For decades, von Neumann architecture has been the foundation of modern computing systems. However, with the explosive growth in data scale, the physical separation between the computation and memory units leads to serious energy and latency issues due to the massive data movement between them [1]. The memory wall problem has become a major constraint for developing high-performance processors, especially for data-intensive applications. Recently, computing-in-memory (CIM) technology has garnered interest as a promising method to break the memory bottleneck existing in the traditional computing architecture, which integrates the computing logics into the memory array and enables in situ data processing without extra data movements [2]. According to the previous works, CIM architecture has demonstrated great advantages in improving energy efficiency, reducing latency, and boosting parallelism [3,4].

Distinguished from the computing paradigm, there are mainly two categories of CIM: analog CIM (ACIM) and digital CIM (DCIM) [5,6,7]. ACIM performs computation by applying voltage or current signals directly to the bit cells, possessing high on-chip bandwidth and computing parallelism. However, ACIM suffers from limited computing accuracy caused by transistor variation and the analog-to-digital converter (ADC), which consumes a large area and energy overhead as well [5]. In contrast, a DCIM encodes data as discrete states and executes arithmetic through a digital logic unit inside the memory array, enabling high computing precision and versatile application scenarios [6,7]. To date, various DCIM implementations have been explored based on different memories. Static random-access memory (SRAM) is one of the most prevalent candidates for DCIM, which can achieve superior speed and flexibility owing to its mature fabrication techniques with advanced CMOS technology [8]. Nonetheless, a volatile SRAM bit cell suffers from unavoidable static power dissipation, which is caused by the leaking current. Furthermore, a basic SRAM cell contains at least six transistors (6T), incurring considerable area overhead as the array size is scaled up. In contrast, nonvolatile memories like resistive random-access memory (ReRAM), phase-change memory (PCM), and magnetoresistive random-access memory (MRAM), allow for the CIM arrays to preserve data without static power dissipation. In addition, among these emerging memories, MRAM exhibits excellent performance in terms of fast speed, high endurance, CMOS compatibility, etc., making it an ideal candidate for DCIM.

MRAM based on new mechanisms has made rapid progress in CIM design. On one side, spin–transfer–torque MRAM (STT-MRAM) exhibits significant scalability with the bit cell structure of one transistor and one MTJ (1T-1MTJ) and holds vast potential for a wide range of CIM applications. For example, Wang et al. [9] proposed a tandem STT-MRAM-based CIM scheme to perform analog multiply and accumulate (MAC) operations through resistance summation of serial MTJs, acquiring a 2 Kb array size and achieving an energy efficiency of 63.7 TOPS/W for 8-bit precision. On the other side, spin–orbit–torque MRAM (SOT-MRAM), characterized by a three-terminal device (2T-1MTJ), decouples the read-and-write path with heavy metal and offers better stability, shorter switching latency, and lower write current. Doevenspeck et al. [10] constructed an optimized SOT-MRAM for weight memory and demonstrated the feasibility of implementing DNN inference with an ACIM scheme. Nevertheless, most prior research on MRAM-based CIM primarily focuses on the analog computing paradigm, while investigations into MRAM-based DCIM are still in the exploratory stages. In order to support high-precision computations (as illustrated in the above example, with precision not exceeding 8-bit), especially for floating-point (FP) computation, along with the progressively rising demand for computation accuracy in AI applications, the research into MRAM-based DCIM circuits has become more and more critical.

In this work, we proposed a digital SOT-MRAM-based CIM macro (MDCIM), integrating it with modified digital multiply and accumulate (MAC) units that are finely combined with SOT-MRAM arrays. The primary contributions of this work are summarized as follows:

(1): Design a digital computing circuit based on SOT-MRAM to reduce the area overhead of MAC units.
(2): Implement FP64 MAC operations based on FMA instructions to reduce the latency of FP operations.
(3): Achieve higher calculation frequency based on digital CIM to reduce the energy consumption of the macro.

This paper is organized as follows: In Section 2, we briefly introduce the architecture of digital CIM circuits and floating-point MAC operations. Then, a pipeline MAC array is proposed for computation in Section 3, and this model is validated using EDA tools in Section 4. Finally, this work is summarized in Section 5.

2. Background

2.1. Working Modes of DCIM

As shown in Figure 1, the DCIM macro is primarily constituted by a memory array, a parallel adder tree, and a partial sum accumulator [11]. The memory array is chiefly composed of memory cells and logic operation gates. The memory cells are used to store reusable weights, which perform parallel multi-bit multiplication with input data in the logical operation units. The parallel adder tree is a full adder array that achieves carry addition by expanding bit width. The partial sum accumulator is made up of shifters and registers. Based on the DCIM macro, matrix vector multiplication (MVM) can be implemented. First, the weight parameters are loaded into the memory cells from the input buffer and are then fed into the memory array in sequence. Subsequently, the memory array performs parallel multiplication of 1-bit inputs and multi-bit weights in one computational cycle and transfers the product to a multilevel full adder to obtain a partial sum. Finally, the shift accumulator accomplishes the addition of products of the low-bit input and high-bit input.

The basic memory cell in a DCIM memory array is principally implemented with SRAM technology, whose elemental structure is depicted in Figure 2a. The 6T-SRAM cell consists of two cross-coupled inverters formed by transistors M1–M4, which store a single bit of data, and two additional access transistors, M5 and M6, controlling the access to the cell during read and write operations. This fundamental structure induces low storage density, incurring substantial area overhead in large-scale integrated arrays. Concurrently, the volatility of SRAM also leads to high static power dissipation. On the contrary, a SOT-MRAM cell merely necessitates two transistors, significantly diminishing the integrated area cost, as shown in Figure 2b. The SOT-MRAM cell contains an MTJ for data storage, consisting of two ferromagnetic layers separated by a thin insulator. The value of ‘0/1’ is represented by the low/high resistance state of the MTJ. Transistors T1 and T2 are utilized to regulate the reading and writing of the stored data, respectively. Moreover, SOT-MRAM also exhibits excellent characteristics including nonvolatility, high endurance, and rapid switching, rendering it an ideal candidate for DCIM [12,13,14,15,16,17].

2.2. Principle of FP64 Computation

According to the definition of the IEEE binary floating-point arithmetic standard (IEEE 754), the FP64 format utilizes 8 bytes for data encoding and storage, partitioned into three segments. The highest bit that expresses sign is abbreviated to S. The middle 11 bits, which express an exponent, are abbreviated as E, while the remaining 52 bits express the fraction of a number, approximating the significant digits of the number, denoted as M [18,19]. The actual value of a normalized FP64 number N can be expressed as follows:

N = {(- 1)}^{S} \times (1 . M) \times 2^{E - 1023}

(1)

Based on this storing format, the execution process of FP64 multiplication can be split up into five steps, as shown in Figure 3, described as follows:

①: Unpack the operands into the FP multiplier. The sign, exponent, and mantissa of each operand are separated. Then, the mantissa is converted to the significand by adding hidden bit 1 at the top digit.
②: XOR the sign bits of two operands to obtain the sign bits of the product.
③: Multiply the significands of two operands to obtain the product, which is the key step that restricts the overall computational speed.
④: Add the exponents of two operands. The sum is shifted with the normalization of the product.
⑤: Pack parts as products. Remove the hidden bit, and reassemble the sign, exponent, and mantissa of the results.

2.3. Process of MAC Operation

The MAC operation is the most common operation in neural network computing, accounting for over 90% of the computational workload [20,21]. One form of this operation is a ‘multiplier + adder’. However, the complex data format of floating-point numbers results in high area overhead and latency. Therefore, we constructed a fused multiply–accumulate (FMA) instruction to achieve the MAC operation:

a \times x + b = d

(2)

As shown in Figure 4, we first use the exponential difference ‘

e_{a} + e_{x} - e_{b}

’ to align

s_{b}

. Then, we add

s_{b}

to the partial product of

s_{a}

and

s_{x}

. Finally, perform normalization and rounding operations. Unlike the ‘multiplier + adder’, the FMA instruction only performs one rounding operation on the sum of full precision products (

s_{a} \times s_{x}

) and the addend (

s_{b}

), allowing for better results in high-precision calculations. At the same time, the instruction integrates a portion of multipliers and adders, reducing area overhead. It also combines multiplication and addition into a one-step operation, reducing latency.

3. Our MDCIM Work

3.1. Algorithm and Architecture of FP64 Computation

In the process of FP64 MAC operations, the power consumption and latency of multipliers are much higher than that of adders. In traditional FP64 multiplication, the 52-bit mantissa multiplier includes 52

\times

52 AND gates, 52 registers of 52 bits each to store column results, and 51 full adders for shift accumulation, with their bit width increasing from 54 bits to 104 bits. Since the number of AND gates in the multiplier cannot be reduced, and the full adder has the highest bit width and is the part with the largest area overhead, we consider reducing the bit width of the shift adder to decrease the area overhead of the entire multiplier.

Segmenting the two high bit-width multiplicands is an effective way to reduce the bit width of the multiplier, and we can decrease the area overhead required for a single MAC unit by increasing the number of computations. As shown in Figure 5, the 52-bit mantissa is divided into 4 sets of 13b inputs, e.g.,

a_{0} ~ a_{3}

and

x_{0} ~ x_{3}

, and then they multiply each other to obtain 16 sets of 26-bit partial products:

a_{0} x_{0} ~ a_{3} x_{3}

. In this process, the number of AND gates and registers remains unchanged (13

\times

13

\times

4

\times

4 = 52

\times

52), but due to the reduction in input bit width, the bit width of the full adder is greatly reduced, with the largest full adder bit width being 26 bits. Then, we accumulate these partial products. It can be seen from Figure 5 that there are partial products with the same shift, and we first accumulate them. For instance,

p_{3}

is the sum of

a_{3} x_{0}

,

a_{2} x_{1}

,

a_{1} x_{2}

, and

a_{0} x_{3}

. Since it is same-position accumulation, we only need thirteen 26-bit full adders, then use eight sets of 26-bit registers to store the results. Ultimately, we obtain seven sets of 26-bit partial sums:

p_{0} ~ p_{6}

and a set of 15-bit partial sum with carry:

p_{7}

. These partial sums can be rearranged into two groups of high bit-width numbers based on the parity of the base, and then, through a 106-bit full adder, the 52-bit mantissa product result is obtained. This method of mantissa segmentation can greatly reduce the area overhead of high bit-width multipliers. Under the same process node, the transistor count of the segmented mantissa multiplier is 0.74× that of the traditional multiplier, and it can also achieve a reduction in power consumption.

We proposed the MDCIM macro by mapping mantissa segment multiplication to digital circuits. Since the algorithm involves a large number of independent MAC operations, we constructed a pipeline multiplication array as the overall architecture. As shown in Figure 6, the macro consists of 16 sets of MAC units, 3 sets of adders, and multilayer latches. A MAC unit group including a 13-bit MRAM cell, a 13-bit multiplier, a carry–save adder (CSA), and a full adder (FA).

As shown in Figure 6, the MAC unit performs preliminary MAC operation between the segmented

m_{a}

and

m_{x}

. The MRAM cell is used to save the segmented mantissa

m_{x}

, i.e.,

x_{0}

,

x_{1}

,

x_{2}

, and

x_{3}

. The multiplier multiplies two segmented mantissas, e.g.,

a_{0}

and

x_{0}

. The CSA is a three-input and two-output adder. The first input to CSA is the 26-bit partial product, e.g.,

a_{0} x_{0}

, from the multiplier while the second input is the carry from the low-order MAC unit, and the final input is the sum of the same position of MAC units. Taking MAC unit ③ as an example, inputs of its CSA are

a_{2} x_{1}

, carry from unit ②, and sum from unit ①. The CSA will output the

s u m_{m i d}

and

c a r r y_{m i d}

of 26 bits, which have overlapping bits. Therefore, the two are fed into the full adder again to obtain a 26-bit sum and a 1-bit carry that are sent to other MAC units. We constructed a 4 × 4 pipeline array multiplier with 16 sets of MAC units and 3 sets of adders to obtain 8 sets of partial sums:

p_{0} ~ p_{7}

. In addition, because each column undergoes different computational cycles, adding multilevel latches to the array can ensure time synchronization.

However, in the multiplier, the mantissa only participates in a portion of the operation. We still need to consider the contribution of hidden bits. As shown in Figure 5, after adding hidden bits, the product of two significands (

s_{a} \times s_{x}

) has two additional sources, including the shifted

s_{a}

and

m_{x}

. In order to reduce the complexity of digital circuits, we placed them in the adder tree. As shown in Figure 7, the sum of the shifted

s_{a}

,

m_{x}

and partial sums (

p_{0} ~ p_{7}

) represents the product of two significands (

s_{a} \times s_{x}

), while the fifth set of data is the exponentially aligned addend

s_{b}

. After calculation in the multilevel adder tree, the significand of the result

d

can be obtained.

3.2. Array of SOT-MRAM

The memory array we constructed based on SOT-MRAM is a part of the CIM macro. Compared to SRAM and STT-MRAM, the advantages of SOT-MRAM are primarily reflected in its higher write speeds and storage density. To evaluate their performance, we used CMOS 55 nm technology to separately synthesize three sets of 16

\times

13 bit memory arrays based on SRAM, STT-MRAM, and SOT-MRAM. We added peripheral read–write circuits to them and measured the relevant parameters. The models of STT-MRAM and SOT-MRAM referred to the designs in References [22,23]. We evaluated the write speed of the model by measuring the write latency of circuits and assessed the storage density of the model by measuring the layout area of arrays.

As shown in Figure 8, the memory density of SOT-MRAM arrays is not significantly different from that of STT-MRAM arrays, but both are much higher than SRAM arrays. Using SOT-MRAM arrays in CIM macros can reduce the area overhead of the chip. Meanwhile, SOT-MRAM resolves the slow writing issue of STT-MRAM. It can be observed from the figure that the write time of SOT-MRAM arrays has now surpassed that of SRAM arrays, which can further lower the latency in the calculation process.

4. Simulation and Discussion

The proposed MDCIM macro model is described using Verilog-A language while large amount of test data was used to simulate and validate the functions designed in the model. FP64 can be converted to the scientific notation representation of decimal numbers, with defined ranges for its exponent and mantissa. Therefore, during the process of generating random inputs, we constrain the range of the data’s exponent to cover the decimal number exponent range that FP64 can represent, while the mantissa is generated using the rand function defined in the stdlib.h header file of C99. In this manner, we tested two million sets of data covering all ranges in FP64. Due to the pseudorandom distribution nature of the rand function, the random test inputs used in this study follow a normal distribution. Table 1 shows some data examples used to test the function of digital circuits. We use hexadecimal numbers starting with 0x to represent 64b operands. At the set of normalized numbers, a is 0x40454d4a9a95352a, which is approximately 42.60383922849208 in decimal representation; x is 0x40425f44be897d13, which is approximately 36.74428540910062 in decimal representation; b is 0x408073cc6798cf32, which is approximately 526.4748069704276 in decimal representation; and d is 0x40a057d8496a0647, which is approximately 2091.9224351055777 in decimal representation. Tests have shown that the output of the DMCIM macro is consistent with the standard reference, with code coverage reaching 100%.

In order to evaluate the model, we synthesized the MDCIM macro using CMOS 55 nm technology and measured the relevant parameters. At a working frequency of 150 MHz and a power supply voltage of 1.08 V, the macro’s power consumption is 0.62 mW. The power efficiency of MDCIM can reach 26.9 GOPS/W. We also drew and measured the layout of the memory array and peripheral circuits, with an area of 51,814 μm², achieving a power density of 0.322 GOPS/mm². As shown in Table 2, we compared MDCIM with other works supporting floating-point calculation. Ref. [24] is quite representative among them, as it designs an architecture capable of executing FP32 MAC operations. Ref. [24] also adopts the FMA computational philosophy. Initially, it performs the multiplication of the mantissas, then identifies the maximum exponent, completes the shifting for both the product and the addend, and finally carries out the accumulation uniformly. Since the mantissa of FP32 only has 23 bits, the author did not perform segmentation operations on the multiplication of mantissas, but instead allocated sufficient space for both the multiplier and adder. In order to fairly compare the energy efficiency and energy density of different works, we normalized the metrics, as shown in Table 2, to make them equivalent to the scenario of processing FP64. Upon comparison, the energy efficiency of our proposed MDCIM macro: 26.9 GOPS/W/FP64, has already surpassed that of a majority of studies of the same kind, including Ref. [24], which is based on the 28 nm process for FP32 MAC operations. The advantage primarily comes from the segmentation algorithm, as well as the low write time and low leakage current of SOT-MRAM. During the comparison process, we also discovered some works with better performance, such as Ref. [25]. The differences in performance metrics arise from multiple factors: Firstly, due to the constraints of MRAM manufacturing technology, our architecture’s performance still lags behind some leading-edge SRAM-based floating-point computing architectures. However, as the backend integration progress of MTJ and CMOS technologies advances, the performance gap is expected to narrow. Meanwhile, we noticed that the data format handled by Ref. [25] is BF16, which is different from FP16. The mantissa of BF16 has only 7 bits, while that of FP16 has 10 bits. This implies that the bit width of the multiplier, which is the major bottleneck restricting the area and speed of the compute macro, has been reduced. Therefore, when performing BF16 MAC operations, the memory requirement of the compute macro is halved, and the throughput is doubled. This provides a significant insight for us in designing floating-point computing circuits. By lowering the precision requirements of the data, more optimization space can be provided to the hardware. In fact, approximate computing also accelerates computation based on this method. Finally, to our knowledge, this is the first work exploring an MRAM-based DCIM tailored for high-precision FP64 operations, and we believe this is a noteworthy step forward.

5. Conclusions

DCIM has natural advantages in high-precision computing, but there are still challenges in terms of power consumption and area overhead. This paper proposed an all-digital CIM Macro based on SOT-MRAM (MDCIM), which can perform FP64 FMA instructions, with CSA and a pipeline MAC array to optimize digital computing circuits for reducing power consumption and latency. This work firstly proposed the application of digital CIM with SOT-MRAM in FP64 computing. However, the high bit width and large computation volume brought by floating-point calculations still pose a challenge for on-chip deployment. Currently, introducing approximate computing into floating point could significantly enhance energy efficiency [28,29]. Approximate computing can substantially accelerate the speed of operations and reduce the complexity of circuit. At the same time, for binary matrices where elements are only zero and one, the precision loss brought about by approximate matrices is within an acceptable degree. In MAC operations, we can choose suitable approximate computing methods based on the sparsity of the matrices involved in the operations. This could potentially reduce the circuit area used for computations in digital memory computation macros. Therefore, approximate floating-point computation will become a very important research approach in the near future.

Author Contributions

Conceptualization, L.L., L.T. and B.P.; methodology, L.T. and B.P.; software, L.T. and J.G.; validation, L.L. and L.T.; formal analysis, J.G.; investigation, Z.L. and B.P.; resources, B.P. and L.L.; data curation, L.T. and B.P.; writing—original draft, L.T., L.L. and B.P.; writing—review and editing, L.L., B.P. and J.Z.; visualization, L.T. and J.Z.; supervision, B.P. and Z.L.; project administration, B.P.; funding acquisition, B.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported partly by the Laboratory Open Fund of Beijing Smart-chip Microelectronics Technology Co., Ltd., partly by National Natural Science Foundation of China (62001019), and partly by the Fundamental Research Funds for the Central Universities (YWF-23-L-1241).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

We currently have no additional data.

Conflicts of Interest

The authors declare no conflict of interest.

References

Chen, J.; Li, J.; Li, Y.; Miao, X.-S. Multiply accumulate operations in memristor crossbar arrays for analog computing. J. Semicond. 2021, 42, 013104. [Google Scholar] [CrossRef]
Ahn, J.; Yoo, S.; Mutlu, O.; Choi, K. PIM-enabled instructions. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA, 13–17 June 2015; pp. 336–348. [Google Scholar] [CrossRef]
Chiu, Y.-C.; Yang, C.-S.; Teng, S.-H.; Huang, H.-Y.; Chang, F.-C.; Wu, Y.; Chien, Y.-A.; Hsieh, F.-L.; Li, C.-Y.; Lin, G.-Y.; et al. A 22nm 4Mb STT-MRAM Data-Encrypted Near-Memory Computation Macro with a 192GB/s Read-and-Decryption Bandwidth and 25.1-55.1TOPS/W 8b MAC for AI Operations. In Proceedings of the 2022 IEEE International Solid- State Circuits Conference (ISSCC), San Francisco, CA, USA, 20–26 February 2022; pp. 178–180. [Google Scholar] [CrossRef]
Deaville, P.; Zhang, B.; Verma, N. A 22nm 128-kb MRAM Row/Column-Parallel In-Memory Computing Macro with Memory-Resistance Boosting and Multi-Column ADC Readout. In Proceedings of the 2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits), Honolulu, HI, USA, 12–17 June 2022; pp. 268–269. [Google Scholar] [CrossRef]
Jung, S.; Lee, H.; Myung, S.; Kim, H.; Yoon, S.K.; Kwon, S.W.; Ju, Y.; Kim, M.; Yi, W.; Han, S.; et al. A crossbar array of magnetoresistive memory devices for in-memory computing. Nature 2022, 601, 211–216. [Google Scholar] [CrossRef]
Fujiwara, H.; Mori, H.; Zhao, W.-C.; Chuang, M.-C.; Naous, R.; Chuang, C.-K.; Hashizume, T.; Sun, D.; Lee, C.-F.; Akarvardar, K.; et al. A 5-nm 254-TOPS/W 221-TOPS/mm2 Fully-Digital Computing-in-Memory Macro Supporting Wide-Range Dynamic-Voltage-Frequency Scaling and Simultaneous MAC and Write Operations. In Proceedings of the 2022 IEEE International Solid- State Circuits Conference (ISSCC), San Francisco, CA, USA, 20–26 February 2022; pp. 1–3. [Google Scholar] [CrossRef]
Chi, P.; Li, S.; Xu, C.; Zhang, T.; Zhao, J.; Liu, Y.; Wang, Y.; Xie, Y. PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory. In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Seoul, Republic of Korea, 18–22 June 2016; pp. 27–39. [Google Scholar] [CrossRef]
Li, S.; Xu, C.; Zou, Q.; Zhao, J.; Lu, Y.; Xie, Y. Pinatubo: A Processing-in-Memory Architecture for Bulk Bitwise Operations in Emerging Non-volatile Memories. In Proceedings of the 53rd Annual Design Automation Conference, Austin, TX, USA, 5–9 June 2016; pp. 1–6. [Google Scholar] [CrossRef]
Wang, J.; Gu, Z.; Wang, H.; Hao, Z. TAM: A Computing in Memory based on Tandem Array within STT-MRAM for Energy-Efficient Analog MAC Operation. In Proceedings of the 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE), Antwerp, Belgium, 17–19 April 2023; pp. 1–6. [Google Scholar]
Doevenspeck, J.; Garello, K.; Verhoef, B.; Degraeve, R.; Van Beek, S.; Crotti, D.; Yasin, F.; Couet, S.; Jayakumar, G.; Papistas, I.A.; et al. SOT-MRAM Based Analog in-Memory Computing for DNN Inference. In Proceedings of the 2020 IEEE Symposium on VLSI Technology, Honolulu, HI, USA, 16–19 June 2020; pp. 1–2. [Google Scholar]
Tu, F.; Wang, Y.; Wu, Z.; Liang, L.; Ding, Y.; Kim, B.; Liu, L.; Wei, S.; Xie, Y.; Yin, S. A 28 nm 29.2TFLOPS/W BF16 and 36.5TOPS/W INT8 Reconfigurable Digital CIM Processor with Unified FP/INT Pipeline and Bitwise In-Memory Booth Multiplication for Cloud Deep Learning Acceleration. In Proceedings of the 2022 IEEE International Solid- State Circuits Conference (ISSCC), San Francisco, CA, USA, 20–26 February 2022; pp. 1–3. [Google Scholar]
Wang, Z.; Zhou, H.; Wang, M.; Cai, W.; Zhu, D.; Klein, J.-O.; Zhao, W. Proposal of Toggle Spin Torques Magnetic RAM for Ultrafast Computing. IEEE Electron Device Lett. 2019, 40, 726–729. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, L.; Wang, M.; Wang, Z.; Zhu, D.; Zhang, Y.; Zhao, W. High-Density NAND-Like Spin Transfer Torque Memory with Spin Orbit Torque Erase Operation. IEEE Electron Device Lett. 2018, 39, 343–346. [Google Scholar] [CrossRef]
Wang, M.; Cai, W.; Zhu, D.; Wang, Z.; Kan, J.; Zhao, Z.; Cao, K.; Wang, Z.; Zhang, Y.; Zhang, T.; et al. Field-free switching of a perpendicular magnetic tunnel junction through the interplay of spin–orbit and spin-transfer torques. Nat. Electron. 2018, 1, 582–588. [Google Scholar] [CrossRef]
Wang, M.; Cai, W.; Cao, K.; Zhou, J.; Wrona, J.; Peng, S.; Yang, H.; Wei, J.; Kang, W.; Zhang, Y.; et al. Current-induced magnetization switching in atom-thick tungsten engineered perpendicular magnetic tunnel junctions with large tunnel magnetoresistance. Nat. Commun. 2018, 9, 671. [Google Scholar] [CrossRef] [PubMed]
Peng, S.; Zhu, D.; Zhou, J.; Zhang, B.; Cao, A.; Wang, M.; Cai, W.; Cao, K.; Zhao, W. Modulation of Heavy Metal/Ferromagnetic Metal Interface for High-Performance Spintronic Devices. Adv. Electron. Mater. 2019, 5, 1900134. [Google Scholar] [CrossRef]
Peng, S.; Zhao, W.; Qiao, J.; Su, L.; Zhou, J.; Yang, H.; Zhang, Q.; Zhang, Y.; Grezes, C.; Amiri, P.K.; et al. Giant interfacial perpendicular magnetic anisotropy in MgO/CoFe/capping layer structures. Appl. Phys. Lett. 2017, 110, 072403. [Google Scholar] [CrossRef]
Whitehead, N.; Fit-Florea, A. Precision & performance: Floating point and IEEE 754 compliance for NVIDIA GPUs. rn (A+ B) 2011, 21, 18749–19424. [Google Scholar]
Szydzik, T.; Moloney, D. Precision refinement for media-processor SoCs: fp32-> fp64 on myriad. In Proceedings of the 2014 IEEE Hot Chips 26 Symposium (HCS), Las Palmas, Spain, 10–12 August 2014. [Google Scholar]
Zhang, H.; Chen, D.; Ko, S.B. Efficient multiple-precision floating-point fused multiply-add with mixed-precision support. IEEE Trans. Comput. 2019, 68, 1035–1048. [Google Scholar] [CrossRef]
Park, J.; Lee, S.; Jeon, D. A neural network training processor with 8-bit shared exponent bias floating point and multiple-way fused multiply-add trees. IEEE J. Solid-State Circuits 2021, 57, 965–977. [Google Scholar] [CrossRef]
Rawat, R.M.; Kumar, V. A Comparative Study of 6T and 8T SRAM Cell With Improved Read and Write Margins in 130 nm CMOS Technology. Wseas Trans. Circuits Syst. 2020, 19, 13–18. [Google Scholar] [CrossRef]
Tohoku University. Researchers Develop 128Mb STT-MRAM with World’s Fastest Write Speed for Embedded Memory. ScienceDaily. 2023. Available online: www.sciencedaily.com/releases/2018/12/181228164841.htm (accessed on 1 September 2023).
Jeong, S.; Park, J.; Jeon, D. A 28nm 1.644TFLOPS/W Floating-Point Computation SRAM Macro with Variable Precision for Deep Neural Network Inference and Training. In Proceedings of the ESSCIRC 2022- IEEE 48th European Solid State Circuits Conference (ESSCIRC), Milan, Italy, 19–22 September 2022; pp. 145–148. [Google Scholar] [CrossRef]
Lee, J.; Kim, J.; Jo, W.; Kim, S.; Kim, S.; Lee, J.; Yoo, H.-J. A 13.7 TFLOPS/W Floating-point DNN Processor using Heterogeneous Computing Architecture with Exponent-Computing-in-Memory. In Proceedings of the 2021 Symposium on VLSI Circuits, Kyoto, Japan, 13–19 June 2021. [Google Scholar] [CrossRef]
Wang, J.; Wang, X.; Eckert, C.; Subramaniyan, A.; Das, R.; Blaauw, D.; Sylvester, D. A Compute SRAM with Bit-Serial Integer/Floating-Point Operations for Programmable In-Memory Vector Acceleration. In Proceedings of the 2019 IEEE International Solid- State Circuits Conference (ISSCC), San Francisco, CA, USA, 17–21 February 2019. [Google Scholar] [CrossRef]
Wang, J.; Wang, X.; Eckert, C.; Subramaniyan, A.; Das, R.; Blaauw, D.; Sylvester, D. A 28-nm Compute SRAM With Bit-Serial Logic/Arithmetic Operations for Programmable In-Memory Vector Computing. IEEE J. Solid-State Circuits 2020, 55, 76–86. [Google Scholar] [CrossRef]
Leon, V.; Paparouni, T.; Petrongonas, E.; Soudris, D.; Pekmestzi, K. Improving Power of DSP and CNN Hardware Accelerators Using Approximate Floating-point Multipliers. ACM Trans. Embed. Comput. Syst. 2021, 20, 1–21. [Google Scholar] [CrossRef]
Gustafsson, O.; Hellman, N. Approximate Floating-Point Operations with Integer Units by Processing in the Logarithmic Domain. In Proceedings of the 2021 IEEE 28th Symposium on Computer Arithmetic (ARITH), Lyngby, Denmark, 14–16 June 2021. [Google Scholar] [CrossRef]

Figure 1. The typical architecture of DCIM macro including memory arrays, adder trees, and accumulators.

Figure 2. (a) The structure of an SRAM cell. (b) The structure of an SOT-MRAM cell.

Figure 3. The standard data format of FP64 and the regulation of FP64 multiplication.

Figure 4. Significands need to be normalized and rounded at the end during FP64 MAC operations.

Figure 5. Divide the significands and multiply each part separately.

Figure 6. A pipeline multiplier and accumulator array for multiplication of segmented mantissas. The red numbers indicate MAC units.

Figure 7. The dataflow of using multilevel adder tree to perform addition on partial products and addends.

Figure 8. The bit density and write times of different memory arrays.

Table 1. Some data examples used to test the function of digital circuits.

Input			Output
a	x	b	d
0x40454d4a9a95352a (42.60)	0x40425f44be897d13 (36.74)	0x408073cc6798cf32 (526.47)	0x40a057d8496a0647 (2091.92)
0x402b9cf739ee73dd (13.81)	0x4045124e249c4939 (42.14)	0x4081a05840b08161 (564.04)	0x4091e7931bfbd1a3 (1145.89)
0x4019d4b3a96752cf (6.46)	0x404265e8cbd197a3 (36.80)	0x407edc24b8497093 (493.76)	0x4086db06849fa665 (731.38)
0x405676f4ede9dbd4 (89.86)	0x40340aa015402a80 (20.04)	0x407726f04de09bc1 (370.43)	0x40a0f6acacac57ff (2171.34)
0x40580ec81d903b20 (96.23)	0x40585af4b5e96bd3 (97.42)	0x4077c61f8c3f187e (380.38)	0x40c30da89eda44c6 (9755.32)

Table 2. Comparison of the proposed MDCIM with previous works.

	ISSCC’19 [26]	JSSC’20 [27]	VLSI’21 [25]	ESSCIRC’22 [24]	This Work
Memory array	28 nm SRAM	28 nm SRAM	28 nm SRAM	28 nm SRAM	SOT-MRAM
Supply voltage (V)	0.6–1.1	0.6–1.1	0.76–1.1	0.5–0.9	0.9–1.32
Supported floating-point format	FP8	FP8	BF16	FP32	FP64
Macro size (mm²)	2.55 (Chip)	2.7 (Die)	5.832 (Die)	0.0439	0.051814
Frequency (MHz)	475	114	250	400	150
GOPS/W/FP64	8.59	4.86	47.5	24.75	26.9
GOPS/mm²/FP64	0.354	0.426	1.28	1.21	0.322

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, L.; Tan, L.; Gan, J.; Pan, B.; Zhou, J.; Li, Z. MDCIM: MRAM-Based Digital Computing-in-Memory Macro for Floating-Point Computation with High Energy Efficiency and Low Area Overhead. Appl. Sci. 2023, 13, 11914. https://doi.org/10.3390/app132111914

AMA Style

Liu L, Tan L, Gan J, Pan B, Zhou J, Li Z. MDCIM: MRAM-Based Digital Computing-in-Memory Macro for Floating-Point Computation with High Energy Efficiency and Low Area Overhead. Applied Sciences. 2023; 13(21):11914. https://doi.org/10.3390/app132111914

Chicago/Turabian Style

Liu, Liang, Lehao Tan, Jie Gan, Biao Pan, Jiahui Zhou, and Zhengliang Li. 2023. "MDCIM: MRAM-Based Digital Computing-in-Memory Macro for Floating-Point Computation with High Energy Efficiency and Low Area Overhead" Applied Sciences 13, no. 21: 11914. https://doi.org/10.3390/app132111914

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MDCIM: MRAM-Based Digital Computing-in-Memory Macro for Floating-Point Computation with High Energy Efficiency and Low Area Overhead

Abstract

1. Introduction

2. Background

2.1. Working Modes of DCIM

2.2. Principle of FP64 Computation

2.3. Process of MAC Operation

3. Our MDCIM Work

3.1. Algorithm and Architecture of FP64 Computation

3.2. Array of SOT-MRAM

4. Simulation and Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI