A New Carry Look-Ahead Adder Architecture Optimized for Speed and Energy

Balasubramanian, Padmanabhan; Maskell, Douglas L.

doi:10.3390/electronics13183668

Open AccessArticle

A New Carry Look-Ahead Adder Architecture Optimized for Speed and Energy

by

Padmanabhan Balasubramanian

^*

and

Douglas L. Maskell

College of Computing and Data Science, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798, Singapore

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(18), 3668; https://doi.org/10.3390/electronics13183668

Submission received: 20 July 2024 / Revised: 12 September 2024 / Accepted: 12 September 2024 / Published: 15 September 2024

Download

Browse Figures

Versions Notes

Abstract

:

We introduce a new carry look-ahead adder (NCLA) architecture that employs non-uniform-size carry look-ahead adder (CLA) modules, in contrast to the conventional CLA (CCLA) architecture, which utilizes uniform-size CLA modules. We adopted two strategies for the implementation of the NCLA. Our novel approach enables improved speed and energy efficiency for the NCLA architecture compared to the CCLA architecture without incurring significant area and power penalties. Various adders were implemented to demonstrate the advantages of NCLA, ranging from the slower ripple carry adder to the widely regarded fastest parallel-prefix adder viz. the Kogge–Stone adder, and their performance metrics were compared. The 32-bit addition was used as an example, with the adders implemented using a semi-custom design method and a 28 nm CMOS standard cell library. Synthesis results show that the NCLA architecture offers substantial improvements in design metrics compared to its high-speed counterparts. Specifically, an NCLA achieved (i) a 14.7% reduction in delay and a 13.4% reduction in energy compared to an optimized CCLA, while occupying slightly more area; (ii) a 42.1% reduction in delay and a 58.3% reduction in energy compared to a conditional sum adder, with an 8% increase in the area; (iii) a 14.7% reduction in delay and a 37.7% reduction in energy compared to an optimized carry select adder, while requiring 37% less area; and (iv) a 20.2% reduction in energy and a 55.4% reduction in area compared to the Kogge–Stone adder.

Keywords:

arithmetic circuits; digital circuits; adder; high-speed; energy-efficient; ASIC; CMOS

1. Introduction

High-speed and energy-efficient adders have several practical applications. In digital signal processing applications like audio, image, and video processing, high-speed and energy-efficient adders are crucial [1,2] where real-time performance is essential. Energy-efficient adders are vital for portable and embedded systems, where power constraints are critical [3]. In microprocessors, high-speed adders improve the overall speed of arithmetic operations [4], which is fundamental for CPU performance. In cryptography, high-speed adders accelerate cryptographic algorithms, which often involve numerous arithmetic operations [5,6]. In graphics processing units, high-speed arithmetic units are essential [7] for fast rendering of graphics in gaming, simulations, and virtual reality. For artificial intelligence and machine learning applications, high-speed adders enhance the performance of neural network training and inference [8]. In networking and communication systems, high-speed adders support the fast processing of data packets, essential for high-speed internet and communication systems [9]. Energy-efficient arithmetic circuits are necessary for network devices [10] that operate continuously, to reduce power and operational costs. In consumer electronics, high-speed processing units improve the responsiveness and smoothness of user interfaces [11] in devices like smartphones, tablets, and smart TVs. In the realm of the Internet of Things (IoT), energy-efficient adders are crucial for IoT devices [12], which usually rely on battery power and need to operate for long periods without recharging. Concerning biomedical technology [13], wearables, and portable medical equipment require high-speed and energy-efficient processing for applications like health monitoring to ensure long battery life and fast real-time performance. Thus, the development of high-speed and energy-efficient adders is important to advance the efficiency of modern electronics and computing systems.

Arithmetic operations, such as addition and multiplication, are major contributors to power consumption in computing systems. For example, over 70% of power usage in graphics processing units is attributed to these operations [14]. Similarly, approximately 80% of power consumption in fast Fourier transform processors is linked to adders and multipliers [15]. Adders are vital components in the data paths of digital signal processing units, significantly impacting computer arithmetic. Addition is noted to be the most frequently executed operation in real-time digital signal processing benchmarks [16]. Furthermore, an analysis of an ARM processor’s arithmetic and logic unit showed that additions account for nearly 80% of its workload [17]. Therefore, the design of high-speed and energy-efficient adders is essential for optimizing digital electronic circuits and systems.

Many adder architectures have been described in the literature [18,19], including the ripple carry adder, carry skip adder, conditional sum adder, carry select adder, carry look-ahead adder, and the family of parallel-prefix adders including the Brent–Kung adder, Sklansky adder, and Kogge–Stone adder, among others. Next, each of these adder architectures will be briefly discussed.

The Ripple Carry Adder (RCA) is a fundamental adder architecture comprising a series of one-bit full adders. In an RCA, the carry output from each full adder is fed into the carry input of the subsequent full adder. The addition process begins with the least significant bit (LSB) and progresses linearly to the most significant bit (MSB). While the RCA is advantageous in terms of minimal area usage and low power dissipation, its primary limitation is the propagation delay due to the linear carry propagation, which restricts its speed. An RCA variant introduced in [20] employs a cascade of two-bit full adders instead of the traditional one-bit full adders. This modification results in faster operation compared to the standard RCA but at the cost of increased area and power dissipation. Reference [20] demonstrated that, for an RCA, although the two-bit full adder cascade enhances speed, the conventional one-bit full adder cascade remains more efficient in terms of area and power dissipation.

The Carry Skip Adder (CSKA) [21], also known as the carry bypass adder, improves addition speed by skipping carry propagation through specific groups of bits, thereby reducing overall delay. This is accomplished by predicting whether the carry will propagate through a group or can be bypassed, based on the input bits’ values. The effectiveness of carry skipping relies on the distribution of carry inputs across the input numbers. When carry bits are concentrated within certain groups, the carry–skip mechanism can effectively predict and skip carry propagation, resulting in faster addition. However, if the input patterns cause frequent carry propagation across multiple groups, the CSKA’s performance may resemble that of a conventional RCA, limiting its advantages.

The Conditional Sum Adder (CSA) [22] dynamically adjusts its structure based on the carry input from previous bit positions. If there is no carry input from a previous position, the CSA utilizes a simpler adder structure to compute the sum. Conversely, if there is a carry input, the CSA switches to a more complex adder structure. This internal adjustment based on the presence or absence of carry bits optimizes performance by minimizing propagation delay and improving efficiency, especially for addition operations with varying input conditions. While the CSA offers advantages in terms of speed and efficiency, its dynamic adjustment mechanism, which internally switches between simpler and more complex adder structures based on the carry input, adds significant complexity to the design. The intricate design of the CSA can lead to longer development times. Due to its complex internal structure, the CSA typically requires more hardware resources, resulting in more area consumption and this leads to high power dissipation. The CSA should be carefully designed and optimized to ensure that it functions correctly and efficiently. While the CSA may be efficient for some bit lengths, its complexity can make it less scalable for high-bit widths. Typically, as the number of bits increases, the design of the CSA’s internal structures becomes more challenging, potentially limiting its practical application in high to very high bit width additions.

The Carry Select Adder (CSLA) [23] is a high-speed adder that typically consists of groups of parallel adders, each capable of generating sum and carry outputs for a specific carry input condition. One adder calculates the sum and carry output assuming a carry input of 0, while the other does the same calculation assuming a carry input of 1. These parallel adders are usually implemented using the RCA architecture. An alternative implementation of the CSLA involves using one adder to produce the sum and carry outputs for a carry input of 0 and incrementing these outputs by 1 using an add-one circuit or binary to excess-1 code converter [24]. The final sum and carry outputs of the CSLA are selected from the outputs of the parallel adders using 2-to-1 multiplexers, with the actual carry input serving as the select signal. Despite the significant logic complexity of the CSLA due to the use of parallel adders or an add-one circuit, it offers high-speed performance. However, the increased hardware complexity can increase the area utilization and power dissipation of the CSLA compared to its counterparts.

The Carry Look-Ahead Adder (CLA) [25] reduces the linear propagation delay seen in a traditional RCA by precomputing carry signals for each bit position using sets of logic gates, rather than depending on the carry propagation from previous stages. This approach enables parallel computation of carry signals across multiple-bit positions, allowing for faster addition of large binary numbers. The main disadvantage of the CLA is its increased area overhead and complexity compared to simpler adder architectures like the RCA. This is because the precomputation of carry signals for each bit position requires additional logic gates, leading to higher area usage. Despite this, CLAs are particularly beneficial for applications requiring high-speed arithmetic operations. Additionally, a variant of the CLA, known as the Ling adder, was also presented in the literature [26].

A Parallel Prefix Adder (PPA) [27] computes the sum of multiple binary numbers in parallel using a tree-like structure composed of prefix computation blocks. These blocks perform prefix operations, such as carry generation and propagation in parallel across multiple stages, enabling high-speed addition with reduced propagation delay. Various PPAs have been discussed in the literature, including the Brent–Kung Adder (BKA) [28], Sklansky Adder [29], and Kogge–Stone Adder (KSA) [30]. Each PPA aims to enhance the addition performance via parallel computation of prefix operations, but they differ in implementation, performance, and area overhead. For example, the Brent–Kung Adder (BKA) utilizes a balanced tree structure to efficiently compute prefix operations, offering good performance with relatively low area overhead. The Sklansky Adder employs a recursive structure, known for its simplicity and regularity, although it may not be as efficient in performance or area utilization as other PPAs. The Kogge–Stone Adder (KSA) uses a highly parallel binary tree structure for prefix operations, making it one of the fastest PPAs. It is renowned for its scalability and ability to handle large bit widths effectively, but it typically requires more area than other PPAs.

In this article, we introduce a New CLA (NCLA) architecture to primarily enhance the speed compared to the Conventional CLA (CCLA) architecture. Nonetheless, the NCLA was found to achieve improved speed and energy efficiency compared to the CCLA and many other high-speed adders. A preliminary version of this work was presented at the 2024 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM ’24) [31], and this article is an extended version. Compared to [31], the following additional material has been included in this article:

Standard design metrics of an assortment of 32-bit NCLAs are given for comparison;
Areas of CLA modules comprising different NCLAs are provided;
A new theoretical delay model based on synthesis has been developed to reliably predict the critical path delays of NCLAs. A comparison between theoretically calculated and practically estimated critical path delays of various NCLAs is provided;
The energy efficiency of different NCLAs, represented by the product of total power dissipation and critical path delay, is provided, which conveys which NCLA configuration is better optimized in terms of energy (for example 32-bit addition);
The energy metric of diverse adders including the proposed NCLA is provided for comparison;
The product of area and delay of various 32-bit adders is portrayed for comparison.

Section 1 introduced high-speed and energy-efficient adders, along with an overview of popular adder architectures. Following this, the rest of the article is structured as follows. Section 2 details the existing CCLA architecture and the proposed NCLA architecture. Section 3 discusses the implementation methodology and presents the design metrics of various adders, highlighting the performance improvements and energy efficiency achieved by the proposed adder architecture. Section 4 offers concluding remarks.

2. CLA Architectures—Conventional and Proposed

In this section, we first discuss the basic principles of carry look-ahead and then describe the conventional and proposed CLA architectures.

The CLA effectively reduces the linear propagation delay of a traditional RCA. It achieves this by precomputing carry signals for each bit position using sets of logic gates instead of relying on linear carry propagation from previous stages. This method enables the parallel computation of carry signals across multiple-bit positions, facilitating faster binary addition. Assuming A_K and B_K are two binary inputs and C_K is the carry input from a previous addition stage, the sum S_K and carry output C_K+1 of the present addition stage are expressed by the following equations, where P_K represents the carry–propagate signal, and G_K represents the carry–generate signal.

S_K = (A_K ⊕ B_K) ⊕ C_K = P_K ⊕ C_K

(1)

C_K+1 = (A_KB_K) + (A_K ⊕ B_K) C_K = G_K + P_KC_K

(2)

In Equation (1), if the addition inputs are mutually exclusive and the carry input is 0, the sum will be 1. Conversely, if the addition inputs are mutually inclusive and the carry input is 1, the sum will be 1. In Equation (2), when P_K is 1, the carry input to an addition stage is forwarded as the carry output, which then becomes the carry input to the next addition stage. Alternatively, when G_K is 1, a carry output is generated from one addition stage and provided as the carry input to the next addition stage. Thus, a knowledge of the adder inputs allows the prediction of the carry output for any adder stage based on the carry input.

Equation (2) is inherently recursive, and by leveraging this property, the carry outputs of a 4-bit CLA module can be derived as shown in Equations (3)–(6). Here, C₀ represents the carry input to the CLA module, and C₁, C₂, C₃, and C₄ represent the carry outputs of the first, second, third, and fourth addition stages of that CLA module. P₀ to P₃ and G₀ to G₃ denote the carry–propagate and carry–generate signals for the first through fourth addition stages.

C₁ = G₀ + P₀C₀

(3)

C₂ = G₁ + P₁G₀ + P₁P₀C₀

(4)

C₃ = G₂ + P₂G₁ + P₂P₁G₀ + P₂P₁P₀C₀

(5)

C₄ = G₃ + P₃G₂ + P₃P₂G₁ + P₃P₂ P₁G₀ + P₃P₂P₁P₀C₀

(6)

If the carry input to a CLA module, C₀, is 0, Equations (3)–(6) reduce to Equations (7)–(10) as given below. These simplified equations show how the look-ahead carry outputs are determined when the initial carry input is zero.

C₁ = G₀

(7)

C₂ = G₁ + P₁G₀

(8)

C₃ = G₂ + P₂G₁ + P₂P₁G₀

(9)

C₄ = G₃ + P₃G₂ + P₃P₂G₁ + P₃P₂P₁G₀

(10)

Figure 1 illustrates the gate-level implementation of a 4-bit CLA without a carry input, while Figure 2 presents a delay-optimized gate-level implementation of a 4-bit CLA with a carry input [32].

A N-bit CLA is typically composed of a cascade of M-bit CLA modules, where both N and M are even numbers, and N is an exact multiple of M. The carry output from one CLA module is passed as the carry input to the next CLA module in the cascade. For instance, a 32-bit CLA can be constructed using sixteen 2-bit CLA modules, eight 4-bit CLA modules, or four 8-bit CLA modules connected in a cascade, as shown in Figure 3. Although it is possible to realize a 32-bit CLA using two 16-bit CLA modules, it is advisable not to use large basic CLA modules. This is because, as Equations (3)–(6) or (7)–(10) indicate, the number and size of the product terms comprising a look-ahead carry output increase linearly with the module size. This would lead to an increase in the number of logic gates and logic levels, leading to a high critical path delay. Therefore, large-sized CLA modules are not recommended for constructing N-bit CLAs. In Figure 3a–c, the least significant CLA module does not have a carry input, while the other modules have a carry input. For example, the least significant 4-bit CLA module in Figure 3b is represented by Figure 1, and the 4-bit CLA modules with carry input used in Figure 3b are represented by Figure 2. Correlating Figure 3b with Figure 1 and Figure 2, the theoretical critical path of the 32-bit CCLA comprising 4-bit CLA modules can be understood as follows: the critical path of the least significant 4-bit CLA module without a carry input would be as highlighted by the pink dotted line in Figure 1, the critical path of more significant but intermediate 4-bit CLA modules would be as marked by the red dotted line in Figure 2, and the critical path of the most significant 4-bit CLA module would be as indicated by the blue dotted line in Figure 2.

In this work, we investigate N-bit CLA implementations using non-uniform-sized CLA modules, which we refer to as the NCLA architecture, to distinguish it from the CCLA architecture that uses only uniform-size CLA modules. The main motivation behind proposing the NCLA architecture is to enable a reduced critical path delay compared to the existing CCLA architecture. Considering the 32-bit addition as a case study, we designed several 32-bit NCLAs using various combinations of non-uniform size CLA modules arranged in a cascade structure. Among these designs, some interesting configurations from a delay perspective are depicted in Figure 4. Figure 4a shows a 32-bit NCLA consisting of four 6-bit CLA modules and two 4-bit CLA modules. Figure 4b portrays a 32-bit NCLA comprising two 8-bit CLA modules, three 4-bit CLA modules, and two 2-bit CLA modules. Figure 4c illustrates a 32-bit NCLA incorporating two 8-bit CLA modules and four 4-bit CLA modules.

The NCLA architecture has been developed based on the following two strategies:

Moderate-size CLA modules are used in the least significant bit positions, while bigger-size CLA modules are used in the more significant bit positions (as seen in Figure 4a,c).
Small-size CLA modules are used in the least significant bit positions, moderate-size CLA modules in the more significant bit positions, and bigger-size CLA modules in the most significant bit positions (as depicted in Figure 4b).

The underlying idea is that in an NCLA when small- or moderate-size CLA module(s) are used in the least significant bit positions and succeeded by bigger-size CLA modules in the more significant bit positions, the delay of the least significant CLA modules tends to be absorbed by the delay of the more significant CLA modules due to their greater size. This tends to lead to a shorter critical path for an NCLA compared to a CCLA. Alternatively, in an NCLA, the number of addition stages involving CLA modules may be reduced when a mix of different-size CLA modules is used compared to the use of same-size CLA modules in a CCLA and this also tends to optimize the NCLA’s critical path delay compared to the CCLA. However, the delays of different NCLA configurations could vary depending on the choice of CLA modules used for different groups of bit positions.

The sizes of CLA modules would vary depending upon their radix. Figure 5 shows the areas of CLA modules ranging from 2-bits to 10-bits without and with the carry input. The area of a 12-bit CLA module without the carry input is excluded as that has not been used for the NCLAs we considered. All the CLA modules were described structurally using gate primitives in Verilog HDL and synthesized using a 28 nm CMOS standard cell library [33] using Synopsys DesignCompiler (version: Q-2019-12-SP5). The synthesis details are given in the next section.

3. Implementation and Design Metrics

We focused on a 32-bit addition as a case study for this research, although our approach applies to additions of any size. Initially, we described several 32-bit NCLAs that utilize various combinations of non-uniform size CLA modules at the gate level in Verilog HDL and synthesized them to evaluate their performance metrics. The goal was to determine which NCLA(s) are better optimized for speed. Synopsys EDA tools and a standard cell library were used for synthesis, simulation, and estimation of standard design metrics such as critical path delay, total area, and total power dissipation. DesignCompiler was used for synthesis, targeting the typical-case PVT specification of a low-leakage 28 nm CMOS standard cell library [33], with a supply voltage of 1.05 V and an operating junction temperature of 25 °C. Default wire load models were applied during synthesis, and a fanout-of-4 drive strength was assigned to the adders’ sum bits. NCLAs were synthesized using DesignCompiler through the ‘compile’ command with speed defined as the optimization goal. Following synthesis, the gate-level netlists of NCLAs were subject to functional simulation using VCS (version: 2020_12_SP2_6). A test bench containing approximately a thousand random inputs was applied to the NCLAs, at a latency of 4 ns to accommodate the RCA’s speed for comparison. A virtual clock (with a period of 8 ns) was used merely to constrain the adder inputs and outputs during synthesis. However, since the clock used is virtual, it neither formed a part of the designs nor contributed to the design metrics. PrimeTime (version: vO-2018-06-SP5-2) was used to estimate the critical path delay, and PrimePower was used to estimate total power dissipation based on the switching activity data gathered from functional simulations. DesignCompiler provided total area estimates for the adders, encompassing cell and interconnect areas. Table 1 shows the standard design metrics of different 32-bit NCLAs. The split-up of the total area of various NCLAs in terms of the areas of their constituent cells and interconnects (net) is also given in Table 1.

Table 1 uses specific legends to refer to different NCLA configurations for ease of referencing. For instance, NCLA-666644 means a 32-bit NCLA has been constructed using four 6-bit CLA modules and two 4-bit CLA modules; NCLA-1010102 means a 32-bit NCLA has been constructed using three 10-bit CLA modules and one 2-bit CLA module; and NCLA-128444 means a 32-bit NCLA has been constructed using one 12-bit CLA module, one 8-bit CLA module, and three 4-bit CLA modules. Given these, it is assumed that the readers can interpret the other NCLA configurations listed in Table 1.

N-bit NCLAs featuring different combinations of CLA modules would have variations in their area occupancies due to the differences in the size of CLA modules, as portrayed in Figure 5. Hence, in Table 1, slight differences in cell area, interconnect area, and total area are noticed between NCLAs although they all are of the same width.

Among the different NCLAs listed in Table 1, a few exhibit sub-nanosecond delays. NCLA-8844422 and NCLA-8864222 achieve the lowest sub-nanosecond critical path delay of 0.99 ns, outperforming other NCLAs. Both NCLA-8844422 and NCLA-8864222 have almost identical areas and power dissipation, with NCLA-8844422 reporting a slight advantage. Therefore, NCLA-8844422 is preferable to other NCLAs in terms of the critical path delay.

An important takeaway from Table 1 is that among the two strategies used to realize the NCLA architecture (mentioned in the previous section), for a 32-bit addition, it is optimal to treat a 2-bit CLA module as a small-size module, a 4-bit/6-bit CLA module as a moderate-size module, and an 8-bit CLA module as a large-size module. When larger-size CLA modules, such as 10-bit or 12-bit modules, are used to realize an NCLA, a better optimization in delay is not achieved. This is evident from the critical path delays mentioned in the last five rows of Table 1. As shown in Equation (6), the final carry output of an M-bit CLA module has (M + 1) product terms, with the largest product term containing (M + 1) literals. Consequently, the number of logic levels involved in producing the final carry output, after logic decomposition using a synthesis tool, tends to be proportional to M. Therefore, large-size CLA modules experience increased propagation delay, which would negatively impact the critical path delay of an NCLA. Hence, CLA modules should be carefully selected and positioned to effectively optimize the critical path delay of an NCLA.

Given the possibility of forming many NCLA configurations using various combinations of different-size CLA modules, as shown in Table 1, the theoretical modeling of the critical path delay of an NCLA may be useful to gain insight into an optimum arrangement of diverse CLA modules within it. A conventional delay model (CDM) identifies the critical path and the gates present in the critical path, and the propagation delays of gates belonging to a standard cell library are substituted into the CDM to calculate the theoretical critical path delay. However, we did not find CDM to be appropriate for the theoretical delay estimation of an NCLA—the reason for this shall be explained next. Therefore, we developed a new synthesis-based delay model (SDM) to provide a useful and reliable theoretical delay estimate of an NCLA. It may be noted here that CDM is not synthesis-based. Nevertheless, it may be noted that both CDM and SDM are approximate since the delays of interconnect and/or parasitic are not accounted for and only the propagation delays of gates present in the critical path are accounted for, which are directly considered from a cell library datasheet.

In an N-bit NCLA comprising different size CLA modules, the critical path starts from a K-bit CLA module which may or may not have a carry input. Referring to Equation (6), the final carry output of a K-bit CLA module has (K + 1) product terms, with the largest product term containing (K + 1) literals. According to Figure 2, this implies that the gates present in the critical path of a K-bit CLA module with a carry input are a 2-input XOR gate, a K-input AND gate, a K-input OR gate, and an AO21 complex gate. On the other hand, referring to Equation (10), the final carry output of a K-bit CLA module with no carry input has K product terms, with the largest product term containing K literals. Hence, according to Figure 1, the gates present in the critical path of a K-bit CLA module without carry input are a 2-input XOR gate, a K-input AND gate, and a K-input OR gate. Either way, a K-input AND gate, and a K-input OR gate are present in a K-bit CLA module if the critical path originates from it. In modern standard cell libraries, the fan-in of simple logic gates is usually limited to 4. Given this, there arises a need to decompose K-input AND and OR gates when K > 4 to perform a theoretical delay modeling. Though K-input AND and OR gates may be decomposed manually for the sake of theoretical delay modeling, such a manual decomposition may not be the same as the physical decomposition of high fan-in AND and OR gates performed by a synthesis tool. As a result, a proper correlation may not be established between theoretically calculated delays and practically estimated delays thus rendering the theoretical delay modeling unreliable. In other words, the theoretically calculated delay might suggest a particular NCLA configuration to be the fastest which may not be practically true, and we found this to be the case with CDM based on our analysis of different NCLA configurations shown in Table 1.

To elucidate the problem with CDM, we discuss the theoretical delay calculation of some 32-bit NCLAs based on it. In specific, we consider NCLA-8844422, NCLA-1010102, NCLA-1010444, NCLA-1244444, NCLA-128444, and NCLA-12884 for analysis here. The critical path delay of these NCLAs, based on CDM, is theoretically expressed by the following delay equations.

D_NCLA-8844422 = (D_XOR2 + D_AND2 + D_OR2) + (5 × D_AO21) + (D_AO21 + D_XOR2)

(11)

D_NCLA-1010102 = (D_XOR2 + 2 × D_AND4 + 2 × D_OR4) + (D_AO21) + (D_AO21 + D_XOR2)

(12)

D_NCLA-1010444 = (D_XOR2 + D_AND4 + D_OR4) + (3 × D_AO21) + (D_AO21 + D_XOR2)

(13)

D_NCLA-1244444 = (D_XOR2 + D_AND4 + D_OR4) + (4 × D_AO21) + (D_AO21 + D_XOR2)

(14)

D_NCLA-128444 = (D_XOR2 + D_AND4 + D_OR4) + (3 × D_AO21) + (D_AO21 + D_XOR2)

(15)

D_NCLA-12884 = (D_XOR2 + D_AND4 + D_AND2 + D_OR4 + D_OR2) + (D_AO21) + (D_AO21 + D_XOR2)

(16)

In Equations (11)–(16), D_AND2 and D_AND4 represent the propagation delays of 2-input and 4-input AND gates, D_OR2 and D_OR4 represent the propagation delays of 2-input and 4-input OR gates, D_XOR2 represents the propagation delay of a 2-input XOR gate, and D_AO21 represents the propagation delay of an AO21 complex gate. In Equations (11)–(16), on the right side, the first term given within brackets denotes the propagation delay of a CLA module from where the critical path originates, the second term given within brackets denotes the (sum of) propagation delay(s) of intermediate CLA module(s), and the third term given within brackets denotes the propagation delay of the final CLA module. For example, in NCLA-1010444, the critical path may originate from the first 4-bit CLA module, traversing three intermediate CLA modules (i.e., two 4-bit CLA modules and one 10-bit CLA module), and finally encounter a 10-bit CLA module.

Since we considered NCLAs comprising 8-bit, 10-bit, and 12-bit CLA modules, 8-input, 10-input, and 12-input AND and OR gate logic would be present in them in theory, which requires decomposition as they are generally not available as cells in a standard cell library. In the case of CDM, an 8-input AND/OR gate is decomposed into two 4-input AND/OR gates in the first level and their outputs are combined using a 2-input AND/OR gate in the second level; a 10-input AND/OR gate is decomposed into two 4-input AND/OR gates in the first level and their outputs along with the remaining two inputs are combined using a 4-input AND/OR gate in the second level; a 12-input AND/OR gate is decomposed into three 4-input AND/OR gates in the first level and their outputs are combined using a 3-input AND/OR gate in the second level. Consequently, the delay of an 8-input AND/OR gate is represented by the sum of the delays of a 4-input AND/OR gate and a 2-input AND/OR gate, as denoted by the first term given within brackets on the right side of Equation (16). The delay of a 10-input AND/OR gate is represented by the sum of the delays of two 4-input AND/OR gates, as denoted by the first term given within brackets on the right side of Equation (12). The delay of a 12-input AND/OR gate is represented by the sum of the delays of a 4-input AND/OR gate and a 3-input AND/OR gate. Leaving aside NCLA-1010102 and NCLA-12884, whose delays are expressed by Equations (12) and (16), in NCLA-8844422, NCLA-1010444, NCLA-1244444, and NCLA-128444, the critical path originates from a 2-bit or a 4-bit CLA module, and the gates present in them do not require decomposition according to CDM.

After substituting the propagation delays of gates belonging to the cell library [33] in Equations (11)–(16), the theoretical critical path delays were calculated based on CDM, given as follows: D_NCLA-8844422 = 0.708 ns; D_NCLA-1010102 = 0.599 ns; D_NCLA-1010444 = 0.592 ns; D_NCLA-1244444 = 0.655 ns; D_NCLA-128444 = 0.592 ns; and D_NCLA-12884 = 0.626 ns. These values suggest the following:

NCLA-1010102, NCLA-1010444, NCLA-1244444, NCLA-128444, and NCLA-12884, theoretically, have lesser critical path delay compared to NCLA-8844422 which does not tally with the practical delay estimates given in Table 1.
NCLA-1244444, NCLA-128444, and NCLA-12884 have different critical path delays, which also do not tally with the practical delay estimates given in Table 1.
NCLA-1010102 and NCLA-1010444 have lesser critical path delay than NCLA-1244444 and NCLA-12884, which again does not tally with the practical delay estimates given in Table 1.

The above observations point to contradictions between the CDM-based theoretically calculated delays and the practically estimated delays (given in Table 1), casting doubts on the usefulness and reliability of CDM for an NCLA. Therefore, as an alternative, we developed SDM to effectively calculate the critical path delays of NCLAs, and it shall be discussed next.

To formulate the SDM, we first described 2-bit, 4-bit, 6-bit, 8-bit, and 10-bit CLA modules without and with the carry input, and a 12-bit CLA module with the carry input structurally at the gate level in Verilog based on the example CLAs logic shown in Figure 1 and Figure 2. The consideration of 2-bit to 12-bit CLA modules was because the NCLAs shown in Table 1 required only those. The CLA modules were then synthesized using a standard cell library [33] using DesignCompiler, and their maximum propagation delays were estimated by PrimeTime. We noticed a regularity in the practical timing estimates of K-bit CLA modules based on two conditions: (i) K > 2, and (ii) the critical path originates from the K-bit CLA module present in the N-bit NCLA. When these two conditions are satisfied, the theoretical delays of K-bit CLA modules without and with a carry input may be generalized by Equations (17) and (18). In Equations (17) and (18), D_AO22 represents the propagation delay of an AO22 complex gate.

D_{K-bit CLA}^{No_Carryinput} = D_XOR2 + (K − 1) × D_AO22

(17)

D_{K-bit CLA}^Carryinput = D_XOR2 + (K − 1) × D_AO22 + D_AO21

(18)

Supposing a large K-bit CLA module is used to construct an N-bit NCLA, and if the critical path of the NCLA would be completely dominated by the propagation delay of that K-bit CLA module, then the theoretical delay of the NCLA may be generalized by Equation (19).

D_{N-bit NCLA}^{With_Large K-bit CLA} = D_XOR2 + (K − 2) × D_AO22 + D_AO21 + D_XOR2

(19)

Based on SDM, the theoretical critical path delays of previously considered example NCLAs are expressed by Equations (20)–(25).

D_NCLA-8844422 = (D_XOR2 + D_AND2 + D_AO22) + (5 × D_AO21) + (D_AO21 + D_XOR2)

(20)

D_NCLA-1010102 = (D_XOR2 + 9 × D_AO22 + D_AO21) + (D_AO21) + (D_AO21 + D_XOR2)

(21)

D_NCLA-1010444 = (D_XOR2 + 9 × D_AO22 + D_AO21) + (D_AO21 + D_XOR2)

(22)

D_NCLA-1244444 = (D_XOR2 + 10 × D_AO22 + D_AO21 + D_XOR2)

(23)

D_NCLA-128444 = (D_XOR2 + 10 × D_AO22 + D_AO21 + D_XOR2)

(24)

D_NCLA-12884 = (D_XOR2 + 10 × D_AO22 + D_AO21 + D_XOR2)

(25)

Let us now compare the CDM-based delay expressions with the SDM-based delay expressions. Comparing Equations (11) and (20), the first term on the right side slightly differs between these two. Comparing Equations (12) and (21), the first term on the right side significantly differs between these two. Equation (13) has three terms on the right side while Equation (22) has only two terms on the right side, and only one term is common between these two. Equations (23)–(25) are the same, and compared to these, the corresponding Equations (14)–(16) are different.

After substituting the propagation delays of gates belonging to the cell library [33] into Equations (20)–(25), the theoretical critical path delays were calculated based on SDM, given as follows: D_NCLA-8844422 = 0.720 ns; D_NCLA-1010102 = 1.007 ns; D_NCLA-1010444 = 0.944 ns; D_NCLA-1244444 = D_NCLA-128444 = D_NCLA-12884 = 0.953 ns. These values suggest the following:

NCLA-8844422, theoretically, has less delay compared to NCLA-1010102, NCLA-1010444, NCLA-1244444, NCLA-128444, and NCLA-12884, which tallies with the practical delay estimates given in Table 1.
NCLA-1244444, NCLA-128444, and NCLA-12884 have the same theoretical delay, which agrees with the corresponding practical delay estimates given in Table 1.
NCLA-1010102 and NCLA-1010444 have greater theoretical delay than NCLA-1244444, NCLA-128444, and NCLA-12884, again showing an agreement with the practical delay estimates given in Table 1.

These observations point to a good correlation between the SDM-based theoretically calculated delays and the practically estimated delays (given in Table 1) thus validating the usefulness and reliability of SDM for NCLAs.

Figure 6 shows two plots portraying a comparison between theoretically calculated delays (based on SDM) and practically estimated delays for the NCLAs listed in Table 1. A good correlation is observed between the theoretical and practical delays for almost all NCLAs but for one exception. According to SDM, NCLA-884444 is predicted to be better optimized for delay compared to its counterparts although the delays of NCLA-8844422 and NCLA-8864222 are very close. However, the practical delay estimates given in Table 1 show that NCLA-8844422 and NCLA-8864222 are better optimized than their counterparts and have a slight edge over NCLA-884444. This small anomaly is due to the approximation inherent in the theoretical delay modeling. Nevertheless, SDM is found useful and reliable overall. Moreover, SDM is scalable and less complex than CDM and it may be used for the theoretical delay calculation of CCLAs as well.

The power-delay product (PDP) is the product of total power dissipation and critical path delay, which is considered a key metric for assessing the energy efficiency of digital logic designs. Hence, we calculated the PDP for NCLAs listed in Table 1 and normalized these values. Normalization was achieved by dividing the actual PDP of each NCLA by the highest PDP corresponding to an NCLA (here, NCLA-1010102), and the normalized PDP values are plotted in Figure 7. Since minimizing power and delay is desirable, a minimum PDP is preferable. Therefore, the smallest value of normalized PDP is preferred, which corresponds to NCLA-88444222 in Table 1, highlighted by the dark red bar in Figure 7.

In the existing literature, specific adder designs were presented, yet many have not compared different adder architectures. For example, ref. [34] presented a CSLA design and provided a comparison between the design metrics of just the conventional CSLA and the proposed CSLA, and ref. [35] made a comparison between only CSLAs and CCLAs. On the contrary, we intend to compare adders belonging to various architectures using standard performance metrics. In this context, we described several 32-bit adders corresponding to diverse architectures, including the CSKA, CSA, CSLA, CCLA, and PPAs, all implemented in Verilog HDL, and synthesized following the same method described earlier. The 32-bit addition was described in a data flow style in Verilog HDL using the addition operator (+), and the 32-bit RCA was subsequently synthesized using DesignCompiler by mentioning the ‘compile_ultra’ command. As mentioned earlier, the ‘compile’ command was used to synthesize all the high-speed adders with speed defined as the optimization goal. We utilized the Synopsys DesignWare library, which includes synthesizable models of high-speed adders like the Ling adder, CSA, and PPAs such as the BKA and the Sklansky adder. Additionally, we used the structural description of a 32-bit KSA from [36] for synthesis. Regarding the CSLA, ref. [35] analyzed four popular configurations and determined that the 32-bit CSLA featuring a uniform 8-8-8-8 input partition was better optimized for delay. Hence, we chose this CSLA type for implementation and comparison in this work.

Table 2 shows the standard design metrics of 32-bit adders corresponding to diverse architectures including NCLA-8844422 (which was found to be better optimized among the proposed 32-bit NCLAs in Table 1). CCLA-2×16, CCLA-4×8, and CCLA-8×4 in Table 2 denote the CCLAs shown in Figure 3a, Figure 3b, and Figure 3c respectively. The split-up of the total area of different adders in terms of cells and interconnect areas is also given in Table 2.

The total power dissipation is composed of dynamic and leakage power, and the dynamic power is composed of cell internal power and net switching power components. The split-up of total power dissipation of different adders is shown in Figure 8. As noticed in Figure 8, CSLA dissipates greater dynamic power than other adders possibly due to the duplication of the adder logic within the CSLA. KSA suffers from greater leakage (static) power compared to other adders, and this is due to its increased area occupancy in comparison with other adders.

From area and power dissipation perspectives, the RCA is the most advantageous. The less area occupancy of the RCA translates into lower power dissipation compared to its counterparts, as seen in Table 2. However, the RCA is slow, and NCLA-8844422 reports a 71% reduction in critical path delay compared to the RCA.

When considering critical path delay, the KSA is the best, but NCLA-8844422 requires 55.4% less area and dissipates 41.2% less power than the KSA. Among the CCLAs, CCLA-8×4 is better optimized compared to CCLA-2×16 and CCLA-4×8. As seen in Table 2, CCLA-8×4 has less critical path delay, area, and power than CCLA-2×16. CCLA-4×8 has the same critical path delay as CCLA-8×4 but the latter requires slightly less area and dissipates slightly less power than the former. Compared to CCLA-8×4, NCLA-8844422 reports a reduction in critical path delay. To comprehend the reason for this, we again resort to SDM-based theoretical delay calculation. Based on SDM, the theoretical delay of CCLA-8×4 is expressed by (26). Note that the first term on the right side of Equation (26) is the same as given by Equation (17), which represents the delay encountered in the first 8-bit CLA module with no carry input. In Equation (26), the second term represents the combined delay of two intermediate 8-bit CLA modules, and the third term represents the delay encountered in the final 8-bit CLA module.

D_CCLA-8×4 = (D_XOR2 + 7 × D_AO22) + (2 × D_AO21) + (D_AO21 + D_XOR2)

(26)

Substituting the propagation delays of gates from the cell library datasheet [33], the theoretical critical path delay of CCLA-8×4 is calculated as 0.863 ns whereas the theoretical critical path delay of NCLA-8844422 was calculated to be 0.720 ns, earlier. Thus, theoretically, NCLA-8844422 has a 16.6% reduced critical path delay than CCLA-8×4 while the practical delay estimates given in Table 2 indicate that NCLA-8844422 achieves a 14.7% reduction in critical path delay compared to CCLA-8×4. Hence, the theoretically calculated and practically estimated delay reductions are quite close.

Figure 9 shows the normalized PDP values of various 32-bit adders, given in Table 2. Normalization of PDP was carried out following the same procedure adopted for Figure 7. The least normalized PDP value, which is preferable, is highlighted by the red bar in Figure 9. It was mentioned earlier that the RCA requires less area and dissipates less power than other adders, but NCLA-88444222 achieves a 65.4% reduction in PDP compared to it. Similarly, while the KSA has the least delay, NCLA-88444222 achieves a 20.2% reduction in PDP in comparison. Hence, considering the design metrics given in Table 2 and the normalized PDP shown in Figure 9, it is inferred that the proposed NCLA-8844422 offers a good trade-off between delay, power, and energy compared to its counterparts.

In some literature, for example [34], the area-delay product (ADP) has been additionally considered as a figure of merit besides the PDP. The ADP helps to quantify the trade-off between area and delay. Increasing the area might reduce the delay and thus increase the speed, which is found to be true for KSA while reducing the area could increase the delay, which is found to be true for RCA. So, the ADP is a measure of how efficiently a design uses its area to achieve a certain level of performance. Lower ADP values generally indicate more efficient designs, which indicates minimizing the area while keeping the delay low. A design with a lower ADP is preferable as it suggests a better balance between area and delay. Given this, the ADP of all the 32-bit adders in Table 2 were calculated and normalized. The actual ADP of each adder was divided by the highest ADP of a specific adder (here, CSKA) to do the normalization. The normalized ADP of various 32-bit adders is depicted in Figure 10, and the preferred value, i.e., the least normalized ADP value is highlighted by the red bar corresponding to NCLA-8844422. Compared to RCA, which has the least ADP among existing adders, NCLA-8844422 achieves a 7% reduction.

4. Conclusions

This article introduced an NCLA architecture that employs non-uniform size CLA modules arranged in a cascade. In contrast, the CCLA architecture utilizes a cascade of uniform-size CLA modules. Two key features of the NCLA are (i) either the delay corresponding to the least significant CLA module(s) is absorbed within the delay of the more significant CLA module(s), or (ii) the number of CLA module stages is reduced, both of which lead to a shortening of the critical path and thus enable optimization of the critical path delay compared to a CCLA. Two strategies were employed to determine the selection and placement of CLA modules within the NCLA architecture to achieve this. Synthesis results demonstrate that the proposed NCLA architecture achieves a balanced trade-off between critical path delay and power dissipation compared to existing adder architectures. This is evidenced by the improved energy efficiency achieved by the NCLA (NCLA-8844422) compared to other adders for a 32-bit addition. Also, this NCLA achieved a better balance between area and delay compared to its counterparts. Future work might focus on investigating the usefulness of the proposed NCLA architecture to realize other arithmetic circuits such as say, multipliers with higher speed and improved energy efficiency.

Author Contributions

Conceptualization, P.B.; methodology, P.B.; validation, P.B.; formal analysis, P.B.; investigation, P.B. and D.L.M.; resources, D.L.M.; data curation, P.B.; writing—original draft preparation, P.B.; writing—review and editing, P.B.; visualization, P.B.; supervision, D.L.M.; project administration, P.B. and D.L.M.; funding acquisition, D.L.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by the Ministry of Education (MOE), Singapore Academic Research Fund under grant numbers Tier-1 RG48/21 and Tier-1 RG127/22.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Gupta, V.; Mohapatra, D.; Raghunathan, A.; Roy, K. Low-power digital signal processing using approximate adders. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2013, 32, 124–137. [Google Scholar] [CrossRef]
Pashaeifar, M.; Kamal, M.; Afzali-Kusha, A.; Pedram, M. A theoretical framework for quality estimation and optimization of DSP applications using low-power approximate adders. IEEE Trans. Circuits Syst.—I Regul. Pap. 2019, 66, 327–340. [Google Scholar] [CrossRef]
Geng, H.; Ma, Y.; Xu, Q.; Mia, J.; Roy, S.; Yu, B. High-speed adder design space exploration via graph neural processes. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2022, 41, 2657–2670. [Google Scholar] [CrossRef]
Kim, V.H.; Choi, K.K. A reconfigurable CNN-based accelerator design for fast and energy-efficient object detection system on mobile FPGA. IEEE Access 2023, 11, 59438–59445. [Google Scholar] [CrossRef]
Han, J.; Li, Y.; Yu, Z.; Zeng, X. A 65 nm cryptographic processor for high speed pairing computation. IEEE Trans. Very Large Scale Integr. Syst. 2015, 23, 692–701. [Google Scholar] [CrossRef]
Panda, A.K.; Palisetty, R.; Ray, K.C. High-speed area-efficient VLSI architecture of three-operand binary adder. IEEE Trans. Circuits Syst.—I Regul. Pap. 2020, 67, 3944–3953. [Google Scholar] [CrossRef]
Qoutb, A.G.; El-Gunidy, A.M.; Tolba, M.F.; El-Moursy, M.A. High speed special function unit for graphics processing unit. In Proceedings of the 9th International Design and Test Symposium, Algiers, Algeria, 16–18 December 2014. [Google Scholar]
Han, D.; Lee, J.; Yoo, H.-J. DF-LNPU: A pipelined direct feedback alignment-based deep neural network learning processor for fast online learning. IEEE J. Solid-State Circuits 2021, 56, 1630–1640. [Google Scholar] [CrossRef]
Lo, C.Y.; Sham, C.-W.; Fu, C. Novel CNN accelerator design with dual Benes network architecture. IEEE Access 2023, 11, 59524–59529. [Google Scholar] [CrossRef]
Datta, D.; Dutta, H.S. Design and implementation of digital down converter for WiFi network. IEEE Embed. Syst. Lett. 2024, 16, 122–125. [Google Scholar] [CrossRef]
Yoo, W.; Jung, Y.; Kim, M.Y.; Lee, S. A pipelined 8-bit soft decision Viterbi decoder for IEEE802.11ac WLAN systems. IEEE Trans. Consum. Electron. 2012, 58, 1162–1168. [Google Scholar] [CrossRef]
Osta, M.; Ibrahim, A.; Chible, H.; Valle, M. Inexact arithmetic circuits for energy efficient IoT sensors data processing. In Proceedings of the IEEE International Symposium on Circuits and Systems, Florence, Italy, 27–30 May 2018. [Google Scholar]
Mendez, T.; Parupudi, T.; Vishnumurthy, K.K.; Nayak, S.G. Development of power-delay product optimized ASIC-based computational unit for medical image compression. Technologies 2024, 12, 121. [Google Scholar] [CrossRef]
Zhang, H.; Putic, M.; Lach, J. Low power GPGPU computation with imprecise hardware. In Proceedings of the 51st Design Automation Conference, San Francisco, CA, USA, 1–5 June 2014. [Google Scholar]
Wanhammar, L. DSP Integrated Circuits; Academic Press: Cambridge, MA, USA, 1999. [Google Scholar]
Chen, D.C.; Guerra, L.M.; Ng, E.H.; Potkonjak, M.; Schultz, D.P.; Rabaey, J.M. An integrated system for rapid prototyping of high performance algorithm specific data paths. In Proceedings of the International Conference on Application Specific Array Processors, Berkeley, CA, USA, 4–7 August 1992. [Google Scholar]
Garside, J.D. A CMOS VLSI implementation of an asynchronous ALU. In Proceedings of the IFIP WG10.5 Working Conference on Asynchronous Design Methodologies, Manchester, UK, 31 March–2 April 1993. [Google Scholar]
Omondi, A.R. Computer Arithmetic Systems: Algorithms, Architecture and Implementations; Prentice Hall: New York, NY, USA, 1994. [Google Scholar]
Ercegovac, M.D.; Lang, T. Digital Arithmetic; Morgan Kaufmann Publishers: Burlington, MA, USA, 2004. [Google Scholar]
Balasubramanian, P.; Prasad, K.; Mastorakis, N.E. A standard cell based synchronous dual-bit adder with embedded carry look-ahead. WSEAS Trans. Circuits Syst. 2010, 9, 736–745. [Google Scholar]
Parhami, B. Computer Arithmetic: Algorithms and Hardware Designs, 1st ed.; Oxford University Press: New York, NY, USA, 2000. [Google Scholar]
Sklansky, J. Conditional-sum addition logic. IRE Trans. Electron. Comput. 1960, EC-9, 226–231. [Google Scholar] [CrossRef]
Bedrij, O.J. Carry-select adder. IRE Trans. Electron. Comput. 1962, EC-11, 340–346. [Google Scholar] [CrossRef]
Chang, T.-Y.; Hsiao, M.-J. Carry-select adder using single ripple-carry adder. Electron. Lett. 1998, 34, 2101–2103. [Google Scholar] [CrossRef]
Rosenberger, G.B. Simultaneous Carry Adder. U.S. Patent 2,966,305, 27 December 1960. [Google Scholar]
Ling, H. High-speed binary adder. IBM J. Res. Dev. 1981, 25, 156–166. [Google Scholar] [CrossRef]
Knowles, S. A family of adders. In Proceedings of the 15th IEEE Symposium on Computer Arithmetic, Vail, CO, USA, 11–13 June 2001. [Google Scholar]
Brent, R.P.; Kung, H.T. A regular layout for parallel adders. IEEE Trans. Comput. 1982, C-31, 260–264. [Google Scholar] [CrossRef]
Sklansky, J. An evaluation of several two-summand binary adders. IRE Trans. Electron. Comput. 1960, EC-9, 213–226. [Google Scholar] [CrossRef]
Kogge, P.M.; Stone, H.S. A parallel algorithm for the efficient solution of a general class of recurrence equations. IEEE Trans. Comput. 1973, 100, 786–793. [Google Scholar] [CrossRef]
Balasubramanian, P.; Maskell, D. A new carry look-ahead adder architecture enabling improved speed and energy efficiency. In Proceedings of the IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, Victoria, BC, Canada, 21–23 August 2024. [Google Scholar]
Balasubramanian, P.; Mastorakis, N.E. High-speed and energy-efficient carry look-ahead adder. J. Low Power Electron. Appl. 2022, 12, 46. [Google Scholar] [CrossRef]
Synopsys SAED_EDK32/28_CORE Databook, Revision 1.0.0. January 2012. Available online: https://www.synopsys.com/community/university-program/teaching-resources.html (accessed on 8 May 2024).
Ramkumar, B.; Kittur, H.M. Low-power and area-efficient carry select adder. IEEE Trans. VLSI Syst. 2012, 20, 371–375. [Google Scholar] [CrossRef]
Balasubramanian, P.; Mastorakis, N. Performance comparison of carry-lookahead and carry-select adders based on accurate and approximate additions. Electronics 2018, 7, 369. [Google Scholar] [CrossRef]
Yazdanbakhsh, A.; Mahajan, D.; Esmaeilzadeh, H.; Lofti-Kamran, P. AxBench: A multiplatform benchmark suite for approximate computing. IEEE Des. Test 2017, 34, 60–68. [Google Scholar] [CrossRef]

Figure 1. Conventional gate-level realization of a 4-bit CLA module with no carry input.

Figure 2. Gate-level realization of a delay-optimized 4-bit CLA module with a carry input.

Figure 3. The 32-bit CCLAs realized using uniform-size CLA modules: (a) using 2-bit CLA modules; (b) using 4-bit CLA modules; and (c) using 8-bit CLA modules. Rose, yellow, and green boxes in the figure represent 2-bit, 4-bit, and 8-bit CLA modules respectively.

Figure 4. The 32-bit NCLAs realized using different size CLA modules: (a) using 4-bit and 6-bit CLA modules; (b) using 2-bit, 4-bit, and 8-bit CLA modules; and (c) using 4-bit and 8-bit CLA modules. Rose, yellow, turquoise, and green boxes represent 2-bit, 4-bit, 6-bit, and 8-bit CLA modules.

Figure 5. Area (in µm²) of different CLA modules without and with the carry input. The areas of 2-, 4-, 6-, 8-, 10-, and 12-bit CLA modules are portrayed by purple, yellow, green, black, dark red, and light blue bars respectively.

Figure 6. Theoretically calculated and practically estimated delays of various 32-bit NCLAs (in ns). The red dots in the blue line and the blue dots in the orange line highlight the corresponding theoretical and practical delays of specific NCLAs.

Figure 7. Normalized PDP of various 32-bit NCLAs with the optimized value corresponding to NCLA-8844422 highlighted by the dark red bar.

Figure 8. Power dissipation components (in µW) of 32-bit adders, estimated using PrimePower.

Figure 9. Normalized PDP of 32-bit adders belonging to different architectures.

Figure 10. Normalized ADP of different 32-bit adders.

Table 1. Design parameters of 32-bit NCLAs, synthesized using a 28 nm standard cell library.

NCLA Configuration	Area (µm²)			Critical Path Delay (ns)	Total Power Dissipation (µW)
NCLA Configuration	Cells	Net	Total	Critical Path Delay (ns)	Total Power Dissipation (µW)
NCLA-666644	468.39	51.78	520.17	1.02	50.00
NCLA-6664442	476.52	52.85	529.37	1.05	50.70
NCLA-6666422	476.52	52.96	529.48	1.02	50.38
NCLA-666662	479.06	53.49	532.55	1.12	50.67
NCLA-664664	468.39	51.78	520.17	1.11	50.00
NCLA-66644222	473.98	52.43	526.41	1.06	50.47
NCLA-88844	470.93	52.80	523.73	1.09	49.78
NCLA-8844422	476.52	53.21	529.73	0.99	50.00
NCLA-884444	468.39	52.04	520.43	1.01	49.62
NCLA-8444444	465.85	51.27	517.12	1.08	49.46
NCLA-888422	479.06	53.98	533.04	1.09	50.16
NCLA-886622	479.06	53.84	532.90	1.04	50.35
NCLA-8864222	476.52	53.32	529.84	0.99	50.32
NCLA-86666	461.78	51.22	513.00	1.09	49.27
NCLA-866642	479.06	53.61	532.67	1.03	50.61
NCLA-1010102	484.14	55.96	540.10	1.26	51.07
NCLA-1010444	470.93	53.22	524.15	1.14	50.10
NCLA-1244444	468.39	52.62	521.01	1.12	49.62
NCLA-128444	470.93	53.39	524.32	1.12	49.78
NCLA-12884	473.47	54.15	527.62	1.12	49.94

Table 2. Design metrics of various 32-bit adders, synthesized using a 28 nm standard cell library.

Adder	Area (µm²)			Critical Path Delay (ns)	Total Power Dissipation (µW)
Adder	Cells	Net	Total	Critical Path Delay (ns)	Total Power Dissipation (µW)
RCA	155.03	10.98	166.01	3.40	42.13
CSKA	407.14	45.25	452.39	2.93	44.23
CSA	412.48	77.65	490.13	1.71	69.43
CSLA	745.15	95.86	841.01	1.16	79.45
Ling adder	392.40	75.21	467.61	2.39	67.48
CCLA-2×16	453.65	48.75	502.40	1.63	48.98
CCLA-4×8	463.30	50.52	513.82	1.16	49.29
CCLA-8×4	455.17	50.79	505.96	1.16	48.63
BKA	419.85	64.40	484.25	2.42	56.65
Sklansky adder	387.06	62.75	449.81	2.74	57.10
KSA	1014.29	174.43	1188.72	0.73	84.99
NCLA-8844422 (proposed)	476.52	53.21	529.73	0.99	50.00

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Balasubramanian, P.; Maskell, D.L. A New Carry Look-Ahead Adder Architecture Optimized for Speed and Energy. Electronics 2024, 13, 3668. https://doi.org/10.3390/electronics13183668

AMA Style

Balasubramanian P, Maskell DL. A New Carry Look-Ahead Adder Architecture Optimized for Speed and Energy. Electronics. 2024; 13(18):3668. https://doi.org/10.3390/electronics13183668

Chicago/Turabian Style

Balasubramanian, Padmanabhan, and Douglas L. Maskell. 2024. "A New Carry Look-Ahead Adder Architecture Optimized for Speed and Energy" Electronics 13, no. 18: 3668. https://doi.org/10.3390/electronics13183668

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A New Carry Look-Ahead Adder Architecture Optimized for Speed and Energy

Abstract

1. Introduction

2. CLA Architectures—Conventional and Proposed

3. Implementation and Design Metrics

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI