1. Introduction
Approximate computing has gained prominence as a practical alternative to accurate computing, offering substantial improvements in speed and power efficiency [
1,
2]. However, these benefits come at the cost of reduced accuracy, necessitating a careful balance between performance enhancement and tolerable accuracy loss. This approach has been successfully investigated across various real-world applications [
3], such as multimedia processing [
4], digital signal processing [
5,
6], big data analytics [
7], neuromorphic computing [
8], neural network implementations for artificial intelligence and machine learning [
9,
10], software development [
11], memory storage [
12,
13], and low-power graphics processing [
14], among others.
This paper explores the application of approximate computing for machine learning, with a particular emphasis on k-means clustering, a widely used unsupervised machine learning technique for partitioning data into clusters based on similarity. K-means does not require any labeled data, which makes it suitable for unsupervised learning tasks. The primary goal of k-means clustering is to partition data points into k clusters, where k also represents the number of centroids, with each centroid serving as the central reference point for a cluster. The core idea of k-means is to minimize the sum of squared distances between the data points and their assigned cluster centroids. Thus, the objective of k-means is to minimize the Euclidean distance between data points and their cluster center, thus reducing the variance within each cluster, which makes it easy to evaluate and understand the results. The k-means is particularly valued for its simplicity, efficiency, and ease of implementation in various applications such as data compression, market segmentation, anomaly detection, dimensionality reduction, etc. It works well with large datasets and can handle a considerable amount of data due to its linear time complexity, as long as the number of clusters is manageable. The algorithmic time complexity of k-means is typically defined by O(n × k × t), where ‘n’ represents the number of data points, ‘k’ represents the number of clusters, and ‘t’ represents the number of iterations required for convergence. The k-means algorithm will usually converge to a solution in a finite number of iterations, and it tends to converge quickly, especially if a good initial set of centroids is chosen, which depends on the data. There also exist limitations of k-means in that it assumes clusters to be roughly spherical, which may not be suitable for all types of data (e.g., data with clusters of varying shapes and densities); the algorithm requires the number of clusters i.e., k, to be specified in advance, which is sometimes not intuitive or easy to determine; and k-means is sensitive to the initial placement of centroids, which can lead to suboptimal results if poor initializations are chosen. However, despite these limitations, the simplicity, speed, and ease of use of k-means make it a go-to clustering method for many practical applications.
In k-means, during the clustering process, data points are assigned to the closest centroid to minimize the within-cluster sum of squares (WCSS), with a lower WCSS value indicating an effective clustering. Conventionally, WCSS is computed with high precision using an accurate adder. This study examines the impact of employing various approximate adders for WCSS computation, and compares the results with those obtained using an accurate adder. Further, a novel approximate adder is introduced that demonstrates clustering performance comparable to an accurate adder, while achieving significant improvements in design efficiency and outperforming many existing approximate adders in optimizing key design metrics.
The remainder of this paper is structured as follows:
Section 2 discusses the two primary categories of approximate adders—those with fixed approximation and those with variable approximation—before introducing the proposed approximate adder.
Section 3 gives a concise overview of the k-means clustering algorithm and provides a comparison of the clustering results obtained using the accurate adder and the proposed approximate adder. Additionally, this section includes a discussion of the clustering performance using other approximate adders and an error analysis of various approximate adders.
Section 4 presents the physical design metrics estimated for the accurate adder and various approximate adders, including the proposed approximate adder, with all the adders implemented using a 28 nm CMOS standard cell library. Finally,
Section 5 concludes by summarizing the contributions of this paper.
2. New Approximate Adder
Approximate adders are typically designed by deliberately introducing inaccuracies into an accurate adder. These adders can be broadly classified into two categories: fixed approximation adders (FAAs) and variable approximation adders (VAAs). FAAs maintain a constant level of approximation, ensuring that they generate either an approximate sum with a predefined accuracy or an accurate sum, depending upon the given inputs within one clock cycle. A fixed approximation guarantees savings in design metrics for an approximate adder compared to its accurate counterpart. In contrast, VAAs allow for a dynamic approximation level, enabling them to produce either an approximate or accurate sum as needed, which may involve one or more clock cycles. VAAs often include an error detection and correction circuit (EDCC) aside from the adder logic to maintain the required output accuracy. While the EDCC is crucial in ensuring accuracy, it tends to introduce a design overhead. A study presented in [
15] for a digital video encoding application found that the power saving achieved by a VAA was like that of an FAA. The reason for this was attributed to the additional EDCC in VAAs, which is not present in FAAs. Given that the proposed approximate adder falls under the FAA category, this paper focuses exclusively on FAAs. Various FAA architectures have been extensively documented and compared in the existing literature [
16,
17], and an interested reader is referred to the same for details; hence, they shall not be discussed here to avoid repetition. Instead, we shall describe the proposed approximate adder in this section. However, we shall cite many FAAs in the next section and consider them for a comparative evaluation in this work.
An FAA is generally composed of two distinct sections: a precise segment, where computations are performed with full accuracy, and an imprecise segment, where intentional errors are introduced into the computation. The lower-order bits of the adder are designated to the imprecise section, while the higher-order bits are assigned to the precise section. As a result, the precise segment plays a more critical role in maintaining overall computational accuracy compared to the imprecise segment.
The proposed approximate adder (NAA), illustrated in
Figure 1, is an FAA that is divided into two sections, which are called precise and imprecise parts. As depicted in the figure, an N-bit NAA consists of an M-bit imprecise section and an (N–M)-bit precise section, with the latter being more critical in determining the overall accuracy of the computation. The imprecise section is highlighted in red, and the precise section is depicted in blue. The adder inputs are represented by A and B, and the adder output, i.e., the sum, is denoted by S.
In the imprecise section, a subset of the less significant sum bits is fixed at a binary value of 1, while the remaining sum bits are computed using reduced (i.e., approximate) logic. The M-bit imprecise part generates M sum bits, where SM−1, SM−2, and SM−3 employ approximate logic, while the lower-order sum bits from SM−4 to S0 are assigned a constant value of 1. Sum bits SM−2 and SM−3 are produced by the logical OR of corresponding input bit pairs (AM−2, BM−2) and (AM−3, BM−3), respectively. The input bit pair (AM−1, BM−1) is logically XOR-ed, the input bit pair (AM−2, BM−2) is logically AND-ed, and the outputs of these XOR and AND gates are logically OR-ed to produce the sum bit SM−1. Consequently, the less significant input bit pairs (A0, B0) through (AM−4, BM−4) are disregarded. An internal carry signal viz. CT is generated by logically AND-ing the input bit pair (AM−1, BM−1), which is then fed into the precise section as its carry input. The (N–M)-bit precise section performs exact addition for the more significant sum bits, ranging from SM to SN, with SN representing the carry overflow of the addition. The sum bits computed by the precise and imprecise sections are concatenated to generate the final sum output of NAA.
The generalized logic equations for the sum bits corresponding to the imprecise part, i.e., S
M−1 up to S
0, and the precise part, viz. S
N up to S
M of the NAA are given below. Equation (5) expresses the internal carry logic that feeds the precise part. Equation (6) refers to an arbitrary Kth bit adder stage in the precise part whose sum bit is dependent on the respective adder inputs and the carry output produced by the (K – 1)th bit adder stage.
Typically, an N-bit NAA has an M-bit imprecise part where M is at least equal to 4 with S
0 fixed as 1 and S
1, S
2, and S
3 are associated with an approximate sum logic. When M ≥ 5, S
M−1, S
M−2, and S
M−3 would have the same logic as shown in
Figure 1, and the remainder of the less significant sum bits would be fixed at 1. Further, the sizes of the precise and imprecise sections of an NAA can be adjusted based on specific application requirements, allowing flexibility in balancing accuracy and design efficiency.
3. K-Means Clustering Involving Accurate and Approximate Adders
The traditional k-means clustering algorithm [
18] has been recognized as a fundamental and elegant approach for dividing numerical data into distinct clusters. It operates through an iterative process, where data points are assigned to clusters based on their closeness to a set of centroids. The algorithm alternates between two key steps. First, each data point is allocated to the nearest centroid using the Euclidean distance; second, the centroids are recalculated based on the average position of the points within their respective clusters. This cycle repeats until the centroids reach a stable state, ultimately forming well-defined clusters of data points [
19].
A key characteristic of the k-means algorithm is its iterative expectation-maximization approach, which provides a level of error resilience at each step. Since each iteration incrementally refines the cluster assignments without requiring absolute precision in every update, k-means can effectively mitigate minor inaccuracies, particularly during centroid recalculations. This error tolerance is especially relevant to our case study, which examines the influence of approximate adders on k-means clustering. The algorithm’s convergence remains largely unaffected by small computational errors, making it a suitable framework for evaluating how approximate arithmetic impacts clustering accuracy. By integrating approximate adders into centroid computations, k-means presents a good testbed for analyzing the trade-offs between computational accuracy and clustering performance. In this work, we assess the clustering accuracy of the k-means algorithm by considering the total within-cluster sum of squares (WCSS) as our evaluation metric. WCSS is effective in reflecting how accurately the k-means algorithm has been executed computationally. This enables us to compare the clustering performances of various approximate adders by analyzing their ability to handle k-means despite arithmetic imprecisions. The methodology for calculating WCSS is discussed in the following.
To evaluate the clustering accuracy of different approximate adders, we conducted software-based experiments using publicly available artificial datasets [
20]. The primary goal was to analyze the impact of approximate arithmetic on the accuracy of k-means clustering. We implemented the k-means algorithm in Python (version: 3.13) and initially simulated an accurate adder in the software. Subsequently, we modified the algorithm to integrate approximate adders during two key computations—the Euclidean distance within each dimension between a centroid and a data point, and the total sum of these Euclidean distances across all dimensions.
Consider an adder function named
, which accepts two integers
and
within the specified bit precision and returns their sum through approximate addition. In practical clustering scenarios, these integers may exceed the precision limits; therefore, preprocessing is necessary to ensure accuracy. To perform subtraction using an approximate adder, we define an auxiliary function
, given by Equation (7), where
to keep the adder from dealing with negative numbers. Hence, even if subtraction is accurate, incorporating the
AA function introduces the approximate adder into the calculation.
As an illustration, let
and
represent a datapoint
i and a centroid
j, respectively. Assume the dataset has
D numerical dimensions, meaning these quantities are vectors. Let
d refer to one of these dimensions. To compute the Euclidean distance within the
dth dimension, we express it via Equation (8).
We omit the square root in this calculation, as we are not much concerned with the raw Euclidean distance but rather with a comparison of the distances. Therefore, the sum of the squared Euclidean distances across all dimensions is represented by Equation (9).
Next, we apply this calculation to all data points and centroids to decide how the centroid assignment will be made in each iteration. The notation on the right side of Equation (9) indicates that an approximate adder was utilized for each term in the summation.
In [
21], only the
aggregation dataset of [
20] was considered for clustering. For this paper, we considered four open-source artificial datasets from [
20], namely,
aggregation,
diamond9,
DS850, and
engytime, that exhibit quite distinct and different characteristics to simulate various challenges for the k-means, allowing us to evaluate the performances of the approximate adders in different clustering tasks. It may be noted that not all datasets of [
20] can be clustered using k-means due to its limitations, such as the inability to handle complex, non-linear relationships between data points, high-dimensional data, etc.
Here, besides the proposed approximate adder (NAA), we considered several other approximate adders, namely LOA [
22], LOAWA [
23], APPROX5 [
24], HEAA [
25], M-HEAA [
26], OLOCA [
27], HOERAA [
28], LDCA [
29], HPETA-II [
30], HOAANED [
31], HERLOA [
32], M-HERLOA [
33], COREA [
21], DBAA [
34], and SAAR [
35], to assess their performance in k-means clustering. We set the maximum bit precision of the adder to 16 bits (i.e., N = 16), consistent with the approach followed in [
21]. For NAA with N = 16 bits, we varied the size of the imprecise part (i.e., M) from 4 bits to 10 bits, performing clustering individually for each value of M across all four datasets. The clustering iterations were carried out until convergence, after which the WCSS was calculated.
We define WCSS as follows: Let
represent the set of clusters, where
denotes the group of points assigned to the
kth centroid,
, as shown in Equation (10). Interested readers may reproduce the clustering results by using the code and following the instructions provided in our GitHub link [
36].
The clustering results obtained using the accurate adder and the proposed approximate adder (NAA) are presented side-by-side in
Figure 2. Through trial and error, the optimum sizes for precise and imprecise parts of NAA were found to be 9 bits and 7 bits, respectively, for all four datasets. These sizes were selected to ensure that the WCSS value obtained using NAA either matched or was very close to the WCSS value obtained with the accurate adder. As shown in
Figure 2, the WCSS value for the
diamond9 dataset remains identical for both the accurate adder and NAA, while the WCSS values for
aggregation,
diamond9, and
engytime datasets exhibit slight differences between the two.
Table 1 illustrates the variation in WCSS for the four datasets when using a 16-bit NAA, with the imprecise adder part size varying from 4 bits to 10 bits. WCSS values based on the accurate adder for clustering different datasets are noted from
Figure 2 as follows:
aggregation = 11,427.23,
diamond9 = 1015.24,
DS850 = 413.13, and
engytime = 12,002.04. In
Table 1, M represents the size of the imprecise part of an N-bit NAA, where N = 16 and M varies from 4 to 10. It may be recalled from the previous discussion that M should be at least 4 for an NAA.
When performing k-means clustering using NAA, our aim was to determine the optimum size for its imprecise and precise parts so that a specific configuration could be chosen for NAA that is suitable for clustering different datasets considered. Generally, it is desirable to maximize the size of the imprecise part of an approximate arithmetic circuit such that it yields an acceptable output quality and enables significant savings in design metrics compared to its accurate counterpart. As seen in
Table 1, the optimum sizes for the imprecise part (M) of NAA for clustering different datasets were M = 5 for
aggregation, M = 7 for
diamond9, M = 6 for
DS850, and M = 7 for
engytime. However, for
aggregation, both M = 5 and M = 7 give nearly identical WCSS values, and for
DS850, M = 6 and M = 7 result in almost the same WCSS values. For
engytime, M = 7 is found to be optimal. Based on these findings, a configuration of 7 bits for the imprecise part and 9 bits for the precise part of the 16-bit NAA was chosen for clustering all four datasets.
The accurate adder and NAA produced the same WCSS value for clustering the
diamond9 dataset, but slightly different WCSS values for
aggregation,
DS850, and
engytime datasets, as seen in
Figure 2. These variations are attributed to differences between the characteristics of datasets. Overall, the results suggest that NAA is a promising choice for k-means clustering, providing comparable or nearly identical clustering performance compared to the accurate adder. Based on the same procedure, the optimum size of the imprecise part of several existing approximate adders for k-means clustering was also determined, given as follows: M = 5 for LOA, HEAA, M-HEAA, OLOCA, HOERAA, HOAANED, HERLOA, and M-HERLOA; and M = 4 for LOAWA, APPROX5, LDCA, HPETA-II, COREA, and DBAA. The approximate adder SAAR features a unique architecture that lacks an imprecise part, and a 16-bit SAAR was implemented in the software as illustrated in [
35].
Figure 3 shows which data points differ between clustering performed using the accurate adder and the proposed approximate adder (NAA) for different datasets. It can be noticed from
Figure 3 that two data points (shown in red) were differently clustered by NAA compared to the accurate adder for the
aggregation dataset, while one data point (shown in red) has been differently clustered by NAA compared to the accurate adder for
DS850 and
engytime datasets. However, there is no difference in clustering performed by the accurate adder and NAA for the
diamond9 dataset. Thus, any minor/no differences in clustering between the accurate adder and NAA explain any minor/no variations between their corresponding WCSS values across different datasets.
Next, to analyze the error characteristics of approximate adders, we computed two commonly used error metrics based on the optimum level of inaccuracy for each approximate adder, as discussed earlier. The error metrics considered were mean error distance (MED), also known as mean absolute error, and root mean square error (RMSE). To perform the error analysis, the high-level functionality of the accurate adder and different approximate adders was modeled in Python. Given that a 16-bit adder has 2
32 possible input combinations, considering all those may be impractical. Therefore, we supplied one million randomly generated input values to the adders and calculated their MED and RMSE compared to the sum produced by the accurate adder. The formulae for MED and RMSE are given by Equations (11) and (12).
In Equations (11) and (12), AccuSum(A
L, B
L) refers to the sum produced by the accurate adder, while AppxSum(A
L, B
L) represents the sum generated by an approximate adder. The notation (A
L, B
L) denotes a specific set of input values given to the adder. K represents the number of inputs provided to the approximate adders for the calculation of the error metrics, with K set to 1 million. The MED and RMSE values computed for the approximate adders having optimum inaccuracy for clustering are given in
Table 2.
As seen in
Table 2, although NAA exhibits numerically higher MED and RMSE values compared to many other approximate adders, k-means clustering was found to inherently tolerate some inaccuracies, making NAA still effective, as demonstrated by
Figure 2 and
Figure 3. However, it is important to note that the MED and RMSE values for NAA given in
Table 2 correspond to M = 7, which is greater than the optimum value of M = 4 or M = 5 determined for other approximate adders. When assuming M = 5 for NAA, its MED and RMSE were found to be 3.4515 and 4.7402, respectively, both of which are lower than the MED and RMSE values of many other approximate adders, such as LOA, HEAA, M-HEAA, OLOCA, HOERAA, and HOAANED. Similarly, with M = 4, the MED and RMSE for NAA were found to be 1.6572 and 2.3470, respectively, which are lower than those of other approximate adders like LOAWA, APPROX5, COREA, LDCA, and HPETA-II. This indicates that NAA, with M = 4 or M = 5, outperforms several approximate adders in terms of error metrics, and NAA is suitable for clustering even with M = 7, showcasing its superiority.
Additionally,
Table 2 reveals that SAAR has significantly greater MED and RMSE values compared to all other approximate adders, which is due to its unique architecture. As a result, the WCSS values obtained by using SAAR for clustering the datasets differ noticeably from those generated by the accurate adder, and they are given as follows:
aggregation—WCSS with accurate adder = 11,427.23 and WCSS with SAAR = 13,204.01;
diamond9—WCSS with accurate adder = 1015.24 and WCSS with SAAR = 1034.50;
DS850—WCSS with accurate adder = 413.13 and WCSS with SAAR = 419.94;
engytime—WCSS with accurate adder = 12,002.04 and WCSS with SAAR = 12,573.49. These imply that SAAR may not be useful for k-means clustering. In the next section, we discuss the physical synthesis of accurate and approximate adders.
4. Accurate and Approximate Adders–Synthesis
The accurate 16-bit adder, along with several approximate 16-bit adders, was described in Verilog HDL and synthesized using gates from a 28 nm bulk CMOS standard cell library [
37] using Synopsys Design Compiler (DC). For the accurate adder, addition was described in a data flow style using the arithmetic operator (+), and synthesis was performed using the ‘compile_ultra’ command of DC, resulting in a 16-bit ripple carry adder (RCA). It is well known that the RCA is low-power and occupies less area compared to other high-speed adders, which is why it was selected as the sample architecture for this work. However, other high-speed adders, such as a carry look-ahead adder, may also be used to implement the accurate adder and the precise parts of the approximate adders. To ensure consistency and a fair comparison with the accurate adder, the RCA topology was also employed to implement the precise parts of all approximate adders in this work. Accordingly, the precise sections of approximate adders were described in data flow style using the addition operator, while the imprecise parts were structurally described. The default wire-load model was included by DC during synthesis to account for interconnect and parasitic effects. Additionally, the sum bits of all the adders were uniformly assigned a fanout-of-4 drive strength.
Following synthesis, the total area occupied by each adder was estimated using DC. Subsequently, the gate-level netlists of all the adders underwent functional verification through simulation using Synopsys VCS. To do this, approximately 1000 randomly generated input values were uniformly supplied to all adders via a test bench, with a latency of 2 ns to account for the delay of the 16-bit RCA. The switching activity recorded during functional simulations was utilized to estimate the total power dissipation using Synopsys Prime Power. The critical path delay of each adder was determined using Synopsys Prime Time. For the timing estimation, a virtual clock was employed to constrain the primary inputs and outputs of the adders; however, it was not part of the physical implementation. The standard design metrics, namely, total area, critical path delay, and total power dissipation estimated for various adders, are presented in
Table 3. The split-up of total area in terms of cell area and interconnect area, and the split-up of total power dissipation in terms of dynamic power and static (leakage) power, are also given in
Table 3. Dynamic power was calculated by summing cell internal power and net switching power, as reported by Synopsys Prime Power.
In
Table 3, it can be observed that NAA occupies the smallest area among the approximate adders, and this is mainly because NAA has a 9-bit precise part and a 7-bit imprecise part, while the rest of the approximate adders have a 12-bit/11-bit precise part, and a corresponding 4-bit/5-bit imprecise part. Since NAA has a relatively bigger imprecise part, its logic is reduced, and consequently, NAA’s area occupancy is less compared to other approximate adders. Nonetheless, NAA has been able to achieve a good clustering quality, which is comparable to the accurate adder, despite having a big imprecise part, and this is an advantage resulting from its unique architecture. Compared to the accurate adder (i.e., RCA), NAA utilizes 21.6% less area, despite its 9-bit precise section having been implemented as an RCA.
Concerning the critical path delay, NAA achieves a 37.1% reduction compared to the accurate adder. Overall, SAAR has the shortest delay, and this is due to certain factors, stated as follows: a 16-bit SAAR is divided into two significant 6-bit precise parts and a less significant 4-bit precise part, according to its architecture, as shown in [
35]. The carry inputs for the 6-bit precise parts do not come from an accurate carry logic, so the critical path delay of the 16-bit SAAR is determined by the delay of a 6-bit precise adder, which has been implemented as an RCA. On the other hand, NAA features a 9-bit precise adder, which explains why its delay is greater compared to SAAR. Other approximate adders, such as LOA, HEAA, M-HEAA, OLOCA, HOERAA, HOAANED, HERLOA, and M-HERLOA, use an 11-bit precise adder realized as an RCA, resulting in delays that are greater than SAAR’s. Similarly, LOAWA, APPROX5, LDCA, COREA, and DBAA implement a 12-bit precise section, also realized as an RCA, leading to delays greater than SAAR’s. However, SAAR’s clustering quality is inferior compared to other approximate adders, as noted in
Section 3. Hence, SAAR is not of interest despite it being faster than other approximate adders.
In terms of power, NAA dissipates less dynamic and static power and thus has less total power dissipation than the accurate adder and other approximate adders. This is primarily attributed to the smaller area occupancy of NAA. When compared with the accurate adder, NAA achieves a 31% reduction in power dissipation while facilitating the same or similar clustering quality.
The power-delay product (PDP) is a crucial metric in digital logic design that combines the power dissipation and maximum propagation delay of a circuit. PDP provides insight into the trade-off between the speed performance and energy efficiency of a digital circuit or system. Lower PDP values point to circuits that are faster and dissipate less power, which is essential for optimizing battery life in portable devices. In high-performance systems, reducing PDP can lead to more efficient processing while minimizing heat generation. Designers aim to balance power and speed to meet specific design goals, whether for low-power applications or high-speed computing. Hence, PDP helps in making informed decisions to optimize a circuit’s performance. We calculated the PDP for all the adders mentioned in
Table 3 and subsequently normalized these values. Normalization was performed by dividing the actual PDP of each adder by the highest PDP value of a specific adder (in this case, RCA). The resulting normalized PDP values are shown in
Figure 4a. A lower PDP is more desirable as both power and delay should be minimized for optimum performance. Thus, the smallest normalized PDP value is preferred, which corresponds to NAA, as seen in
Figure 4a.
The area-delay product (ADP) is another performance metric used alongside the PDP. ADP serves to assess the trade-off between area and delay, providing a measure of how effectively a design utilizes its area to achieve a specific performance level. Generally, designs with lower ADP values are considered more efficient, as they manage to minimize area while maintaining a low delay. A design with a lower ADP is more desirable, as it reflects a better balance between area and delay. In light of this, ADP for all the adders presented in
Table 3 was computed and normalized. To normalize the ADP, the actual value of each adder’s ADP was divided by the highest ADP value, which corresponds to the accurate adder (RCA). The normalized ADP values for various adders are shown in
Figure 4b, with the optimal value, i.e., the lowest normalized ADP, corresponding to NAA. NAA has less PDP and ADP compared to many other approximate adders, and it achieves a 56.7% reduction in PDP and a 50.9% reduction in ADP compared to the accurate adder (RCA) while providing nil/negligible compromise on the clustering quality.