High-Performance Hybrid Algorithm for Minimum Sum-of-Squares Clustering of Infinitely Tall Data

Mussabayev, Ravil; Mussabayev, Rustam

doi:10.3390/math12131930

Open AccessArticle

High-Performance Hybrid Algorithm for Minimum Sum-of-Squares Clustering of Infinitely Tall Data

by

Ravil Mussabayev

^1,2,*,†

and

Rustam Mussabayev

^2,3,†

¹

Department of Mathematics, University of Washington, Padelford Hall C-138, Seattle, WA 98195-4350, USA

²

AI Research Lab, Satbayev University, Satbaev Str. 22, Almaty 050013, Kazakhstan

³

Laboratory for Analysis and Modeling of Information Processes, Institute of Information and Computational Technologies, Pushkin Str. 125, Almaty 050010, Kazakhstan

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2024, 12(13), 1930; https://doi.org/10.3390/math12131930

Submission received: 28 May 2024 / Revised: 18 June 2024 / Accepted: 19 June 2024 / Published: 21 June 2024

(This article belongs to the Section Mathematics and Computer Science)

Download

Browse Figures

Versions Notes

Abstract

:

This paper introduces a novel formulation of the clustering problem, namely, the minimum sum-of-squares clustering of infinitely tall data (MSSC-ITD), and presents HPClust, an innovative set of hybrid parallel approaches for its effective solution. By utilizing modern high-performance computing techniques, HPClust enhances key clustering metrics: effectiveness, computational efficiency, and scalability. In contrast to vanilla data parallelism, which only accelerates processing time through the MapReduce framework, our approach unlocks superior performance by leveraging the multi-strategy competitive–cooperative parallelism and intricate properties of the objective function landscape. Unlike other available algorithms that struggle to scale, our algorithm is inherently parallel in nature, improving solution quality through increased scalability and parallelism and outperforming even advanced algorithms designed for small- and medium-sized datasets. Our evaluation of HPClust, featuring four parallel strategies, demonstrates its superiority over traditional and cutting-edge methods by offering better performance in the key metrics. These results also show that parallel processing not only enhances the clustering efficiency, but the accuracy as well. Additionally, we explore the balance between computational efficiency and clustering quality, providing insights into optimal parallel strategies based on dataset specifics and resource availability. This research advances our understanding of parallelism in clustering algorithms, demonstrating that a judicious hybridization of advanced parallel approaches yields optimal results for MSSC-ITD. Experiments on the synthetic data further confirm HPClust’s exceptional scalability and robustness to noise.

Keywords:

HPClust algorithm; clustering; parallel processing; big data; large-scale datasets; minimum sum-of-squares; decomposition; K-means; K-means++; global optimization; hybrid approach; adaptive algorithm; data sampling; multi-strategy optimization; high-performance computing

MSC:

68T09; 68T20

1. Introduction

Clustering is a critical task that involves the identification of similar objects within a given set. As digital data continue to grow at an unprecedented rate, this problem has become increasingly challenging and has applications in diverse domains. For instance, in the biological and medical domains, it has been used for gene expression analysis [1], enhancing medical diagnostics [2], and advancing bioinformatics research [3]. In the realm of technology and data, clustering optimizes vector quantization and data compression techniques [4], identifies anomalies [5], aids in pattern recognition and classification [3], dissects time-series data for forecasting [6], and forms the basis for the finance and blockchain sectors [7,8]. Furthermore, in the context of consumer and media analytics, clustering helps in segmenting customers for targeted marketing [9], analyzing images and videos for content extraction [10], and understanding social media trends [11]. Lastly, in the information sciences, it refines information retrieval systems [12] and processes natural language for better human–computer interaction [13], alongside analyzing network and traffic patterns [14].

The most fundamental and widely studied clustering model is the minimum sum-of-squares clustering (MSSC) [15]. It can be formulated as follows. Consider a set of m data points

X = {x_{1}, \dots, x_{m}}

in the Euclidean space

R^{n}

. Then, MSSC is aimed at finding k cluster centers (centroids)

C = (c_{1}, \dots, c_{k}) \in R^{n \times k}

that minimize the sum of squared distances from each data point

x_{i}

to its nearest cluster center

c_{j}

:

min_{C} f (C, X) = \sum_{i = 1}^{m} min_{j = 1, \dots, k} {∥ x_{i} - c_{j} ∥}^{2}

(1)

where

∥ \cdot ∥

denotes the Euclidean norm. Each collection of centroids C uniquely defines the corresponding partition

X = X_{1} \cup \dots \cup X_{k}

, where each subset (cluster)

X_{j}

consists of the points that are closest to

c_{j}

than to any other centroid. Equation (1) represents the objective function measuring the total squared deviation of data points from their closest centroids. Its global optimization leads to the simultaneous maximization of the similarity between objects within the same cluster and minimization of the similarity between objects in different clusters.

When dealing with big data, where the number of data points is unbounded, i.e.,

| X | = m = \infty

, formulation (1) gives rise to the minimum sum-of-squares clustering of infinitely tall data (MSSC-ITD) problem, which is one of the key contributions of our work. This problem makes traditional clustering methods unfeasible. The MSSC-ITD problem is a novel formulation that we have introduced in this paper, and our proposed algorithm is the first to provide an efficient solution to this challenge. In particular, few clustering algorithms exist that can address this problem, and even fewer can perform a global search in such conditions. Our approach fills this gap, providing a robust and efficient solution to the MSSC-ITD problem.

The research has shown that global minimizers provide the most accurate representation of the clustering structure of a given dataset [16]. However, achieving global minimizers in MSSC is a challenging task due to the highly non-convex nature of the objective function. This non-convexity becomes even more pronounced as the dataset size increases, making the task of finding global minimizers even more complex.

To address this challenge, several approaches have been proposed in the literature to explore the solution space and locate global minimizers, such as gradient-based optimization techniques [17], stochastic optimization algorithms [18], metaheuristic search strategies [16,19], and hybrid approaches [20]. Each of these approaches has its strengths and weaknesses, and there is no all-around solution. As a result, further research is needed to develop more efficient and robust techniques for locating global minimizers in the context of the MSSC-ITD problem.

Apart from the above classification, parallel processing in big data clustering algorithms presents another critical and frequently overlooked aspect. Most approaches that have been discovered in the literature are limited to only data parallelism, which is usually implemented using the MapReduce model. Meanwhile, more sophisticated parallel strategies are either not investigated or not applicable to the big data clustering algorithms available in the literature.

For general k and m, the MSSC algorithms are known to be computationally intensive due to their NP-hard complexity [15]. The NP-hardness of MSSC is heavily exacerbated by big data contexts. High-performance computing (HPC) technologies, including supercomputers and computer clusters, offer a robust platform for tackling such complex problems. By distributing the data across multiple nodes, computers, or processors, parallel processing enables scalable and efficient handling of big data. This approach leverages the combined computational power of multiple computing resources, allowing for faster and more effective execution of MSSC algorithms on massive datasets.

In this work, we propose HPClust, a set of novel parallel approaches for the MSSC-ITD problem. The decomposition principle is at the heart of the HPClust algorithm. This principle not only serves as the algorithm’s cornerstone but also facilitates efficient and effective parallel processing of big data. Parallel processing is one of the core approaches employed for big data clustering. In the current work, we endeavor to comprehensively explore this dimension with the goal of maximizing the performance of the HPClust algorithm in big data contexts.

Four parallel approaches—inner, competitive, cooperative, and hybrid—are proposed to tackle the MSSC-ITD problem. The inner parallel method involves parallelizing distance evaluations in the K-means local search applied within each sequential clustering subproblem, offering scalability in the subproblem size. The competitive strategy implements concurrency at the subproblem level, maximizing diversity in the initial clustering solutions. The cooperative approach simultaneously processes clustering subproblems, maximizing exploration by continuously selecting the best solution and capitalizing on it. The hybrid strategy combines the last two into a multi-strategy competitive–cooperative approach, aiming for an optimal exploration–exploitation trade-off in MSSC-ITD solutions.

The name HPClust can be interpreted in two ways, both reflecting the algorithm’s key strengths. Firstly, “High-Performance Clustering” highlights the algorithm’s computational efficiency, speed, and ability to scale through parallelism, making it a high-performance solution for clustering tasks. Alternatively, “Hybrid Parallel Clustering” emphasizes the innovative combined parallel clustering strategy employed by HPClust, which leverages the strengths of different parallel approaches to achieve superior performance. This hybrid strategy sets HPClust apart as a winning solution in the field of parallel clustering algorithms.

Notably, our algorithm boasts a significant conceptual advantage as one of the few clustering algorithms that is inherently parallel in nature. This allows it to improve solution quality through increased scalability and parallelism, setting it apart from other algorithms that may struggle to scale. Moreover, our algorithm is capable of competing with advanced clustering algorithms designed for small- and medium-sized datasets, demonstrating its versatility and robustness. Unlike other algorithms where parallelism is a forced add-on, our algorithm’s parallel nature is an intrinsic property that enables seamless scalability.

While other approaches to clustering often rely solely on data parallelism, our approach utilizes a combination of more advanced and sophisticated parallelism types. Data parallelism involves dividing the dataset into smaller chunks and processing each chunk simultaneously on different processors but only brings advantages in processing time. In contrast, task parallelism (functional parallelism) enables us to execute different tasks or functions of the clustering algorithm in parallel, allowing for more flexibility and effectiveness when merging their results. Furthermore, hybrid parallelism combines these approaches, allowing us to leverage the strengths of each to achieve better results. Unlike other parallel approaches that only focus on scaling clustering in the data space without guarantees on solution quality, our approach leverages the strengths of different parallelism types by combining data parallelism with task parallelism and hybrid parallelism, achieving better results. This integrated approach sets our method apart from others, which often rely on a single type of parallelism, and enables us to deliver higher-quality clustering solutions and scalability in big data clustering.

Furthermore, we provide a comprehensive review of various parallel and high-performance computing techniques used for big data clustering and indicate their strengths and weaknesses. We pinpoint the intricacies involved in the process of applying these approaches to HPClust, as well as exhibit the obtained insights in the form of a tutorial on applying parallel and high-performance computing technologies to the problem of big data clustering.

Our paper is structured as follows. Section 2 surveys the key developments and strategies in the field of parallel clustering algorithms. Section 3 presents the proposed HPClust algorithm, while Section 4 describes its various parallel strategies. Section 5 provides an overview of modern high-performance techniques for optimizing big data clustering algorithms, highlighting key nuances and considerations in the implementation details of the HPClust algorithm. Section 6 describes our experimental setup and its methodology. Section 7 provides a detailed analysis and interpretation of our experimental findings, along with insights into trade-offs. Section 8 offers practical guidelines for selecting the optimal parallel strategy for HPClust, aimed at big data clustering practitioners. Finally, Section 9 concludes our work and identifies promising future research directions.

2. Related Works

In the field of big data clustering, many methods have been created that work in parallel and distribute the workload to handle the difficulties presented by the large size, complex dimensions, and real-time nature of big data. Parallelism and distributed computing appear as two prominent techniques for big data clustering.

Usually, parallel processing in clustering algorithms involves dividing the data into smaller subsets, clustering them simultaneously on multiple processors, and aggregating these partial results into a global solution. This helps in reducing the computation time and makes the clustering process much more efficient. It is usually used when the data are too large to fit into memory or the computation time is a bottleneck.

Distributed computing, on the other hand, involves the distribution of big data across multiple machines. Clustering is then performed in a distributed manner using frameworks like Apache Hadoop or Apache Spark [21]. By distributing the data and computations, processing time is reduced, and scalability is achieved. This approach is useful when a dataset is unacceptably large to be stored and processed on a single machine.

The K-means [22] algorithm with the Forgy initialization [23] is a commonly used traditional clustering method due to its simplicity and effectiveness. However, its application to big data can pose problems due to its high time complexity, which is

(m \cdot n \cdot k)

for a single iteration, and the need to store all data in memory. The pseudocode of the Forgy K-means clustering method is provided in Algorithm 1.

Algorithm 1: Forgy K-Means Clustering

To circumvent the time complexity limitations of traditional approaches, like Forgy K-means, some parallel and distributed clustering algorithms have been suggested in the literature. The MapReduce framework is by far the most popular approach to scale clustering in the data space [24]. Zhao et al. [25] implemented a distributed version of K-means according to the MapReduce concept that led to a significant speed-up compared to the sequential version without any guarantees on the clustering solution quality.

A widely adopted method to handle large datasets that cannot be accommodated entirely in RAM is the Minibatch K-means algorithm [26]. It is an online version of the K-means algorithm that employs random subsets, or minibatches, of a dataset during each iteration to update the current solution. While this technique significantly accelerates computation time, it sacrifices the clustering quality since it exerts no control over the solution updates across iterations. Moreover, Minibatch K-means is an inherently sequential algorithm, amenable to only data parallelism.

Bahmani et al. [27] developed a scalable version of K-means++ that merges the advantages of K-means++ and Mini-batch K-means. However, our experimental evaluation on large real-world datasets showed that K-means

| |

, while being on par with K-means++ in speed, is significantly worse than K-means++ with respect to solution quality.

Alguliyev et al. proposed an innovative approach in their study, where they introduced the parallel batch K-means for big data clustering (PBK-BDC) algorithm [28]. This algorithm partitions large datasets into smaller segments, clusters them with the help of K-means, and aggregates the resulting cluster centers into a final pool. The algorithm then clusters the pool using K-means again. The pseudocode for the PBK-BDC algorithm can be found in Algorithm 2. Notably, PBK-BDC is one of the most prominent partition-based clustering algorithms. In the original paper, the authors empirically evaluated PBK-BDC and found that it outperformed the classical K-means algorithm [28]. However, this evaluation did not compare PBK-BDC to other advanced algorithms for clustering large datasets, leaving room for further research.

Algorithm 2: PBK-BDC Clustering Method

Mohebi et al. [21] conducted a comprehensive review of various parallel algorithms and concluded that the field of big data clustering using parallel computing is still in its emergent stage and offers significant scope for further research. They observed that parallel data processing can potentially reduce the clustering time of large datasets, but it may also have an adverse impact on the quality and performance of clustering. Thus, the primary objective of research in this area should be to achieve an optimal balance between quality and speed of clustering for big data applications.

Our proposed HPClust algorithm, utilizing advanced parallel processing techniques and intelligent sample selection, seeks to fill the gaps in the field. HPClust proves that advanced parallel strategies and careful algorithm design may optimize both the efficiency and effectiveness of clustering algorithms simultaneously, while maintaining exceptional scalability across various data scales.

3. Proposed Algorithm

We propose HPClust, an array of parallel heuristic approaches for solving the MSSC-ITD problem via high-performance computing techniques. The algorithm’s main idea is to apply the problem decomposition technique, letting each parallel worker iteratively process a sequence of subproblems, and intelligently combine the obtained partial results into a single global clustering solution.

Each parallel worker w operates by sequentially clustering the incoming samples from a large dataset. It begins by randomly selecting a small sample S of size s from X and uses the K-means++ algorithm to obtain the initial configuration of centroids C. The worker then clusters each new incoming sample by the K-means algorithm using the best set of centroids

C_{w}

(or

C_{b e s t}

) obtained from all previously processed samples for the current worker (or among all parallel workers), called the incumbent solution. The incumbent solution is chosen based on the objective function value (1) obtained on a sample. This “keep the best” principle allows the algorithm not to lose information about the best local minimum obtained so far, and more iterations can only lead to further improvements.

HPClust solves the issue of degenerate clusters (also known as empty clusters) by reinitializing them with K-means++ when all data points are reassigned to other clusters. This introduces new cluster centers, enhancing the overall clustering solution and increasing opportunities to minimize the objective function. Furthermore, introducing new samples in each iteration perturbs the incumbent solution, injecting variability into the clustering outcomes.

When a stop condition is reached by any parallel worker (e.g., a time limit or maximum number of processed samples), the algorithm selects the final centroid set C by choosing the solution obtained by the worker with the lowest incumbent sample objective function. Then, HPClust assigns data points of the entire dataset to their closest cluster centers in the final centroid set C. However, this final assignment step may be omitted if only the final centroids or a limited set of assignments are required.

HPClust’s iteration time complexity is

(s \cdot n \cdot k)

(where k is the number of clusters). The algorithm’s scalability can be fine-tuned by selecting suitable sample sizes and counts. By processing smaller subsets of the data in each iteration, the computational demands are substantially reduced. Additionally, employing random subsets of the data during each iteration and periodically re-initializing the centroids of degenerate clusters prevents the algorithm from being trapped in suboptimal solutions. This allows the algorithm to explore different parts of the objective function’s landscape, potentially finding better solutions than a single application of the K-means algorithm.

4. Parallel Strategies for the HPClust Algorithm

The HPClust algorithm is designed to be highly parallel in nature. Four different parallel strategies can be employed:

Inner parallelism (HPClust-inner): Employs parallel clustering at the implementation level of K-means and K-means++, processing individual data samples sequentially while parallelizing the calculation of minimum distances;
Competitive parallelism (HPClust-competitive): Independent workers process individual data samples in parallel, each using its own previous best centroids $C_{w}$ for initialization, and the best solution is selected when the stopping criterion is met. A pseudocode of the HPClust-competitive algorithm is shown in Algorithm 3;
Cooperative parallelism (HPClust-cooperative): Workers share information on best solutions and use the best set of centroids $C_{b e s t}$ obtained from all previous iterations across every worker, initializing each subsequent sample using the cooperative knowledge. A pseudocode of the HPClust-cooperative algorithm is provided in Algorithm 4;
Hybrid or competitive–cooperative parallelism (HPClust-hybrid): Combines competitive and cooperative strategies, initially utilizing diversity through competitive parallelism for a duration of $T_{1}$ seconds or $N_{1}$ iterations and then capitalizing on the most successful evolved form through cooperative parallelism for an additional $T_{2}$ seconds or $N_{2}$ iterations. A pseudocode of the HPClust-cooperative algorithm is presented in Algorithm 5.

The goal of the hybrid mode is to leverage the advantages of both competitive and cooperative approaches, ensuring diversity and exploiting the most successful solutions. Flowcharts for the competitive and cooperative strategies are provided in Figure 1 and Figure 2.

The HPClust algorithm source code, including implementations of various parallel strategies, is available at https://github.com/rmusab/hpclust (accessed on 28 May 2024).

Our study focuses on the efficiency of parallel interaction strategies, assuming equal access to the full-sized dataset and independent sampling, without exploring distributed data storage optimizations, which are left for a separate study.

Algorithm 3: Competitive HPClust Clustering

Algorithm 4: Cooperative HPClust Clustering

Algorithm 5: Hybrid HPClust Clustering

5. High-Performance Techniques in HPClust

5.1. Analytical Optimization

In the analytical optimization of computational algorithms, several high-performance computing techniques are relevant. These techniques represent algorithmic improvements or theoretical advancements applied at the abstract level of the algorithm itself.

Parallel processing of iterations;
Data sampling and partitioning;
Tuning the level of parallelism;
Optimizing inter-process communication.

The parallel processing of iterations allows for the simultaneous processing of multiple iterations. This strategy employs the execution of various instances of the algorithm on different subsets of data, significantly reducing the time required for convergence [29].

In relation to data management, HPClust can operate on subsets of data, allowing for a strategy of data partitioning. The initial dataset can be divided into smaller sections, each to be processed by an individual computing unit. This technique, known as data parallelism, proves particularly useful when handling datasets that exceed the memory capacity of a single machine [30].

The strategy of data sampling, wherein a random sample is selected from the dataset, can also be parallelized [31]. Especially in cases of extensive datasets, scanning the complete dataset becomes time-consuming. By distributing the dataset across multiple processors, each can sample a section of the data independently. Then, the resultant samples can be combined.

Tuning the level of parallelism to the specifics of a dataset can lead to significant performance improvements [32].

Optimizing inter-process communication by designing an algorithm to minimize data transfer between processes can improve performance. Techniques such as compression, delta encoding, or other forms of data reduction can also be utilized [33].

5.2. Nuances of Parallelism in HPClust

The HPClust algorithm, a partitioning-based clustering method, is well-suited for parallelism across its key processes. Within its inner parallel variant, HPClust-inner, two primary operations—initialization and centroid updating—can be executed concurrently. Initially, the algorithm leverages K-means++ on a subset of data, calculating distances from points to centroids, which can be carried out in parallel due to the independent nature of these calculations.

During each K-means iteration, the algorithm updates centroids (denoted as

C_{n e w}

) by measuring distances from all points in the sample to these new centroids, thereby redefining clusters. This centroid update phase shares the parallelizable characteristic of the initialization phase.

Despite the benefits of parallel processing in speeding up these tasks, it introduces certain challenges, such as the need for effective load balancing across cores or processors to avoid inefficiencies like idle processors, especially when the sample size s is much smaller than the number of processors.

Moreover, implementing parallel computation in HPClust requires careful attention to concurrency control to avoid race conditions—scenarios where the outcome depends on the order or timing of thread execution. In HPClust, threads may concurrently modify shared memory, such as updating centroid or data point memberships, potentially leading to inconsistent results.

To address these issues, synchronization mechanisms like locks, semaphores, or atomic operations are essential to ensure single-thread access to shared data, maintaining consistency and integrity. Optimizing the algorithm to reduce shared memory access can also help minimize race conditions. However, over-synchronization should be avoided as it can cause thread contention and decrease parallel efficiency.

For HPClust’s parallel performance, it is important to achieve an optimal balance between data protection and computational speed. The aim is to improve computational speed through parallel processing without altering the clustering outcomes, maintaining consistency in results irrespective of the processor count. However, unlike other parallel clustering algorithms, this is not required for HPClust. Instead, HPClust can achieve higher accuracy by performing more iterations within a fixed time interval. This means that parallelism in HPClust improves not only efficiency but also accuracy.

Furthermore, the robustness of HPClust’s parallel strategies is evident in centroid initialization, where allowing each worker to independently determine the initial centroids helps overcome the challenges of poor initial selections, a known issue in K-means clustering. This feature emphasizes the importance of effective parallel design in maximizing HPClust’s performance and accuracy.

5.3. Implementation-Level Optimization

To technically optimize the performance of HPClust on parallel or distributed computing systems, the following programmatic implementation-related techniques can be employed:

Vectorized operations;
SIMD instructions;
Concurrent data structures;
Distributed computing;
Load balancing;
Parallel random number generation;
Parallel input/output (I/O).

Furthermore, the utilization of vectorized operations also contributes to the optimization process. Libraries such as NumPy in Python and Armadillo in C++ offer the capacity for vectorized operations. The use of these operations across entire arrays, rather than individual elements, can lead to substantial speed increases. This is due to the reduction in loop overhead and more efficient utilization of CPU features [34].

Simultaneously, modern CPUs provide support for single instruction multiple data (SIMD) instructions. With these, the same operation can be performed across multiple data points concurrently [35]. Vectorizing the computations, such as distance calculations in the HPClust algorithm, allows for the exploitation of these instructions, resulting in significant speed gains.

Modern programming languages and libraries offer concurrent data structures, which are designed for safe use across multiple threads or processes. These structures can prevent race conditions and synchronization issues, contributing to the efficiency of parallel algorithms [36].

For extremely large datasets that exceed the memory of a single machine, distributed computing frameworks such as Apache Hadoop or Apache Spark are beneficial. These frameworks facilitate the distribution of data and computation across several nodes in a cluster, accommodating larger datasets than would be possible on a single machine [30].

Load balancing is a strategy to efficiently use computational resources, ensuring an even distribution of work across all threads or processes. This strategy may include the dynamic assignment of tasks to processors based on their current workload. Alternatively, more sophisticated load-balancing algorithms can be employed [37].

The generation of random numbers, a function of the HPClust algorithm, can also be performed in parallel. Several techniques and libraries support parallel random number generation, maintaining independent and identically distributed numbers [38].

Finally, parallel I/O techniques can help alleviate the bottleneck caused by input/output operations such as reading data from a disk or writing results back. A parallel file system or separate threads or processes performing I/O operations can facilitate this [39].

To implement these parallel strategies, various libraries and frameworks can be utilized. OpenMP or MPI in C/C++ and multiprocessing in Python offer traditional approaches. For GPU-accelerated parallel computation, CUDA or OpenCL are typically used. However, for a balance between functionality and simplicity, one might also consider employing modern libraries such as Numba. Numba provides a just-in-time compiler for Python that is easy to use yet powerful. Mojo is another notable option, providing simple and efficient parallelization solutions with a focus on high-level, user-friendly interfaces. To take full advantage of modern hardware architectures, one could use optimized numerical libraries, such as Intel’s Math Kernel Library (MKL) or cuBLAS for GPUs. These libraries provide highly optimized implementations of common mathematical operations, which can lead to significant speedups.

Numba [40,41] is a key instrument in high-performance computing, featuring optimization capabilities such as parallelization, multi-threading, and vectorization. These features are core strategies in performance optimization, transforming the execution speed of Python functions, loops, and numerical computations. Numba’s dynamic generation of optimized machine code for both CPUs and GPUs further contributes to this performance boost, converging Python’s usability and the speed of lower-level languages.

Numba’s proficiency extends to CUDA support, facilitating the optimization of computational procedures through the use of NVIDIA GPUs. Moreover, it showcases seamless integration with Python’s scientific stack, demonstrating compatibility with NumPy, SciPy, and Pandas, thereby optimizing Python’s computational efficiency. In the context of distributed computing, Numba’s interplay with Dask, a parallel computing library in Python, introduces an additional level of optimization, enabling efficient large-scale computations. Therefore, Numba serves as a potent tool in scientific computing, optimizing the bridge between Python’s user-friendly nature and the computational efficiency of lower-level languages.

5.4. Future Optimization Directions

Future optimization of the HPClust algorithm can leverage the following high- performance techniques:

Dynamically adjusting the number of threads;
Reducing communication overhead.

The number of threads can be adjusted dynamically, depending on the current system load and the size of the processed data subset, maximizing the use of CPU cores [32].

The overhead in communication between different threads or processes is a major concern in parallel algorithms [33]. Designing the algorithm to allow each thread or process to operate independently, reducing the need for communication, can address this.

6. Experimental Setup

6.1. Hardware and Software

Our experiments were conducted on an Ubuntu 22.04 64-bit system, equipped with an AMD EPYC 7663 Processor. The machine has 1.46 TB of RAM and runs Python 3.10.11, NumPy 1.24.3, and Numba 0.57.0. We utilizes Numba to accelerate Python code through just-in-time compilation and also to enable parallel processing capabilities.

6.2. Competitive Algorithms

We compare the performance of HPClust, equipped with different parallel strategies, to two benchmark algorithms: Forgy K-means [23] and PBK-BDC [28]. Forgy K-means serves as a basic lower benchmark, representing a simple and straightforward approach. On the other hand, PBK-BDC is an advanced upper benchmark that represents the most optimized big data clustering algorithm available in the literature [28].

6.3. Datasets

The experiments were conducted on 23 datasets: 19 are publicly available (detailed in Table 1 and Table 2), and 4 are normalized. These datasets, which are numerically based and have no missing values, vary significantly in size, from 7797 to 10,500,000 instances, and feature 2 to 5000 attributes. This variety ensures testing of HPClust’s adaptability across different data scales. Additionally, we align our methodology with Karmitsa et al. [17] for comparative analysis.

6.4. Experimental Design and Evaluation Metrics

Each dataset undergoes clustering

n_{e x e c}

times into k clusters of varying sizes. Each execution of an algorithm on some pair

(X, k)

is considered an experiment. The total number of conducted experiments reaches

22, 098

. We assessed each experiment by measuring the resulting relative error (

ε

), CPU time (t), and baseline convergence time (

\bar{t}

). The relative error reveals the performance relative to historical bests:

ε = 100 \cdot (f - f^{*}) / f^{*}

. Sometimes, a relative error may yield a negative value, which actually indicates an even more impressive performance by the algorithm, surpassing expectations.

For HPClust, the clustering time t represents the time until the last solution update of the fastest worker. Also, we employ a special baseline convergence time metric,

\bar{t}

, to more accurately measure clustering time, avoiding bias from minor late-stage improvements. More specifically, for each pair

(X, k)

, the baseline convergence time

\bar{t}

is calculated as the time to achieve a baseline sample objective value

{\bar{f}}_{s}

, which is the maximum (relative to the algorithms) median of the best sample objectives obtained across

n_{e x e c}

runs. Then, the baseline convergence time

\bar{t}

is defined as the time until any worker reaches this baseline sample objective value.

6.5. Hyperparameter Selection

We set a maximum CPU time limit T and stop the K-means clustering process if iterations exceed 300 or the improvement between two consecutive objectives is less than

10^{- 4}

. For K-means++, we consider three candidate points for sampling each new centroid.

Sample sizes are optimized based on preliminary tests to ensure no further adjustments improve performance. The specific values of T and

n_{e x e c}

can be found in the detailed tables of experimental results included in the Supplementary Materials.

6.6. Preliminary Experiments

Preliminary experiments helped establish the baselines and optimize parameters. Initially, we established that having 8 CPUs would be the optimal value for the subsequent experiments. In this context, the optimal selection means that this choice achieves the best balance between the solution quality and execution time simultaneously for all the considered algorithms, allowing for further fair comparison under equal conditions.

The subsequent preliminary experiments involved running parallelized HPClust versions to establish baseline sample objective values

{\bar{f}}_{s}

and fine-tuning the hybrid parallel approach by experimenting with different time splits (

T_{1}

and

T_{2}

).

6.7. Main Experiments

The main experimental results are displayed using a special table format. Each algorithm and pair

(X, k)

originate a series of

n_{e x e c}

experiments. Each series has minimum, median, and maximum resulting values of relative accuracy and time, which are calculated across

n_{e x e c}

executions of the algorithm on configuration

(X, k)

. The means of these metrics across the values of k for each dataset are displayed in the corresponding columns of the presented tables. Table 3 and Table 4 provide a comparison of the proposed HPClust parallel strategies, while Table 5 and Table 6 compare the best HPClust parallel strategy with the selected competitive algorithms.

For instance, for a particular algorithm, we have the following entry in a table: ISOLET #Succ = 6/7; Min = 0.01; Med = 0.24; Max = 0.59. In this case, the ratio 6/7 indicates that for each of the 7 different values of

k \in

{2, 3, 5, 10, 15, 20, 25}, we performed a series of runs for each of the compared algorithms. For each fixed choice of

(X, k)

, the corresponding series consists of

n_{e x e c}

= 15 independent runs of each algorithm. Thus, for each dataset, we have 7 series of runs for each of the compared algorithms, with each series containing 15 independent results. The number 6 in the #Succ ratio 6/7 indicates that the median objective function values for 6 out of 7 series of runs of this algorithm were lower than the mean objective function values in the corresponding series of all other algorithms.

The means in the final rows of these tables highlight overall performance across datasets. The best results for each metric and dataset pair were bolded, indicating top algorithm performance. The highest accuracy values for each dataset are displayed in bold among the algorithm results. Success is indicated when an algorithm’s median performance on a series of executions for a value of k outperforms or matches the best result among all algorithms for this series.

6.8. Scaling Experiment

Additionally, we conducted an experiment to demonstrate the scalability of our proposed HPClust strategies. We generated a synthetic dataset with 10 features comprising 10 Gaussian blobs uniformly distributed within the box

(- 40, 40)

, each with a randomly sampled standard deviation between

(0, 10)

. The number of points was varied according to the law

m = 3^{i + 7}

, where

i = 0, \dots, 8

. For each i, we performed 10 execution repetitions for each algorithm and recorded the results. We employed a sample size of

s = min {5000, m - 1000}

and a processing time limit of

T = 3.0

seconds for the HPClust and PBK-BDC algorithms. For the HPClust-hybrid, we used a naive time split of

T_{1} = T_{2} = T / 2

to avoid additional optimization. To introduce noise, we added 500 random points uniformly distributed within the box

(- 50.0, 50.0)

to each synthetic dataset. This experiment allowed us to assess the scalability of our algorithms under varying dataset sizes.

7. Experimental Results and Discussion

7.1. Performance Evaluation

The results of the first set of preliminary experiments, illustrated in Figure 3a,b, determined the optimal number of CPUs for subsequent experiments, setting the stage for further investigation. As anticipated, the fully sequential strategy (HPClust-sequential) displayed no significant correlation with the number of parallel processors employed. The HPClust version with inner parallelism demonstrated a reduction in processing time with an increase in the number of CPUs, while the accuracy remained independent of the CPU count. In contrast, both the HPClust-competitive and HPClust-cooperative strategies exhibited an improvement in clustering accuracy as the number of CPUs increased. However, this accuracy gain came at the expense of increased processing time for these versions of HPClust. We attribute this observation to the need for coordination among multiple processors and the technical complexities introduced by Numba, such as parallel access to shared memory locations by multiple workers. Upon closer examination of the scores, we determined that utilizing 8 CPUs strikes the optimal balance between processing time and resulting accuracy across all algorithms on our machine. Thus, this choice of the CPU count was used in all the subsequent experiments.

Other preliminary experiments were straightforward. They allowed us to obtain the necessary optimal values of the parameters for the main set of experiments.

A summary of the results of the main experiments are provided in Table 3, Table 4, Table 5 and Table 6. Full details of the results of the main experiments are provided in the Supplementary Materials.

As Table 3 demonstrates, the HPClust-competitive, HPClust-cooperative, and HPClust-hybrid strategies markedly boost overall clustering quality, achieving results that are up to three times better than HPClust-inner.

The HPClust-competitive approach showed a slight edge in average clustering quality compared to HPClust-cooperative, likely due to comprehensive initializations that mitigate K-means’ sensitivity to initial conditions. The analysis highlights a trade-off between extensive local optimization with a single start point and multiple initializations. The experiments suggest that multiple initializations, persistently processed by the competitive method, lead to better outcomes than the cooperative method’s focus on a single initialization. This finding favors exploring diverse K-means++ initializations to select the optimal one in the end.

The HPClust-hybrid exhibited the highest average clustering accuracy among the tested methods. This outcome was anticipated to a certain extent, as the hybrid approach combines the strengths of both regimes. In the initial stage, the competitive strategy enables extensive and rapid exploration of various K-means++ initializations on samples. In the subsequent stage, the cooperative strategy facilitates a thorough exploitation of the best solution obtained from the first stage for the remaining time. However, the hybrid strategy necessitates an additional optimization concerning the parameter

T_{1}

, which determines the split between the competitive and cooperative regimes. This parameter is highly dependent on the specific dataset and the number of clusters. In certain scenarios, particularly when dealing with numerous diverse datasets for clustering, this might pose a significant overhead that could be challenging to handle.

In examining the baseline convergence times among various parallel strategies, it was evident that the HPClust-inner method achieved quicker baseline convergence than the alternatives for the majority of datasets. This disparity was especially notable in larger datasets, as shown at the beginning of Table 4. For some datasets, to maintain high-quality clustering, substantial sample sizes were necessary, which were proportionate to the dataset sizes. The HPClust-inner strategy, by integrating parallelized K-means++ and K-means for each new sample, managed to expedite processing times relative to the sequential version in other parallel HPClust approaches. These findings highlight the crucial impact of algorithm selection and dataset characteristics on the delicate balance between computational efficiency and clustering accuracy. This underscores the importance of thoughtfully balancing sample size (which affects speed) with the quality of resulting clusters, as a careful trade-off is essential for achieving optimal outcomes.

Further analysis of the competitive, cooperative, and hybrid HPClust strategies revealed an intricate interplay between the benefits of parallel processing and the resulting time costs. These methods did improve the solution quality, but the coordination required among multiple processors and the additional complexity from using the Numba library prolonged the convergence process, compared to the HPClust-inner method. Typically, with 8 CPUs, these strategies took up to twice as long to converge as the HPClust-inner method. This observation highlights the need to carefully weigh the trade-offs between exploiting computational resources to accelerate clustering and incurring additional overheads that may impact performance.

Table 5 clearly demonstrates the superiority of the HPClust-hybrid algorithm over its competitors, exhibiting a significant lead in both the number of dominant series and average overall accuracy across all datasets. The HPClust-hybrid algorithm achieves an average accuracy that is a remarkable several orders of magnitude higher than its competitors.

As shown in Table 6, Forgy K-means, with its linear time complexity with respect to m, predictably exhibits a significant increase in time costs for the largest datasets, exceeding the fastest HPClust version by more than 20 times. While PBK-BDC is the quickest for small datasets, its average time costs for the largest datasets are triple those of HPClust, highlighting HPClust’s efficiency advantage for large datasets.

The scaling experiment results are presented in Figure 4a,b. For each x-axis value, the median score across 10 repetitions is displayed. The figures clearly show that all HPClust versions are highly robust and scalable with respect to the number of points, achieving optimal clustering accuracy (within

0.2 %

of ground truth) while keeping clustering time under 3 seconds, regardless of dataset size. In contrast, competitive algorithms Forgy K-means and PBK-BDC exhibited substantially suboptimal clustering quality, with Forgy K-means incurring unacceptable linearly rising time costs with increasing points (e.g., over 2 h for a single execution on a 43 million point dataset). Meanwhile, PBK-BDC failed to provide steadily optimal clustering solutions at any data scale, despite slightly increased time costs for larger datasets. The HPClust versions demonstrated superior performance and scalability. The detailed experimental results, showcasing median values across various data scales and algorithms, are presented in Table 7 and Table 8 for a comprehensive understanding.

Surprisingly, the scaling experiment’s results reveal an additional extraordinary property of HPClust: its iterative sampling processing with small samples renders it robust to noise and outliers, demonstrating a remarkable resilience to data perturbations and anomalies.

7.2. Trade-Offs Analysis

Our experiments with the HPClust algorithm have revealed several key trade-offs. Here, we present an in-depth analysis of these trade-offs, which often involve intricate balancing acts between efficiency, accuracy, computation time, and dataset characteristics. The following are the primary trade-offs that practitioners might have to consider:

Accuracy vs. Computation Time: Our results showed that the choice of strategy significantly influences the balance between computation time and the resulting accuracy. For example, while HPClust-inner demonstrated faster convergence times, especially for large datasets, the HPClust-competitive, HPClust-cooperative, and HPClust-hybrid strategies offered improved clustering quality at the cost of slightly increased computation time. Thus, your choice should weigh the importance of quick results against the necessity of clustering precision;
Parallelism vs. Overhead: The level of parallelism used directly impacts the computation time and the overhead associated with managing multiple processors. While increasing the number of processors generally results in faster computation, it also introduces added overhead in coordinating these processors. This was particularly evident when using HPClust-competitive, HPClust-cooperative, and HPClust-hybrid strategies, which took nearly twice as long to converge as HPClust-inner, despite yielding superior solutions;
Sample Size vs. Quality of Clusters: The size of the sample used in the HPClust algorithm directly impacts the quality of clusters and the computation time. Larger samples often led to better approximations of the overall data distribution and improved final clustering quality. However, these benefits were offset by slower algorithmic performance, which is a crucial aspect to consider when dealing with large datasets;
Strategy Selection vs. Initialization Quality: In the context of HPClust, another critical trade-off lies in the choice of strategy and its influence on the quality of initializations. HPClust-competitive, which applies multiple initializations and continues clustering different K-means++ initializations to select the best one at the end, showed a slightly improved clustering quality over HPClust-cooperative. Meanwhile, the HPClust-hybrid strategy effectively amalgamated the comprehensive exploration capabilities of the competitive approach with the exploitation abilities of the cooperative approach. However, it should be noted that this comes with the requirement of additional optimization for the split parameter $T_{1}$ . Therefore, the sensitivity of K-means to the quality of initial initialization is another critical factor to consider when choosing the strategy.

In navigating these trade-offs, understanding the unique requirements of your task and the nature of your dataset is paramount. Each strategy presents its own advantages and disadvantages, which should be carefully considered in light of these trade-offs. With the correct approach, these trade-offs can be effectively managed to achieve optimal clustering results with the HPClust algorithm.

8. Guidelines for Choosing Parallel Strategy

Considering the outcomes of our research, we propose the following revised guidelines for selecting an appropriate parallel strategy for the HPClust algorithm:

If you are handling large datasets and have concerns over computation time, opt for the HPClust-inner strategy. This variant consistently showed faster convergence to baselines across most datasets, especially larger ones, as evidenced in the first rows of Table 4. The employment of significant sample sizes, relative to the dataset sizes, along with parallelized K-means++ and K-means on each new sample, contributed to its accelerated processing times. However, remember that larger sample sizes often led to slower algorithmic performance, so balancing sample size with the quality of clusters remains crucial;
When computation time is less of a constraint and you aim for better clustering quality, choose between HPClust-competitive and HPClust-cooperative strategies. Both these strategies demonstrated an improved quality of final solutions compared to other versions of HPClust, on average three times better with 8 CPUs. However, due to the additional overhead of coordinating multiple processors and the complexities associated with the Numba library, they also exhibited longer convergence times, nearly twice as long as HPClust-inner with 8 CPUs;
If the clustering quality is your primary focus, HPClust-hybrid or HPClust-competitive should be the preferred choices. Our findings indicated a slightly improved clustering quality with HPClust-competitive compared to HPClust-cooperative. This improvement stems from the application of multiple initializations at the beginning, as K-means is highly sensitive to initial initialization quality. This strategy continues to cluster different K-means++ initializations, eventually selecting the best one at the end, leading to a superior solution. In the meantime, if you aim for superior clustering quality and willing to spend extra time on parameter optimization, opt for the HPClust-hybrid strategy. This choice demonstrated the best resulting clustering quality, while retaining the same degree of time efficiency as the competitive and cooperative approaches.

These guidelines should assist researchers and practitioners in choosing an appropriate parallel strategy for their specific needs. However, keep in mind that these are general guidelines, and the choice of parallel strategy should be adapted to the specific requirements of your task and the nature of your dataset. This research strongly suggests that parallelism, when feasible, offers a significant enhancement in clustering accuracy and convergence time compared to the sequential variant.

Overall, the best strategy is likely to be one that strikes a balance between the need for accuracy, computation time, and the specific characteristics of the dataset at hand. The effectiveness of each strategy will inevitably depend on these factors, and the choice should be made accordingly.

9. Conclusions and Future Works

Our paper introduces the HPClust algorithm and explores its four parallel strategies on diverse datasets, including real-world and synthetic ones. Our comprehensive evaluation focuses on three essential metrics: relative clustering accuracy (

ε

), total runtime (t), and baseline-normalized runtime (

\bar{t}

). These metrics provide a thorough assessment of each strategy’s effectiveness and efficiency, enabling a well-rounded comparison.

The experimental results demonstrate HPClust’s unrivaled effectiveness, efficiency, and scalability compared to baseline algorithms across a vast range of real-world datasets (spanning small to big sizes) and synthetic datasets. HPClust consistently outperforms its competitors, showcasing remarkable robustness to data scale and noise, as well as adaptability in various data settings.

Furthermore, this research demonstrates that no single parallel strategy universally optimizes the HPClust algorithm. Instead, the most effective approach depends on the dataset’s characteristics, emphasizing the need for adaptive techniques that dynamically select the best strategy. However, in most cases, we recommend practitioners to employ either the competitive or hybrid (competitive–cooperative) parallel strategies of HPClust, which have shown superior performance and versatility.

Additionally, our work offers a comprehensive review of the primary high-performance techniques utilized for optimizing data clustering algorithms. We delve into the intricate aspects and nuances of applying parallel techniques, specifically analyzing the challenges and pitfalls associated with the HPClust algorithm. Through a detailed trade-off analysis, we provide practical guidelines to assist in selecting the most suitable parallel strategy for specific use cases. These guidelines aim to facilitate informed decision-making and provide actionable recommendations.

The future research will focus on developing adaptive methods that can intelligently choose the most suitable parallel strategy based on the specific dataset, optimizing performance and accuracy. Additionally, we will conduct a more in-depth analysis of the trade-offs revealed in this study, exploring their nuanced effects on algorithmic performance and accuracy, to uncover actionable insights for further improvement.

Another promising future research direction for the proposed HPClust algorithm is its potential adaptation for clustering streaming datasets or continuously growing datasets. This is particularly relevant in scenarios involving IoT sensors, financial transactions, social media feeds, and other real-time data sources, where data is constantly generated and requires efficient clustering techniques to uncover insights and patterns. By extending HPClust to handle streaming data, researchers can unlock new opportunities for real-time analytics and decision-making in various fields.

This study’s findings and observations lay the groundwork for advancing efficient and adaptive parallel techniques for HPClust and beyond. Our goal is for this research to make a meaningful impact in the fields of data clustering and high-performance computing, driving innovation and improvement in these areas. By shedding light on the complex relationships between parallel strategies, dataset characteristics, and algorithmic performance, we aim to spark further discovery and progress.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/math12131930/s1.

Author Contributions

Conceptualization, R.M. (Rustam Mussabayev); Methodology, R.M. (Rustam Mussabayev); Software, R.M. (Ravil Mussabayev); Validation, R.M. (Ravil Mussabayev); Formal analysis, R.M. (Ravil Mussabayev); Investigation, R.M. (Ravil Mussabayev); Resources, R.M. (Rustam Mussabayev); Data curation, R.M. (Rustam Mussabayev); Writing—original draft, R.M. (Ravil Mussabayev); Writing—review & editing, R.M. (Rustam Mussabayev); Visualization, R.M. (Ravil Mussabayev); Supervision, R.M. (Rustam Mussabayev); Project administration, R.M. (Rustam Mussabayev); Funding acquisition, R.M. (Rustam Mussabayev). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Science Committee of the Ministry of Science and Higher Education of the Republic of Kazakhstan (grant no. BR21882268).

Data Availability Statement

The data presented in this study are publicly available. The corresponding URLs to the used datasets are provided in the article [UC Irvine Machine Learning Repository] [https://archive.ics.uci.edu/] [accessed on May 28 2024].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Jiang, D.; Tang, C.; Zhang, A. Cluster analysis for gene expression data: A survey. IEEE Trans. Knowl. Data Eng. 2004, 16, 1370–1386. [Google Scholar] [CrossRef]
Mittal, H.; Pandey, A.C.; Pal, R.; Tripathi, A. A new clustering method for the diagnosis of CoVID19 using medical images. Appl. Intell. 2021, 51, 2988–3011. [Google Scholar] [CrossRef] [PubMed]
de Ridder, D.; de Ridder, J.; Reinders, M.J.T. Pattern recognition in bioinformatics. Briefings Bioinform. 2013, 14, 633–647. [Google Scholar] [CrossRef]
Yin, Y.; Liu, F.; Zhou, X.; Li, Q. An Efficient Data Compression Model Based on Spatial Clustering and Principal Component Analysis in Wireless Sensor Networks. Sensors 2015, 15, 19443–19465. [Google Scholar] [CrossRef]
Tu, B.; Yang, X.; Li, N.; Zhou, C.; He, D. Hyperspectral anomaly detection via density peak clustering. Pattern Recognit. Lett. 2020, 129, 144–149. [Google Scholar] [CrossRef]
Rakthanmanon, T.; Keogh, E.J.; Lonardi, S.; Evans, S. MDL-based time series clustering. Knowl. Inf. Syst. 2012, 33, 371–399. [Google Scholar] [CrossRef]
Lejun, Z.; Minghui, P.; Shen, S.; Weizheng, W.; Zilong, J.; Yansen, S.; Huiling, C.; Ran, G.; Gataullin, S. Redundant data detection and deletion to meet privacy protection requirements in blockchain-based edge computing environment. China Commun. 2024, 21, 149–159. [Google Scholar] [CrossRef]
Zhang, L.; Zhang, B.; Guo, R.; Wang, Z.; Wang, G.; Qiu, J.; Su, S.; Liu, Y.; Xu, G.; Tian, Z.; et al. Research on Covert Communication Technology Based on Matrix Decomposition of Digital Currency Transaction Amount. KSII Trans. Internet Inf. Syst. 2024, 18, 1020–1041. [Google Scholar] [CrossRef]
Chen, X.; Fang, Y.; Yang, M.; Nie, F.; Zhao, Z.; Huang, J.Z. PurTreeClust: A Clustering Algorithm for Customer Segmentation from Massive Customer Transaction Data. IEEE Trans. Knowl. Data Eng. 2018, 30, 559–572. [Google Scholar] [CrossRef]
Yeung, M.; Yeo, B.; Liu, B. Segmentation of video by clustering and graph analysis. Comput. Vis. Image Underst. 1998, 71, 94–109. [Google Scholar] [CrossRef]
Zhao, P.; Zhang, C.Q. A new clustering method and its application in social networks. Pattern Recognit. Lett. 2011, 32, 2109–2118. [Google Scholar] [CrossRef]
Djenouri, Y.; Belhadi, A.; Djenouri, D.; Lin, J.C.W. Cluster-based information retrieval using pattern mining. Appl. Intell. 2021, 51, 1888–1903. [Google Scholar] [CrossRef]
Alguliyev, R.M.; Aliguliyev, R.M.; Isazade, N.R.; Abdi, A.; Idris, N. COSUM: Text summarization based on clustering and optimization. Expert Syst. 2019, 36, 340. [Google Scholar] [CrossRef]
Depaire, B.; Wets, G.; Vanhoof, K. Traffic accident segmentation by means of latent class clustering. Accid. Anal. Prev. 2008, 40, 1257–1266. [Google Scholar] [CrossRef] [PubMed]
Aloise, D.; Deshpande, A.; Hansen, P.; Popat, P. NP-hardness of Euclidean sum-of-squares clustering. Mach. Learn. 2009. [Google Scholar] [CrossRef]
Gribel, D.; Vidal, T. HG-means: A scalable hybrid genetic algorithm for minimum sum-of-squares clustering. Pattern Recognit. 2019. [Google Scholar] [CrossRef]
Karmitsa, N.; Bagirov, A.M.; Taheri, S. Clustering in large data sets with the limited memory bundle method. Pattern Recognit. 2018. [Google Scholar] [CrossRef]
Franti, P.; Sieranoja, S. How much can k-means be improved by using better initialization and repeats? Pattern Recognit. 2019, 93, 95–112. [Google Scholar] [CrossRef]
Hansen, P.; Mladenovic, N. J-Means: A new local search heuristic for minimum sum of squares clustering. Pattern Recognit. 2001, 34, 405–413. [Google Scholar] [CrossRef]
Mansueto, P.; Schoen, F. Memetic differential evolution methods for clustering problems. Pattern Recognit. 2021, 114, 107849. [Google Scholar] [CrossRef]
Mohebi, A.; Aghabozorgi, S.; Wah, T.Y.; Herawan, T.; Yahyapour, R. Iterative big data clustering algorithms: A review. Softw. Pract. Exp. 2016, 46, 107–129. [Google Scholar] [CrossRef]
Lloyd, S. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137. [Google Scholar] [CrossRef]
Pena, J. An empirical comparison of four initialization methods for the K-means algorithm. Pattern Recognit. Lett. 1999, 20, 1027–1040. [Google Scholar] [CrossRef]
Dean, J.; Ghemawat, S. MapReduce: Simplified data processing on large clusters. Proc. Commun. ACM 2008, 51, 107–113. [Google Scholar] [CrossRef]
Zhao, W.; Ma, H.; He, Q. Parallel k-means clustering based on MapReduce. In Proceedings of the Cloud Computing; Springer: Berlin/Heidelberg, Germany, 2009; Volume 5931, pp. 674–679. [Google Scholar]
Sculley, D. Web-scale k-means clustering. In Proceedings of the 19th International Conference on World Wide Web, Raleigh, NC, USA, 26–30 April 2010; pp. 1177–1178. [Google Scholar]
Bahmani, B.; Moseley, B.; Vattani, A.; Kumar, R.; Vassilvitskii, S. Scalable k-means++. Proc. VLDB Endow. 2012, 5, 622–633. [Google Scholar] [CrossRef]
Alguliyev, R.M.; Aliguliyev, R.M.; Sukhostat, L.V. Parallel batch k-means for Big data clustering. Comput. Ind. Eng. 2021, 152, 107023. [Google Scholar] [CrossRef]
Crawford, I.L.; Wadleigh, K.R. Software Optimization for High Performance Computers; Prentice Hall PTR: Hoboken, NJ, USA, 2000. [Google Scholar]
Dafir, Z.; Lamari, Y.; Slaoui, S.C. A survey on parallel clustering algorithms for Big Data. Artif. Intell. Rev. 2021, 54, 2411–2443. [Google Scholar] [CrossRef]
He, Q.; Wang, H.; Zhuang, F.; Shang, T.; Shi, Z. Parallel sampling from big data with uncertainty distribution: Special issue: Uncertainty in Learning from Big Data. Fuzzy Sets Syst. 2015, 258, 117–133. [Google Scholar] [CrossRef]
Sabne, A.; Sakdhnagool, P.; Eigenmann, R. Scaling large-data computations on multi-GPU accelerators. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, New York, NY, USA, 21–23 June 2013; ICS ’13. pp. 443–454. [Google Scholar] [CrossRef]
Gupta, M.; Schonberg, E.; Srinivasan, H. A unified framework for optimizing communication in data-parallel programs. IEEE Trans. Parallel Distrib. Syst. 1996, 7, 689–704. [Google Scholar] [CrossRef]
Psarras, C.; Barthels, H.; Bientinesi, P. The Linear Algebra Mapping Problem. Current State of Linear Algebra Languages and Libraries. ACM Trans. Math. Softw. 2022, 48, 35. [Google Scholar] [CrossRef]
Chhugani, J.; Nguyen, A.D.; Lee, V.W.; Macy, W.; Hagog, M.; Chen, Y.K.; Baransi, A.; Kumar, S.; Dubey, P. Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture. Proc. VLDB Endow. 2008, 1, 1313–1324. [Google Scholar] [CrossRef]
Saraswat, V.A.; Sarkar, V.; von Praun, C. X10: Concurrent programming for modern architectures. In Proceedings of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, New York, NY, USA, 29–31 March 2007; PPoPP ’07. p. 271. [Google Scholar] [CrossRef]
Jafarnejad Ghomi, E.; Masoud Rahmani, A.; Nasih Qader, N. Load-balancing algorithms in cloud computing: A survey. J. Netw. Comput. Appl. 2017, 88, 50–71. [Google Scholar] [CrossRef]
Barash, L.; Shchur, L. PRAND: GPU accelerated parallel random number generation library: Using most reliable algorithms and applying parallelism of modern GPUs and CPUs. Comput. Phys. Commun. 2014, 185, 1343–1353. [Google Scholar] [CrossRef]
May, J.M. Parallel I/O for High Performance Computing; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2001. [Google Scholar]
Lam, S.K.; Pitrou, A.; Seibert, S. Numba: A LLVM-based Python JIT compiler. In Proceedings of the LLVM ’15: Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, Austin, TX, USA, 15 November 2015; pp. 1–6. [Google Scholar] [CrossRef]
Marowka, A. Python accelerators for high-performance computing. J. Supercomput. 2018, 74, 1449–1460. [Google Scholar] [CrossRef]

Figure 1. Flowchart of the HPClust algorithm with the competitive parallelism.

Figure 2. Flowchart of the HPClust algorithm using a cooperative parallel strategy.

Figure 3. Comparative results of the algorithms with respect to the number of employed CPUs averaged across all datasets.

Figure 4. Comparative results of the algorithms with respect to the number of points m in a synthetic dataset.

Table 1. Brief description of the datasets.

Datasets	No. Instances	No. Attributes	Size	File Size
Datasets	$m$	$n$	$m \times n$	File Size
CORD-19 Embeddings	599,616	768	460,505,088	8.84 GB
HEPMASS	10,500,000	28	294,000,000	7.5 GB
US Census Data 1990	2,458,285	68	167,163,380	361 MB
Gisette	13,500	5000	67,500,000	152.5 MB
Music Analysis	106,574	518	55,205,332	951 MB
Protein Homology	145,751	74	10,785,574	69.6 MB
MiniBooNE Particle Identification	130,064	50	6,503,200	91.2 MB
MFCCs for Speech Emotion Recognition	85,134	58	4,937,772	95.2 MB
ISOLET	7797	617	4,810,749	40.5 MB
Sensorless Drive Diagnosis	58,509	48	2,808,432	25.6 MB
Online News Popularity	39,644	58	2,299,352	24.3 MB
Gas Sensor Array Drift	13,910	128	1,780,480	23.54 MB
3D Road Network	434,874	3	1,304,622	20.7 MB
KEGG Metabolic Relation Network (Directed)	53,413	20	1,068,260	7.34 MB
Skin Segmentation	245,057	3	735,171	3.4 MB
Shuttle Control	58,000	9	522,000	1.55 MB
EEG Eye State	14,980	14	209,720	1.7 MB
Pla85900	85,900	2	171,800	1.79 MB
D15112	15,112	2	30,224	247 kB

Table 2. URLs for the used datasets (all accessed on 28 May 2024).

Datasets	URLs
CORD-19 Embeddings	https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge
HEPMASS	https://archive.ics.uci.edu/ml/datasets/HEPMASS
US Census Data 1990	https://archive.ics.uci.edu/ml/datasets/US+Census+Data+(1990)
Gisette	https://archive.ics.uci.edu/ml/datasets/Gisette
Music Analysis	https://archive.ics.uci.edu/ml/datasets/FMA%3A+A+Dataset+For+Music+Analysis
Protein Homology	https://www.kdd.org/kdd-cup/view/kdd-cup-2004/Data
MiniBooNE Particle Identification	https://archive.ics.uci.edu/ml/datasets/MiniBooNE+particle+identification
MFCCs for Speech Emotion Recognition	https://www.kaggle.com/cracc97/features
ISOLET	https://archive.ics.uci.edu/ml/datasets/isolet
Sensorless Drive Diagnosis	https://archive.ics.uci.edu/ml/datasets/dataset+for+sensorless+drive+diagnosis
Online News Popularity	https://archive.ics.uci.edu/ml/datasets/online+news+popularity
Gas Sensor Array Drift	https://archive.ics.uci.edu/ml/datasets/gas+sensor+array+drift+dataset
3D Road Network	https://archive.ics.uci.edu/ml/datasets/3D+Road+Network+(North+Jutland,+Denmark)
KEGG Metabolic Relation Network (Directed)	https://archive.ics.uci.edu/ml/datasets/KEGG+Metabolic+Relation+Network+(Directed)
Skin Segmentation	https://archive.ics.uci.edu/ml/datasets/skin+segmentation
Shuttle Control	https://archive.ics.uci.edu/ml/datasets/Statlog+(Shuttle)
Pla85900	http://softlib.rice.edu/pub/tsplib/tsp/pla85900.tsp.gz
D15112	https://github.com/mastqe/tsplib/blob/master/d15112.tsp

Table 3. Resulting relative clustering accuracies

ϵ

(%) for the proposed parallel HPClust strategies.

Table 3. Resulting relative clustering accuracies

ϵ

(%) for the proposed parallel HPClust strategies.

Dataset	HPClust-Inner				HPClust-Competitive
Dataset	#Succ	Min	Median	Max	#Succ	Min	Median	Max
CORD-19 Embeddings	0/7	0.07	0.21	0.34	3/7	0.0	0.07	0.18
HEPMASS	0/7	0.08	0.22	0.66	3/7	0.03	0.07	0.19
US Census Data 1990	2/7	0.92	3.13	35.87	3/7	0.48	1.48	2.89
Gisette	0/7	−0.43	−0.37	−0.19	2/7	−0.44	−0.38	−0.32
Music Analysis	3/7	0.41	0.91	6.24	4/7	0.43	0.74	1.67
Protein Homology	3/7	0.15	0.91	2.32	1/7	0.41	0.88	2.03
MiniBooNE Particle Identification	2/7	−0.03	0.51	402,305.65	1/7	−0.07	−0.0	719,099.04
MiniBooNE Particle Identification (normalized)	1/7	0.2	0.54	101.63	2/7	0.2	0.55	1.1
MFCCs for Speech Emotion Recognition	2/7	0.14	0.64	1.95	1/7	0.11	0.34	0.76
ISOLET	0/7	0.15	0.68	1.72	1/7	0.04	0.23	0.63
Sensorless Drive Diagnosis	1/7	−0.32	1.25	31.06	2/7	−0.43	−0.27	12.2
Sensorless Drive Diagnosis (normalized)	1/7	0.4	3.03	9.69	4/7	0.31	1.06	3.26
Online News Popularity	2/7	0.7	2.36	14.39	2/7	0.69	1.65	3.74
Gas Sensor Array Drift	2/7	0.15	3.24	12.29	2/7	−0.05	1.78	3.77
3D Road Network	2/7	0.04	0.4	1.24	2/7	0.03	0.22	1.06
Skin Segmentation	1/7	0.04	2.91	9.72	2/7	−0.05	1.05	4.36
KEGG Metabolic Relation Network (Directed)	3/7	−0.08	1.55	34.13	2/7	−0.42	0.24	2.5
Shuttle Control	1/8	0.17	5.68	41.76	2/8	−0.01	2.32	12.58
Shuttle Control (normalized)	1/8	0.89	2.81	17.98	0/8	0.69	1.79	4.07
EEG Eye State	3/8	0.54	0.79	7.15	2/8	0.53	0.56	119,444.14
EEG Eye State (normalized)	0/8	−0.06	2.4	31.49	6/8	−0.06	0.01	67.39
Pla85900	0/7	0.07	0.37	1.7	3/7	0.07	0.2	0.73
D15112	2/7	0.1	0.48	1.76	3/7	0.08	0.14	0.4
Overall Results	32/165	0.19	1.51	17,507.42	53/165	0.11	0.64	36,463.84
Dataset	HPClust-cooperative				HPClust-hybrid
Dataset	#Succ	Min	Median	Max	#Succ	Min	Median	Max
CORD-19 Embeddings	3/7	0.04	0.08	0.16	1/7	0.02	0.08	0.24
HEPMASS	0/7	0.04	0.16	0.57	4/7	−0.01	0.08	0.24
US Census Data 1990	1/7	0.45	1.64	4.37	1/7	0.32	1.72	3.17
Gisette	2/7	−0.46	−0.39	−0.32	3/7	−0.44	−0.4	−0.34
Music Analysis	0/7	0.4	0.83	2.68	0/7	0.33	0.85	2.24
Protein Homology	2/7	0.21	0.91	1.81	1/7	0.5	1.05	2.1
MiniBooNE Particle Identification	2/7	−0.08	0.0	0.37	2/7	−0.07	−0.0	0.15
MiniBooNE Particle Identification (normalized)	1/7	0.19	0.56	1.43	3/7	0.23	0.51	1.29
MFCCs for Speech Emotion Recognition	2/7	0.1	0.34	0.94	2/7	0.12	0.33	0.83
ISOLET	2/7	0.03	0.25	0.68	4/7	0.01	0.23	0.59
Sensorless Drive Diagnosis	2/7	−0.41	−0.21	11.82	2/7	−0.42	−0.21	8.18
Sensorless Drive Diagnosis (normalized)	1/7	0.28	1.39	4.0	1/7	0.38	1.34	3.81
Online News Popularity	2/7	0.56	1.6	7.79	1/7	0.47	1.69	7.86
Gas Sensor Array Drift	1/7	−0.04	0.91	4.05	2/7	0.06	0.79	3.99
3D Road Network	2/7	0.04	0.22	1.04	1/7	0.04	0.21	0.88
Skin Segmentation	2/7	−0.22	1.11	5.76	2/7	−0.02	1.02	4.25
KEGG Metabolic Relation Network (Directed)	1/7	−0.3	0.35	6.26	1/7	−0.29	0.25	23.7
Shuttle Control	4/8	−0.14	1.55	4.76	1/8	0.08	1.86	9.13
Shuttle Control (normalized)	2/8	0.81	2.22	4.97	5/8	0.71	1.61	4.49
EEG Eye State	1/8	0.53	0.57	0.76	2/8	0.52	0.55	6.05
EEG Eye State (normalized)	0/8	−0.06	0.01	56.9	2/8	−0.06	0.02	16.2
Pla85900	2/7	0.07	0.22	1.22	2/7	0.07	0.2	0.58
D15112	1/7	0.08	0.29	0.8	1/7	0.07	0.15	0.44
Overall Results	36/165	0.09	0.63	5.34	44/165	0.11	0.61	4.35

Table 4. Baseline convergence times

\bar{t}

(in seconds) of the HPClust parallel strategies.

Table 4. Baseline convergence times

\bar{t}

(in seconds) of the HPClust parallel strategies.

Dataset	HPClust-Inner				HPClust-Competitive
Dataset	#Succ	Min	Median	Max	#Succ	Min	Median	Max
CORD-19 Embeddings	2/7	6.92	16.1	24.82	1/7	12.78	17.45	24.12
HEPMASS	0/7	5.77	8.65	15.59	2/7	2.35	5.24	14.12
US Census Data 1990	0/7	0.24	0.63	2.07	1/7	0.19	0.53	1.63
Gisette	4/7	3.27	4.4	6.38	0/7	17.06	19.38	23.72
Music Analysis	0/7	0.58	3.22	7.19	0/7	1.44	4.02	7.89
Protein Homology	2/7	0.79	1.71	3.18	1/7	1.63	2.45	4.03
MiniBooNE Particle Identification	4/7	0.46	1.05	2.38	1/7	2.24	3.07	4.39
MiniBooNE Particle Identification (normalized)	3/7	0.09	0.4	0.85	1/7	0.28	0.49	0.91
MFCCs for Speech Emotion Recognition	1/7	0.12	0.39	0.83	1/7	0.29	0.57	0.96
ISOLET	0/7	0.38	1.01	2.93	0/7	0.85	1.88	3.84
Sensorless Drive Diagnosis	3/7	0.14	0.29	0.9	0/7	0.72	1.02	2.07
Sensorless Drive Diagnosis (normalized)	0/7	0.02	0.09	0.28	0/7	0.04	0.09	0.26
Online News Popularity	2/7	0.09	0.27	0.59	0/7	0.15	0.29	0.62
Gas Sensor Array Drift	0/7	0.11	0.47	1.68	0/7	0.29	0.69	1.64
3D Road Network	2/7	0.08	0.23	0.49	0/7	0.15	0.35	0.88
Skin Segmentation	0/7	0.03	0.07	0.18	1/7	0.02	0.05	0.12
KEGG Metabolic Relation Network (Directed)	0/7	0.1	0.3	0.82	1/7	0.26	0.46	0.97
Shuttle Control	0/8	0.1	0.32	0.87	0/8	0.09	0.29	0.74
Shuttle Control (normalized)	0/8	0.04	0.15	0.32	3/8	0.02	0.07	0.2
EEG Eye State	0/8	0.13	0.43	1.11	0/8	0.06	0.31	0.78
EEG Eye State (normalized)	0/8	0.04	0.11	0.74	0/8	0.06	0.12	0.34
Pla85900	0/7	0.07	0.64	1.42	1/7	0.05	0.35	1.14
D15112	0/7	0.06	0.42	1.11	3/7	0.06	0.21	0.77
Overall Results	23/165	0.85	1.8	3.34	17/165	1.79	2.58	4.18
Dataset	HPClust-cooperative				HPClust-hybrid
Dataset	#Succ	Min	Median	Max	#Succ	Min	Median	Max
CORD-19 Embeddings	0/7	12.06	18.2	26.11	2/7	11.49	16.03	23.92
HEPMASS	4/7	2.92	4.86	12.03	0/7	2.6	6.69	16.98
US Census Data 1990	1/7	0.14	0.46	1.47	3/7	0.14	0.45	1.41
Gisette	0/7	16.95	18.99	23.07	0/7	16.99	19.33	23.71
Music Analysis	2/7	1.58	3.33	6.99	0/7	1.35	3.88	8.19
Protein Homology	2/7	1.73	2.52	4.23	0/7	1.83	2.91	4.43
MiniBooNE Particle Identification	0/7	2.19	2.83	4.4	1/7	2.05	3.05	4.33
MiniBooNE Particle Identification (normalized)	1/7	0.29	0.51	0.88	0/7	0.25	0.53	1.0
MFCCs for Speech Emotion Recognition	1/7	0.22	0.49	0.99	1/7	0.26	0.55	1.06
ISOLET	3/7	0.87	1.42	2.88	0/7	0.76	1.96	4.07
Sensorless Drive Diagnosis	0/7	0.62	1.05	2.0	1/7	0.72	1.05	1.95
Sensorless Drive Diagnosis (normalized)	3/7	0.04	0.09	0.25	2/7	0.04	0.09	0.27
Online News Popularity	2/7	0.14	0.28	0.56	1/7	0.14	0.29	0.71
Gas Sensor Array Drift	1/7	0.27	0.63	1.62	1/7	0.27	0.73	1.74
3D Road Network	0/7	0.18	0.33	0.87	1/7	0.16	0.37	1.18
Skin Segmentation	6/7	0.02	0.04	0.18	0/7	0.02	0.04	0.16
KEGG Metabolic Relation Network (Directed)	2/7	0.24	0.42	0.98	1/7	0.25	0.44	0.97
Shuttle Control	2/8	0.09	0.21	0.59	2/8	0.08	0.21	0.67
Shuttle Control (normalized)	5/8	0.02	0.05	0.21	0/8	0.02	0.07	0.26
EEG Eye State	2/8	0.07	0.23	0.89	3/8	0.08	0.22	0.77
EEG Eye State (normalized)	2/8	0.06	0.11	0.33	0/8	0.06	0.15	0.46
Pla85900	6/7	0.06	0.24	1.16	0/7	0.06	0.35	1.11
D15112	3/7	0.05	0.24	0.83	1/7	0.04	0.24	0.89
Overall Results	48/165	1.77	2.5	4.07	20/165	1.72	2.59	4.36

Table 5. Relative clustering accuracies

ϵ

(in %) resulting from the comparison of the hybrid HPClust strategy with the competitive algorithms.

Table 5. Relative clustering accuracies

ϵ

(in %) resulting from the comparison of the hybrid HPClust strategy with the competitive algorithms.

Dataset	HPClust-Hybrid				Forgy K-Means				PBK-BDC
Dataset	#Succ	Min	Med	Max	#Succ	Min	Med	Max	#Succ	Min	Med	Max
CORD-19 Embeddings	3/7	0.02	0.08	0.24	4/7	0.01	0.17	1.37	0/7	0.67	1.74	3.28
HEPMASS	5/7	−0.01	0.08	0.24	2/7	0.02	0.18	0.63	0/7	0.63	1.45	3.21
US Census Data 1990	6/7	0.32	1.72	3.17	1/7	2.58	80.73	259.79	0/7	14.86	65.27	279.29
Gisette	0/7	−0.44	−0.4	−0.34	7/7	−0.52	−0.48	−0.39	0/7	−0.47	−0.42	−0.32
Music Analysis	1/7	0.33	0.85	2.24	6/7	−0.01	0.47	6.97	0/7	1.27	4.85	42.27
Protein Homology	4/7	0.5	1.05	2.1	3/7	14.84	14.91	15.09	0/7	4.98	20.63	48.21
MiniBooNE Particle Identification	4/7	−0.07	−0.0	0.15	3/7	2.62	19.52	110,992.52	0/7	2.61	41,006.39	110,992.75
MiniBooNE Particle Identification (normalized)	2/7	0.23	0.51	1.29	5/7	−0.02	1.39	240.25	0/7	2.34	7.75	36.83
MFCCs for Speech Emotion Recognition	4/7	0.12	0.33	0.83	3/7	0.22	1.49	2.92	0/7	1.97	10.1	40.56
ISOLET	6/7	0.01	0.23	0.59	1/7	0.05	0.8	2.78	0/7	0.22	1.03	2.59
Sensorless Drive Diagnosis	7/7	−0.42	−0.21	8.18	0/7	122.75	162.37	183.78	0/7	149.77	162.36	215.62
Sensorless Drive Diagnosis (normalized)	6/7	0.38	1.34	3.81	1/7	1.3	6.21	26.96	0/7	4.49	11.24	48.1
Online News Popularity	5/7	0.47	1.69	7.86	2/7	7.76	14.93	33.83	0/7	15.31	37.76	93.96
Gas Sensor Array Drift	5/7	0.06	0.79	3.99	2/7	10.01	24.31	39.62	0/7	9.52	25.52	39.35
3D Road Network	1/7	0.04	0.21	0.88	6/7	0.0	0.23	0.23	0/7	2.67	40.65	159.28
Skin Segmentation	5/7	−0.02	1.02	4.25	2/7	2.17	9.02	21.32	0/7	7.46	20.55	71.1
KEGG Metabolic Relation Network (Directed)	6/7	−0.29	0.25	23.7	1/7	94.27	95.67	108.63	0/7	94.26	94.92	107.54
Shuttle Control	8/8	0.08	1.86	9.13	0/8	131.85	176.25	243.9	0/8	139.77	174.3	231.7
Shuttle Control (normalized)	6/8	0.71	1.61	4.49	2/8	2.63	16.59	74.13	0/8	8.54	31.94	105.37
EEG Eye State	7/8	0.52	0.55	6.05	1/8	27.46	876,227.79	1,020,813.69	0/8	3.81	803,953.67	1,020,824.57
EEG Eye State (normalized)	8/8	−0.06	0.02	16.2	0/8	100.2	542.0	763.41	0/8	131.31	572.73	758.52
Pla85900	5/7	0.07	0.2	0.58	2/7	−0.02	0.39	1.99	0/7	2.55	10.5	39.62
D15112	4/7	0.07	0.15	0.44	3/7	0.11	1.15	5.82	0/7	0.31	1.41	6.39
Overall Results	108/165	0.11	0.61	4.35	57/165	22.62	38,147.66	49,297.36	0/165	26.04	36,793.75	49,310.86

Table 6. Total clustering times t (in seconds) resulting from the comparison of the hybrid HPClust strategy with the competitive algorithms.

Dataset	HPClust-Hybrid				Forgy K-Means				PBK-BDC
Dataset	#Succ	Min	Med	Max	#Succ	Min	Med	Max	#Succ	Min	Med	Max
CORD-19 Embeddings	4/7	14.71	27.15	36.39	0/7	419.46	704.62	1696.51	3/7	60.31	76.19	105.52
HEPMASS	4/7	6.16	19.99	28.4	0/7	343.81	508.85	865.84	3/7	33.09	35.56	39.44
US Census Data 1990	5/7	0.34	2.08	2.96	0/7	29.55	61.8	120.46	2/7	4.18	4.72	5.46
Gisette	6/7	18.01	21.12	26.19	1/7	28.7	52.93	97.55	0/7	21.53	33.18	63.21
Music Analysis	4/7	1.51	5.61	8.53	0/7	49.88	86.6	145.67	3/7	5.34	7.32	10.42
Protein Homology	4/7	1.82	3.37	5.21	0/7	13.77	19.31	31.43	3/7	5.56	7.9	11.86
MiniBooNE Particle Identification	4/7	2.37	4.33	6.37	2/7	7.64	12.36	17.04	1/7	7.83	11.92	18.68
MiniBooNE Particle Identification (normalized)	4/7	0.34	0.79	1.42	0/7	4.07	7.14	15.28	3/7	0.93	1.21	1.77
MFCCs for Speech Emotion Recognition	3/7	0.28	0.72	1.26	0/7	2.99	4.91	8.07	4/7	0.67	0.94	1.3
ISOLET	0/7	1.03	3.55	4.97	0/7	1.11	1.76	3.52	7/7	0.4	0.76	1.52
Sensorless Drive Diagnosis	3/7	0.78	1.57	2.71	3/7	1.35	2.15	4.06	1/7	1.23	2.08	4.09
Sensorless Drive Diagnosis (normalized)	2/7	0.05	0.22	0.33	0/7	0.4	0.76	1.9	5/7	0.1	0.15	0.21
Online News Popularity	3/7	0.18	0.53	0.87	0/7	0.73	1.99	3.82	4/7	0.41	0.77	1.1
Gas Sensor Array Drift	0/7	0.35	1.48	2.22	0/7	0.43	0.98	2.13	7/7	0.26	0.58	1.2
3D Road Network	4/7	0.15	0.49	1.28	0/7	7.38	9.2	10.56	3/7	1.73	2.31	3.49
Skin Segmentation	1/7	0.04	0.15	0.21	0/7	0.17	0.3	0.64	6/7	0.06	0.08	0.1
KEGG Metabolic Relation Network (Directed)	3/7	0.34	0.85	1.28	0/7	1.14	1.61	2.23	4/7	1.2	1.64	2.09
Shuttle Control	0/8	0.25	0.87	1.45	3/8	0.1	0.19	0.41	5/8	0.11	0.18	0.34
Shuttle Control (normalized)	0/8	0.04	0.26	0.39	0/8	0.04	0.09	0.19	8/8	0.02	0.02	0.03
EEG Eye State	0/8	0.21	0.98	1.43	4/8	0.07	0.13	0.22	4/8	0.08	0.14	0.23
EEG Eye State (normalized)	0/8	0.11	0.66	0.99	2/8	0.06	0.14	0.33	6/8	0.06	0.11	0.23
Pla85900	0/7	0.11	0.93	1.47	0/7	0.13	0.26	0.58	7/7	0.05	0.07	0.14
D15112	0/7	0.2	0.9	1.43	0/7	0.02	0.03	0.06	7/7	0.01	0.01	0.02
Overall Results	54/165	2.15	4.29	5.99	15/165	39.7	64.27	131.67	96/165	6.31	8.17	11.85

Table 7. Resulting relative clustering accuracies

ε

for the scaling experiment in the format

(m e d i a n v a l u e \pm s t a n d a r d d e v i a t i o n)

.

Table 7. Resulting relative clustering accuracies

ε

for the scaling experiment in the format

(m e d i a n v a l u e \pm s t a n d a r d d e v i a t i o n)

.

m	HPClust-Inner	HPClust-Competitive	HPClust-Collective	HPClust-Hybrid	Forgy K-Means	PBK-BDC
$3^{7}$	3.67 (±3.90)	−1.83 (±1.54)	1.31 (±2.13)	−1.84 (±1.30)	27.38 (±16.96)	26.61 (±9.62)
$3^{8}$	17.40 (±13.71)	−0.92 (±0.01)	−0.92 (±0.01)	−0.91 (±0.01)	38.53 (±18.78)	52.35 (±22.51)
$3^{9}$	−0.04 (±18.43)	−0.06 (±0.03)	−0.06 (±0.02)	−0.05 (±0.03)	81.21 (±62.81)	123.71 (±56.46)
$3^{10}$	0.14 (±19.13)	0.14 (±0.04)	0.14 (±0.03)	0.17 (±0.04)	83.56 (±52.48)	82.05 (±51.69)
$3^{11}$	0.19 (±0.05)	0.18 (±0.05)	0.19 (±0.06)	0.19 (±0.04)	141.95 (±118.03)	256.22 (±91.90)
$3^{12}$	10.12 (±10.06)	0.21 (±0.05)	0.20 (±0.03)	0.20 (±0.04)	54.23 (±36.96)	124.13 (±32.12)
$3^{13}$	0.23 (±24.33)	0.21 (±0.03)	0.22 (±14.60)	0.20 (±0.04)	67.99 (±69.14)	134.67 (±40.73)
$3^{14}$	0.18 (±20.97)	0.21 (±0.03)	0.20 (±31.04)	0.22 (±0.04)	165.58 (±94.48)	188.93 (±92.47)
$3^{15}$	0.19 (±8.83)	0.20 (±0.02)	0.20 (±11.76)	0.22 (±0.02)	46.06 (±42.66)	84.34 (±31.01)

Table 8. Resulting clustering times t for the scaling experiment in the format

(m e d i a n v a l u e \pm s t a n d a r d d e v i a t i o n)

.

Table 8. Resulting clustering times t for the scaling experiment in the format

(m e d i a n v a l u e \pm s t a n d a r d d e v i a t i o n)

.

m	HPClust-Inner	HPClust-Competitive	HPClust-Collective	HPClust-Hybrid	Forgy K-Means	PBK-BDC
$3^{7}$	1.03 (±0.86)	0.62 (±0.28)	0.56 (±0.46)	1.84 (±0.59)	0.00 (±0.00)	0.00 (±0.00)
$3^{8}$	1.41 (±0.66)	1.56 (±0.75)	0.68 (±0.70)	1.82 (±0.69)	0.01 (±0.00)	0.01 (±0.00)
$3^{9}$	1.46 (±0.76)	1.99 (±0.92)	1.23 (±0.70)	1.73 (±0.86)	0.03 (±0.02)	0.01 (±0.00)
$3^{10}$	1.48 (±0.80)	1.96 (±0.95)	1.53 (±0.70)	1.52 (±0.78)	0.19 (±0.19)	0.03 (±0.01)
$3^{11}$	1.08 (±1.00)	1.41 (±0.89)	1.74 (±0.78)	1.46 (±0.72)	1.39 (±0.80)	0.06 (±0.01)
$3^{12}$	1.35 (±0.83)	1.72 (±0.93)	2.27 (±0.73)	1.69 (±0.64)	6.96 (±3.34)	0.18 (±0.01)
$3^{13}$	2.46 (±0.97)	1.47 (±0.85)	2.06 (±0.76)	2.70 (±0.81)	37.07 (±24.76)	0.53 (±0.03)
$3^{14}$	2.59 (±0.80)	1.48 (±0.81)	1.73 (±0.74)	2.71 (±0.94)	222.16 (±100.82)	1.62 (±0.10)
$3^{15}$	1.66 (±0.73)	1.64 (±0.82)	2.64 (±0.98)	2.62 (±1.21)	957.92 (±443.64)	4.81 (±0.47)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mussabayev, R.; Mussabayev, R. High-Performance Hybrid Algorithm for Minimum Sum-of-Squares Clustering of Infinitely Tall Data. Mathematics 2024, 12, 1930. https://doi.org/10.3390/math12131930

AMA Style

Mussabayev R, Mussabayev R. High-Performance Hybrid Algorithm for Minimum Sum-of-Squares Clustering of Infinitely Tall Data. Mathematics. 2024; 12(13):1930. https://doi.org/10.3390/math12131930

Chicago/Turabian Style

Mussabayev, Ravil, and Rustam Mussabayev. 2024. "High-Performance Hybrid Algorithm for Minimum Sum-of-Squares Clustering of Infinitely Tall Data" Mathematics 12, no. 13: 1930. https://doi.org/10.3390/math12131930

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

High-Performance Hybrid Algorithm for Minimum Sum-of-Squares Clustering of Infinitely Tall Data

Abstract

1. Introduction

2. Related Works

3. Proposed Algorithm

4. Parallel Strategies for the HPClust Algorithm

5. High-Performance Techniques in HPClust

5.1. Analytical Optimization

5.2. Nuances of Parallelism in HPClust

5.3. Implementation-Level Optimization

5.4. Future Optimization Directions

6. Experimental Setup

6.1. Hardware and Software

6.2. Competitive Algorithms

6.3. Datasets

6.4. Experimental Design and Evaluation Metrics

6.5. Hyperparameter Selection

6.6. Preliminary Experiments

6.7. Main Experiments

6.8. Scaling Experiment

7. Experimental Results and Discussion

7.1. Performance Evaluation

7.2. Trade-Offs Analysis

8. Guidelines for Choosing Parallel Strategy

9. Conclusions and Future Works

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI