2.1.1. K-Means Clustering

K-means clustering (KC) is a centroid-based unsupervised clustering algorithm, originally designed for signal processing. It is the most widely applied method of cluster analysis in data mining [33]. K-means aims to partition the inputs into *k* clusters. Given a set of observations (*x*1, *x*2, ..., *xi*) for *p* variables, the algorithm runs as follows:


The goal then is to minimize *k*Cthe within-cluster sum of squares:

$$\operatorname{argmin}\_{\mu, \mathbb{C}} \sum\_{\ell=1}^{k} \sum\_{\mathbf{x}\_{\ell} \in \mathbb{C}\_{\ell}}^{i} \|\mathbf{x}\_{\mathbf{i}} - \boldsymbol{\mu}\_{\ell}\|^2 \tag{1}$$

where *k* is the number of cluster centers and {μ-}, - = 1, ... , *k* are the cluster centroids Cμμ-C-. The total intra-cluster distance is the total squared Euclidean distance from each point to the center of its cluster, and this is a measure of the variance or internal coherence of the clusters [47]. This can be used to assess the stability of the solution. When this falls below a predefined threshold, the algorithm stops. The algorithm is often run multiple times with different random initialization of cluster centroids to avoid sub-optimal problems in convergence. The clustering solution with the lowest sum-of-squares is chosen as the final output.

However, the choice of *k* is challenging when model performance metrics are not available. Often, an initial value of *k* is chosen, then the algorithm is repeated for higher and lower values. To improve the efficiency of discovering the best *k* value, a score (SCI, CHI, DBI)-based performance assessment method is recommended in many prior studies [42].
