2.3.3. Davies-Bouldin Index

The DBI can also be used to evaluate the model, where a lower DBI relates to a model with better separation between the clusters [59]. The index is defined as the average similarity (*Rij*) between each cluster k and the next closest (i.e., most similar) cluster. The DBI is calculated as Equation (8):

$$\text{DBI} = \frac{1}{k} \sum\_{i=1}^{k} \max\_{i \neq j} (R\_{ij}) \tag{8}$$

where DBI is the Davies–Bouldin index. Zero is the lowest possible score. Values closer to zero indicate a better partition. *k* is the number of the cluster. *Rij* is the similarity measure which features per Equation (9):

$$R\_{ij} = \frac{s\_i + s\_j}{d\_{ij}}\tag{9}$$

where *si* is the average intra-distance between each point of cluster i and the centroid of that cluster representing as cluster diameter; *dij* is the inter-cluster distance between cluster centroids *i* and *j*; *Rij* is set to the trade-off between inter-cluster distance and intra-cluster distance. The computation of DBI is simpler than that of SC since this index is computed only with quantities and features inherent to the dataset [60]. However, a good value reported by DBI might not imply the best information retrieval [55].

## 2.3.4. Intra-Cluster Distance

Intra-cluster distance (ICD) is the distance between two samples belonging to the same cluster. Three types of intra-cluster distance, including complete diameter distance, average diameter distance, and centroid diameter distance, are popular in prior studies. As the number of clusters increase, individual clusters become more homogenous, and the ICD decreases. At a certain point, the decrease in distances becomes negligible. Plotting this distance against *k* usually results in an inflection point or elbow point where this occurs, and can be used to identify the optimal value of *k* [61]. The number of clusters is chosen at this point, hence the "elbow criterion." Here we use the centroid distance to represent ICD, given as double the average distance between all of the objects:

$$\Delta(S) \,= 2 \left\{ \frac{\sum\_{\mathbf{x} \in S} d(\mathbf{x}, T)}{|S|} \right\} \tag{10}$$

$$T = \frac{1}{|S|} \sum\_{x \in S} x$$

where Δ(*S*) is the centroid diameter distance of the formed cluster representative *S*; *x* is the samples belonging to cluster *S*; *d*(*x*, *T*) is the distance between two objects, *x* and *T*; |*S*| is the number of objects in cluster *S*.
