2.3.2. Calinski-Harabasz Index

The CHI is calculated as the ratio of the between-clusters dispersion average and the within-cluster dispersion [58], penalized by the number of clusters (*k*). A higher CHI score indicates better-defined clusters (i.e., dense and well separated). CHI for a set of *k* clusters is calculated as:

$$\text{CHI} = \frac{T\_r(B\_k)}{T\_r(\mathcal{W}\_k)} \times \frac{N-k}{k-1} \tag{5}$$

where *N* is the number of points in our data; *k* is the number of the cluster; *Tr* represents dispersion matrix; *Bk* is the between-group dispersion matrix, and *Wk* is the within-cluster dispersion matrix. *Bk* and *Wk* are defined by the following equations:

$$\mathcal{W}\_k = \sum\_{q=1}^k \sum\_{\mathbf{x} \in \mathcal{C}\_q} \left( \mathbf{x} - c\_q \right) \left( \mathbf{x} - c\_q \right)^T \tag{6}$$

$$B\_k = \sum\_{q}^{k} n\_q (c\_q - c)(c\_q - c)^T \tag{7}$$

where *Cq* is the set of points in the cluster *q*, *cq* is the center of the cluster *q*, *c* is the center of the whole data set which has been clustered into *k* clusters, *nq* is the number of points in the cluster *q*.
