*3.2. Details of the SKM Algorithm*

Giving the image of point cloud *C* under the k-nearest neighbor and local statistical mapping *kC*: = Ψ*k*[*C*], which is the parameter point cloud in N*n*, it is reasonable that we cluster the parameter points to gain the potential classifications among the original data, and the core idea of the SKM algorithm is the application of the K-means algorithm together with non-Euclidean difference functions. The SKM algorithm's performance depends on the choice of difference functions, which makes the SKM algorithm flexible for various tasks.

The specific steps of the SKM algorithm are as Algorithm 1:

#### **Algorithm 1** Statistical K-Means Cluster Algorithm

**Input:** point cloud *C*, k-nearest neighbor indicator *k*, initial cluster center *c*<sup>0</sup> <sup>1</sup>, ··· , *<sup>c</sup>*<sup>0</sup> *k*, threshold *ε*

**Output:** a K division of point cloud *C*


#### **4. Data Pre-Processing and Preparations**

After the introduction of the SKM algorithm, we can prepare the data for our method to simulate on. This section mainly explains the work of data pre-processing and the criteria to assess the cluster results.

#### *4.1. Data Pre-Processing*

Here, the original data of the experiment are selected among the top 20 universities in mainland China in terms of scientific research funding in 2021. A total of 32 types of indicators from 2010 to 2019 are taken into account. Data sources are the WOS and CSSCI databases alongside the analysis platform of CNKI [22–24]. The names of universities and statistical indicators are as Tables 1 and 2.


**Table 1.** The names of the twenty universities and their abbreviations.


**Table 2.** Selection of thirty-two statistical indicators.

Assuming that *xi* as the *i*-th indicator, the numerical expression of academic performance of a university *s* in the year *y* is denoted by

$$\mathbf{X}\_{\mathbf{s},\mathbf{y}} = (\mathbf{x}\_1, \mathbf{x}\_2, \cdots, \mathbf{x}\_k)^T.$$

It is natural that we make up a matrix *X*(*s*, *y*) whose element is the academic performance vector *Xs*,*y*. Hence, the row represents different universities, and the column represents the different years. Since our indicators are in different dimensions, we apply

the z-score normalization on the indicators of every column: namely, normalize the same indicator of different universities in the year.

$$\mathfrak{x}\_{nor} = \frac{\mathfrak{x} - mean(X)}{std(X)}, \ \mathfrak{x} \in X.$$

The normalization makes indicators among different years comparable, which forms the basis of clustering.

#### *4.2. Clustering Assessment Criteria*

The commonly used clustering assessment criteria can be generally devided into two classes, external assessment and internal assessment. The external assessment needs a reference model as the benchmark, while the internal assessment simply measures the clustering results from the perspective of compactness, connectivity and so on. Since there is no state-of-the-art reference model or ranking in this field, it is convincing to choose proper internal assessment criteria. In this paper, we use the Davies–Bouldin Index (DBI), Dunn Index (DI) and Silhouette Score (SC) as the clustering assessment criteria, which have been proved to be effective in such problems [25,26].

Assume that *C* = {*C*1, *C*2, ··· , *Ck*} as the cluster result, where |*C*| represents the number of samples in *C*, dist(*xi*, *xj*) represents the distance metric of sample *xi* and *xj*, *μ<sup>i</sup>* represents the center of cluster *Ci*. Giving definitions as follows

$$\text{avg}(\mathbb{C}) = \frac{2}{|\mathbb{C}|(|\mathbb{C}| - 1)} \sum\_{1 \le i \le j \le |\mathbb{C}|} \text{dist}(\mathbf{x}\_i, \mathbf{x}\_j),$$

$$\text{diam}(\mathbb{C}) = \max\_{1 \le i \le j \le |\mathbb{C}|} \{ \text{dist}(\mathbf{x}\_i, \mathbf{x}\_j) \},$$

$$d\_{\min}(\mathbb{C}\_{i\prime}, \mathbb{C}\_j) = \min\_{\mathbf{x}\_i \in \mathbb{C}\_i, \mathbf{x}\_j \in \mathbb{C}\_j} \{ \text{dist}(\mathbf{x}\_i, \mathbf{x}\_j) \},$$

$$d\_{\text{cen}}(\mathbb{C}\_i, \mathbb{C}\_j) = \text{dist}(\mu\_i, \mu\_j).$$

Then, we can define *DBI*, *D I* and *SC* as

$$DBI = \frac{1}{k} \sum\_{i=1}^{k} \max(\frac{\text{avg}(\mathbf{C}\_{i}) + \text{avg}(\mathbf{C}\_{j})}{d\_{\text{cen}}(\mu\_{i\prime}\mu\_{j})}),$$

$$DI = \frac{\min\_{1 \le i \le j \le m} \{d\_{\text{min}}(\mathbf{C}\_{i\prime}\mathbf{C}\_{j})\}}{\max\_{1 \le l \le m} \{\text{diam}(\mathbf{C}\_{l})\}},$$

$$s(x\_{i}) = \frac{b - a}{\max(a, b)}, \ a = \frac{1}{|\mathbf{C}\_{q}| - 1} \sum\_{x\_{i}, x\_{j} \in \mathbf{C}\_{q}} \text{dist}(x\_{i\prime}, x\_{j}),$$

$$b = \frac{1}{\sum |\mathbf{C}| - |\mathbf{C}\_{q}|} \sum\_{x\_{i} \in \mathbf{C}\_{q}, x\_{j} \notin \mathbf{C}\_{q}} \text{dist}(x\_{i\prime}, x\_{j}), \ \ \ \ \ \ \ \ \ \ \ \frac{\sum s(x\_{i})}{\sum |\mathbf{C}|}.$$

The three indicators evaluate the clustering results from different perspectives. *DBI* measures the maximum similarity between clusters; hence, the smaller *DBI* is, the better the clustering result is; *D I* calculates the ratio of the minimum cluster distance and the largest intra-class discrete distance, and a good clustering result should make the value as big as possible; the *SC* value of each sample represents the degree of matching relationship between the sample and its cluster; therefore, the higher the *SC* value in general, the better the clustering result.

#### **5. Data Cloud Simulation**

In this section, we will respectively apply the traditional K-means, GMM and the SKM algorithm on the processed data. By analyzing the cluster results and calculating the assessment criteria scores, we can compare the performance of different algorithms as well as give the academic levels of the 20 universities. The estimation of university academic level is given by the most reasonable cluster result, as all these cluster algorithms evolve random processes.
