*2.1. SMOTE*

SMOTE is a classic oversampling algorithm normally used for solving data unbalance problems [45]. Compared to the random oversampling approach, SMOTE performance is better in preventing overfitting [40] by adding ND to achieve balancing distribution with PD. The basic idea is to perform linear interpolation between the existing ND and their neighbors. Specific steps of SMOTE are as follows:


$$\mathbf{x}\_{\text{new}} = \mathbf{x}\_{i} + rand(0, 1) \times (y\_{j} - \mathbf{x}\_{i}) \tag{1}$$

where *xj* = 1, 2, ... , *n*, *rand*(0,1) represents a random number between 0 and 1.

New data synthesized by SMOTE is shown in Figure 2.

**Figure 2.** New data synthesized by SMOTE.

In Figure 2 *x* is the core data currently used to construct the new data: *x*1, *x*2, *x*3, *x*4 are the four nearest neighbor data of *x*; *r*1, and *r*2, *r*3, *r*4 are synthetic new data.

#### *2.2. K-Means Clustering Algorithm*

K-means clustering is a widely used algorithm that takes the distance between data points and cluster center as the optimization objective [46]. The algorithm would maximize the similarity of elements in clusters while minimizing the similarity between clusters. The K-means selects the desired cluster center, *K*, minimizes the variance of the whole cluster through continuous iteration and recalculation of the cluster center, and takes the relatively compact and mutually independent clusters as the ultimate goal.

The basic idea of the K-means is to determine the number of initial clusters centers, *K*, and randomly select *K* data as the center of the initial cluster in the given dataset *D*. Then, for each remaining data in *D*, calculate the Euclidean distance to each cluster center, divide it into the cluster class belonging to the nearest cluster center, and repeat the calculation to generate new cluster centers. The clustering process converges when cluster centers encountered no longer change or the number of iterations reaches the preset threshold limit.

Specific steps are as follows:


$$d(i,j) = \sqrt{(\mathbf{x}\_{i1} - \mathbf{x}\_{j1})^2 + (\mathbf{x}\_{i2} - \mathbf{x}\_{j2})^2 + \dots + (\mathbf{x}\_{in} - \mathbf{x}\_{jn})^2} \tag{2}$$

where *i* = {*xi*1, *xi*2, *xi*3, ... , *xin*} and *j* = {*xj*1, *xj*2, *xj*3, ... , *xjn*} are *n*-dimensional dataset.

(3) After all data have been calculated, the new clustering center of each class can be recalculated by Equation (3):

$$\mu\_{j+1} = \frac{1}{N\_j} \sum\_{\mathbf{x}\_i \in \mathcal{S}\_j} \mathbf{x}\_i \tag{3}$$

where *Nj* represents the number of data in class *j*.

