*3.1. Formula of MDSVC* 3.1.1. Preliminary

Let *φ*(*x*) be the mapping function induced by a kernel *k*, i.e., *k* - *xi*, *x<sup>j</sup>* = *φ*(*xi*) *Tφ* - *xj* . In the feature space, we use the Gaussian kernel, and we derive *k*(*x*, *x*) = 1. The distance between *<sup>a</sup>* and *<sup>x</sup>* is *φ*(*x*) <sup>−</sup> *<sup>a</sup>*<sup>2</sup> , where . is the Euclidean norm and *<sup>a</sup>* is the center of the sphere. We denote *X* as the matrix whose *i*-th column is *φ*(*xi*). In what follows in the

rest of this subsection, we first give the definitions of statistics of mean and variance in clustering; we then present Theorems 1 and 2 to facilitate the formation of the variance; next, we employ the mean and variance (Equations (1) and (2)) to obtain and elucidate the final formula as a convex quadratic programming problem.

**Definition 1.** *The margin mean is defined as follows.*

$$\bar{\gamma} = \frac{1}{\text{m}} \sum\_{i} \left\| \phi(\mathbf{x}\_i) - \mathbf{a} \right\|^2 = 1 - \frac{2}{\text{m}} \mathbf{a}^T \mathbf{X} \mathbf{e} + \mathbf{a}^2 \tag{1}$$

where *e* stands for the all-one column vector of *m* dimensions. Because we use the Gaussian kernel, we have *k*(*x*, *x*) = 1, which can facilitate the calculation. The reason for choosing this form of mean is that we incline to make the center of the MDSVC's sphere close to the denser part of the samples. Next, we define the margin variance.

**Definition 2.** *The margin variance is defined as follows.*

$$\begin{split} \frac{\lambda}{\gamma} &\stackrel{\text{m}}{=} \frac{1}{\text{m}^{2}} \sum\_{i=1}^{m} \sum\_{j=1}^{m} (\left\| \left[ \boldsymbol{\phi}(\mathbf{x}\_{i}) - \boldsymbol{a} \right] \right\|^{2} - \left\| \boldsymbol{\phi}(\mathbf{x}\_{j}) - \boldsymbol{a} \right\|^{2})^{2} \\ &= \frac{4}{\text{m}^{2}} \sum\_{i=1}^{m} \sum\_{j=1}^{m} (\boldsymbol{a}^{T} \boldsymbol{\phi}(\mathbf{x}\_{i}) - \boldsymbol{a}^{T} \boldsymbol{\phi}(\mathbf{x}\_{j}))^{2} \\ &= \frac{4}{\text{m}^{2}} \sum\_{i=1}^{m} \sum\_{j=1}^{m} (\boldsymbol{a}^{T} \boldsymbol{\phi}(\mathbf{x}\_{i}) \boldsymbol{\phi}(\mathbf{x}\_{i})^{T} \boldsymbol{a} - 2\boldsymbol{a}^{T} \boldsymbol{\phi}(\mathbf{x}\_{i}) \boldsymbol{\phi}(\mathbf{x}\_{j})^{T} \boldsymbol{a} + \boldsymbol{a}^{T} \boldsymbol{\phi}(\mathbf{x}\_{j}) \boldsymbol{\phi}(\mathbf{x}\_{j})^{T} \boldsymbol{a} \\ &= \frac{8}{\text{m}} \sum\_{i=1}^{m} \boldsymbol{a}^{T} \boldsymbol{\phi}(\mathbf{x}\_{i}) \boldsymbol{\phi}(\mathbf{x}\_{i})^{T} \boldsymbol{a} + \frac{8}{\text{m}^{2}} \sum\_{i=1}^{m} \sum\_{j=1}^{m} \boldsymbol{a}^{T} \boldsymbol{\phi}(\mathbf{x}\_{i}) \boldsymbol{\phi}(\mathbf{x}\_{j})^{T} \boldsymbol{a} \end{split} \tag{2}$$

The variance considers the distribution of the overall data rather than the distribution of SVs. Note that if we only characterize the mean in our method, the hyperplane would incline to dense clusters and there may appear more support vectors for the high density of the clusters, which will result in unbalance. However, we should realize that the mean is just the first step to adjusting the sphere of MDSVC. Next, we introduce the variance to adjust the boundary with less volatility. We can find that the variance quantifies the scatter of clustering. Additionally, we denote kernel matrix *Q=XTX*, where *<sup>Q</sup>ij* = *<sup>k</sup>* - *xi*, *x<sup>j</sup>* = *φ*(*xi*) *Tφ* - *xj* . Note that *φ*(*xi*)*φ* - *xj T* , different from *φ*(*xi*) *Tφ* - *xj* , is difficult to obtain due to its complicated form, so we have to use an alternative way to address this issue. Thus, we use the following Theorem 1. Note that the formula of variance can be further simplified, so we employ Theorem 2 to elucidate and facilitate the form of the variance. Finally, we obtain the simplified form for the margin variance as in Equation (8).

**Theorem 1.** *The center of hypersphere a can be represented as follows,*

$$\mathfrak{a} = \sum\_{i=1}^{m} a\_i \mathfrak{d}(\mathfrak{x}\_i) = \mathbf{X} \mathfrak{a} \tag{3}$$

**Proof of Theorem 1.** Suppose that *a* can be decomposed into the span of *φ*(*xi*) and an orthogonal vector *v*, that is

$$\mathfrak{a} = \sum\_{i=1}^{m} \mathfrak{a}\_i \mathfrak{ϕ}(\mathfrak{x}\_i) + \mathfrak{v} = \mathbf{X}\mathfrak{a} + \mathfrak{v}, \qquad \mathfrak{a} = [\mathfrak{a}\_1, \dots, \mathfrak{a}\_m]^T \tag{4}$$

where *v* satisfies *φ*(*xi*) *<sup>T</sup>v* = 0 for all *i*, i.e., *xTv* = 0. Then we have the following formula

$$\mathfrak{a}^2 = \mathfrak{a}^T \mathbf{X}^T \mathbf{X} \mathfrak{a} + \mathfrak{v}^T \mathfrak{v} \ge \mathfrak{a}^T \mathbf{X}^T \mathbf{X} \mathfrak{a} \tag{5}$$

Therefore, when minimizing *a*, *v* = 0 does not affect its value. The formula of mean is then derived as follows

$$\begin{aligned} \bar{\hat{\boldsymbol{\gamma}}} &= \frac{1}{\mathbf{m}} \boldsymbol{\Sigma} \left\| \boldsymbol{\phi}(\mathbf{x}\_{i}) - \boldsymbol{a} \right\|^{2} = 1 - \frac{2}{\mathbf{m}} \boldsymbol{\mathfrak{a}}^{T} \mathbf{X}^{T} \mathbf{X} \boldsymbol{\mathfrak{e}} + \boldsymbol{\mathfrak{a}}^{T} \mathbf{X}^{T} \mathbf{X} \boldsymbol{\mathfrak{a}} + \boldsymbol{\mathfrak{v}}^{T} \boldsymbol{\mathfrak{v}} \\ &\geq 1 - \frac{2}{\mathbf{m}} \boldsymbol{\mathfrak{a}}^{T} \mathbf{X}^{T} \mathbf{X} \boldsymbol{\mathfrak{e}} + \boldsymbol{\mathfrak{a}}^{T} \mathbf{X}^{T} \mathbf{X} \boldsymbol{\mathfrak{a}} \end{aligned}$$

From the aforementioned formula, the mean is equivalent to modulus *a* in optimization, that is, <sup>−</sup> *<sup>γ</sup>* <sup>⇔</sup> *<sup>a</sup>Ta*. For variance, we have the following form

$$\begin{split} \hat{\gamma} &= \frac{1}{\mathbf{m}^{2}} \sum\_{i=1}^{\mathbf{m}} \sum\_{j=1}^{\mathbf{m}} \left( \left\| \boldsymbol{\Phi}(\mathbf{x}\_{i}) - \boldsymbol{\mathsf{a}} \right\|^{2} - \left\| \boldsymbol{\Phi}(\mathbf{x}\_{j}) - \boldsymbol{\mathsf{a}} \right\|^{2} \right)^{2} \\ &= \frac{8}{\mathbf{m}} \sum\_{i=1}^{\mathbf{m}} \boldsymbol{\mathsf{a}}^{T} \boldsymbol{\phi}(\mathbf{x}\_{i}) \boldsymbol{\phi}(\mathbf{x}\_{i})^{T} \boldsymbol{\mathsf{a}} + \frac{8}{\mathbf{m}^{2}} \sum\_{i=1}^{\mathbf{m}} \sum\_{j=1}^{\mathbf{m}} \boldsymbol{\mathsf{a}}^{T} \boldsymbol{\phi}(\mathbf{x}\_{i}) \boldsymbol{\phi}(\mathbf{x}\_{j})^{T} \boldsymbol{\mathsf{a}} \\ &= \frac{8}{\mathbf{m}} \sum\_{i=1}^{\mathbf{m}} \boldsymbol{\mathsf{a}}^{T} \mathbf{X}^{T} \boldsymbol{\phi}(\mathbf{x}\_{i}) \boldsymbol{\phi}(\mathbf{x}\_{i})^{T} \mathbf{X} \boldsymbol{\mathsf{a}} + \frac{8}{\mathbf{m}^{2}} \sum\_{i=1}^{\mathbf{m}} \sum\_{j=1}^{\mathbf{m}} \boldsymbol{\mathsf{a}}^{T} \mathbf{X}^{T} \boldsymbol{\phi}(\mathbf{x}\_{i}) \boldsymbol{\upphi}(\mathbf{x}\_{j})^{T} \mathbf{X} \boldsymbol{\alpha} \end{split} \tag{6}$$

Thus, the variance is independent of *v*. The rest of the optimization objectives are also independent of *v*. Based on all of the aforementioned equations, *a* can be represented as the form of Equation (3). -

**Theorem 2.** *<sup>Q</sup>iQ<sup>i</sup> <sup>T</sup>*, ∑*<sup>m</sup> <sup>i</sup>*=<sup>1</sup> <sup>∑</sup>*<sup>m</sup> <sup>j</sup>*=<sup>1</sup> *<sup>Q</sup>iQ<sup>j</sup> T, H, P, QG are symmetric matrices where*

$$\mathbf{Q}\_{i} = \begin{bmatrix} k(\mathbf{x}\_{1}, \mathbf{x}\_{i}) \\ \vdots \\ k(\mathbf{x}\_{m}, \mathbf{x}\_{i}) \end{bmatrix}, \mathbf{H} = \frac{8\lambda\_{2}}{\text{m}} \sum\_{i=1}^{m} \mathbf{Q}\_{i} \mathbf{Q}\_{i}^{T}$$

$$\mathbf{P} = \frac{8\lambda\_{2}}{\text{m}^{2}} \sum\_{i=1}^{m} \sum\_{j=1}^{m} \mathbf{Q}\_{i} \mathbf{Q}\_{j}^{T}, \mathbf{G} = \left( (\lambda\_{1} + 1)\mathbf{Q} + \mathbf{H} + \mathbf{P} \right)^{-1} \mathbf{Q}$$

$$\left( (\lambda\_{1} + 1)\mathbf{Q} + \mathbf{H} + \mathbf{P} \right)^{-1} \text{ refers to the inverse matrix of } (\lambda\_{1} + 1)\mathbf{Q} + \mathbf{H} + \mathbf{P}$$

**Proof of Theorem 2.** *<sup>Q</sup>i*(*m*×1) is a column vector of the kernel matrix *<sup>Q</sup>* with the following form

$$\mathbf{Q}\_{i(m \times 1)} = \begin{bmatrix} k(\mathbf{x}\_1, \mathbf{x}\_i) \\ \vdots \\ k(\mathbf{x}\_m, \mathbf{x}\_i) \end{bmatrix}$$

$$\mathbf{Q}\_i \mathbf{Q}\_i^T = \begin{bmatrix} k(\mathbf{x}\_1, \mathbf{x}\_i) \\ \vdots \\ k(\mathbf{x}\_m, \mathbf{x}\_i) \end{bmatrix} \begin{bmatrix} k(\mathbf{x}\_1, \mathbf{x}\_i) & \cdots & k(\mathbf{x}\_m, \mathbf{x}\_i) \end{bmatrix}$$

$$= \begin{bmatrix} k(\mathbf{x}\_1, \mathbf{x}\_i)^2 & \cdots & k(\mathbf{x}\_1, \mathbf{x}\_i)k(\mathbf{x}\_m, \mathbf{x}\_i) \\ \vdots & \ddots & \vdots \\ k(\mathbf{x}\_1, \mathbf{x}\_i)k(\mathbf{x}\_m, \mathbf{x}\_i) & \cdots & k(\mathbf{x}\_m, \mathbf{x}\_i)^2 \end{bmatrix}$$

⎤ ⎥ ⎥ ⎦

Note that*Q<sup>i</sup> Qi <sup>T</sup>* is a symmetric matrix from the above form. Obviously, ∑*<sup>m</sup> <sup>i</sup>*=<sup>1</sup> <sup>∑</sup>*<sup>m</sup> <sup>j</sup>*=<sup>1</sup> *<sup>Q</sup>iQ<sup>j</sup> T* is a symmetric matrix. Therefore, H and P are both symmetric matrices. We deduce *QG* as follows

$$\begin{array}{l} \mathbf{Q} \mathbf{G} = \mathbf{Q} ( (\lambda\_1 + 1)\mathbf{Q} + H + \mathbf{P} )^{-1} \mathbf{Q} \\ \Rightarrow (\mathbf{Q} \mathbf{G})^{\top} = \left( \mathbf{Q} ( (\lambda\_1 + 1)\mathbf{Q} + H + \mathbf{P} )^{-1} \mathbf{Q} \right)^{\top} \\ = \mathbf{Q} ( (\lambda\_1 + 1)\mathbf{Q} + H + \mathbf{P} )^{-1} \mathbf{Q} = \mathbf{Q} ( (\lambda\_1 + 1)\mathbf{Q} + H + \mathbf{P} )^{-1} \mathbf{Q} \\ \Rightarrow \left( \mathbf{Q} \mathbf{G} \right)^{\top} = \mathbf{G}^{\top} \mathbf{Q} = \mathbf{Q} \mathbf{G} \end{array}$$

Therefore, *QG* is a symmetric matrix. -

According to Theorem 1, we have the following form of mean and variance

$$\bar{\gamma} = \frac{1}{\text{m}} \sum\_{\bar{i}} \left\| \phi(\mathbf{x}\_{\bar{i}}) - \mathbf{a} \right\|^2 = 1 - \frac{2}{\text{m}} \mathbf{a}^T \mathbf{Q} \mathbf{e} + \mathbf{a}^T \mathbf{Q} \mathbf{a} \tag{7}$$

$$\begin{split} \overset{\wedge}{\gamma} &= \frac{8}{\mathfrak{m}} \mathsf{a}^{T} \sum\_{i=1}^{m} \mathbf{Q}\_{i} \mathbf{Q}\_{i}^{\top} \mathsf{a} + \frac{8}{\mathfrak{m}^{2}} \mathsf{a}^{T} \sum\_{i=1}^{m} \sum\_{j=1}^{m} \mathbf{Q}\_{i} \mathbf{Q}\_{j}^{\top} \mathsf{a} \\ &= \mathsf{a}^{T} (\frac{8}{\mathfrak{m}} \sum\_{i=1}^{m} \mathbf{Q}\_{i} \mathbf{Q}\_{i}^{\top} + \frac{8}{\mathfrak{m}^{2}} \sum\_{i=1}^{m} \sum\_{j=1}^{m} \mathbf{Q}\_{i} \mathbf{Q}\_{j}^{\top}) \mathsf{a} \end{split} \tag{8}$$
