**Yun-Long Gao 1, Si-Zhe Luo <sup>1</sup> , Zhi-Hao Wang 1, Chih-Cheng Chen 2,\* and Jin-Yan Pan 2,\***


Received: 15 July 2019; Accepted: 7 August 2019; Published: 12 August 2019

**Abstract:** Graph-based embedding methods receive much attention due to the use of graph and manifold information. However, conventional graph-based embedding methods may not always be effective if the data have high dimensions and have complex distributions. First, the similarity matrix only considers local distance measurement in the original space, which cannot reflect a wide variety of data structures. Second, separation of graph construction and dimensionality reduction leads to the similarity matrix not being fully relied on because the original data usually contain lots of noise samples and features. In this paper, we address these problems by constructing two adjacency graphs to stand for the original structure featuring similarity and diversity of the data, and then impose a rank constraint on the corresponding Laplacian matrix to build a novel adaptive graph learning method, namely locality sensitive discriminative unsupervised dimensionality reduction (LSDUDR). As a result, the learned graph shows a clear block diagonal structure so that the clustering structure of data can be preserved. Experimental results on synthetic datasets and real-world benchmark data sets demonstrate the effectiveness of our approach.

**Keywords:** machine learning; graph embedding method; dimensionality reduction; diversity learning; adaptive neighbors

#### **1. Introduction**

Due to the large number of data generated by the advancements of science and technology, dimensionality reduction has become an important task in data mining and machine learning research with many applications [1–4]. These data have such characteristics as high dimensionality, nonlinearity, and extreme complexity, which bring a lot of problems to the subsequent data processing. However, the intrinsic structure of data are often suspected to be much lower due to the redundant information hidden in the original space [5]. Therefore, revealing the potential low-dimensional representation involved in the corresponding high-dimensional structure is an essential preprocessing step for various applications. Under the background, a lot of supervised and unsupervised dimensionality reduction methods are proposed, such as principal component analysis (PCA) [6], linear discriminant analysis (LDA) [7], Laplacian embedding(LE) [8–10], local linear embedding (LLE) [11], locality preserving projections (LPP) [12], neighborhood minmax projections (NMMP) [13], isometric feature mapping (IsoMAP) [14], discriminant sparsity neighborhood preserving embedding (DSNPE) [15], and multiple empirical kernel learning with locality preserving constraint (MEKL-LPC) [16], etc. Obviously, the unsupervised dimensionality reduction method is more challenging than other methods due to the lack of label information. Among them, the graph embedding method exhibits significant performance because it captures the structural information of high-dimensional space. The graph embedding method is built on the basis of manifold assumption, which means the data are formed according to a certain manifold structure and the nearby data points tend to have the same labels.

The commonly used graph-based algorithms, such as LPP [12], IsoMAP [14], local graph based correlation clustering (LGBACC) [17], and locality weighted sparse representation (LWSR) [18] generally have the same steps—for instance, (1) build adjacency graph for each neighborhood; (2) construct pairwise feature (similarity) for each neighborhood to describe the intrinsic manifold structure; and (3) convert the problem into an eigenvalue problem. Thus, we can find the traditional graph-based algorithms mentioned above are all established independently of the subsequent processes, i.e., cluster indicators need to be extracted through post-processing, such dimensionality reduction results are highly dependent on the input pairwise feature matrix [19]. For graph-based algorithms taken, only local distances' account in the original space cannot adequately eliminate noise and capture the underlying manifold structure [20], in that it is an insufficient description for data similarity. Moreover, it is usually difficult to explicitly capture the intrinsic structure of data only by using pairwise data during the graph construction process [21]. In fact, for pairwise data, the similarity is dependent on the adjacency graph constructed by a pair of data individually, without consideration for the local environment of pairwise data. It can be seen from Figure 1, though the distance between A and B is shorter than that between *A* and *C* in the original space, and, clearly, *S*(*A*, *B*) is called a similarity, one pairwise feature is bigger than *S*(*A*, *C*), hence point *A* and point *C* should be sorted out to one class, and *B* to the other class. However, point *A* and point *C* could get more similar in regular classification or clustering tasks because there exists a dense distribution of many points which link *A* and *C*, resulting from a big gap between *A* and *B*, which are regarded as less similar in some traditional methods with two more manifold and consequently divided into different class. Therefore, the traditional definition of similarity does not sufficiently describe the structure.

**Figure 1.** A data point map (point *A* and point *B* are closer, but point *A* and point *C* get bigger similarity in two more manifold structures.)

In recent years, there has been a lot of research devoted to solving these problems. For example, the constrained Laplacian rank (CLR) method [22] learns a block diagonal similarity matrix so that the clustering indicators can be immediately extracted. For Cauchy graph embedding [23], a new objective is proposed to preserve the similarity between the original data for the embedded space, and emphasize the closer two nodes in the embedding space, those that are more similar. Projected clustering with adaptive neighbors (PCAN) [24] designs a similarity matrix and is assigned adaptive and optimal neighbors to every piece of data on the basis of the local distances to learn instead of learning a probabilistic affinity matrix before dimensionality reduction. Stable semi-supervised discriminant learning (SSDL) [25] is worked out to learn the intrinsic structure of a constructed adjacency graphs which could extract the local topology characteristics, and get the geometrical properties as well. Nonetheless, these methods only focus on parts of the problems mentioned above, and the challenge

in reasonably representing underlying data structure or adaptively adjusting the similarity graph still exists. As a consequence, it is quite necessary and challenging to develop an algorithm to address these problems.

In this paper, we propose a novel adaptive graph learning method, namely locality sensitive discriminative unsupervised dimensionality reduction (LSDUDR), which aims to uncover the intrinsic topology structures of data by proposing two objective functions. In the first step, one of the objective functions is aimed at guaranteeing the mapping of all points close to each other in the subspace, while the other one is with the purpose of excluding points with a large distance from the subspace. Furthermore, a data similarity matrix is learned to adaptively adjust the initial input data graph according to the basis of the projected local distances, that is to say, we adjust the projection jointly with graph learning. Moreover, we constrain the similarity matrix by imposing a rank constraint to make it contain more explicit data structure information. It is worthwhile to emphasize the main contributions of our method: (1) LSDUDR can construct a discriminative linear embedded representation that can deal with high-dimensional data and characterize the intrinsic geometrical structure among data; (2) compared with traditional two-stage graph embedding methods, which need an independent affinity graph to be constructed in advance in LSDUDR, and a clustering-oriented graph can be learned and the clustering indicators are extracted with no post-processing needed for the graph; (3) comprehensive experiments were performed on both synthetic data sets and real world benchmark data sets and better effectiveness of the proposed LSDUDR was demonstrated.

#### **2. Related Work**

#### *2.1. Principal Component Analysis (PCA)*

PCA is one of the most representative unsupervised dimensionality reduction methods. The main idea of PCA is to seek a projection transformation to maximise the variance of data. Assume that we have a data matrix *<sup>X</sup>* ∈ *<sup>R</sup>d*×*n*, where *xi* ∈ *<sup>R</sup>d*×<sup>1</sup> denotes the *<sup>i</sup>*-th sample. For better generality, the samples in the data set are centralized, i.e., *<sup>n</sup>* ∑ *i*=1 *xi* = 0. PCA aims to solve the following problem:

$$\max\_{\mathbf{W}^T \mathbf{W} = I} \sum\_{i,j=1}^n \left\| \mathbf{W}^T \mathbf{x}\_i - \mathbf{W}^T \mathbf{x}\_j \right\|\_{2'}^2 \tag{1}$$

where *<sup>W</sup>* ∈ *<sup>R</sup>d*×*<sup>m</sup>* is the projection matrix, and *<sup>m</sup>* is the dimensionality of the linear subspace. When data points lie in a low-dimensional manifold and the manifold is linear or nearly-linear, the low-dimensional structure of data can be effectively captured by a linear subspace spanned by the principal PCA directions; the property provides a basis for utilizing the global scatter of samples as regularization in many applications.

#### *2.2. Locality Preserving Projections (LPP)*

LPP is very popular to substitute algorithms in linear manifold learning in which the data are projected responding to the direction of maximal variance, and the adjacent graph was employed to extract the structure properties of high-dimensional data, and structure properties were transplanted into low-dimensional subspace. The objective function of LPP is

$$\min\_{\mathbf{W}^T \mathbf{W} = I} \sum\_{i,j=1}^n \left\| \left| \mathbf{W}^T \mathbf{x}\_i - \mathbf{W}^T \mathbf{x}\_j \right\|\right\|\_2^2 s\_{ij\prime} \tag{2}$$

where *sij* is defined as the similarity between samples *xi* and *xj*. As we can see, LPP is a linear version of Laplacian Eigenmaps that uses linear model approximation to nonlinear dimensionality reduction, Thus, it shares many of the data representation properties of nonlinear techniques such as Laplacian eigenmaps or locally linear embedding.

#### *2.3. Clustering and Projected Clustering with Adaptive Neighbors (PCAN)*

The PCAN algorithm performs subspace learning and clustering simultaneously instead of learning an initial pairwise feature matrix that is constructed before dimensionality reduction. The goal of PCAN is to assign the optimal and adaptive neighbors for each data point according to the local distances so that it can learn a new data similarity matrix. Therefore, it can be used as a clustering method, and can also be used as a unsupervised dimensionality reduction method. Denote the total scatter matrix by *St* <sup>=</sup> *XHXT*, where *<sup>H</sup>* is the centering matrix defined as *<sup>H</sup>* <sup>=</sup> *<sup>I</sup>* <sup>−</sup> <sup>1</sup> *<sup>n</sup>* **<sup>11</sup>***T*, and **<sup>1</sup>** is a column vector whose elements are all 1. PCAN constrains the subspace with *WTStW* = *I* so that the data in the subspace has no statistical correlation. In [24], the definition of PCAN is:

$$\begin{aligned} \min\_{\mathbf{S}, \mathbf{W}} & \sum\_{i,j=1}^{n} \left( \left\| \mathbf{W}^T \mathbf{x}\_i - \mathbf{W}^T \mathbf{x}\_j \right\|\_2^2 s\_{ij} + \theta s\_{ij}^2 \right) \\ \text{s.t.} & \forall i, \mathbf{s}\_i^T \mathbf{1} = 1, 0 \le s\_{ij} \le 1, \mathbf{W}^T \mathbf{S}\_t \mathbf{W} = I, \\ & \text{rank}(L) = n - c, \end{aligned} \tag{3}$$

where *<sup>L</sup>* = *<sup>D</sup>* − (*<sup>S</sup>* + *<sup>S</sup>T*)/2 is called Laplacian matrix in graph theory, and the *<sup>i</sup>*-th diagonal element of the degree matrix *<sup>D</sup>* ∈ *<sup>R</sup>n*×*<sup>n</sup>* is <sup>∑</sup> *j* (*sij* + *sji*)/2. Then, by assigning the adaptive neighbors according

to the local distances, the neighbors assignment divides the data points into *c* clusters based on the learned similarity matrix *S*, which can be directly used for clustering without having to perform other post-procedures.

#### **3. Locality Sensitive Discriminative Unsupervised Dimensionality Reduction**

#### *3.1. Intrinsic Structure Representation*

The proposed method needs a pre-defined affinity matrix *S* as the initial graph. While learning the affinity values of *S*, we get a smaller distance by adopting the the square of Euclidean distance - *xi* − *xj* - -2 2 , which is related to a larger affinity value *sij*. Thus, determining the value of *sij* can be seen as solving the following problem:

$$\min\_{\mathbf{s}\_i^T \mathbf{1} = 1, s\_i \ge 0, s\_{ii} = 0} \sum\_{j=1}^n \left( \left\| \mathbf{x}\_i - \mathbf{x}\_j \right\|\_2^2 \mathbf{s}\_{ij} + \theta \mathbf{s}\_{ij}^2 \right),\tag{4}$$

where *θ* is the regularization parameter. The affinities are learned using a suitable *θ* in formula (4) so that we can get the optimal solution *si* with *k* nonzero values, i.e., the number of neighbors *k*. Let us define *eij* = - *xi* − *xj* - -2 <sup>2</sup> and denote *ei* as a vector and *eij* as *j*-th element; formula (4) can be simplified as

$$\min\_{\mathbf{s}\_i^T \mathbf{1} = 1, s\_i \ge 0, s\_{ii} = 0} \frac{1}{2} \left\| \mathbf{s}\_i + \frac{1}{2\theta} \mathbf{e}\_i \right\|\_2^2. \tag{5}$$

According to [22], we can get the optimal affinities *s*ˆ*ij* as follows:

$$\hat{s}\_{ij} = \begin{cases} \frac{c\_{i,k+1} - c\_{ij}}{k c\_{i,k+1} - \sum\_{h=1}^{k} c\_{ih}} \, ^j \underline{k} \le k, \\\ 0, j > k. \end{cases} \tag{6}$$

Next, we define two adjacency graphs *Ms* = {*X*, *S*} and *Md* = {*X*, *V*} in order to characterize the intrinsic structure of data. Among them, the elements in matrix *S* represent the similarity between nearby points, and the elements in matrix *V* represent the diversity between nearby points. We define the elements *vij* in *V* as follows:

$$w\_{ij} = \begin{cases} 1 - s\_{ij\prime}j \le k, \\ 0, j > k. \end{cases} \tag{7}$$

Following the above work, we still did not get a clear and simple intrinsic structure if only using similarity or diversity. Therefore, two objective functions simultaneously proposed to emphasize the local intrinsic structure. One objective function is proposed to guarantee that nearby data points should be embedded to be close to each other in the subspace and mainly focuses on preserving the similarity relationships among nearby data; the other objective function mainly focuses on the shape of a manifold and guarantees that nearby data with large distance are not embedding to be very close to each other in the subspace and effectively preserves the diversity relationships of data. By integrating this two objective functions, the local topology are guaranteed, that is to say, similarity property and diversity property of the data can be perfectly preserved. Based on the above conclusions, we employ the following objective functions to capture the local intrinsic structure:

$$\min\_{\mathbf{W}^T \mathbf{W} = I} \sum\_{i,j=1}^n \left\| \left\| \mathbf{W}^T \mathbf{x}\_i - \mathbf{W}^T \mathbf{x}\_j \right\|\right\|\_2^2 \\ \text{s.t.} \tag{8}$$

$$\max\_{\{\mathbf{W}^T\mathbf{W} = I \quad i\_j\mathbf{j} = 1}} \sum\_{i,j=1}^n \left\| \left\| \mathbf{W}^T \mathbf{x}\_i - \mathbf{W}^T \mathbf{x}\_j \right\|\right\|\_2^2 \upsilon\_{ij}. \tag{9}$$

By simple algebra, we have:

$$\sum\_{i,j=1}^{n} \left\| \left| \mathcal{W}^T \mathbf{x}\_i - \mathcal{W}^T \mathbf{x}\_j \right\|\right\|\_2^2 \mathbf{s}\_{ij} = tr\left(\mathcal{W}^T X L\_S X^T \mathcal{W}\right),\tag{10}$$

$$\sum\_{i,j=1}^{n} \left\| \left\| \mathcal{W}^T \mathbf{x}\_i - \mathcal{W}^T \mathbf{x}\_j \right\|\right\|\_2^2 v\_{ij} = tr\left(\mathcal{W}^T X L\_V X^T \mathcal{W}\right),\tag{11}$$

where *LS* = *<sup>D</sup>* − (*<sup>S</sup>* + *<sup>S</sup>T*)/2 and *LV* = *<sup>P</sup>* − (*<sup>V</sup>* + *<sup>V</sup>T*)/2, *<sup>P</sup>* ∈ *<sup>R</sup>n*×*<sup>n</sup>* is a diagonal matrix and its entries are column sum of *V*. Furthermore, in order to consider the global geometric structure information of data, we introduce the third objective function, i.e., preserving as much information as possible by maximizing overall variance of the input data. Then, inspired by LDA, we can construct a concise discriminant criterion by combining the three objective functions, which contain both local and global geometrical structures information for dimensionality reduction:

$$\min\_{\begin{array}{c}\mathcal{W} \\ \mathcal{W}\end{array}} \frac{\text{tr}\left(\mathcal{W}^T \mathcal{X} \left(L\_S - \beta L\_V\right) \mathcal{X}^T \mathcal{W}\right)}{\text{tr}\left(\mathcal{W}^T \mathcal{X} H \mathcal{X}^T \mathcal{W}\right)},\tag{12}$$

Bringing the definitions of *LS* and *LV* into Equation (12), we have:

$$\begin{split} tr\left(\boldsymbol{\mathcal{W}}^{T}\mathbf{X}\left(\boldsymbol{L}\_{\mathcal{S}}-\boldsymbol{\beta}\boldsymbol{L}\_{V}\right)\boldsymbol{X}^{T}\boldsymbol{\mathcal{W}}\right) &= \sum\_{i,j=1}^{n} \left\lVert\boldsymbol{\mathcal{W}}^{T}\mathbf{x}\_{i}-\boldsymbol{\mathcal{W}}^{T}\mathbf{x}\_{j}\right\rVert\_{2}^{2}\boldsymbol{s}\_{ij}-\boldsymbol{\beta}\sum\_{i,j=1}^{n} \left\lVert\boldsymbol{\mathcal{W}}^{T}\mathbf{x}\_{i}-\boldsymbol{\mathcal{W}}^{T}\mathbf{x}\_{j}\right\rVert\_{2}^{2}\boldsymbol{v}\_{ij} \\ &= \left(1+\boldsymbol{\beta}\right)\sum\_{i,j=1}^{n} \left\lVert\boldsymbol{\mathcal{W}}^{T}\mathbf{x}\_{i}-\boldsymbol{\mathcal{W}}^{T}\mathbf{x}\_{j}\right\rVert\_{2}^{2}\boldsymbol{s}\_{ij}-\boldsymbol{\beta}\sum\_{i=1}^{n}\sum\_{j=1}^{k} \left\lVert\boldsymbol{\mathcal{W}}^{T}\mathbf{x}\_{i}-\boldsymbol{\mathcal{W}}^{T}\mathbf{x}\_{j}\right\rVert\_{2}^{2}. \end{split} \tag{13}$$

According to the definition of *sij*, when *<sup>j</sup>* <sup>&</sup>gt; *<sup>k</sup>*, we have *sij* <sup>=</sup> 0. Therefore, *<sup>n</sup>* ∑ *i*,*j*=1 - -*<sup>W</sup>Txi* − *<sup>W</sup>Txj* - -2 <sup>2</sup> *sij* models the local geometric structure, while *<sup>n</sup>* ∑ *i*=1 *k* ∑ *j*=1 - -*<sup>W</sup>Txi* − *<sup>W</sup>Txj* - -2 <sup>2</sup> represents the total scatter in the local region. Thus, we call this model locality sensitive discriminative unsupervised dimensionality reduction.

#### *3.2. Analysis of Optimal Graph Learning*

When the data contain a large number of noise samples, the similarity matrix *S* obtained by Equation (6) is virtually impossible to be the ideal state. The desired situation is that we map the data to a low-dimensional subspace in which the elements of similarity matrix within a cluster is

nonzero and evenly distributed while the values of elements between clusters are zero. Based the above considerations, we adopt a novel and feasible way to achieve the desired state:

min *W*,*S tr <sup>W</sup>TX* (*LS* − *<sup>β</sup>LV*) *<sup>X</sup>TW tr* (*WTXHXTW*) <sup>+</sup> *<sup>θ</sup> S*<sup>2</sup> *F <sup>s</sup>*.*t*.∀*i*, **<sup>s</sup>***<sup>T</sup> <sup>i</sup>* **<sup>1</sup>** = 1, 0 ≤ *sij* ≤ 1, *<sup>W</sup>TW* = *<sup>I</sup>*,*rank*(*LS*) = *<sup>n</sup>* − *<sup>c</sup>*. (14)

In order to exclude the situation of trivial solution, we add the regularization term *<sup>θ</sup> S*<sup>2</sup> *<sup>F</sup>*. The first and second constraints are added according to the definition of graph weights, which is defined for a vertex as the sum of the distance between one vertex and the members and is non-negative. In addition, we also add the rank constraint to the problem. If *S* is non-negative, the Laplacian matrix has a significant property:

**Theorem 1.** *A graph S with sij* ≥ 0(∀*i*, *j*) *has c connected components if and only if the algebraic multiplicity of eigenvalue 0 for the corresponding Laplacian matrix LS is c [26].*

Theorem 1 reveals that, when *rank*(*LS*) = *n* − *c*, the obtained graph could distinctly divide the data set into exactly *c* clusters based on the block diagonal structure of similarity matrix *S*. It is worth mentioning that Equation (14) can simultaneously learn the projection matrix *W* and the similarity matrix *S*, which is significantly different from previous works. However, it is hard to tackle it directly, especially when there are several strict constraints. In order to solve the question, an iterative optimization algorithm is proposed.

#### **4. Optimization**

### *4.1. Determine the Value of S, W, F*

Without loss of generality, suppose *σi*(*LS*) is the *i*-th smallest eigenvalue of *LS*. It is clearly seen that *σi*(*LS*) ≥ 0 since *LS* is positive semi-definite. Then, if *λ* is big enough, Equation (14) can be rewritten as:

$$\begin{aligned} \min\_{\begin{subarray}{c} W, \boldsymbol{\mathcal{S}} \\ W, \boldsymbol{\mathcal{S}} \end{subarray}} & \frac{\text{tr}\left(\boldsymbol{W}^{\mathrm{T}} \mathbf{X} \left(\boldsymbol{L}\_{\mathcal{S}} - \boldsymbol{\beta} \mathbf{L}\_{V} \right) \boldsymbol{X}^{\mathrm{T}} \boldsymbol{W}\right)}{\text{tr}\left(\boldsymbol{W}^{\mathrm{T}} \mathbf{X} \boldsymbol{H} \boldsymbol{X}^{\mathrm{T}} \boldsymbol{W}\right)} + \theta \left\|\boldsymbol{\mathcal{S}}\right\|\_{F}^{2} + 2\lambda \sum\_{i=1}^{\mathcal{E}} \sigma\_{i} \left(\boldsymbol{L}\_{\mathcal{S}}\right) \\ \text{s.t.} & \forall i, \mathbf{s}\_{i}^{T} \mathbf{1} = \mathbf{1}, 0 \le s\_{ij} \le \mathbf{1}, \boldsymbol{W}^{T} \boldsymbol{W} = I. \end{aligned} \tag{15}$$

Hyperparameter *λ* here can be used to trade balance between the rank of the graph Laplacian and consistency of the data structure. The rank constraint of the graph Laplacian is usually satisfied with a large enough *<sup>λ</sup>*. Meanwhile, given a rank-enforcing matrix *<sup>F</sup>* ∈ *<sup>R</sup>n*×*c*, suppose that node *<sup>i</sup>* is assigned a function value as <sup>f</sup>*<sup>i</sup>* ∈ *<sup>R</sup>*1×*c*. According to the Ky Fan's Theorem [27], the rank constraint term in Equation (15) can be seen as the optimization of the smallest *c* eigenvalues of the Laplacian matrix. Thus, we can transform Equation (15) into the following form:

$$\min\_{\begin{subarray}{c}W,S,\boldsymbol{F} \\ \boldsymbol{W},\mathbf{s},\boldsymbol{F} \end{subarray}} \frac{\mathrm{tr}\left(\boldsymbol{W}^{T}\boldsymbol{X}\left(\boldsymbol{L}\_{S}-\beta\boldsymbol{L}\_{V}\right)\boldsymbol{X}^{T}\boldsymbol{W}\right)}{\mathrm{tr}\left(\boldsymbol{W}^{T}\boldsymbol{X}\boldsymbol{H}\boldsymbol{X}^{T}\boldsymbol{W}\right)} + \theta\left\|\boldsymbol{S}\right\|\_{F}^{2} + 2\lambda\mathrm{tr}\left(\boldsymbol{F}^{T}\boldsymbol{L}\_{S}\boldsymbol{F}\right),\tag{16}$$
  $\begin{array}{l} \text{s.t.} \forall i, \mathbf{s}\_{i}^{T}\mathbf{1} = \mathbf{1}, \boldsymbol{0} \le \mathbf{s}\_{i} \le \mathbf{1}, \boldsymbol{W}^{T}\boldsymbol{W} = \boldsymbol{I}, \boldsymbol{F}^{T}\boldsymbol{F} = \boldsymbol{I}.\end{array}$ 

When *S* and *F* are fixed, problem (16) can be rewritten as:

$$\begin{array}{ll}\min\frac{\operatorname{tr}\left(\boldsymbol{W}^{T}\boldsymbol{X}\left(\boldsymbol{L}\_{S}-\beta\boldsymbol{L}\_{V}\right)\boldsymbol{X}^{T}\boldsymbol{W}\right)}{\operatorname{tr}\left(\boldsymbol{W}^{T}\boldsymbol{X}\boldsymbol{H}\boldsymbol{X}^{T}\boldsymbol{W}\right)}\\\text{s.t.}\boldsymbol{W}^{T}\boldsymbol{W}=\boldsymbol{I}.\end{array} \tag{17}$$

We can use the iterative method introduced in [28] to solve *W* from Equation (17), and the Lagrangian function is constructed according to Equation (17):

$$L\left(\mathcal{W},\eta\right) = \frac{\text{tr}\left(\mathcal{W}^T \mathcal{X} \left(L\_S - \beta L\_V\right) \mathcal{X}^T \mathcal{W}\right)}{\text{tr}\left(\mathcal{W}^T \mathcal{X} H \mathcal{X}^T \mathcal{W}\right)} - \eta \text{tr}\left(\mathcal{W}^T \mathcal{W} - I\right),\tag{18}$$

where *η* is a scalar. Then, taking the derivative of *W* and letting the result be zero, we have

$$\left(X\left(L\_S - \beta L\_V\right)X^T - \frac{\text{tr}\left(\mathcal{W}^T X \left(L\_S - \beta L\_V\right)X^T \mathcal{W}\right)}{\text{tr}\left(\mathcal{W}^T X H X^T \mathcal{W}\right)} X H X^T\right) \mathcal{W} = \tilde{\eta} \mathcal{W},\tag{19}$$

where *η*˜=*ηtr WTXHXTW* . The optimal solution of *W* in Equation (19) is formed by the *m* eigenvectors corresponding to the *m* smallest eigenvalues of the matrix:

$$\left(\left(X\left(L\_S-\beta L\_V\right)X^T-\frac{\text{tr}\left(\mathcal{W}^TX\left(L\_S-\beta L\_V\right)X^TW\right)}{\text{tr}\left(\mathcal{W}^TXHX^TW\right)}XHX^T\right).\tag{20}$$

When *W* and *S* are fixed, problem (16) becomes

$$\begin{array}{ll}\min\_{F} 2\lambda tr(F^T L\_S F) \\ \text{s.t.} & F^T F = I. \end{array} \tag{21}$$

Since *λ* is a constant, the optimal solution of rank-enforcing matrix *F* in Equation (21) is composed of *c* eigenvectors, which are derived from *c* smallest eigenvalues of Laplacian matrix *LS*.

When we fix *W* and *F*, problem (16) was written as:

$$\begin{split} \min\_{\mathbf{S}} \sum\_{i,j=1}^{n} \left( \frac{(1+\beta) \left\| \left\| \mathbf{W}^T \mathbf{x}\_i - \mathbf{W}^T \mathbf{x}\_j \right\|\_2^2 s\_{ij}}{tr\left( \mathbf{W}^T \mathbf{X} H \mathbf{X}^T \mathbf{W} \right)} + \theta \mathbf{s}\_{ij}^2 + \lambda \left\| \left\| \mathbf{f}\_i - \mathbf{f}\_j \right\|\_2^2 s\_{ij} \right) \right. \\ \text{s.t.} & \forall i, \mathbf{s}\_i^T \mathbf{1} = 1, 0 \le s\_{ij} \le 1. \end{split} \tag{22}$$

Note that problem (22) can be solved independently for different **s***i*, so that the following problem can be solved separately for each *i*:

$$\begin{aligned} \min\_{\mathbf{s}\_i} & \sum\_{i=1}^n \left( \Gamma\_{ij} \mathbf{s}\_{ij} + \theta \mathbf{s}\_{ij}^2 + \lambda \Psi\_{ij} \mathbf{s}\_{ij} \right) \\ \text{s.t.} & \forall i, \mathbf{s}\_i^T \mathbf{1} = \mathbf{1}, 0 \le s\_{ij} \le \mathbf{1}, \end{aligned} \tag{23}$$

where <sup>Γ</sup>*ij* <sup>=</sup> (1+*β*)*W<sup>T</sup> xi*−*W<sup>T</sup> xj* 2 2 *tr*(*WTXHXTW*) and <sup>Ψ</sup>*ij* <sup>=</sup> - **f***<sup>i</sup>* − **f***<sup>j</sup>* - -2 2 . Then, Equation (23) can be rewritten as:

$$\begin{array}{l||c}\min\_{\mathbf{s}\_i} & \left\|\mathbf{s}\_i + \frac{1}{2\theta}(\Gamma\_i + \lambda \mathbf{Y}\_i)\right\|\_2^2\\ \text{s.t.} & \forall i, \mathbf{s}\_i^T \mathbf{1} = 1, 0 \le s\_{ij} \le 1. \end{array} \tag{24}$$

Thus, Equation (24) can be solved easily with a close form solution. Denote vector **<sup>d</sup>***<sup>i</sup>* ∈ *<sup>R</sup>n*×<sup>1</sup> with *dij* = Γ*ij* + *λ*Ψ*ij*. For each *i*, Lagrange functions can be obtained:

$$L(\mathcal{W}, \mathfrak{g}, \boldsymbol{\gamma}\_{\mathrm{i}}) = \frac{1}{2} \left\| \mathbf{s}\_{\mathrm{i}} + \frac{1}{2\theta} \mathbf{d}\_{\mathrm{i}} \right\|\_{2}^{2} - \boldsymbol{\varsigma}(\mathbf{s}\_{\mathrm{i}}^{T}\mathbf{1} - \mathbf{1}) - \boldsymbol{\gamma}\_{\mathrm{i}}^{T}\mathbf{s}\_{\mathrm{i}}.\tag{25}$$

where *ς* and *γ<sup>T</sup> <sup>i</sup>* ≥ 0 are the Lagrangian multipliers. Take a partial derivative for each **s***<sup>i</sup>* and set it to zero; then, according to *K.K.T.* conditions:

$$\begin{aligned} (\mathbf{s}\_i)\_j - (\mathbf{d}\_i)\_j + \varsigma - \gamma\_i &= 0, \\ (\mathbf{s}\_i)\_j &\ge 0, \\ \gamma\_i &\ge 0, \\ (\mathbf{s}\_i)\_j \gamma\_i &\ge 0, \\ \mathbf{s}\_i^T \mathbf{1} - 1 &= 0. \end{aligned} \tag{26}$$

Then, we can obtain **s***<sup>i</sup>* that should be:

$$\mathbf{s}\_{i\bar{j}} = \begin{bmatrix} -\frac{1}{2\theta\_i}\mathbf{d}\_i + \mathbf{g} \end{bmatrix}\_+. \tag{27}$$

### *4.2. Approach to Determine the Initial Value of θ, λ*

In actual experiments, regularization parameters are difficult to tune because their values may range from zero to infinity. In this section, we propose an efficient way to determine the regularization parameter *θ* and *λ* as follows:

$$
\lambda = \theta = \frac{1}{n} \sum\_{j=1}^{n} \left[ \frac{k}{2} d\_{i,k+1} - \frac{1}{2} \sum\_{j=1}^{k} d\_{ij} \right]. \tag{28}
$$

*k* is a pre-defined parameter. In this way, we only need to set the number of neighbors we prefer rather than setting two hyper-parameters of *θ* and *λ*. The number of neighbors is usually easy to set according to the number of samples and locality of the data set. The rationality of deciding *θ* and *λ* using the distance gaps between *k*-th neighbor and (*k* + 1)-th neighbor lies in the fact that, to achieve a desired similarity where the top *k*-neighbor similarities are kept and the rest are set to zeros, we should approximately achieve

$$\,\_2^k d\_{ik} - \,\_2^1 \sum\_{j=1}^k d\_{ij} < \theta\_i \le \frac{k}{2} d\_{i,k+1} - \,\_2^1 \sum\_{j=1}^k d\_{ij} \tag{29}$$

where *di*1, *di*2, ..., *din* are sorted in ascending order. If we set the inequality to equality, we can get an estimation of *θ*:

$$
\theta \sim \frac{1}{n} \sum\_{j=1}^{n} \left[ \frac{k}{2} d\_{i,k+1} - \frac{1}{2} \sum\_{j=1}^{k} d\_{ij} \right]. \tag{30}
$$

Similarly, *λ* is set to be equal to *θ* as follows:

$$
\lambda = \theta \sim \frac{1}{n} \sum\_{j=1}^{n} \left[ \frac{k}{2} d\_{i,k+1} - \frac{1}{2} \sum\_{j=1}^{k} d\_{ij} \right]. \tag{31}
$$

Since these two parameters control the regularization strength, we adaptively update the parameters during each iteration:


The detailed steps are summarized in Algorithm 1.

#### **Algorithm 1** Framework of the LSDUDR method.

**Require:** Data *<sup>X</sup>* ∈ *<sup>R</sup>d*×*n*, cluster number *<sup>c</sup>*, projection dimension *<sup>m</sup>*.

Initialize *S* and *V* according to Equations (6) and (7). Initialize parameter *θ* and *λ* by the Equation (28). **If algorithm 1 not converge:**

#### **repeat**

1. Construct the Laplacian matrix *LS* = *<sup>D</sup>* − (*<sup>S</sup>* + *<sup>S</sup>T*)/2 and *LV* = *<sup>P</sup>* − (*<sup>V</sup>* + *<sup>V</sup>T*)/2.

2. Calculate *F*, columns of *F* are *c* eigenvectors of *LS* and are derived from the *c* samllest eigenvalues.

3. Calculate the projection matrix *W* by the *m* eigenvectors corresponding to the *m* smallest eigenvalues of the matrix:

$$\left(\left(X\left(L\_S-\beta L\_V\right)X^T-\frac{\operatorname{tr}\left(\mathcal{W}^TX\left(L\_S-\beta L\_V\right)X^TW\right)}{\operatorname{tr}\left(\mathcal{W}^TXHX^TW\right)}XHX^T\right).$$

4. Compute *S* by updating **s***<sup>i</sup>* according to Equation (27).

5. Calculate the number of connected components of the graph, if it is smaller than *c*, then multiply *λ* by 2; if larger than *c*, then divide *λ* by 2.

#### **until Convergence**

**End if**

#### **return**

Projection matrix *<sup>W</sup>* ∈ *<sup>R</sup>d*×*<sup>m</sup>* and similarity matrix *<sup>S</sup>* ∈ *<sup>R</sup>n*×*n*.

#### **5. Discussion**

#### *5.1. Analysis*

As previously discussed, LSDUDR represents the local intrinsic structure of data set based on Equations (8) and (9). Then, we integrate the two objective functions as follows:

$$\begin{aligned} &\sum\_{i,j=1}^{n} \left\| \left\| \boldsymbol{\mathcal{W}}^{T} \mathbf{x}\_{i} - \boldsymbol{\mathcal{W}}^{T} \mathbf{x}\_{j} \right\|\_{2}^{2} \mathbf{s}\_{ij} - \boldsymbol{\mathcal{J}} \sum\_{i,j=1}^{n} \left\| \boldsymbol{\mathcal{W}}^{T} \mathbf{x}\_{i} - \boldsymbol{\mathcal{W}}^{T} \mathbf{x}\_{j} \right\|\_{2}^{2} \mathbf{v}\_{ij} \\ &= \sum\_{i=1}^{n} \sum\_{j=1}^{k} \left\| \left\| \boldsymbol{\mathcal{W}}^{T} \mathbf{x}\_{i} - \boldsymbol{\mathcal{W}}^{T} \mathbf{x}\_{j} \right\|\_{2}^{2} \left( \mathbf{s}\_{ij} + \beta \mathbf{s}\_{ij} - 1 \right) \\ &= \sum\_{i,j=1}^{n} \left\| \left\| \boldsymbol{\mathcal{W}}^{T} \mathbf{x}\_{i} - \boldsymbol{\mathcal{W}}^{T} \mathbf{x}\_{j} \right\|\_{2}^{2} \mathbf{z}\_{ij}, \end{aligned} \tag{32}$$

where the elements *zij* are defined as follows:

$$z\_{ij} = \begin{cases} s\_{ij} + \beta s\_{ij} - 1, j \le k, \\ 0, j > k. \end{cases} \tag{33}$$

It is easy to see that Equation (32) is very similar to Equation (8). However, they are completely different when they express the intrinsic geometrical structure of the data. Without loss of generality, we set the weight elements *sij* in Equation (8) as a heat kernel function. Figure 2 shows their weight change process with a distance between two points *xi* and *xj*.

**Figure 2.** Difference between *sij* and *zij*.

As we know, the real-world data are usually unbalanced and complex, thus some points may be distributed in sparse areas while other data points are distributed in compact areas. As shown in Figure 2, *zij* is positive for data points in compact regions, thus Equation (32) maps these data points to be very close in the subspace, and mainly preserves the similarity of data. If data points lie in sparse regions, *zij* is negative, and Equation (32) mainly characterizes diversity of data in this case, i.e., the shape of a manifold structure. However, the difference among points in a neighborhood is not considered in Equation (8), and always projects the neighborhood points to be close into subspace, which ignores the intrinsic geometrical structure of data.

It is noteworthy that three updating rules are included in the proposed algorithm, which are computationally efficient. In fact, [29] has already proven the convergence of the alternative optimization method. In our algorithm, the main cost lies in each iteration being the eigen-decomposition step for Equations (7) and (21). The time computational complex of the proposed method is *O*((*d*2*m* + *n*2*c*)*t*), where *t* is the number of iterations.

#### *5.2. Convergence Study*

The method proposed by Algorithm 1 can be used to find a locally optimal solution of problem (14). The convergence of Algorithm 1 is given through Theorem (2).

**Theorem 2.** *The alternate updating rules in Algorithm 1 monotonically decrease the objective function value of optimization problem (14) in each iteration until convergence.*

**Proof.** In the procedure of iteration, we get the global optimal selective matrix *Wt*+<sup>1</sup> by solving optimization problem *Wt*+<sup>1</sup> <sup>=</sup> arg min *<sup>W</sup>TW*=*<sup>I</sup> tr*(*WTX*(*LS*−*βLV* )*XTW*) *tr*(*WTXHXTW*) . As a result, we have the following inequality:

$$\frac{\text{tr}\left(\boldsymbol{\mathcal{W}}\_{t+1}^{T}\boldsymbol{X}(\boldsymbol{L}\_{S}-\boldsymbol{\beta}\boldsymbol{L}\_{V})\boldsymbol{X}^{T}\boldsymbol{\mathcal{W}}\_{t+1}^{T}\right)}{\text{tr}\left(\boldsymbol{\mathcal{W}}\_{t+1}^{T}\boldsymbol{X}\boldsymbol{H}\boldsymbol{X}^{T}\boldsymbol{\mathcal{W}}\_{t+1}^{T}\right)} \leq \frac{\text{tr}\left(\boldsymbol{\mathcal{W}}\_{t}^{T}\boldsymbol{X}(\boldsymbol{L}\_{S}-\boldsymbol{\beta}\boldsymbol{L}\_{V})\boldsymbol{X}^{T}\boldsymbol{\mathcal{W}}\_{t}^{T}\right)}{\text{tr}\left(\boldsymbol{\mathcal{W}}\_{t}^{T}\boldsymbol{X}\boldsymbol{H}\boldsymbol{X}^{T}\boldsymbol{\mathcal{W}}\_{t}^{T}\right)}.\tag{34}$$

Since variable *Ft*+<sup>1</sup> is updated by solving problem *F<sup>T</sup> <sup>t</sup>*+<sup>1</sup> <sup>=</sup> arg min *<sup>F</sup><sup>T</sup> <sup>F</sup>*=*<sup>I</sup>* 2*λtr*(*F<sup>T</sup> <sup>t</sup> LF<sup>T</sup> <sup>t</sup>* ), we obtain the following inequality:

$$\operatorname{tr}(F\_{t+1}^T L F\_{t+1}^T) \le \operatorname{tr}(F\_t^T L F\_t^T). \tag{35}$$

Consequently, we have the following inequality:

$$\frac{\text{tr}\left(\boldsymbol{\mathcal{W}}\_{t+1}^{\boldsymbol{T}}\boldsymbol{X}(\boldsymbol{L}\_{\boldsymbol{S}}-\boldsymbol{\beta}\boldsymbol{L}\_{\boldsymbol{V}})\boldsymbol{X}^{\boldsymbol{T}}\boldsymbol{W}\_{t+1}^{\boldsymbol{T}}\right)}{\text{tr}\left(\boldsymbol{\mathcal{W}}\_{t+1}^{\boldsymbol{T}}\boldsymbol{X}\boldsymbol{H}\boldsymbol{X}^{\boldsymbol{T}}\boldsymbol{W}\_{t+1}^{\boldsymbol{T}}\right)}+\text{tr}\left(\boldsymbol{F}\_{t+1}^{\boldsymbol{T}}\boldsymbol{L}\boldsymbol{F}\_{t+1}^{\boldsymbol{T}}\right)\leq\frac{\text{tr}\left(\boldsymbol{\mathcal{W}}\_{t}^{\boldsymbol{T}}\boldsymbol{X}(\boldsymbol{L}\_{\boldsymbol{S}}-\boldsymbol{\beta}\boldsymbol{L}\_{\boldsymbol{V}})\boldsymbol{X}^{\boldsymbol{T}}\boldsymbol{W}\_{t}^{\boldsymbol{T}}\right)}{\text{tr}\left(\boldsymbol{\mathcal{W}}\_{t}^{\boldsymbol{T}}\boldsymbol{X}\boldsymbol{H}\boldsymbol{X}^{\boldsymbol{T}}\boldsymbol{W}\_{t}^{\boldsymbol{T}}\right)}+\text{tr}\left(\boldsymbol{F}\_{t}^{\boldsymbol{T}}\boldsymbol{L}\boldsymbol{F}\_{t}^{\boldsymbol{T}}\right).\tag{36}$$

In addition, *K.K.T.* conditions (26) illustrate that the converged solution of Algorithm 1 is at least a stationary point of Equation (25). Because the updating of weights matrix *St*+<sup>1</sup> ∈ *<sup>R</sup>n*×*<sup>n</sup>* can be divided into *n* independently sub-optimization problem with respect to *n*-dimensional vector. Consequently, the objective function value of optimization problem (14) decreases monotonically in each iteration until the algorithm convergence.

### **6. Experiment**

In the experiment, the following two metrics are used to evaluate the performance of the proposed LSDUDR algorithm: Accuracy (ACC) and Normalized Mutual Information (NMI) [30]. Accuracy is defined as

$$\text{ACC} = \frac{\sum\_{i=1}^{n} \delta \left( t\_i \mu u \rho \left( t\_i^{\text{g}} \right) \right)}{n} \,\tag{37}$$

where *ti* is the label of the clustering result and *t g <sup>i</sup>* is the known label of *xi*. *map t g i* is the optimal mapping function that permutes the label set of the clustering results and the known label set of samples. *δ ti*, *map t g i* is an indicator function. Normalized Mutual Information is defined as

$$NMI = \frac{\sum\_{i,j=1}^{c} t\_{ij} \log \frac{n \times t\_{ij}}{t\_i t\_j}}{\sqrt{\left(\sum\_{i=1}^{c} t\_i \log \frac{t\_i}{n}\right) \left(\sum\_{j=1}^{c} \hat{t}\_j \log \frac{t\_j}{n}\right)}},\tag{38}$$

where *ti* is the number of samples in the *i*-th cluster *Ci* according to clustering results and ˆ*tj* is the number of samples in the *j*-th ground truth class *Gj*. *tij* is the number of overlap between *Ci* and *Gj*.

We compare the performance of LSDUDR with *K*-Means [31], Ratio Cut [32], Normalized Cut [33] and PCAN methods, since they are closely related to LSDUDR, i.e., the information contained in the eigenvectors of an affinity matrix is used to detect the similarity. We made comparisons with Ratio Cut, Normalized Cut to show that LSDUDR can effectively mitigate the influence of outliers by inducing robustness and adaptive neighbors. To emphasize the importance of describing the intrinsic manifold structure, we compared the results of PCAN with LSDUDR which concatenates to uncover the intrinsic topology structures of data by proposing two objective functions and performs discriminatively embedded *K*-Means clustering.

#### *6.1. Experiment on the Synthetic Data Sets*

To verify the robust performances and strong discriminating power of the proposed LSDUDR, two simple synthetic examples (two-Gaussian and multi-cluster data) are given in this experiment.

In this first synthetic data set, we deliberately set a point away from the two-Gaussian distribution as an outlier so that a one-dimensional linear manifold representation was obtained to clearly divide two clusters. LSDUDR and PCAN were demonstrated on the synthetic examples respectively and the results are shown in Figure 3. It is clear that one cluster shown in pink almost submerges in another one as blue in the one-dimensional representation using PCAN, while it is separated distinctly out using LSDUDR, so we can conclude that LSDUDR has more discriminating power than PCAN. Furthermore, LSDUDR is less sensitive to outliers than PCAN because the objective function of LSDUDR will bring a heavy penalty to two points when they are embedded to be close in the subspace but with large distance in the origin space.

**Figure 3.** (**a**) two-Gaussian synthetic data projection results; (**b**) one-dimensional representation obtained by Projected clustering with adaptive neighbors. (PCAN); (**c**) one-dimensional representation obtained by Locality Sensitive Discriminative Unsupervised Dimensionality Reduction. (LSDUDR).

The second synthetic data set is a multi-cluster data, which contains 196 randomly generated clusters that are distributed in a spherical manner. We compared LSDUDR with *K*-means and PCAN. Due to the fact that *K*-means is sensitive to initialization [34], we repeatedly run *K*-means 100 times and use the minimal *K*-means objective value as the result. To be fair, the parameters of PCAN are adjusted to report the best performance of PCAN. As for LSDUDR, we run LSDUDR once to generate a clustering result and use it as initialization for *K*-means and report the best performance. Table 1 and Figure 4 show the experiment results of LSDUDR and other two algorithms on multi-cluster data. As can be seen from Table 1, LSDUDR obtained better performance than those of other methods according to the the minimal K-means objective value and clustering accuracy. Thus, LSDUDR has stronger discriminating power than PCAN and *K*-means especially when the data distribution is complex.

**Table 1.** Compare results on multi-cluster synthetic data sets.


**Figure 4.** Clustering results of three algorithms.

#### *6.2. Experiment on Low-Dimensional Benchmark Data Sets*

In this subsection, we evaluate the performance of the proposed LSDUDR on ten low-dimensional benchmark data sets with comparison to four related methods, including *K*-Means, Ratio Cut, Normalized Cut and PCAN methods. Description of these data sets is summarized in Table 2, including four synthetic data sets and six University of CaliforniaIrvine (UCI) datasets [35]. In low-dimensional

data, we set the projection dimension in PCAN and LSDUDR to be *c* − 1. For the methods that require a fixed input data graph, we use the self-turn Gaussian method [34,36] to build the graph. For the methods involving *K*-Means to extract the clustering labels, we repeatedly ran *K*-Means 100 times with the same settings and chose the best performance. As for PCAN and LSDUDR, we only ran it once and reported the result directly from the learned graph. The experimental results are shown in Tables 3 and 4.


**Table 2.** Specifications of the data sets.

**Table 3.** ACC(%) on low-dimensional benchmark data sets.


**Table 4.** NMI(%) on low-dimensional benchmark data sets.


In this experiment, we can observe that PCAN and LSDUDR are much better than those of fixed graph-based methods. This observation confirms that separation of graph construction and dimensionality reduction leads the similarity matrix to not being able to be fully relied on and the experimental results will seriously deteriorate. In addition, LSDUDR outperforms other methods in nine data sets on account of preserving locality structure among data.

#### *6.3. Embedding of Noise 3D Manifold Benchmarks*

To confirm the ability of robustly characterizing the manifold structure of LSDUDR, we use three typical 3D manifold benchmark data sets [37], i.e., Guassian, Toroidal Helix and Swiss Roll. In this experiment, we tried to map these 3D manifold benchmarks to 2D in order to find out a low-dimensional embedding but with the most manifold structure information. The experimental results are shown in Figure 5.

**Figure 5.** Projection results on 3D manifold benchmarks by the PCAN and LSDUDR methods.

Under the same conditions, PCAN method is also tested for comparison. Figure 5 shows the 2D embedding results of PCAN and LSDUDR which each row is related to on the manifold benchmark. It is obvious that PCAN did not find a suitable projection direction. This is because PCAN only considers the similarity between data points, which is not enough to characterize the intrinsic structure of data and even causes the destruction of a manifold structure. However, LSDUDR considers both similarity and diversity of the data set, and thus has strong sensitivity to local topology of data.

#### *6.4. Experiment on the Image Data Sets*

#### 6.4.1. Visualization for Handwritten Digits

To further test the low-dimensional embedding applicability of the proposed LSDUDR algorithm, another experiment is carried out on a Binary Alphadigits data set [38], as shown in Figure 6. We select four letters ("C", "P", "X", "Z") and four digits ("0", "3", "6", "9") from the Binary Alphadigits data set, which comprises binary digits from "0" to "9" and capital "A" to "Z". The embedding results are drawn in Figure 7.

**Figure 6.** Some image samples of the handwritten digits.

**Figure 7.** Experiment on the Alphadigits data set.

It can be seen from Figure 7a,b that there are overlaps in clusters of "C", "P" and "Z", digits "0" and "6" when we use PCA and LPP. In addition, worse results are obtained from PCAN and it is shown in Figure 7c that almost all points are tangled for all clusters. However, for LSDUDR, results in Figure 7d show that classes are separated clearly, which reflects that diversity plays an important role in representing the intrinsic structure of data.

#### 6.4.2. Face Benchmark Data Sets

We use four image benchmark data sets in this section for experiments on projection, since these data typically have high dimensionality. We summarize the four face image benchmark data sets in Table 5. To study the data-adaptiveness and noise-robustness of the proposed LSDUDR algorithm, we use a range of data sets contaminated by different kinds of noise based on the face data sets, as shown in Figure 8. Similar to the above-mentioned experiment, three algorithms, including PCAN, PCA and LPP, are used for comparison.


**Table 5.** The description of the face image benchmark data sets.

(**a**) Original image (**b**) Image with Gaussian noise

(**c**) Image with multiplicative noise (**d**) Image with salt-and-pepper noise

**Figure 8.** Some image samples of the data sets with different kinds of noise.

The experimental results about face benchmark data sets are shown in Figure 9, from which we get a convincing observation that the experimental results obtained by adaptive graph learning algorithms are usually more outstanding, especially when the dimensionality of projection space increases. This is because adaptive graph learning algorithms can use the embedded information that are obtained in the previous step to update the similarity matrix, hence the dimensionality reduction results are more accurate. In addition, we observe that PCA and LPP are more sensitive to the dimensionality of embedded space while the curve of LSDUDR is basically stable with the change of dimensionality. Furthermore, LSDUDR is capable of projecting the data into a subspace with a relatively small dimension *c* − 1; such subspace with low dimensionality obtained by our method would be even better than the subspaces obtained by PCA and LPP with higher dimensionality. It indicates that local topology and geometrical properties were taken into account for the similarity and diversity of data when using LSDUDR, and thus have better performance and achieved higher accuracy than PCAN when the images reserve sufficient spatial information.

**Figure 9.** *Cont*.

**Figure 9.** Projection results on face image benchmark data sets with different kinds of noise.

#### **7. Conclusions**

In this paper, a novel adaptive graph learning method (LSDUDR) is proposed from a new perspective by integrating a similarity graph and diversity graph to learn a discriminative subspace where data can be easily separated. Meanwhile, LSDUDR performs dimensionality reduction and local structure learning simultaneously based on the high quality Laplacian matrix. Different from previous graph-based models, LSDUDR constructs two adjacency graphs that could represent the intrinsic structure of data well in learning the local sensitivity of the data. Furthermore, LSDUDR doesn't require other clustering methods to obtain cluster indicators but extracts label information from a similarity graph or diversity graph, which adaptively updates in a reconstruction manner. We also discuss the convergence of the proposed algorithm as well as the value of trade-off parameters. Experimental results on the synthetic data, face image databases and several benchmark data illustrate the effectiveness and superiority of the proposed method.

In this paper, we focus on the scenario of construction of two adjacency graphs to represent the original structure with data similarity and diversity. Our method can be used to remove irrelevant and correlated features involved in high-dimensional feature space and convert data represented in subspaces [39]. In our future work, it is potentially interesting to extend the proposed methods to unsupervised feature selection of data points with multiview and multitask.

**Author Contributions:** Y-L.G. and S-Z.L. conceived and designed the experiments; S-Z.L. performed the experiments; C-C.C. and Z-H.W. analyzed the data; J-Y.P. contributed analysis tools.

**Funding:** This research was funded by [National Natural Science Foundation of China] grant number [61203176] and [Fujian Provincial Natural Science Foundation] grant number [2013J05098, 2016J01756].

**Conflicts of Interest:** The authors declare no conflicts of interest.

### **References**


c 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Symmetry* Editorial Office E-mail: symmetry@mdpi.com www.mdpi.com/journal/symmetry

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34 Fax: +41 61 302 89 18