Adaptive Weighted Multiview Kernel Matrix Factorization and Its Application in Alzheimer’s Disease Analysis

Cao, Yarui; Liu, Kai

doi:10.3390/analytics3040024

Open AccessArticle

Adaptive Weighted Multiview Kernel Matrix Factorization and Its Application in Alzheimer’s Disease Analysis

by

Yarui Cao

and

Kai Liu

^*

Computer Science Division, School of Computing, College of Engineering, Computing and Applied Sciences, Main Campus, Clemson University, Clemson, SC 29631, USA

^*

Author to whom correspondence should be addressed.

Analytics 2024, 3(4), 439-448; https://doi.org/10.3390/analytics3040024

Submission received: 17 July 2024 / Revised: 24 October 2024 / Accepted: 28 October 2024 / Published: 4 November 2024

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Recent technology and equipment advancements have provided us with opportunities to better analyze Alzheimer’s disease (AD), where we could collect and employ the data from different image and genetic modalities that may potentially enhance the predictive performance. To perform better clustering in AD analysis, in this paper, we propose a novel model to leverage data from all different modalities/views, which can learn the weights of each view adaptively. Different from previous vanilla Non-negative matrix factorization which assumes data is linearly separable, we propose a simple yet efficient method based on kernel matrix factorization, which is not only able to deal with non-linear data structure but also can achieve better prediction accuracy. Experimental results on the ADNI dataset demonstrate the effectiveness of our proposed method, which indicates promising prospects for kernel application in AD analysis.

Keywords:

adaptive multi-view clustering; kernel matrix factorization; Alzheimer’s disease

1. Introduction

Alzheimer’s disease (AD) is a chronic neurodegenerative disease related to a part of the brain that controls memory, thought, and language. It often occurs in the elderly, beginning with mild memory loss and possibly dramatically worsening and leading to further cognitive function loss. The whole progress can be divided into three diagnostic groups: AD, mild cognitive impairment (MCI), and health control (HC).

With the development of imaging genetics, exploring genome-wide array data and multimodal brain imaging data may help researchers deepen the understanding of AD, facilitate accurate early detection and diagnosis, and improve treatment. For example, Nazarian et al. [1] found a significant association between certain single-nucleotide polymorphisms (SNPs) and AD. Magnetic resonance imaging (MRI) is an important medical neuroimaging technique used to acquire regional imaging biomarkers, such as voxel-based morphometry (VBM) [2] features, to investigate focal structural abnormalities in grey matter (GM). Fluorodeoxyglucose positron emission tomography (FDG-PET) [3] can distinguish AD from other causes of dementia using the characteristic glucose metabolism patterns. Therefore, multi-view analyses have attracted more attention in recent years, and among these methods, non-negative matrix factorization (NMF)-based approaches are promising and have shown strong interpretability [4,5,6,7].

However, existing multi-view analysis methods treat each view as equally important, which contradicts the fact that some views may play more critical roles than others. Therefore, adaptively learning the weights of different views is an interesting topic of importance. Vanilla NMF’s success relies heavily on the assumption that the data lie in a well-separable space with a linear structure; however, in practice, its accuracy is rather low, which indicates the necessity of developing a new method that can deal with a non-linearity data structure.

To address the aforementioned problems, in this paper, we propose a new method with an efficient updating algorithm. Our contribution is twofold: first, the new framework can learn each view’s weight and therefore conduct view selection; second, by introducing kernel mapping, it can handle both linear and non-linear cases. Extensive experiments on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset demonstrate its effectiveness with higher clustering results, which shed light on the prospect of Kernel’s application in other AD analysis tasks, such as regression, classification, etc.

2. Related Works

The framework in this paper originates from NMF, which is widely used in clustering [8,9] as it can learn features and membership indicators in a very natural and interpretable way. Thus, we first provide a brief overview of it, along with its extensions.

2.1. Single-View NMF with Graph Regularization

Assume we have

X = [x_{1}, x_{2}, \dots, x_{n}] \in R^{m \times n}

as input data and the task is to cluster

X

into k clusters. Vanilla NMF assumes that each data point

x

can be represented as a linear combination of features

f

(columns of

F

):

x_{j} \approx \sum_{i} F_{i} G_{i j} = F g_{j}

. Therefore, the objective function [10] can be formulated as follows:

{min ∥ X - F G ∥}_{F}^{2} + θ tr (G L G^{T}) s . t . F, G \geq 0,

(1)

where

{∥ \cdot ∥}_{F}^{2}

denotes the squared Frobenius norm with

{∥ Z ∥}_{F}^{2} = \sum_{i} \sum_{j} z_{i j}^{2}

, and the second term ‘graph regularization’ [11] was introduced to promote clustering accuracy. Each column of

F \in R^{m \times k}

represents a centroid of k clusters, while each column of

G \in R^{k \times n}

can be taken as the probabilities of data

x

belonging to each cluster.

L : = D - W

denotes the Laplace matrix, where

D

is a diagonal matrix whose entries are column sums of

W

:

D_{i i} = \sum_{j} W_{i j}

, with

W \in R^{n \times n}

being the similarity matrix of the data samples. In our paper, we set

W_{i j}^{(a)} = e x p (- \frac{∥ x_{i}^{(a)} - x_{j}^{(a)} ∥^{2}}{2 σ^{2}})

, where

(a)

denotes the a-th view.

2.2. Multi-View Clustering via NMF

Assume we have v different views of the input data

{X^{(1)}, X^{(2)}, \dots, X^{(v)}}

, where

X^{(i)}

indicates the i-th view. Each view can be factorized as

X^{(i)} = F^{(i)} G^{(i)}

, where

F^{(i)} \in R^{m \times k}

contains the centroids of k clusters, and

G^{(i)} \in R^{k \times n}

indicates the corresponding coefficient (probabilities matrix) for each view [4]. Figure 1 provides an example of how multi-view data can be factorized using NMF.

As the clustering results may vary with different views, a consensus clustering indicator matrix

G^{*}

was introduced to promote consistency across views. Moreover, similar to single-view NMF, a graph regularization term is added for each view and the objective [12] is as follows:

\begin{matrix} min_{F^{(a)}, G^{(a)}, G^{*}} \sum_{a = 1}^{v} {∥ X^{(a)} - F^{(a)} G^{(a)} ∥_{F}^{2} + & λ_{a} ∥ G^{(a)} - G^{*} ∥_{F}^{2} + θ_{a} tr (G^{(a)} L^{(a)} {(G^{(a)})}^{T})} \\ s . t . F^{(a)}, G^{(a)}, G^{*} \geq 0, \end{matrix}

(2)

where

λ, θ

are regularization parameters that need tuning.

G^{(a)}

indicates the clustering results from view a and the final consensus clustering will be obtained from

G^{*}

.

3. Our Methodology

We applied the multi-view NMF clustering presented above to the Alzheimer’s Disease dataset and found the clustering performance was not that high. Similar results can be found in [5] as well. Therefore, in this section, we propose a novel framework that can not only deal with non-linearly separable cases but also can learn the various weights of different views [13].

3.1. Adaptive Weighted Multi-View NMF

Equation (2) utilizes all the data from different views with a consensus

G^{*}

, and treats each view as having the same importance. However, in practice, different views may have various weights, which inspired us to propose a formulation that can learn the weights accordingly, as follows:

\begin{matrix} min_{β_{a}, F^{(a)}, G^{(a)}, G^{*}} \sum_{a = 1}^{v} β_{a}^{γ} {∥ X^{(a)} - & F^{(a)} G^{(a)} ∥_{F}^{2} + λ_{a} ∥ G^{(a)} - G^{*} ∥_{F}^{2} + θ_{a} tr (G^{(a)} L^{(a)} {(G^{(a)})}^{T})} \\ s . t . F^{(a)}, G^{(a)}, G^{*} \geq 0, \sum_{a = 1}^{v} β_{a} = 1, \end{matrix}

(3)

where

γ

is a hyper-parameter denoting the order of

β

, which is an integer by default. One can see that when

γ = 0

, it degenerates into Equation (2), where the weight in each view is the same. In Section 4, we will further discuss its impact on modality selection.

3.2. Adaptive Weighted Kernel Multi-View NMF

Most traditional NMF and existing multi-view methods conduct a clustering/classification analysis based directly on the original data, meaning that the accuracy highly relies on the assumption that the data are linearly separable in the original space. However, through extensive experiments using either single-view or multi-view analyses, we found that the accuracy is rather low, as reported in Section 6.3. Thus, we propose a kernel version multi-view framework to overcome the difficulty of handling non-linearity via NMF. The basic idea is that, instead of clustering on the original space, we first map the data into higher dimension

Φ : R^{m} \to R^{d}

(allowing for infinite space, namely,

d = \infty

, such as Gaussian Kernel), and then complete the clustering based on the mapped space [14,15,16,17].

To make use of the kernel trick, following the idea of K-means, where each centroid can be represented as a linear combination of data points, we set

f^{(a)} = Φ (X^{(a)}) p

, and accordingly obtained

F^{(a)} = Φ (X^{(a)}) P

. Also, following the idea of semi-NMF, we remove the nonnegative constraint on

P

such that the learned features can become more flexible, but not necessarily non-negative. Finally, we formulate our proposed kernel adaptive multi-view objective as follows:

\begin{matrix} \sum_{a = 1}^{v} β_{a}^{γ} {∥ Φ (X^{(a)}) - Φ (X^{(a)}) & P^{(a)} G^{(a)} ∥_{F}^{2} + λ_{a} ∥ G^{(a)} - G^{*} ∥_{F}^{2} + θ_{a} tr (G^{(a)} L^{(a)} {(G^{(a)})}^{T})} \\ s . t . G^{(a)}, G^{*} \geq 0, \sum_{a = 1}^{v} β_{a} = 1 . \end{matrix}

(4)

Equation (4) not only deals with non-linear data across multiple views but also adaptively learns the weight in each view.

4. Optimization

Given the variables to be optimized, we propose an alternating minimization method to iteratively optimize the solution to Equation (4).

Optimizing $P$ in each view: As there is no constraint on

P

, we can simply take the derivative and set it to be 0 and we have

P G G^{T} = G^{T}

; then, we have the following:

P = G^{T} {(G G^{T})}^{- 1} .

(5)

Optimizing $G$ in each view: Since

P

may have mixed signs, the multiplicative updating algorithm cannot guarantee the non-negativity of

G

. Therefore, we propose the projected gradient descent method to make the objective (4) monotonically decrease with the update:

G^{+} = max {G - \frac{1}{L_{J}} \nabla_{G} J, 0},

(6)

where

L_{J}

is the Lipschitz continuous gradient and

J

denotes the objective in Equation (4). By definition, we have

L_{J} = 2 [σ_{m a x} (P^{T} K P) + λ + θ σ_{m a x} (L)], \nabla_{G} J = 2 (P^{T} K P G - P^{T} K + λ (G - G^{*}) + θ G L)

, where

K (i, j) = 〈 Φ (x_{i}), Φ (x_{j}) 〉

. When a different kernel is chosen,

〈 Φ (x_{i}), Φ (x_{j}) 〉

is different but always computationally economic. For example, if a linear Kernel is chosen, then

〈 Φ (x_{i}), Φ (x_{j}) 〉 = x_{i}^{T} x_{j}

; for a polynomial kernel,

〈 Φ (x_{i}), Φ (x_{j}) 〉 = {(x_{i}^{T} x_{j} + c)}^{d}

; for Gaussian Kernel

〈 Φ (x_{i}), Φ (x_{j}) 〉 = e x p (- \frac{∥ x_{i} - x_{j} ∥^{2}}{2 σ^{2}})

, where

c, d

and

σ

are hyper-parameters.

Optimizing $G^{*}$ : By taking the derivative with respect to

G^{*}

and set it to 0, we have:

G^{*} = \frac{\sum β_{a}^{γ} λ_{a} G^{(a)}}{\sum β_{a}^{γ} λ_{a}},

(7)

which automatically satisfies the non-negative constraint.

Optimizing $β$ : For the sake of simplicity, we denote the loss in each view as

q_{a}

. Given

q

with other variables being fixed in each view, we optimize

β

using Lagrangian multipliers:

L (β, γ t) = \sum_{a = 1}^{v} β_{a}^{γ} q_{a} - γ t (\sum_{a = 1}^{v} β_{a} - 1) .

By taking the derivative w.r.t.

β

and t, with a simple algebra operation, we can obtain the optimal solution:

β_{a}^{*} = \frac{q_{a}^{\frac{1}{1 - γ}}}{\sum_{a} q_{a}^{\frac{1}{1 - γ}}} .

(8)

Apparently,

γ

is a key factor when learning the adaptive weights in each view. Specifically, when

γ = 1

, only the view with the least error will be selected. This can be validated using the solution of

β

when

γ \to 1^{+}

. For more details on the objective decreasing proof with the update, readers can refer to Appendix A. The variables are updated alternatively in a repeated manner until the objective function reaches convergence. The whole process of our method is summarized in Algorithm 1.

Algorithm 1 Adaptive weighted kernel multi-view MNF

Input: Multi-view input data

{X^{(1)}, X^{(2)}, \dots, X^{(v)}}

.

Initialization: Feature matrices

{P^{(1)}, P^{(2)}, \dots, P^{(v)}}

.

Membership matrices

{G^{(1)}, G^{(2)}, \dots, G^{(v)}}

.

Consensus matrix

G^{*}

,

β

(s.t

\sum_{a = 1}^{v} β_{a} = 1

),

λ

and

θ

.

Output:

G^{*}

1:: Calculate Laplace matrix $L^{(a)} = D^{(a)} - W^{(a)}$ .
2:: repeat
3:: For each view a, update $P^{(a)}$ as Equation (5);
4:: For each view a, update $G^{(a)}$ as Equation (6);
5:: Update $G^{*}$ as Equation (7);
6:: Update $β$ as Equation (8).
7:: until converges
8:: return $G^{*}$

5. Complexity Analysis

The main aspect of our method is calculating

L^{(a)}, P, G^{(a)}, G^{*}

and

β

.

The complexity of computing

L^{(a)}

is

O (n^{2})

comprises the complexity of the linear combination and the complexity of calculating each view’s diagonal matrix. For

P^{(a)}

, the complexity includes matrix multiplication and matrix inversion, which is

O (k^{3} + n k^{2})

. The complexity of obtaining each entry in a kernel is

O (m)

, where m is the dimension of the original space, regardless of the kernel used. When the kernel trick is applied to optimize

G^{(a)}

, the overall complexity for the entire matrix with

n^{2}

entries becomes

O (m n^{2})

. The key computational tasks in optimizing

G^{(a)}

involve calculating the Lipschitz constant and the subgradient, whose complexities are

O (k^{3} + n^{3} + k n^{2} + n k^{2} + m n^{2})

and

O (k n^{2} + n k^{2} + n k + m n^{2})

, respectively. Thus, the complexity of

G^{(a)}

is

O (\max {k^{3}, n^{3}} + m n^{2})

. Given the updated rules of

G^{*}

, its complexity is

O (v n k)

. For

β

, the computational complexity primarily involves calculating the loss

q_{a}

and summarizing the losses of all views. The complexity of computing the loss is

O (k n^{2} + n k^{2} + d n k)

, where d represents the higher-dimensional space mapped by

Φ

. Therefore, the overall complexity for

β

becomes

O (v k n \max {k, n, d})

.

6. Experiments

In this section, we are going to evaluate the performance of the proposed Algorithm 1 on the ADNI dataset [18]. Experiments were conducted in MATLAB R2021b (64bit) Windows version on a desktop with an Intel CPU (i7-1165G7) and 16G memory.

6.1. ADNI Dataset

The data used to validate our method were obtained from the ADNI database with four views: VBM, FreeSurfer, FDG-PET, and SNPs with feature dimensions of 86, 56, 26, and 1224 respectively. After removing all incomplete samples, we obtained 88 AD, 174 MCI, and 83 HC in total, which were used for computing the clustering accuracy compared with

G^{*}

.

6.2. Baseline Methods

To demonstrate the advantage of our proposed method, we compare it with the following methods:

Single View (SV): Vanilla NMF with the objective denoted as Equation (1); in our experiment, we ran four views separately.

Feature Concatenation (CNMF): A simple and straightforward way to concatenate all features from different views

X = {X^{(1)}; \dots; X^{(v)}}

and run the single-view method.

(Adaptive Weighted) Multi-view NMF (AW/MNMF): This method uses original data without kernel mapping, as shown in Equations (2) and (3). It can be regarded as a special case of our proposed method when it uses linear kernels.

6.3. Experiment Setting and Result

We conducted a grid search with cross-validation to determine the two regularization parameters

λ

and

θ

, with each being set to 1 and

γ = 2

. We chose different kernels. For polynomial kernel, we set

c = 1

and d to

[1, 2, 3]

. For Gaussian kernel, we set

σ

as

e x p {- 2, - 1, 0, \dots, 10}

. We found that, with different settings, our proposed method converged at around 15 iterations, which is very efficient. Table 1 shows the clustering results (accuracy, NMI, Rand, and Mirkin’s Index) of different methods on the AD dataset. Our proposed method clearly outperformed its counterparts.

It is interesting to check the features we learned from our method (for the sake of interpretability, we chose linear kernels). The magnitude of

F

represents the importance of a certain feature (SNP) to AD. We plotted a heatmap of

F

and sorted it by magnitude, as shown in Figure 2; it is easy to see that almost 50% of the SNP may not be important to AD. We checked the SNPs in the first half of this Figure (which are possibly closely related) against existing clinical discoveries. We found that many features given by our prediction play an important role in AD; for example, rs3818361, rs10519262, and rs2333227 could be found in (https://www.snpedia.com/index.php/Alzheimer%27s_disease (accessed on 16 July 2024)).

We also compared the features learned by other views (VBM, FreeSurfer, FDG), in a similar way as used in the analysis of SNP. We can observe that hippocampal measures (LHippocampus, RHippocampus, LHippVol, and RHippVol) were also identified, in accordance with the fact that, in the pathological pathway of AD, the medial temporal lobe, including the hippocampus, is first affected, followed by progressive neocortical damage. The thickness measures of isthmus cingulate (LIsthmCing and RIsthmCing), frontal pole (LFrontalPole and RFrontalPole) and posterior cingulate gyrus (LPostCingulate and RPostCingulate) were also selected, which, again, is in accordance with the fact that the GM atrophy of these regions is high in AD.

7. Limitations

In this paper, we addressed the multiview clustering problem using a complete dataset. However, many real-world datasets are often missing or incomplete, leading to sparsity issues. Our experiments did not take this scenario into account. In further work, we will work to resolve the issue of incomplete data for such a problem.

8. Conclusions

In this paper, we focused on solving the multiview clustering problem. Considering that multiple views might contain more latent information than a single view, we propose using an alternating minimization method to derive the solution, which can not only deal with non-linearly separable cases but can also learn the various weights of different views. The experimental results illustrate the superiority of our method compared to its counterparts.

Author Contributions

Conceptualization, K.L. and Y.C.; methodology, K.L.; validation, Y.C.; formal analysis, Y.C.; resources, Y.C.; writing—original draft preparation, Y.C.; writing—review and editing, K.L.; supervision, K.L.; project administration, K.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Data for this experiment was collected from ADNI under authorized access, so ethical review and approval were waived for this study.

Informed Consent Statement

All data for this experiment was obtained from ADNI with authorized access, and therefore, patient consents were waived.

Data Availability Statement

The data used in this experiment was collected from ADNI (https://adni.loni.usc.edu/, accessed on 16 July 2024), and the supporting report is available from (https://www.snpedia.com/index.php/Alzheimer%27s_disease, accessed on 16 July 2024).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

We now show the details of Section 4.

Appendix A.1. Optimizing P in Each View

While fixing

β, G^{*}, G^{(a)}

, minimize:

\begin{matrix} \sum_{a = 1}^{v} β_{a}^{γ} {∥ Φ (X^{(a)}) - Φ (X^{(a)}) & P^{(a)} G^{(a)} ∥_{F}^{2} + λ_{a} ∥ G^{(a)} - G^{*} ∥_{F}^{2} + θ tr (G^{(a)} L^{(a)} {(G^{(a)})}^{T})} \\ s . t . G^{(a)}, G^{*} \geq 0, \sum_{a = 1}^{v} β_{a} = 1, \end{matrix}

(A1)

is equivalent to minimizing the following:

∥ Φ (X^{(a)}) - Φ (X^{(a)}) P^{(a)} G^{(a)} ∥_{F}^{2} .

For the sake of simplicity, we remove superscript

(a)

and we have the following:

\begin{matrix} J & = {∥ Φ (X) - Φ (X) P G ∥}_{F}^{2} \\ = tr \{{(Φ (X) - Φ (X) P G)}^{T} (Φ (X) - Φ (X) P G)\} \\ = tr \{K - 2 K P G + G^{T} P^{T} K P G\}, \end{matrix}

which, according to definition

K = Φ {(X)}^{T} Φ (X)

, is symmetric and a semi-positive definite. Setting the derivative w.r.t.

P

as 0, we have:

\begin{matrix} \nabla_{P} J & = - 2 K G^{T} + 2 K P G G^{T} = 2 (K P G G^{T} - K G^{T}) = 0 \\ \Rightarrow & K P G G^{T} = K G^{T} \Rightarrow P = G^{T} {(G G^{T})}^{- 1} \end{matrix}

Appendix A.2. Optimizing G in Each View

Following the way we optimize

P

, minimizing

G

in each view is equivalent to the following:

{min ∥ Φ (X) - Φ (X) P G ∥}_{F}^{2} + λ {∥ G - G^{*} ∥}_{F}^{2} + θ tr \{G L G^{T}\}

(A2)

with

G \geq 0

. In the following, we are to show that if

G^{+} = m a x {G - t \nabla_{G} J, 0}

, where

t = \frac{1}{L_{J}}

, then the objective is monotonically decreasing. We recognize that this method is simply the Projected Gradient Descent. Accordingly, we refer the reviewer to the related references [19,20,21] as proof that the objective decreasing part is standard and omits the details here. To begin with, we are going to determine the Lipschitz continuous constant.

Theorem A1.

A differentiable function

J

is L-Lipschitz continuous gradient [22] if for some

L_{J} > 0

{∥ \nabla J (G^{+}) - \nabla J (G) ∥}_{F} \leq L_{J} {∥ G^{+} - G ∥}_{F}, \forall G^{+}, G .

According to Equation (A2), we have

\nabla_{G} J = \nabla J (G) = 2 (P^{T} K P G - P^{T} K + λ (G - G^{*}) + θ G L)

; therefore,

\begin{matrix} {∥ \nabla J (G^{+}) - \nabla J (G) ∥}_{F} \\ = & 2 {∥ P^{T} K P (G^{+} - G) + λ (G^{+} - G) + θ (G^{+} - G) L ∥}_{F} \\ \leq & 2 ∥ P^{T} K P (G^{+} - G) ∥_{F} + ∥ λ (G^{+} - G) ∥_{F} + {∥ θ (G^{+} - G) L ∥}_{F} \\ \leq & 2 ({∥ P^{T} K P ∥}_{2} + λ + θ {∥ L ∥}_{2}) {∥ G^{+} - G ∥}_{F} : = L_{J} {∥ G^{+} - G ∥}_{F}, \end{matrix}

where

L_{J} = 2 [σ_{m a x} (P^{T} K P) + λ + θ σ_{m a x} (L)]

. The second and third lines follow from Subadditivity and Submultiplicative Inequality (

{∥ A + B ∥}_{F} \leq {∥ A ∥}_{F} + {∥ B ∥}_{F} {, ∥ A B ∥}_{F} \leq {∥ A ∥}_{2} {∥ B ∥}_{F}

, respectively. Here

{∥ \cdot ∥}_{2}

denotes the spectral norm, which is the largest singular value of the input matrix.), therefore

G^{+} = m a x {G - \frac{1}{L_{J}} \nabla_{G} J, 0}

will make the objective monotonically non-increasing.

Appendix A.3. Optimizing G* in Each View

Taking the derivative of Equation (A1) w.r.t.

G^{*}

and setting it to be 0, we have:

2 \sum_{a} β_{a}^{γ} λ_{a} (G^{*} - G^{(a)}) = 0,

with a simple reformulation, we have

G^{*} = \frac{\sum β_{a}^{γ} λ_{a} G^{(a)}}{\sum β_{a}^{γ} λ_{a}} .

Given the nonnegative of

G^{(a)}, λ

, and

β

,

G^{*}

automatically satisfies the positive constraint.

Appendix A.4. Optimizing β in Each View

While fixing all the remaining variables and denoting the objective in each view as

q_{a}

, we utilize the Lagrangian Multiplier and formulate the objective as follows:

L (β, γ t) = \sum_{a = 1}^{v} β_{a}^{γ} q_{a} - γ t (\sum_{a = 1}^{v} β_{a} - 1) .

By taking the derivative w.r.t.

β_{a}

(

1 \leq a \leq v

), we have:

γ q_{a} β_{a}^{γ - 1} - γ t = 0,

which implies

β_{a} = {(\frac{t}{q_{a}})}^{\frac{1}{γ - 1}} .

(A3)

As

\sum_{a} β_{a} = 1

, we have:

t = {(\sum q_{a}^{\frac{1}{1 - γ}})}^{1 - γ} .

Now, going back to Equation (A3), we have:

β_{a} = \frac{q_{a}^{\frac{1}{1 - γ}}}{\sum_{a} q_{a}^{\frac{1}{1 - γ}}},

which is the weight of each view.

References

Nazarian, A.; Yashin, A.I.; Kulminski, A.M. Genome-wide analysis of genetic predisposition to Alzheimer’s disease and related sex disparities. Alzheimer’s Res. Ther. 2019, 11, 5. [Google Scholar] [CrossRef] [PubMed]
Wang, W.Y.; Yu, J.T.; Liu, Y.; Yin, R.H.; Wang, H.F.; Wang, J.; Tan, L.; Radua, J.; Tan, L. Voxel-based meta-analysis of grey matter changes in Alzheimer’s disease. Transl. Neurodegener. 2015, 4, 6. [Google Scholar] [CrossRef] [PubMed]
Marcus, C.; Mena, E.; Subramaniam, R.M. Brain PET in the diagnosis of Alzheimer’s disease. Clin. Nucl. Med. 2014, 39, e413. [Google Scholar] [CrossRef] [PubMed]
Liu, J.; Wang, C.; Gao, J.; Han, J. Multi-view clustering via joint nonnegative matrix factorization. In Proceedings of the 2013 SIAM International Conference on Data Mining, Snowbird, UT, USA, 19–23 May 2013; pp. 252–260. [Google Scholar]
Liu, K.; Wang, H.; Risacher, S.; Saykin, A.; Shen, L. Multiple incomplete views clustering via non-negative matrix factorization with its application in Alzheimer’s disease analysis. In Proceedings of the 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), Washington, DC, USA, 4–7 April 2018; pp. 1402–1405. [Google Scholar]
Zong, L.; Zhang, X.; Zhao, L.; Yu, H.; Zhao, Q. Multi-view clustering via multi-manifold regularized non-negative matrix factorization. Neural Netw. 2017, 88, 74–89. [Google Scholar] [CrossRef] [PubMed]
Liu, K. A Study of Non-Negative Matrix Factorizations: Foundations, Methods, Algorithms, and Applications. Ph.D. Thesis, Colorado School of Mines, Mines Institutional Repository, Golden, CO, USA, 2019. [Google Scholar]
Lee, D.; Seung, H.S. Algorithms for non-negative matrix factorization. Adv. Neural Inf. Process. Syst. 2000, 13, 535–541. [Google Scholar]
Wang, Y.X.; Zhang, Y.J. Nonnegative matrix factorization: A comprehensive review. IEEE Trans. Knowl. Data Eng. 2012, 25, 1336–1353. [Google Scholar] [CrossRef]
Cai, D.; He, X.; Han, J. Graph Regularized Non-Negative Matrix Factorization for Data Representation; University of Illinois Urbana: Champaign, IL, USA, 2008. [Google Scholar]
Cai, D.; He, X.; Han, J.; Huang, T.S. Graph regularized nonnegative matrix factorization for data representation. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 1548–1560. [Google Scholar] [PubMed]
Rai, N.; Negi, S.; Chaudhury, S.; Deshmukh, O. Partial Multi-View Clustering using Graph Regularized NMF. In Proceedings of the 2016 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8 December 2016; pp. 2192–2197. [Google Scholar]
Hou, C.; Nie, F.; Tao, H.; Yi, D. Multi-view unsupervised feature selection with adaptive similarity and view weight. IEEE Trans. Knowl. Data Eng. 2017, 29, 1998–2011. [Google Scholar] [CrossRef]
Mika, S.; Schölkopf, B.; Smola, A.; Müller, K.R.; Scholz, M.; Rätsch, G. Kernel PCA and de-noising in feature spaces. Adv. Neural Inf. Process. Syst. 1998, 11, 536–542. [Google Scholar]
Patle, A.; Chouhan, D.S. SVM kernel functions for classification. In Proceedings of the 2013 International Conference on Advances in Technology and Engineering (ICATE), Mumbai, India, 23–25 January 2013; pp. 1–9. [Google Scholar]
Hofmann, T.; Schölkopf, B.; Smola, A.J. Kernel methods in machine learning. Ann. Stat. 2007, 36, 1171–1220. [Google Scholar] [CrossRef]
Müller, K.R.; Mika, S.; Tsuda, K.; Schölkopf, K. An introduction to kernel-based learning algorithms. IEEE Trans. Neural Netw. 2001, 12, 181–201. [Google Scholar] [CrossRef] [PubMed]
Petersen, R.C.; Aisen, P.S.; Beckett, L.A.; Donohue, M.C.; Gamst, A.C.; Harvey, D.J.; Jack, C., Jr.; Jagust, W.J.; Shaw, L.M.; Toga, A.W.; et al. Alzheimer’s disease Neuroimaging Initiative (ADNI) clinical characterization. Neurology 2010, 74, 201–209. [Google Scholar] [CrossRef] [PubMed]
Chen, Y.; Wainwright, M.J. Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees. arXiv 2015, arXiv:1509.03025. [Google Scholar]
Iusem, A.N. On the convergence properties of the projected gradient method for convex optimization. Comput. Appl. Math. 2003, 22, 37–52. [Google Scholar]
Drummond, L.G.; Iusem, A.N. A projected gradient method for vector optimization problems. Comput. Optim. Appl. 2004, 28, 5–29. [Google Scholar] [CrossRef]
Zhou, X. On the fenchel duality between strong convexity and lipschitz continuous gradient. arXiv 2018, arXiv:1803.06573. [Google Scholar]

Figure 1. Multi-view clustering via matrix factorization, which allows for data to be collected from various modalities or views.

Figure 2. The significance of 1224 SNP to AD (linear kernel); higher values indicate importance to AD. This could help to narrow down the search for the SNPs that cause AD.

Table 1. Average clustering performance on the AD dataset (%).

Method	SV	CNMF	MNMF	AWMNMF	Ours
Acc.	$43.25$	$43.65$	$48.97$	$50.35$	$54.78$
NMI	$50.55$	$50.55$	$50.54$	$50.51$	$54.20$
RI	$47.63$	$47.61$	$40.11$	$38.09$	$55.87$
MI	$52.37$	$52.39$	$59.89$	$61.91$	$62.46$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cao, Y.; Liu, K. Adaptive Weighted Multiview Kernel Matrix Factorization and Its Application in Alzheimer’s Disease Analysis. Analytics 2024, 3, 439-448. https://doi.org/10.3390/analytics3040024

AMA Style

Cao Y, Liu K. Adaptive Weighted Multiview Kernel Matrix Factorization and Its Application in Alzheimer’s Disease Analysis. Analytics. 2024; 3(4):439-448. https://doi.org/10.3390/analytics3040024

Chicago/Turabian Style

Cao, Yarui, and Kai Liu. 2024. "Adaptive Weighted Multiview Kernel Matrix Factorization and Its Application in Alzheimer’s Disease Analysis" Analytics 3, no. 4: 439-448. https://doi.org/10.3390/analytics3040024

APA Style

Cao, Y., & Liu, K. (2024). Adaptive Weighted Multiview Kernel Matrix Factorization and Its Application in Alzheimer’s Disease Analysis. Analytics, 3(4), 439-448. https://doi.org/10.3390/analytics3040024

Article Menu

Adaptive Weighted Multiview Kernel Matrix Factorization and Its Application in Alzheimer’s Disease Analysis

Abstract

1. Introduction

2. Related Works

2.1. Single-View NMF with Graph Regularization

2.2. Multi-View Clustering via NMF

3. Our Methodology

3.1. Adaptive Weighted Multi-View NMF

3.2. Adaptive Weighted Kernel Multi-View NMF

4. Optimization

5. Complexity Analysis

6. Experiments

6.1. ADNI Dataset

6.2. Baseline Methods

6.3. Experiment Setting and Result

7. Limitations

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1. Optimizing P in Each View

Appendix A.2. Optimizing G in Each View

Appendix A.3. Optimizing G* in Each View

Appendix A.4. Optimizing β in Each View

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI