2.1. Participant
The data employed in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (
http://adni.loni.usc.edu/, accessed on 15 February 2020). The ADNI was launched in 2003 as a public–private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether MRI, PET, other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of MCI and AD. All ADNI participants provided written informed consent, and the institutional review board of each ADNI site approved study protocols.
On 1.5 Tesla MRI scanners from Siemens (Siemens, Erlangen, Germany), Philips (Philips, Best, The Netherlands), and General Electric Health-care (General Electric Health Care, Waukesha, WI, USA), high-resolution T1-weighted structural MRI (sMRI) data at baseline were collected at multiple ADNI sites using the standard ADNI Phase 1 (ADNI-1) MRI protocol. A sagittal 3D MP-RAGE sequence was used to scan each subject, with the following acquisition parameters: inversion time/repetition time: 1000/2400 ms; flip angle: 8; 24 cm field of view; 192 × 192 × 166 acquisition matrix, and a voxel size of 1.25 × 1.25 × 1.2 mm
3. In plane, zero-filled reconstruction yielded a 256 × 256 matrix for a reconstructed voxel size of 0.9375 × 0.9375 × 1.2 mm
3. In order to assure uniformity among scans obtained at different sites, images were calibrated using phantom-based geometric corrections. Additional image corrections were also applied, to adjust for scanner- and session-specific calibration errors. In addition to the original uncorrected image files, images with all these corrections already applied (GradWarp, B1, phantom scaling, and N3) are available to the general scientific community (at
www.loni.ucla.edu/ADNI, accessed on 15 February 2020). The samples included in the ADNI-1 cohort were diagnosed with 3 clinical statuses (CN, MCI, and AD), including 187 AD patients, 382 MCI patients, and 229 CNs at baseline. The neuropsychological assessments used in this study could be divided into global cognitive screening tests, the Functional Assessment Questionnaire (FAQ), and ADNI composite scores. Global tests consist of the Mini-Mental State Examination (MMSE), sum-of-box assessments of clinical dementia (CDR-SB), the 11-item AD Assessment Scale-Cognitive (ADAS-Cog11) or expanded to 13 items (ADAS-Cog13). The ADNI composite scores include four sub-domains: memory, executive function, language, and visuospatial. Gibbons et al. derived the composite scores for memory (ADNI-MEM) and executive function (ADNI-EF) from the ADNI neuropsychological battery using item response theory [
15,
16] and Choi et al. designed the composite scores for language (ADNI-LAN) and visuospatial abilities (ADNI-VS) using similar methods [
17]. The demographic details and neuropsychological assessment [
18] results for the three groups are provided in
Table 1. The dataset was randomly split into 70% training, 10% validation, and 20% test sets. The training set was used to train the algorithm, the validation set was used to find the optimal combination of hyper-parameters, and the test set was used to evaluate the model.
2.3. Densenet for Gmdms Feature Learning
DenseNet, an extension of the ResNet architecture, was proposed by Huang et al. [
20]. To maximize the information flow through layers, the DenseNet architecture uses a simple connectivity pattern in which each layer in a dense block obtains the feature maps from all previous layers and passes its own feature maps to all subsequent layers. With this architecture, DenseNet has several advantages, including preventing over-fitting and degradation phenomena, improving the efficiency of feature propagation, retaining the efficiency of feature reuse, and substantially reducing the model’s size.
The GMDMs were used as inputs for the model. DenseNet, trained from scratch, was used to investigate a binary problem (AD versus CN). To generate the optimal model for AD versus CN, we empirically tuned DenseNet’s hyper-parameters using a grid-search technique, such as the learning rate (1 × 10
−6–1 × 10
−2), the number of dense blocks (2–5), the growth rate (8–24), the compression rate (0.2–0.8), and the batch size (32–128), according to the validation results. While changing the values of the hyper-parameter, mean values for accuracy (ACC) were calculated for each value of the hyper-parameter. In the cost function calculation, balanced class weights were used to ensure that classes were weighted inversely proportional to their frequency in the training set. A schematic of the optimized 3D DenseNet architecture is shown in
Figure 2. It consisted of a 3 × 3 × 3 convolutional layer, followed by three dense blocks, and a transition layer in between. The output of the last dense block is flattened, followed by two fully connected layers with 512 units and 256 units, respectively, and finally connected to the output layer. Each dense block has three repeating units: each repeating unit has one bottleneck 1 × 1 × 1 convolutional layer with 48 channels, followed by a 3 × 3 × 3 convolutional layer with 12 channels. The loss function was binary cross-entropy. The learned hyper-parameters are shown below; the learning rate, growth rate, compression rate, and batch size were set at 0.0001, 12, 0.5, and 64, respectively. A transfer learning strategy was applied to this optimized DenseNet architecture to initialize the training of the CNNs for two binary (AD versus MCI and MCI versus CN) and one multiple-class classification problem (CN, MCI, and AD). This was done primarily because of the fact that these four tasks are highly associated, and the latter jobs are substantially more demanding. Training was performed using Adam optimization. The model is implemented in Keras using Tensorflow as the backend and trained on an NVIDIA GTX 3090 GPU with 24 GB of RAM. After training, the anatomical features of the GMDMs were extracted from the first fully connected layer. The CNN model was trained for a maximum of 200 epochs, and early-stopped after 30 epochs of the validation loss not improving.
2.4. Population Graph Construction
To consider the correlations among the subjects in a cohort, the population is regarded as a graph. Individual subjects are represented by the nodes of the population graph, which include compact anatomical feature vectors taken from 3D DenseNet, while the edges encode pairwise phenotypic similarities based on non-imaging and/or imaging data. The population graph is constructed using the set of CN and patients with MCI and AD. The subjects from the dataset are represented by the graph nodes, and similarities between the nodes’ characteristics, such as demographic, imaging, and/or neuropsychological features, are treated as edges connecting the nodes. A population graph is constructed based on two important elements: (a) the node feature vector assigned to each node, and (b) the weighted adjacency matrix. More explicitly, we built an undirected weighted graph G (V, E, X) in which the set of nodes V = {v
1,⋯,v
n} corresponds to a set of subjects. Each node v
i contains a 512-dimensional feature vector x
i described in
Section 2.3. The feature matrix X∈R
n×512 consists of stacked feature vectors of n nodes in the graph. The weighted adjacency matrix A is composed of a set of edges E ⊆ V × V, which correspond to links between the nodes, where an edge-assigning function assigns weight
S(
i,
j) to each edge. However, constructing a population graph is not a straightforward task, as there are multiple edge-assigning functions that map the data to the graph structure. Edge-assigning function is critical for capturing the underlying structure of a graph and explaining the similarities between the feature vectors. We computed the similarity between the pair of anatomical feature vectors
xi and
xj of nodes
i and
j. The similarity index was denoted as
.
A similarity function
is defined as a Kronecker delta function if the non-imaging feature is categorical (e.g., subject’s gender). The function is specified as a unit-step function with regard to a threshold
if the non-imaging feature is quantitative (e.g., subject’s age).
In the equations above, and are the values of the non-imaging features for nodes i and j.
The combined similarity index is defined by the equation below.
where
P is the number of non-imaging features that has been used to generate edges. Equation (4) states that
increases when there is a high degree of similarity between two subjects’ imaging feature vectors and/or their non-imaging measures. Non-imaging features and imaging features are incorporated.
For clarity, we categorized the resulting graphs into three groups based on their edge-assigning functions:
Baseline graphs: Graphs were constructed using the similarity between imaging feature vectors described in
Section 2.3.
Non-imaging graphs: Graphs were constructed using the relationships between non-imaging features.
Combined graphs: Graphs that were constructed using a combination of non-imaging and imaging features.
To examine how the construction of the population graph (edges and features), especially the edge-assigning function, influences AD staging performance, three experiments were implemented in this study. Experiment I was designed to explore the implications of incorporating demographic information in the edge-assigning function on the classification of AD versus CN. Experiment II was designed to investigate the impact of adding various neuropsychological tests to the edge-assigning function on MCI classification. Experiment III aims to investigate the possibility of using the edge-assigning function, which produced the best outcomes in Experiment II, to perform well on multi-class classification.
Individuals with AD usually demonstrate a high level of heterogeneity [
21]. Some atrophic areas affected by one AD subtype may be preserved by another [
22,
23]. As a result, imaging features and AD risk factors should be combined in the diagnosis of AD. One of the biggest risk factors for AD is aging; more than 13% of people aged 65 and up and 43% of people aged 85 and up have been diagnosed with AD [
24]. Genetic factors also play a role. Apolipoprotein E (ApoE) is a well-known risk factor for late-onset AD [
25,
26]. Female birth sex has been linked to an increased risk of developing AD, and two-thirds of older adults with AD are women [
27,
28]. Therefore, non-imaging information such as age, gender, and ApoE genotype was used to calculate the similarity of the nodes in this investigation. Based on all possible combinations, seven population graphs were created. A grid search with validation was used to determine the threshold of age;
Of note is the fact that distinguishing MCI patients from CN subjects or AD patients based on neuroimaging data is more difficult than distinguishing between AD and CN, and the results of the former are always less accurate [
29]. The criteria for clinically categorizing ADNI-1′s subjects into different disease groups were summarized as follows [
30]: (a) CN subjects with normal cognition and memory, MMSE between 24–30, CDR = 0, non-depressed (b) MCI patients with verified memory complaint, MMSE between 24–30, CDR = 0.5, have objective memory loss measured by education adjusted scores on Wechsler Memory Scale Logical Memory II, absence of significant levels of impairment in other cognitive domains, essentially preserved activities of daily living, or (c) probable AD with validated memory complaint, MMSE in the range of 20–26 and CDR ≥ 0.5, and met NINCDS/ADRDA criteria for probable AD. Because neuropsychological tests, particularly the MMSE and CDR, were employed as major criteria in categorizing participants, they could provide complementary information for MCI classification. Non-imaging information from nine neuropsychological assessments was utilized to compute the similarity of the nodes in the population graph, and 18 population graphs were created, nine with a non-imaging similarity index as edges and nine with a combined similarity index as edges. The optimal threshold
β of each neuropsychological assessment for each task was determined through an exhaustive grid search with validation;
Most AD and MCI research normally simplifies the classification problem to a set of binary classification tasks, such as AD versus CN and/or MCI versus CN. However, AD staging should be naturally modeled as a multi-class classification problem, necessitating the examination of the entire AD spectrum. The classification of AD, CN, and MCI is difficult because a multi-class model has more interference than a two-class model. In the current study, the edge-assigning function that achieved the best result in the MCI classification was used for multi-class classification.
2.5. GCN
After constructing the population graph represented in
Section 2.4, we learn the GCNs to predict the target labels. Various GCN frameworks have been proposed, and one of the most seminal examples was proposed by Kipf and Welling [
31] in 2016. The GCN model architecture is composed of stacked layers of graph convolution, with each layer’s propagation rule described as:
where
D and
A are the degree matrix and adjacency matrix, respectively,
I is the identity matrix, and
is the diagonal node degree matrix of
,
W(l) are the network parameters of the lth layer to be learned,
H(l+1) are the node embeddings,
H(l) are generated from the previous message-passing step, and
f represents a non-linear activation function.
is intended to add a self-connection to each node and keep the scale of the feature vectors. During training, the vertices connected with high edge weights become more similar as they pass through multiple layers.
From the perspective of message passing, two steps were performed: (1) producing an intermediate representation by aggregating information for a node from its neighbors; and (2) transforming the aggregated representation with a linear transformation parameterized by W shared by all nodes, which was followed by non-linearity activation. In the current study, we built a GCN model (
Figure 3) by stacking two graph convolutional layers with the adjacency and node feature matrices as inputs, and the activation function of the first convolutional layer is ReLU. It’s worth noting that the first graph convolutional layer has 32 neurons and that the second graph convolutional layer has two neurons (for binary classification) or three neurons (for three-class classification), followed by a soft-max activation function. The loss function is defined by the difference between the predicted label and the actual label, where a cross-entropy loss function is used in our implementation. For GCN, we adopted code from the GCN in PyTorch GitHub repository (
https://github.com/tkipf/pygcn (accessed on 20 December 2022)). The model was trained using a grid-search technique in order to find the optimal combination of hyper-parameters (learning rate and dropout ratio) for this architecture. The range of the hyper-parameter values was (1 × 10
−6–1 × 10
−2 for learning rate and 0.3–0.8 for dropout ratio). The training was conducted using the Adam optimizer implemented in PyTorch. The optimal learning rate was 0.001, 0.0001, and 0.0001 for Experiments I, II, and III, respectively, and dropout was 0.5. The maximum epoch was set at 500 for all the tasks, with a criterion to stop training if the accuracy on the validation set did not improve after 20 epochs. During the training, we use the entire set of data, including labeled training and unlabeled test samples, to construct the whole population graph. The GCNs are trained to minimize the cross-entropy loss for all training samples. After training the GCNs, the model will output a prediction for each test sample.