1. Introduction
Alzheimer’s disease (AD) is the most common neurodegenerative disease worldwide, with an estimated global prevalence exceeding 100 million by 2050 [
1]. It is characterized by a progressive decline in cognitive function, beginning with mild memory impairment and eventually leading to severe dysfunction that interferes with daily life. Mild cognitive impairment (MCI) is widely recognized as a transitional stage between normal aging and AD [
2,
3], and this is often considered its prodromal phase [
4,
5,
6]. MCI patients face a significantly higher rate of AD of approximately 15%. In resting-state fMRI analysis, the human brain often exhibits symmetrical patterns of functional connectivity across hemispheres, reflecting coordinated neural processing. Deviations from such symmetry, or the emergence of asymmetrical connectivity patterns, can serve as critical biomarkers for differentiating MCI trajectories and understanding disease progression. Therefore, an early and accurate diagnosis of MCI is essential for timely intervention and to reduce the burden on global healthcare systems [
7].
Recent studies have explored MCI progression using various neuroimaging techniques, including positron emission tomography (PET), structural magnetic resonance imaging (sMRI), and resting-state functional MRI (rs-fMRI). These modalities have provided valuable biomarkers for distinguishing progressive MCI (pMCI) from stable MCI (sMCI) and predicting conversion to AD. For example, integrating structural and metabolic imaging has improved early AD diagnosis, achieving a prediction accuracy of 79.37% for pMCI progression [
8]. Additionally, deep learning frameworks leveraging multimodal and multiscale networks have demonstrated effectiveness in classifying AD and MCI [
9,
10,
11,
12]. Similarly, in domains such as natural language processing, hybrid architectures that integrate BiLSTM, Transformer, and attention mechanisms (e.g., the Hyb-KAN model [
13]) have also demonstrated exceptional accuracy and interpretability in complex classification tasks. In addition, a CNN–BiLSTM hybrid model with attention mechanisms has been proposed for speech emotion recognition, achieving superior accuracy over state-of-the-art methods [
14]. Recent studies have leveraged graph-based deep learning on rs-fMRI for AD diagnosis, including STGTN [
15], which captures spatiotemporal features with adversarial training, and FTF-GNN [
16], which fuses frequency and time domain information to model both synchronous and asynchronous brain–region interactions. Graph-theoretic approaches and machine learning methods have also been applied to rs-fMRI data, successfully identifying functional network alterations in healthy controls (HCs) and AD in MCI patients [
17,
18,
19,
20]. Resting-state fMRI, combined with graph theory approaches and support vector machines (SVMs), has been employed to predict MCI progression, demonstrating its potential for early AD diagnosis [
21,
22]. Another study utilized rs-fMRI and sMRI features to train and test SVM models for differentiating MCI converters from non-converters [
23]. Notably, many researchers have extracted graph measures from brain networks and classified various states of MCI progression using machine learning classification frameworks. Graph theory has proven to be an efficient approach for identifying alterations in brain networks associated with psychiatric and neurological disorders. Moreover, multimodal neuroimaging, including rs-fMRI and APOE genotype scores, has been integrated to distinguish AD from other groups, such as pMCI, sMCI, and healthy controls [
24]. A 3D convolutional neural network (3D-CNN) was proposed to combine and analyze rs-fMRI, clinical assessments, and demographic information. This method was used to predict MCI progression to AD, achieving an average prediction over a 5-year interval, with an AUC of 91.72% and an accuracy of 87.59% [
25]. Notably, combining sMRI and rs-fMRI has achieved an accuracy of 89.8% in differentiating pMCI from sMCI [
26]. These findings suggest that functional brain network analysis plays a crucial role in understanding disease progression.
A growing body of research has employed graph theory to study brain network organization in MCI and AD, particularly in functional connectivity (FC) analyses. However, FC alone cannot reveal causal interactions between brain regions [
27]. Effective connectivity (EC), which characterizes the directional influence of one brain region over another, offers a more informative perspective on brain network dynamics [
28]. EC methods such as structural equation modeling, dynamic causal modeling [
29], and Granger causality analysis (GCA) [
30] have been widely applied in neuroimaging research. Among them, GCA, a statistical technique for inferring causal relationships from time series data, has been extensively used in fMRI and EEG studies [
31,
32,
33]. Mohammadian et al. proposed FE-STGNN, a spatiotemporal graph neural network that fuses dynamic functional and effective connectivity to achieve state-of-the-art accuracy in MCI diagnosis using rs-fMRI [
34]. However, most existing studies focus primarily on MCI conversion to AD [
35], while the phenomenon of MCI reversion (rMCI), where some patients recover normal cognitive function, has received limited attention. Reported reversion rates vary widely, ranging from 18% to 50% globally after 1–5 years of follow-up [
6,
36,
37]. In some studies, rMCI rates have even exceeded pMCI rates [
36,
38,
39]. However, rMCI individuals remain at an elevated risk of future cognitive decline, highlighting the importance of early identification and monitoring [
38]. Dang et al. [
40] revealed that gray matter covariance patterns in key brain networks can predict both progression to MCI and reversion to normal cognition (rMCI) with high accuracy (AUC: 0.69–0.81). Qin et al. [
41] developed a functional multistate modeling approach to characterize reversion from mild cognitive impairment (rMCI), which effectively revealed its transition probabilities and highlighted the subsequent risks of stability or deterioration. These findings highlight the complexity of cognitive state transitions and suggest that methodological advances alone are not sufficient. Therefore, there is an urgent need to investigate reliable biomarkers for both conversion and reversion to facilitate more effective diagnostic and prognostic models.
Despite significant advances in neuroimaging and classification frameworks, the early identification of high-risk populations and reliable biomarkers for both MCI conversion and reversion remains a challenge. To address this, our study proposes an effective connectivity-based framework that systematically characterizes MCI progression and reversion, bridging the gap in existing neuroimaging research. Early and accurate diagnosis of MCI is essential for delaying or even preventing its progression to AD [
36,
42]. Although recent studies have extensively explored the application of functional networks derived from rs-fMRI and the integration of multimodal neuroimaging data to predict MCI conversion, the role of effective connectivity in this transformation remains underexplored. Many classification models rely on multimodal approaches to improve performance; however, the simultaneous acquisition of multimodal data in clinical practice is often impractical due to cost, time constraints, and patient burden. Therefore, investigating a single-modality approach that can effectively capture the neural mechanisms underlying MCI progression is of great significance. Furthermore, previous studies have predominantly focused on distinguishing between pMCI and sMCI, with classification accuracy serving as the primary evaluation metric. However, such classifications do not necessarily reflect alterations in brain functional integration and segregation during disease progression. Importantly, while MCI conversion has been widely examined, little attention has been given to MCI reversion to normal cognition from a brain network perspective, particularly in terms of effective connectivity. Given these gaps, our study aims to systematically investigate the role of rs-fMRI-based effective connectivity in both MCI conversion and reversion, providing novel insights into the dynamic plasticity of brain networks.
To this end, we introduce a novel method, large-scale Granger causality modeling integrated with a long short-term memory (LSTM) network (LSTM-lsGC), to comprehensively analyze the brain network alterations underlying MCI progression, with a particular emphasis on identifying biomarkers for cognitive reversion. We construct a disease-specific template by comparing the amplitude of Amplitude of Low Frequency Fluctuation (ALFF) between healthy controls (HCs) and AD. Our LSTM-lsGC approach selectively integrates relevant brain regions into the causality computation, enhancing efficiency and minimizing interference. This enables a refined EC matrix to be derived, from which graph-theoretic features are extracted to characterize brain network organization. By integrating LSTM with large-scale GCA, the method captures nonlinear dynamics in brain connectivity, providing a novel framework to differentiate rMCI, sMCI, and pMCI and facilitating early intervention strategies. To validate our approach, we conducted comparative experiments against other EC estimation methods and network templates. This study represents the first attempt to systematically examine MCI reversion using an EC-based graph-theoretic approach, offering new insights into the neural mechanisms underlying disease progression and potential biomarkers for early intervention.
  2. Materials and Methods
The general framework for identifying MCI subgroups with effective connectivity and network construction is illustrated in 
Figure 1. It consists of template construction, EC calculation with the LSTM-lsGC model, and feature selection and classification. It is summarized as follows. (1) We compared the ALFF between subjects in the HC and AD groups to obtain significant difference voxels and construct the HAD template. The HAD mask was constructed entirely from HC and AD data, with no overlap between these subjects and the MCI dataset used for the classification tasks. (2) For the EC matrix, we calculated the Granger causalities by applying the LSTM-lsGC model. (3) We used the selected features obtained from global and local features to identify MCI in three states. In addition, comparisons were made with brain effective connectivity networks constructed using an automated anatomical labeling (AAL) template and pairwise conditional Granger causality (PGC) from the MVGC toolbox [
43] to validate the validity and superiority of our work.
  2.1. Participants
All data used in this study were selected from the publicly available Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (
http://adni.loni.usc.edu/ (accessed on 27 April 2024)). For human subjects and the ethical and regulatory considerations of the ADNI, the study would be conducted according to Good Clinical Practice guidelines US 21CFR Part 50—Protection of Human Subjects and Part 56—Institutional Review Boards (IRBs)/Research Ethics Boards (REBs), as well as under state and federal regulations. Written informed consent and HIPAA authorizations for the study had to be obtained from all participants or authorized representatives and the study partners before protocol-specific procedures were carried out. The primary goal of the ADNI has been to test whether serial MRI, positron emission tomography (PET), other biological markers, and clinical and neuropsychological evaluation can be combined to measure the progression of MCI and AD. The patients used in this study had a complete session of rs-fMRI data. In the ADNI project, the diagnostic criteria for MCI were the following: (1) Mini-Mental State Examination (MMSE) scores between 24 and 30; (2) a Clinical Dementia Rating (CDR) of 0.5; (3) memory complaint and objective memory loss measured by education-adjusted scores on the Wechsler Memory Scale’s Logical Memory II; and (4) no observable impairment in other cognitive fields and the ability to remember the activities of daily life (no dementia). Twenty-eight HC, 23 AD, 29 sMCI, and 23 MCI patients, including 7 rMCI and 16 pMCI converters, were included in this study. The details of the patients are shown in 
Table 1. The MCI converters were changed after 6–36 months, and the sMCI patients were not converted to AD or normal after 36 months of follow-up. Subjects without an MMSE score, CDR, or FAQ score were excluded, and all MCI patients had the same 0.5 CDR score and MMSE scores of 24–30. An FAQ score of 0–20 was not significant in other cognitive domains, preserved activities of daily living, or the absence of dementia.
  2.2. Data Acquisition and Preprocessing
According to the ADNI protocol, the participants underwent rs-fMRI scanning on a 3T Philips scanner (Manufacturer: Philips Medical Systems), and the participants were required to be awake during data acquisition, with eyes open and minimized head movements. The fMRI images were acquired using an echo planar imaging (EPI) sequence with a repetition time (TR) of 3000.0 ms, time echo (TE) of 30.0 ms, and flip angle (FA) of 80.0°, with 48 slices imaged. The pixel spacing in the X and Y dimensions was 3.3 mm, and the slice thickness was 3.3 mm.
DPABI (a toolbox for Data Processing & Analysis of Brain Imaging, Release V4.2-190919), based on the MATLAB 2018b platform, was used to preprocess the fMRI data of the patients [
44]. Preprocessing includes removing the first 10 time points for all participants to allow magnetization balance. Then, the data were processed with slice timing, head motion correction, spatial normalization using EPI templates and reslicing to 3 × 3 mm × 3 mm voxels, smoothing using Gaussian kernel function with FWHM = 4, detrending, and nuisance covariates regression.
  2.3. Template Construction
The ALFF values of the HC and AD groups were calculated with filtered signals in the low-frequency range (0.01–0.08 Hz) without additional filtering. To simplify cross-subject comparisons, various normalization methods were applied to rs-fMRI metrics. The ALFF was transformed into z-scores by subtracting the mean values within the gray matter mask and dividing by the standard deviation within the mask. This type of normalized map is named the zALFF. The zALFF difference between the HC and AD voxels was calculated with a two-sample 
T-test, and a false discovery rate (FDR) test was used. A voxel 
 indicates that there is a significant difference between the two groups, and this voxel is defined as a difference voxel. Afterward, the differing voxels were matched with the AAL atlas [
45]. If the differing voxels overlapped with the brain regions corresponding to the AAL, then they were designated as new regions of interest (ROIs). A new template comprising 53 ROIs, named differmask, was obtained. Additionally, in AAL, the voxels corresponding to the differmask were removed, resulting in an AALdel template. The AALdel still contained 116 ROIs, but the number of voxels was reduced compared with AAL. Finally, we obtained a conversion mask, the HC-AD difference template (HAD), including 169 ROIs used in our work by combining the differmask and the AALdel. Unlike most other studies, this study did not exclude the ROI on these cerebella. Changes in cerebellar brain regions in different states of MCI are also of interest, and our study thus investigated and analyzed entire brain regions based on the HAD for all ROIs.
  2.4. Large-Scale Granger Causality Based on LSTM
Granger causality analysis is a method for calculating effective connectivity in multivariate time series. The principle of GCA is based on the concept of precedence and predictability [
46], where the improvement in predictive quality of a time series in the presence of the past of another time series is evaluated and quantified, revealing the causal influence between the two series [
47]. Granger causality testing with nonlinear neural network-based methods has been developed and integrated into a Python package: nonlincausality (accessed 11 December 2021) [
48,
49]. What sets our study apart is our enhancement of the Granger causality analysis method for high-dimensional feature utilization. We introduced principal component analysis into the Granger causality relationship based on LSTM, resulting in the LSTM-lsGC model. LsGC, an extension of traditional multivariate GC (mvGC) [
50], provides the basis for the entire processing scheme of LSTM-lsGC. Suppose 
x is a stationary multivariate time series. MvGC was estimated by fitting a multivariate vector autoregressive (MVAR) model to the time series with a time lag of 
P:
        where 
 is the vector of all time series, 
A is the coefficient matrix, and the uncorrelated noise process is 
. For a specific variable 
 (a component of 
X), the model excluding 
 is
        where 
 represents the vector of time series excluding 
, 
 represents the corresponding coefficient matrices, with 
, and 
 is the noise vector without 
. Afterward, the Granger causality from 
 to 
 can be defined as follows:
 is the covariance matrix of the noise when excluding , and  is the covariance matrix of the noise for the full model. This simplified formula tests whether the residual covariance increases significantly when  is excluded from the model, indicating whether  Granger causes other variables in the time series vector X.
A drawback of the MVAR approach is that model parameter estimation is infeasible for high-dimensional (HD) data. We combined principal component analysis and Granger causality to overcome this drawback. The original high-dimensional data were split into two parts: one part consisting of the regions of interest (ROIs) () for which we aimed to explore causality and the other part being for the remaining ROIs. For the remaining ROIs, we applied PCA to reduce the dimensionality, transforming the high-dimensional data into a low-dimensional (LD) form, which was then combined with the extracted ROIs. The predictions were then made using the LSTM model, and Granger causality was calculated thereafter.
Specifically, the time series from the extracted ROIs were divided as follows: 
 for the target ROIs and 
 for the remaining ROIs. We then performed dimensionality reduction on the remaining ROIs, which can be expressed as follows:
        where 
W is the transformation matrix of a size 
. The dimension-reduced time series 
 and then merged with 
 and 
.
LSTM networks are particularly suited for estimating brain connectivity because of their ability to model time series with various transmission lags. The inputs include the current time step 
, the previous hidden state 
, and the previous cell state 
. The forget gate and memory gate control what information is discarded or retained. After passing through these gates, new information is computed and propagated to the next cell. The sigmoid (
) function compresses values to the range [0, 1], while the tanh function maps values to the range [−1, 1]. The flow of data through these gates and the resulting predictions are represented by the orange and green lines in the diagram. To replace the MVAR model, we used the LSTM model to predict the time series:
Here, 
 is the predicted time series, 
 represents the LSTM model, and 
 is the estimation error. To compute the Granger causality from 
 to 
, we created two inputs: one combining 
, 
, and 
, and the other combining 
 and 
 while omitting 
. Using these inputs, the LSTM model produced predictions 
, from which we computed the residual errors 
 (with 
 and 
) and 
 (without 
). The LSTM-lsGC index was then calculated as described in Equation (
3).
This study employed an LSTM network for time series prediction. The model was composed of a single hidden LSTM layer with 10 hidden units, followed by a fully connected output layer. The number of neurons in the output layer corresponds to the dimensionality of the PCA-reduced input features. The PCA reduces the input features to 10 principal components based on an explained variance threshold of over 90% on the training set. To reduce overfitting, L1 regularization (coefficient = ) was applied to the kernel weights of the LSTM layer, and L2 regularization (coefficient = ) was applied to the recurrent weights. The model was trained using the RMSprop optimizer (learning rate = 0.005) and the mean squared error (MSE) loss function. The training batch size was set to 12, with a maximum of 200 epochs. An early stopping mechanism was adopted to monitor the validation loss, and training was terminated if no improvement was observed for 5 consecutive epochs.
Hyperparameters were selected according to standard domain criteria and validation performance. The PCA dimension after reduction was chosen to retain over 90% of the explained variance in the input features. The number of hidden units and regularization coefficients were determined to balance model capacity and overfitting. The optimizer learning rate, batch size, and early stopping patience were tuned based on the convergence speed and stability on the validation set. These settings ensured the model could effectively learn temporal patterns while maintaining robustness against overfitting.The above description is summarized in 
Table 2.
Finally, the Granger causality value 
 is stored in the 
 position of the Granger matrix 
G, which has a size 
, with zeros on the diagonal. Therefore, the matrix contains 
 non-zero LSTM-lsGC values. The overall LSTM-lsGC computation process is illustrated in 
Figure 2.
  2.5. Brain Network Features and Feature Selection
The network is composed of nodes and edges, and this is true for brain networks as well. Nodes are denoted with ROIs, and edges are defined by the measures of connectivity. We constructed the brain network of each subject following the method described above. Furthermore, before exacting graph measures, thresholding is usually performed to set the value of the edges to lower the threshold to zero. The sparsity threshold was utilized to define the value of the possible edges on an individual’s brain networks. The threshold serves as the cost of network connection, defined as the ratio of suprathreshold relation to the total possible number of connections in the network [
51]. Notably, there is no golden rule for defining a single sparsity threshold, along with the fact that different thresholds cause different results [
23]. Thus, we set the threshold matrix in the range of 0.05–0.95, with steps equal to 0.05, so that the graph would be neither sparse nor dense. Afterward, each threshold functional connectivity matrix was converted to binary by replacing the elements that were non-zero with 1 and 0 otherwise. In our work, the area under the curve of the network graph measure was calculated for each subject as features to eliminate threshold randomness [
52,
53]. The measures cover global graph measures, including small-world, network efficiency (global and local efficiency), assortativity, hierarchy and network efficiency, and local measures. The local graph measures were the clustering coefficient, shortest path, efficiency, local efficiency, degree centrality, and betweenness centrality. In this part, 1019 features, including 5 global features and 1014 local features (6 local measures × 169 ROIs), were obtained for subsequent feature selection.
Feature selection can reduce the number of features and dimensionality by removing redundant features, in addition, it also can improve the performance of classification and reduce overfitting. Feature selection is generally divided into three types: filter, wrapper, and embedded. The minimum redundancy maximum relevance (mRMR) algorithm is a filter-based feature selection method proposed by Peng et al. [
54]. This algorithm selects features using mutual information, correlation, or similarity scores, considering both feature-label correlation and feature-feature correlation. The relevance of feature set 
F and label 
L is defined by the average of all mutual information values between individual features 
 and the label 
c, and it is as follows:
The redundancy of all features in the set 
F is the average of all mutual information values between feature 
 and feature 
:
The feature selection algorithm used in this study can maintain a strong correlation between the features and class while reducing the number of features.
  2.6. Feature Ranking and Classification
In this study, the three groups of subjects were divided into a training set and a test set at an 8:2 ratio. The subset of features obtained after feature selection was ranked based on the importance scores derived from the random forest classifier, with higher scores indicating higher rankings. According to the ranking, features were joined sequentially to the set for classification, and the subset of features corresponding to the maximum accuracy was selected. We applied a grid search algorithm to identify the best parameters, and the number of trees was set to range from 1 to 100 with an interval of 10. The accuracy of the RF classifiers based on these subsets of features was calculated using the fivefold cross-validation approach. It is worth noting that in order to avoid feature redundancy and reduce the computational cost, we set an interrupt part where if the accuracy remained unchanged 10 consecutive iterations, then the process was terminated, and the current result was retained. Since the unequal number of subjects in each group to be classified can lead to biased classification results, the solution of updating the class weights with the number of subjects in each class is for overcoming this problem. The maximum accuracy and the corresponding best feature subset of the above classification were determined and repeated 50 times.
To ensure the stability and reliability of the classification results, especially given the relatively small number of rMCI subjects (
n = 7), we adopted a fivefold cross-validation strategy repeated 50 times. Fivefold cross-validation evaluates the model with different training and testing splits, capturing variability due to the specific partitioning of the dataset. Repeating the cross-validation 50 times further accounted for variability from random initialization and fold assignment. The performance metrics from all repetitions were visualized using violin plots which display the full distribution, including the quartiles and minimum and maximum values. This combination provides a comprehensive empirical assessment of model uncertainty and robustness. Conceptually, it serves the same purpose as bootstrap or Bayesian uncertainty analysis, capturing variability in model performance while being computationally simpler and more reliable for small sample sizes [
55]. The accuracy, sensitivity, and specificity are used to measure the performance and test the stability and robustness of the classification, and their formulas are as follows:
To ensure consistency for the evaluation metrics (ACC, SEN, and SPE), we adopted a unified procedure across all experiments. For each repetition, predictions from all cross-validation folds were concatenated, and the metrics were calculated on this concatenated set. Sensitivity (SEN) was defined with rMCI as the positive class, specificity (SPE) had sMCI as the negative class, and accuracy (ACC) was defined as usual. Each classification task, including rMCI vs. sMCI, was repeated 50 times with random shuffling, and the reported performance represents the average across these repetitions.
These metrics for each label (e.g., label 0, label 1, and label 2) were calculated, and finally the average sensitivity and specificity across all labels were computed. All metrics (ACC, SEN, and SPE) were calculated following the same procedure across both triple and binary classifications. The AUC is the area under the ROC curve, and to obtain this value, we converted both the predicted labels and the original labels into one-hot encoding at first. The true positive rate (TPR) and false positive rate (FPR) were calculated for each label at various thresholds to construct the receiver operating characteristic (ROC) curve. Finally, we computed the area under the ROC curve (AUC) to evaluate the model’s overall performance.
  3. Results
  3.1. Classification Results
In our work, the PGC, LSTM-GC, and LSTM-lsGC models were used to calculate EC matrices based on entire brain regions extracted from the HAD template. The EC of brain regions from the AAL and Brainnetome (246 ROIs) was also obtained with LSTM-lsGC. Then, brain networks were constructed, and graph measures were extracted as features to study the progression of MCI. Aside from that, the mRMR algorithm was applied to reduce feature redundancy and improve the operation speed. The selected features were ranked by the random forest classifier and then used for classification, and the features corresponding to the highest classification accuracy were obtained. The classification performances of three groups with 50 repeats are shown in 
Figure 3. The plots display the distribution of performance indices, with the 25%, 50%, and 75% quartiles, standard deviation (std), and minimum and maximum values annotated. This distribution-based visualization provides information equivalent to bootstrap or confidence interval analysis, thereby illustrating the uncertainty and stability of the model performance in the small rMCI sample.
In the triple classification presented in 
Figure 3, the average accuracy of classification was 84.92% using the HAD template and LSTM-lsGC. The maximum accuracy could reach 91.67%, with a corresponding area under the ROC curve (AUC) of 0.9131. Moreover, the accuracy range (min–max) was the same, being 75–87.5% when using the other two methods to construct brain networks. However, as can be seen from the figure, the distributions of the two accuracies were not the same, and it was mainly distributed above 80% when using our proposed LSTM-lsGC method, while the average sensitivity and specificity were also superior.
The average classification results for 50 repetitions, which not only contained the triple classification but also binary classification among three groups, are shown in 
Table 3. In binary classification, we classified each of the two groups between subjects, using graph measures of the EC obtained from HAD and LSTM-lsGC as features. In the rMCI vs. sMCI group, the average accuracy of the classification was 89.09%. In the rMCI vs. pMCI group, the optimal accuracy could reach 100%, and the average sensitivity, specificity, and AUC were 85.8%, 85.8%, and 0.949, respectively. In the sMCI vs. pMCI group, the average accuracy and AUC were 86.71% and 0.8871, respectively.
  3.2. Features Used in Classification
In the process of constructing the template, the differmask was constructed by matching the difference voxels between the HC and AD groups to the AAL and redefining them as new ROIs.The differmask added in the HAD (i.e., ordinal numbers 117–169) corresponded to the set of brain regions in the AAL template as {1, 3, 7, 17, 18, 19, 20, 33, 34, 35, 37, 37, 38, 40, 42, 43, 44, 44, 45, 46, 47, 48, 51, 53, 55, 56, 57, 58, 59, 61, 64, 67, 68, 69, 70, 71, 72, 74, 75, 76, 77, 83, 85, 86, 87, 88, 89, 90, 97, 98, 100, 101, 105, 114, 116} (Each number represents the ordinal number in the AAL). The subset of features with the highest accuracy was retained in the random forest classifier based on the ranked importance of the features, and the number of feature occurrences in the repeated classification process was counted. The higher the rank of importance of the feature and its repeated use in classification, the more significant the difference between the two groups of subjects in the feature.
There were a total of 25 features that appeared with a frequency of no less than 70% in the subset of features from the repeated experiments, and four of the cerebellar features were of concern: brain regions 103, 112, 116 in the AAL and the corresponding brain region 101 of the AAL in the HAD. The rest of the features are shown in 
Table 4. In the table, the most widely used graph attribute was local efficiency, appearing seven times, followed by the shortest path (five) and clustering coefficient and betweenness centrality (both appearing four times), while the least used ones were the degree centrality and efficiency at three and two times, respectively. Of the 25 features, 12 corresponded to the AAL, while 13 belonged to the features in the differmask. As can be observed in the table, these features contained 15 brain regions, of which 2, 3, 2, 4, and 4 belonged to the prefrontal, frontal, occipital, parietal, and temporal lobes, respectively. The shortest path, betweenness centrality, and degree centrality of the precentral gyrus (PreCG), as well as the efficiency, shortest path, and degree centrality of the middle frontal gyrus (MFG), are all important features displayed in the table. In addition, the supramarginal gyrus (SMG) and lingual gyrus (LING) also have two important features in the table.
  3.3. Significance Test for Effective Connectivity
A Kruskal–Wallis H test and one-way ANOVA tests were performed to test for significant differences in the effective connectivity matrices of subjects with the three MCI states. Connection strengths with monotonic variations (i.e., increasing or decreasing) in rMCI, sMCI, and pMCI were screened before the significance test. Subsequently, multiple comparisons correction via controlling for the false discovery rate (FDR) was performed using Dunn’s test with two-stage FDR correction [
56,
57]. The results of the significance test for effective connection strength with monotonic variations in three groups are shown in 
Table 5, and the visualized brain connections are shown in 
Figure 4. There are 23 edges in the figure and 11 edges related to ROIs in differmask. A total of 39 nodes are present in the figure, and there are 10 and 30 ROIs located in differmask and AALdel, respectively, with two ROIs overlapping and constituting a node. In addition, there are seven nodes in the figure with two significant difference edges, corresponding to each of the seven ROIs: the inferior frontal gyrus, opercular part (IFGoperc.L 11), gyrus rectus in the right and left brain (REC.L, 27 and REC.R, 28, respectively), hippocampus (HIP.R, 38), precuneus (PCUN.L, 146), the temporal pole, namely the middle temporal gyrus (TPOmid.R, 88 and 160), and inferior temporal gyrus (ITG.R, 162).
  4. Discussion
In the present study, based on the resting-state fMRI data, three states of MCI progression—rMCI, sMCI, and pMCI—were investigated by utilizing graph theory, machine learning, and neural networks. The major findings of this study are as follows. (1) The superiority of the proposed classification model based on large-scale nonlinear effective connectivity was demonstrated according to the results of the classification. (2) With the combination of connection matrices and graph theory, the triple classification accuracy could reach 84.61%, with an ROC_AUC of 0.873. We discovered that brain networks and cortical areas changed differently as MCI progressed to different states. Notably, it is a challenging problem to achieve high-accuracy classification and investigate changes in the brain regions during MCI progression from rMCI, sMCI, and pMCI using rs-fMRI, since the changes in resting-state brain networks are slight during these three stages.
  4.1. Discussion of Classification Performance
The triple classification performance in 
Table 3 and 
Figure 3 illustrate the power of HAD along with LSTM-lsGC in exploring MCI progression. In terms of the selection of ROIs, HAD has a different number of ROIs, although it shares the same voxels as AAL. We further subdivided the ROIs in the AAL template by comparing the differences in ALFF of the brain voxels between HC and AD groups to derive a more fine-grained brain template relevant to dementia. Using HAD corresponded to a classification accuracy range from 79.17% to 91.67%, while the accuracy range of AAL was from 75% to 87.5%. HAD demonstrated greater superiority in its classification results compared with AAL, which suggests that the proposed templates are effective and reliable in the judgment of the three MCI types. Compared with the Brainnetome atlas, which contains 246 brain regions, the HAD template demonstrated superior performance in the three-class classification task, despite having fewer ROIs. This suggests that the HAD template captures more relevant or distinct features for differentiating between MCI states. Its ability to focus on significant brain regions rather than simply increasing the number of ROIs may contribute to its enhanced diagnostic accuracy. The results highlight the potential advantage of focusing on the quality of region selection and template construction, emphasizing the importance of extracting high-quality brain regions in neuroimaging-based classification. In terms of effective connectivity brain network construction, compared with PGC and LSTM-GC, the average accuracy of classification using the acquired feature sets of PGC, LSTM-GC, and LSTM-lsGC were 80.92%, 82.5%, and 84.92%, and these results indicate that the proposed effective connectivity method, LSTM-lsGC, can provide effective and reliable prediction of MCI progression. Our proposed method can not only solve the inability to capture the nonlinear signals in the brain in the traditional Granger causality method but also utilize the advantages of LSTM in time series prediction, such as dynamic adaptation to time scales, to further improve the precision degree of prediction. The relatively low sensitivity (60.14%) in the three-class classification was primarily due to class imbalance, particularly the small number of rMCI subjects (n = 7). We mitigated this by using stratified train and test splits and adjusting the class weights, yet the limited rMCI sample still showed constrained sensitivity. This underscores the difficulty of accurately classifying rare subgroups in small datasets. Future studies with larger, balanced cohorts are needed to improve the sensitivity and better capture the characteristics of rMCI individuals. As can be seen in 
Figure 3, the framework of extracting ROIs’ time series using HAD and calculating effective connections via LSTM-lsGC achieved the best classification performance, with the accuracy mainly distributed above 80%, and the distributions of the sensitivity, specificity, and ROC_AUC were also significantly higher than those of the other two groups. The results of our work confirmed our expectations, not only improving the performance of triple classification but also enabling a multifaceted exploration of the MCI transformation process through self-defined templates and extracted features.
The exploration of MCI progression has been the focus of studies started in recent years, but it is not thorough enough and needs to be explored more deeply. More notably, fewer studies on MCI progression have been performed on rs-fMRI data. Eight articles included in 
Table 6 are studies of MCI conversion based on rs-fMRI data in recent years [
15,
16,
21,
23,
24,
25,
26,
58], except for that of Hojjati et al. (2017). The remaining studies are based on a combination of rs-fMRI and other modal imaging data. It is noteworthy that while previous studies have investigated the utility of FC for the classification of pMCI from sMCI, the application of EC for this classification has not been explored thoroughly. Compared with previous studies, we constructed brain networks based on effective connectivity calculated by the LSTM-lsGC model. The nature of these networks is a directed graph rather than an undirected graph. The model not only filters valid information over the time series but can also take into account the causal relationship between large-scale ROIs. Moreover, the reliability and validity of the model were demonstrated through the relatively excellent and stable outcomes in the classification. Most importantly, unlike previous studies, our study not only focused on the variability in brain network changes between the two groups of patients (pMCI vs. sMCI) but also investigated differences regarding MCI reversal. 
Table 3 shows the results of binary categorization between the three groups of subjects, with an average accuracy of 89.09%, 86.71% and 90.86% for rMCI vs. sMCI, sMCI vs. pMCI, and rMCI vs. pMCI, respectively. As can be observed in 
Table 6, our results exceeded those of most of the previous studies, even though these studies utilized data from multiple modalities. This further validates that the effective connectivity based on fMRI can reveal more potent information in the progression of MCI, and it also indicates that there is still much information worth exploring in fMRI data. Meanwhile, it is noteworthy that the classification performance of two adjacent states was lower, namely rMCI vs. sMCI and sMCI vs. pMCI, whereas the accuracy of rMCI vs. pMCI, the two states with larger differences, was higher. This reveals that sMCI, as a stable state in the middle of reversal to normal controls and progress to AD, differed less significantly from the other two states than between sMCI and pMCI in the rs-fMRI brain network. Moreover, when comparing the classification performance in 
Table 3 and 
Table 6, it can be found that the combination of our proposed large-scale nonlinear effective connectivity model, functional connectivity, and graph theory was notably effective in recognizing the three states of MCI. Compared with [
15,
16], our framework focuses on MCI progression using a smaller and clinically distinct dataset (rMCI, sMCI, and pMCI), whereas their studies primarily investigated EMCI and LMCI or NC and AD groups with larger sample sizes. By integrating the HAD brain template to ensure anatomical consistency, applying PCA for effective dimensionality reduction, employing LSTM-based nonlinear modeling to enhance causal estimation accuracy, and extracting network features from graph attribute curve areas, our approach achieved high classification accuracy (up to 89.09%) even on small and imbalanced MCI datasets. More importantly, it enriches fMRI analytical methodologies and provides a new methodological perspective for understanding the mechanisms underlying MCI progression, thereby deepening insights into the developmental trajectory of MCI.
  4.2. Analysis of Brain Region
Graph theory has been previously reported to enhance classification performance in AD, and our findings are consistent with this. As illustrated in 
Table 4, several graph-theoretic measures related to the precentral gyrus (PreCG) appeared in more than 70% of classifications, including the local efficiency in the AALdel network and the shortest path, degree centrality, and betweenness centrality in the differmask network. The PreCG, part of the sensorimotor network, is responsible for somatic sensation and movement. Previous research has suggested that impairments in the sensorimotor system may precede cognitive decline in AD, affecting daily living activities and accelerating cognitive deterioration [
8,
10]. Importantly, in our study, the PreCG, particularly in the differmask network, exhibited significant differences in ALFF between the healthy controls and AD patients. It also played a key role in classifying the three MCI states, underscoring the role of abnormal connectivity in the PreCG as a marker of cognitive decline [
59]. Additionally, the clustering coefficient of the right lingual gyrus (LING.R) in the differmask network and its local efficiency in the AALdel network were critical features in the triple classification task. Previous PET studies have identified significant abnormalities in the lingual gyrus in AD patients [
60,
61], and our findings are in line with these observations. The lingual gyrus appears to be involved in the progression of MCI and early-stage AD, further confirming its importance in understanding cognitive impairment.
During the repeated classification experiments, the clustering coefficients of the inferior temporal gyrus (ITG) appeared in 72% of the classifications, indicating its crucial role in distinguishing between rMCI, sMCI, and pMCI states. Additionally, significant differences were observed in the effective connections from Postcentral_R and Cerebellum_7b_L to the ITG (162, Temporal_Inf_R). The ITG is known to play a key role in verbal fluency and cognitive function, and dysfunction in this region is thought to contribute to early-stage AD-related clinical symptoms. Previous studies have highlighted the ITG’s involvement in the prodromal phase of AD, reinforcing its relevance in early cognitive impairment [
21,
62]. The default-mode network (DMN) plays a central role in cognitive processes such as situational memory, introspection, and emotional regulation. Research by Ibrahim et al. identified the DMN as a potential biomarker for AD and MCI classification [
19]. The DMN can be subdivided into the pre-DMN and post-DMN [
63], with the hippocampus (HIP) being a key component of the post-DMN. In our study, two effective connections related to the hippocampus—Hippocampus_R → Insula_L and Precentral_L → Hippocampus_R—showed significant differences across the three MCI groups. The enhanced connectivity associated with the hippocampus suggests a potential compensatory mechanism, where the brain increases connectivity in specific regions to compensate for network damage.
Another key region, the precuneus (PCUN), is a vital part of the DMN. Previous studies have reported disproportionate atrophy of the precuneus in early-onset AD [
61,
64], and our results align with this finding. Specifically, we observed significant reductions in the strength of effective connections between the PCUN and Amygdala_L, and Cuneus_R, suggesting a disruption in the DMN’s connectivity during MCI progression. Furthermore, we identified 13 brain regions within the frontal cortex that showed significant differences in connectivity. The frontal lobes are critical for language, decision making, and cognitive tasks [
65], and their involvement in MCI progression is consistent with existing knowledge. Notably, the cerebellum, traditionally associated with motor control, also emerged as a key feature in distinguishing MCI states, further supporting its role in cognitive functions related to MCI progression. These findings suggest that the cerebellum should be considered in future studies exploring MCI progression. Our results, consistent with previous research, highlight the diagnostic value of various brain regions and graph-theoretical measures derived from effective connectivity. These brain networks and connectivity measures offer potential biomarkers for differentiating MCI stages, enhancing our understanding of the neural mechanisms underlying cognitive impairment.
The HAD template showed significant advantages in identifying specific brain network changes, though there is still room for improvement in some binary classification tasks. Additionally, due to the limited sample size, there may be some bias in the study results. While data augmentation, transfer learning, or resampling techniques (e.g., SMOTE) could potentially improve classification for rare classes, the current rs-fMRI dataset does not support the reliable generation of synthetic samples. Moreover, when applying transfer learning, variations across datasets, such as differences in scanner models, acquisition parameters, and scanning environments, can lead to inconsistencies in fMRI signal characteristics [
66]. These inconsistencies may affect the training process and reduce the model’s generalization performance. In the meantime, compared with pMCI and sMCI, rMCI individuals often exhibit fluctuating cognitive performance and diverse neuropathological changes, resulting in greater intra-group variability. Such heterogeneity may obscure consistent neuroimaging patterns, thereby reducing classification sensitivity despite maintaining the overall accuracy. Future studies with larger and more balanced samples are needed to further validate this finding.
In the future, we plan to incorporate larger datasets to further optimize the HAD template and improve its performance across all classification tasks. We will also explore the combination of different brain region templates to achieve more comprehensive diagnostic capabilities. In summary, although the proposed pipeline demonstrated promising performance in identifying reversion within the current cohort, external validation is crucial to confirm its generalizability. Due to the rarity of rMCI subjects, we currently do not have access to an independent cohort for validation. To partially mitigate this issue, we repeated the classification 50 times and conducted uncertainty analyses (quartiles, standard deviation, and min and max values) to assess model stability, which showed consistent results. Nevertheless, we acknowledge this limitation and emphasize that future work will focus on validating the pipeline using external datasets (e.g., ADNI) and larger multi-center cohorts. Nevertheless, the research findings demonstrate the advantages of our method compared with others. Our work not only provides new insights and perspectives into the progression of MCI but also holds significant importance for the diagnosis and treatment of AD, given that MCI is a precursor stage of AD. Consequently, our work and findings offer further insights into the AD field, making valuable contributions to future research and clinical diagnosis.