1. Introduction
Alzheimer’s disease (AD), the most common form of dementia among older adults, is characterized by a progressive decline in cognitive function and the pathological accumulation of amyloid-beta plaques and tau tangles [
1]. The disease’s complexity is further compounded by its multifactorial nature, involving genetic, environmental, and molecular factors. Understanding the pathogenesis of AD at a cellular and molecular level has been a significant challenge, largely due to the heterogeneous nature of the brain and the intricate interactions between various cell types [
2]. The considerable interval between the onset of initial pathophysiological changes and the emergence of clinical symptoms suggests an Alzheimer’s disease continuum, encompassing various transitional stages. At the earliest point in this continuum is the preclinical AD phase. Following this is the prodromal stage known as mild cognitive impairment (MCI), characterized by cognitive deficits that do not significantly interfere with daily activities. Beyond MCI lies the dementia phase [
3].
The development of single-cell RNA sequencing (scRNA-seq) has provided an unprecedented opportunity to explore multiple complexities at the resolution of individual cells [
4,
5]. scRNA-seq technology has emerged as the leading method for deciphering the diversity and intricacies of RNA transcripts at the individual cell level. It enables the exploration of the composition, functions, and heterogeneity within various organized tissues, organs, or organisms. The procedures of scRNA-seq primarily involve single-cell isolation and capture, cell lysis, reverse transcription (conversion of RNA into cDNA), cDNA amplification, and library preparation [
6]. This technology allows detailed examination of gene expression patterns within single cells, offering insights into the cellular composition of tissues and the distinct role of different cell types between healthy and diseased conditions. In the context of AD, scRNA-seq has the potential to reveal how different brain regions are uniquely affected by the disease, highlighting variations in cellular responses and molecular pathways [
7,
8]. Increased amyloid-beta secretion in AD olfactory mucosal cells and detailed cell-type-specific gene expression patterns have been reported through scRNA-seq as well as 240 differentially expressed disease-associated genes compared to the cognitively healthy controls and five distinct cell populations [
9]. Specific transcriptional changes in different cell types such as neurons, astrocytes, and microglia from post-mortem human brain tissue of AD patients and control subjects have been identified, revealing distinct transcriptional alterations in these glial cells and suggesting their pivotal roles in the disease’s progression [
10].
In a previous work by our group, gene–gene interaction networks integrated with scRNA-seq expression profiles were constructed, while the most active subnetworks were isolated from the entire network topology [
11]. Moreover, combining both deep learning and machine learning processes examining scRNA-seq data obtained from the peripheral blood of both AD patients with an amyloid-positive status and healthy controls with an amyloid-negative status, differentially expressed genes have been observed which were mainly enriched in the regulation of the immune system, interferon-gamma-mediated signalling, and the cellular defence response [
12]. Drawing upon data from a database called scREAD (single-cell RNA-seq Database for Alzheimer’s Disease), another study centred on astrocytes isolated from the entorhinal cortex of both AD patients and healthy individuals. The study identified differentially expressed genes and extracted disease-specific pathways and gene ontologies, along with predicting drugs and natural products capable of regulating AD-specific signatures in astrocytes [
13]. Furthermore, disruptions in synaptic signalling and cell-cycle regulation across different cell types in the prefrontal cortex of AD patients have been observed, offering insights into neuronal dysfunction and degeneration mechanisms in the disease while critical pathways involved in synaptic signalling and cell-cycle regulation have been significantly disturbed in the prefrontal cortex, highlighting potential therapeutic targets [
14]. Variations in immune response genes and disruptions in the insulin/IGF1 signalling pathways have also been identified, crucial for understanding the disease’s early stages, pinpointing potential biomarkers for early detection and intervention, which could be pivotal in monitoring the disease’s progression [
15]. An upregulation of the insulin/IGF1 signalling machinery seems contrary to the notion of central insulin resistance. Alternatively, it might represent a compensatory mechanism that enhances neuroprotection in areas of the Alzheimer’s disease brain that have not yet experienced neuronal loss [
15]. Pathways and neurotransmitters involved in AD are summarized in
Figure 1.
The datasets utilized for the purposes of the present work were obtained from the scREAD database [
16] which encompasses both scRNA-seq and snRNA-seq datasets derived from postmortem human brain tissue exhibiting AD and animal models with AD pathology. Control datasets sourced from healthy, non-AD samples were also included. By employing advanced computational methods, researchers can integrate scRNA-seq datasets from various brain regions to form a holistic view of the disease. In a recent work by our group, scRNA-seq data between the mice cortex and hippocampus from healthy and AD samples have been compared, and differentially expressed genes were observed, mainly enriched in muscarinic acetylcholine receptors, dopamine receptors, and perisynaptic extracellular matrix [
17]. The present study leverages computational techniques to analyse scRNA-seq data from multiple brain regions impacted by AD and reinforce existing studies that AD manifests differently in different brain regions and cell types. RNA-seq can explore differential gene expression across multiple brain regions, providing new challenges to identify key biological processes from a molecular perspective. By synthesizing data across the entorhinal cortex, prefrontal cortex, superior frontal gyrus, and superior parietal lobe, we aimed to build a comprehensive model of AD’s impact on the brain’s cellular and molecular landscape. Through this approach, we emphasize the potential of computational analyses to deepen our understanding of neurodegenerative diseases such as AD. This allows for the comparison of cellular and molecular profiles across different areas, identifying common and distinct elements of the pathology.
3. Results
According to our analysis, findings are meticulously visualized through a series of tools designed to enhance our interpretative ability. This includes the ranking of genes associated with each cluster and the use of a heatmap, which illustrates the marker genes and their expression patterns across the clusters. Through these visualization techniques, we not only highlight the biological differences between distinct clusters but also underscore the potential discovery of unique cellular identities or states. This comprehensive approach significantly enriches our understanding of the dataset’s underlying biology, paving the way for further explorations into the cellular intricacies of the brain. The series of graphs represent a comparative analysis of gene expression distributions within individual clusters against the backdrop of all other cells not included in those clusters (
Figure 2 and
Figure S1). Each plot is labelled with a cell identifier signifying the specific cluster being analysed against the collective remainder of the dataset. In each subplot, the horizontal axis, marked as ‘ranking’, orders the genes from left to right based on their relative importance or impact within the cluster’s gene expression profile. The ‘score’ on the vertical axis quantifies the level of differential expression, with higher scores potentially indicating a greater degree of upregulation within the cluster compared to the rest of the cells.
Points plotted in each graph, as
Figure 2 shows, represent individual genes, with their position reflecting their ranked relevance and expression score within the cluster. For instance, a gene that appears toward the left with a high score is of substantial significance within that cluster and shows a notably different expression level compared to cells outside of the cluster. The discrete distribution of points across the ranking spectrum allows us to discern which genes are most characteristic of each cluster. This comparative graphical approach is invaluable for highlighting the genes that distinguish each cluster from the rest, thereby providing a detailed view into the molecular identity of each cell population. By observing the patterns and positions of these genes across the series of plots, researchers can draw conclusions about the biological processes that may be predominant in each cluster and identify potential targets for further experimental investigation.
Furthermore, an intuitive graphical representation to visualize gene expression across distinct categories was performed using heatmaps, as
Figure 3 illustrates. These values are systematically arranged and grouped by specific categories, providing a clear differentiation between different gene expression levels. Within this matrix plot, each column corresponds to a category or cluster, such as different cell types, tissues, or experimental conditions. For each cluster within the plot, the expression of genes is quantified in terms of fold change values as a measurement comparing the expression level of a gene in one condition to its level in a reference condition. A positive fold change value signifies that there is an upregulation or increase in the expression of the gene within that cluster, suggesting that the gene is more active compared to the reference condition. Conversely, a negative fold change indicates a downregulation or decrease in gene expression, implying that the gene is less active or potentially repressed in the compared condition. This differential expression analysis, highlighted through
Figure 3, enables us to identify genes that show significant changes in expression across different categories or conditions, facilitating insights into biological processes and disease mechanisms.
Figure 4 shows dot blot for summarizing the expression of each gene across all cells within a group and visualizes scRNA-seq expression data across different clusters. Cell groups are shown along the horizontal axis, and genes are arranged along the vertical axis. The dendrogram at the top represents hierarchical clustering based on expression profiles, grouping similar expression patterns together. The size of each dot indicates the fraction of cells within a group expressing the gene marker (with the percentage scale shown in the top right corner), while the colour intensity represents the mean expression level of the gene in that specific group (as indicated by the colour scale at the bottom). This visualization facilitates the assessment of both the prevalence and intensity of gene expression across different groups, providing insights into the dynamics of gene expression and cellular diversity within the sample. In our analysis, the dot plot was standardized by variance to enable comparison across different genes. Additionally, dot plots, as illustrated in
Figures S2 and S3, provide an insightful visualization of scRNA-seq expression data across various clusters, facilitating comparisons between control and disease groups.
Figure S2 displays the expression patterns of key genes across different cell types within the control group, including inhibitory neurons, excitatory neurons, astrocytes, oligodendrocyte precursor cells, oligodendrocytes, and microglial cells.
Figure S3 presents the dot plot for the disease group, highlighting alterations in gene expression patterns due to the pathological condition. Significant alterations in the expression levels and prevalence of specific genes across cell types are evident when compared to the control group. These dot plots are an invaluable tool for visualizing the complexity of gene expression across different cell populations, offering a thorough overview of cellular heterogeneity and the effects of disease on gene expression dynamics.
Furthermore, violin plots as
Figure 5 shows, display the distribution of expression levels for the top five differentially expressed genes in each group compared to the rest of the groups in the control dataset. The width of each plot indicates the density of cells at different expression levels, while the split view highlights differences between the two conditions being compared.
Lastly, we merged the newly created AnnData (adata) objects for both the control and disease groups into a single adata object, following to the same analytical procedures as previously established (
Figure 6). The integration facilitates a focused comparison of cells present in both the control and AD conditions, allowing us to pinpoint genes that exhibit significant differences between these groups. By concentrating on these particular cells, we tried to uncover crucial genetic markers that may elucidate the underlying mechanisms of disease progression or resilience. This comparative approach is instrumental in distinguishing the genetic expressions that are pivotal in disease manifestation compared to normal physiological states.
Across various classifiers, including K Neighbors Classifier, Extreme Gradient Boosting, Decision Tree Classifier, and Gradient Boosting Classifier, the analysis reveals consistently high values for accuracy and AUC scores. These metrics indicate robust discriminative capabilities in distinguishing between healthy and disease conditions based on the gene expression data (
Figure 7). Furthermore, these classifiers demonstrate balanced performance in terms of recall, precision, and F1 score, highlighting their ability to effectively identify positive cases while minimizing false positives. Strong agreement metrics such as Kappa and MCC further validate the reliability of these classifiers, suggesting consistent and balanced predictions across different models. While there are variations in training times among the models, it’s noteworthy that models like Extreme Gradient Boosting exhibit both high accuracy and efficiency, achieving remarkable results within a reasonable timeframe. Our analysis was facilitated by the PyCaret library (PyCaret version 3.3.2), a versatile and user-friendly tool for streamlined machine learning experimentation. PyCaret automates various aspects of the machine learning workflow, including data preprocessing, feature selection, model training, hyperparameter tuning, and model evaluation. Its intuitive interface and extensive suite of functionalities enable researchers to efficiently explore multiple models, compare their performance, and derive actionable insights from the data.
By leveraging PyCaret’s capabilities, we were able to conduct a thorough evaluation of different classifiers on the gene expression dataset, facilitating informed decision-making and accelerating the research process. The library’s comprehensive documentation and extensive support for a wide range of machine learning tasks make it an invaluable resource for both novice and experienced practitioners in the field. Beyond the quantitative metrics, our computational analysis offers deeper insights into the underlying biological mechanisms.
In addition to these metrics, the confusion matrix offers a detailed view of each classifier’s performance by illustrating the true positive, true negative, false positive, and false negative rates.
Figure 8 shows the confusion matrix for the Custom Probability Threshold Classifier which highlights its impressive ability to correctly classify instances. This classifier refers to the method of converting predicted probabilities into class labels based on a specified threshold. By default, this threshold is set at 0.5, which means that any predicted probability above 0.5 is classified as positive, while those below 0.5 are classified as negative. In our analysis, we customized this threshold to 0.7 indicating that only predictions with a probability above 0.7 are classified as positive, making the classification criteria more stringent. The classifier achieved 978 true negatives indicating a high accuracy in identifying healthy samples. Furthermore, the 1726 true positives demonstrate its robustness in detecting disease conditions accurately. The minimal number of false negatives (28) and false positives (10) underscore the classifier’s precision, ensuring that most positive identifications are correct while very few actual positive cases are missed. This high precision and recall combination corroborates the elevated F1 score observed in the evaluation metrics. The confusion matrix thus provides a granular insight into the classifier’s performance, revealing its effectiveness in maintaining a low false-positive rate, which is crucial for reducing unnecessary follow-up tests and treatments. Simultaneously, the low false-negative rate ensures that most disease cases are correctly identified, highlighting the model’s reliability and robustness in practical applications. These detailed breakdowns affirm the classifier’s utility in real-world scenarios, where accurate and reliable predictions are paramount.
We further utilized various libraries to identify pathways associated with upregulated and downregulated genes (
Tables S1 and S2, respectively). Reactome pathway analysis showed that upregulated DEGs were mainly enriched in pathways such as G-alpha (I) signalling Events, transcription of neuronal ligands, interleukin-1 processing, caspase-mediated cleavage of cytoskeletal proteins and GPCR downstream signalling, while WikiPathway showed enrichment in neuroinflammation and glutamatergic signalling, interleukin-1 induced activation of NF-kB, and the IL-10 anti-inflammatory signalling pathway. The enriched gene ontology (GO) terms were divided into biological processes (BP), and the results of GO analysis revealed that upregulated DEGs were mainly enriched in BPs including positive regulation of gene expression, response to metal ions, and positive regulation of protein transport. On the other hand, downregulated genes were mainly enriched in neurotrophin, P75 NTR receptor-mediated signalling, LTC4-CYSLTR mediated IL4 production, and transcription of neuronal ligands, according to Reactome analysis and in Wnt signaling, the chemokine signaling pathway, and the leukotriene metabolic pathway (WikiPathway). BP analysis indicated that DEGs were significantly enriched in the leukocyte apoptotic process, glutathione catabolic process, and positive regulation of the ERK1 and ERK2 cascades.
This study provides a detailed analysis of AD by identifying key differentially expressed genes across multiple brain regions, shedding light on the intricate molecular dynamics associated with the disease’s pathology. Our findings, particularly in the upregulation of genes like GSN and TTN in pathways associated with amyloidosis, muscle stretch and cardiac muscle contraction, reveal potent therapeutic targets. Additionally, the positive regulation of gene expression by genes such as CCDC88B and IL1A suggests new ways for modulating gene expression in positive regulation of T-cell maturation and inflammatory function. Notably, the identification of pathways related to neuroinflammation and glutamatergic signalling, featuring genes like IL1A, GRM4, and SST, emphasizes their potential role in AD’s systemic pathological processes. These pathways are consistently altered across the studied regions, underscoring the importance of targeting these molecular mechanisms to mitigate the disease’s progression. The regional expression of genes such as HDAC1 and ARHGEF40, involved in downregulation pathways like death receptor signalling and p75 NTR receptor-mediated signalling, hints at a complex regulatory mechanism that might confer specific regional vulnerabilities or resilience to AD pathology (
Figure 9).
4. Discussion
The present work includes data integration and analysis of specific scRNA-seq datasets from multiple brain regions associated with AD pathology. This approach seeks to provide a systemic view of the disease’s impact across different human brain regions, leveraging data to uncover insights that have not been previously recognized due to the isolated nature of earlier analyses. This approach underscores the potential of computational analyses to deepen our understanding of AD from a holistic perspective, providing valuable insights that could lead to the development of targeted molecular interventions. Specifically, the study intends to achieve the following pillars: 1. Data Integration: Combine scRNA-seq datasets from critical brain regions such as the entorhinal cortex, prefrontal cortex, superior frontal gyrus, and superior parietal lobe. This integrated analysis will allow for a comparative assessment of cellular and molecular features across these regions, enhancing our understanding of how AD manifests differently in various parts of the brain. 2. Computational Analysis: Utilize computational methods to analyse these integrated datasets, focusing on identifying common and region-specific molecular signatures that characterize AD. This includes the application of batch effect correction, normalization, dimensionality reduction, and clustering algorithms to synthesize and interpret the complex data. 3. Insight Development: While not generating new experimental results, the study aims to derive novel insights into the pathology of AD by reanalysing existing data. This will include identifying patterns and correlations that may have been overlooked in previous studies that focused on single datasets or regions. 4. Therapeutic Implications: Explore potential therapeutic targets by understanding the molecular mechanisms across the brain’s affected regions. Identifying pathways that are consistently altered in these regions could highlight targets for therapeutic interventions that might be effective across the broader spectrum of AD pathology. 5. Methodological Contribution: Demonstrate the power and utility of computational methods in the integration and analysis of complex and large-scale biological data. The study will showcase how computational approaches can be used to enhance the value of existing datasets, providing a blueprint for similar future studies in neurodegenerative diseases and beyond. By meeting these objectives, the study will significantly enrich our understanding of AD, offering value for future research into comprehensive and targeted treatments. It seeks to establish a new standard for the effective application of computational analysis in interpreting and integrating diverse biological data, thus paving the way for novel avenues in research and therapeutic advancements. According to Q–Q plot, significant deviations from the normal distribution are indicated, suggesting that gene expression data are not normally distributed. Additionally, the histogram shows a sharp peak around zero with a rapid drop-off, indicating that most data points are concentrated near this value, which further supports the non-normality of the distribution. Given these observations, it is clear that gene expression data do not follow a normal distribution. The Wilcoxon rank-sum test is a non-parametric method, meaning it does not assume normality in the data. Therefore, it is a suitable and robust choice for our analysis, allowing us to accurately identify differentially expressed genes without being affected by the non-normal distribution of the data. As
Figure S4 shows, the significant deviations from normality observed in both the Q–Q plot and histogram justify the use of the Wilcoxon rank-sum test for differential gene expression analysis in our study.
While we refrain from making direct assertions about specific genes’ diagnostic potential, our approach sheds light on the computational indicators that might point towards important genetic markers associated with the conditions under study. By leveraging advanced machine learning techniques and comprehensive evaluation strategies, our analysis provides a nuanced understanding of the gene expression patterns characteristic of different conditions. These findings not only contribute to our understanding of the molecular underpinnings of disease, but also offer valuable guidance for future research endeavours. Rather than as a conclusive diagnostic marker, our computational analysis serves as a powerful exploratory tool, indicating potential candidate genes worthy of further investigation. This nuanced approach underscores the importance of integrating computational methods with traditional experimental techniques in unravelling the complexities of disease mechanisms.
AD, the most prevalent cause of dementia among older adults, poses significant challenges due to its intricate and multifactorial nature. With genetic, environmental, and molecular factors contributing to it, unravelling the pathogenesis of AD and developing effective treatments is a persistent and complex endeavour [
18,
19]. The advent of scRNA-seq technology provides a methodology to explore the cellular heterogeneity of the tissue, by profiling tens of thousands of individual cells, and has opened new ways for exploring the molecular details of diseases with unprecedented precision [
6]. More precisely, through scRNA-seq, researchers can probe the cellular diversity, offering a comprehensive approach the specific cellular environmental conditions that contribute to disease progression [
20]. Recent technological advances have particularly enhanced our ability to discern subtle variations in gene expression across individual cells, which is crucial for identifying the molecular signatures associated with AD. However, the use of scRNA-seq in AD has been primarily restricted to isolated analyses of specific brain regions or datasets [
21,
22]. A comprehensive, integrated examination across multiple affected regions remains rare, which limits our understanding of the systemic and regional impacts of the disease across the brain’s complex landscape.
By integrating data, computational analysis revealed novel molecular signatures, validating observed patterns as authentic biological phenomena rather than artifacts of data manipulation. This step is necessary in preventing any overlap or confusion during the merging process, guaranteeing that each cell, now part of a larger dataset, retains a distinct identity. This clear delineation is fundamental for subsequent analyses, ensuring that data from disparate datasets can be accurately compared. This rigorous approach not only bolsters the credibility of our findings but also establishes a methodological blueprint for future studies aiming to decode the complex molecular landscape of AD. By integrating insights from external studies indicating distinct transcriptional networks in AD, particularly within neuronal and glial populations, we corroborate our findings within a broader scientific context [
10]. Furthermore, the dynamic perspective on gene expression, highlighted through RNA velocity studies, complements our static analysis by illustrating the importance of temporal dynamics in understanding cellular responses in AD [
14].
Blood-based biomarkers, especially immune-related ones, could provide a more accessible and cost-effective solution for early AD detection. In a recent work, advances in understanding brain–immune interactions and how machine learning can combine various biomarkers and demographic information to improve early diagnosis is discussed. Furthermore, mechanistic modelling techniques for analysing cell dynamics are explored, highlighting the potential of immune-related blood biomarkers for early AD diagnosis [
23]. The clinical implications of discovering new diagnostic markers or therapeutic targets are crucial. Baheti et al. highlight the advantages of molecular modelling methods, which offer a faster and more efficient way to design drugs with improved efficacy and ethical considerations compared to traditional approaches. Researchers are increasingly adopting these advanced methods to better address AD and other diseases [
24]. Looking forward, the framework developed in the present work promises to be a robust analytical tool for comparing cellular and molecular changes between AD patients and healthy controls. This comparative analysis can shed light not only the specific pathological triggers associated with AD but also on potential resilience factors found within the control group. Such insights could inspire the development of focused interventions aimed at replicating these resilience factors in susceptible populations. Moreover, by harnessing this methodology, future research can leverage scRNA-seq data to gain a systemic view of AD’s impact across different brain regions [
25]. This approach will enable a deeper understanding of the disease at a cellular level, paving the way for precision medicine strategies that are fine-tuned to the molecular profiles observed in individual patients [
26]. However, this study does face limitations, primarily due to its reliance on existing datasets, which may not capture the full spectrum of cellular diversity in AD pathology. The analytical methods, while sophisticated, also depend heavily on the quality and completeness of the data integrated into our study. Future research should aim to include more diverse datasets, potentially incorporating longitudinal data to observe the progression of AD over time, which could provide further insights into the dynamics of the disease’s development.