Next Article in Journal
Predicting Rheumatoid Arthritis Development Using Hand Ultrasound and Machine Learning—A Two-Year Follow-Up Cohort Study
Previous Article in Journal
Evaluation of Bitemark Analysis’s Potential Application in Forensic Identification: A Systematic Review
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Advances in Inflammatory Bowel Disease Diagnostics: Machine Learning and Genomic Profiling Reveal Key Biomarkers for Early Detection

by
Asif Hassan Syed
1,*,
Hamza Ali S. Abujabal
2,
Shakeel Ahmad
1,
Sharaf J. Malebary
3 and
Nashwan Alromema
4
1
Department of Computer Science, Faculty of Computing and Information Technology-Rabigh, King Abdulaziz University, Jeddah 22254, Saudi Arabia
2
Department of Mathematics, Faculty of Science, King Abdulaziz University, P.O. Box 80203, Jeddah 21589, Saudi Arabia
3
Department of Information Technology, Faculty of Computing and Information Technology-Rabigh, King Abdulaziz University, P.O. Box 344, Rabigh 21911, Saudi Arabia
4
Department of Computer Science, Faculty of Computing and Information Technology-Rabigh, King Abdulaziz University, P.O. Box 344, Rabigh 21911, Saudi Arabia
*
Author to whom correspondence should be addressed.
Diagnostics 2024, 14(11), 1182; https://doi.org/10.3390/diagnostics14111182
Submission received: 25 April 2024 / Revised: 25 May 2024 / Accepted: 1 June 2024 / Published: 4 June 2024
(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)

Abstract

:
This study, utilizing high-throughput technologies and Machine Learning (ML), has identified gene biomarkers and molecular signatures in Inflammatory Bowel Disease (IBD). We could identify significant upregulated or downregulated genes in IBD patients by comparing gene expression levels in colonic specimens from 172 IBD patients and 22 healthy individuals using the GSE75214 microarray dataset. Our ML techniques and feature selection methods revealed six Differentially Expressed Gene (DEG) biomarkers (VWF, IL1RL1, DENND2B, MMP14, NAAA, and PANK1) with strong diagnostic potential for IBD. The Random Forest (RF) model demonstrated exceptional performance, with accuracy, F1-score, and AUC values exceeding 0.98. Our findings were rigorously validated with independent datasets (GSE36807 and GSE10616), further bolstering their credibility and showing favorable performance metrics (accuracy: 0.841, F1-score: 0.734, AUC: 0.887). Our functional annotation and pathway enrichment analysis provided insights into crucial pathways associated with these dysregulated genes. DENND2B and PANK1 were identified as novel IBD biomarkers, advancing our understanding of the disease. The validation in independent cohorts enhances the reliability of these findings and underscores their potential for early detection and personalized treatment of IBD. Further exploration of these genes is necessary to fully comprehend their roles in IBD pathogenesis and develop improved diagnostic tools and therapies. This study significantly contributes to IBD research with valuable insights, potentially greatly enhancing patient care.

1. Introduction

1.1. Background

Inflammatory Bowel Disease (IBD), which encompasses ulcerative colitis (UC) and Crohn’s disease (CD), is a chronic inflammatory condition that affects the gastrointestinal tract. It has a significant global impact, affecting millions worldwide [1,2]. Early and accurate detection of IBD is crucial for effective disease management and personalized treatment, but the complex and heterogeneous nature of IBD poses diagnostic challenges [3].

1.2. Research Motivation

Advancements in high-throughput transcriptomic microarray technologies have provided opportunities to explore gene expression profiles associated with IBD. These datasets offer insights for identifying diagnostic biomarkers to distinguish IBD patients from healthy individuals [4]. However, analyzing high-dimensional low-sample size (HDLSS) transcriptomic data remains challenging [5,6]. Machine learning (ML) techniques have emerged as powerful tools for analyzing complex biological datasets and discovering predictive patterns [7,8,9].

1.3. An Overview of the Study Objectives and Methodology

This study explores high-throughput technologies and ML to identify molecular signatures associated with IBD, enhancing our understanding of IBD pathogenesis [10,11,12,13,14,15,16]. We aim to employ supervised feature selection (FS) methods to identify informative gene biomarkers for accurately classifying IBD patients and healthy controls, facilitating earlier diagnosis and personalized treatment.
The main goals of this study are as follows:
  • Evaluate the effectiveness of high-throughput technologies and ML in identifying molecular signatures and enhancing our understanding of IBD pathogenesis.
  • Assess the accuracy and reliability of the identified gene biomarkers for diagnosis of IBD.
  • Investigate the impact of the identified gene biomarkers on IBD diagnosis and personalized treatment.
We have devised an ML-based framework to achieve these goals using a publicly available transcriptomic microarray dataset (GEO75214) from the GEO database [17]. This dataset will be used to discover DEGs associated with IBD. We will employ a comprehensive set of supervised FS approaches and various visualization tools to analyze the DEGs and identify the most informative genes associated with IBD. The selected DEGs will be utilized to train a set of supervised-learning classifiers, and their performance will be thoroughly evaluated using relevant metrics such as AUC-ROC and accuracy [18].
Furthermore, we validate the identified gene biomarkers using independent cohorts from the GEO database (GEO10616 and GEO36807). These validation cohorts assess the reliability and applicability of the identified biomarkers. Additionally, we perform gene ontology (GO) and pathway enrichment analysis on the identified DEGs using the Overrepresentation Enrichment Analysis (ORA) method available through the WebGestalt toolkit 2024 [19]. This analysis provides insights into the molecular mechanisms, disrupted biological pathways, and processes associated with IBD.

1.4. Main Contributions

The study contributes to the field of IBD research in the following ways:
  • Discovery of novel gene biomarkers: the study identifies DENND2B and PANK1 as novel biomarkers with strong diagnostic potential for IBD.
  • Validation of biomarkers: the study rigorously validates the identified gene biomarkers using independent datasets, confirming their reliability and generalizability.
  • Enhanced understanding of IBD pathogenesis: the study improves our understanding of the molecular mechanisms and disrupted pathways associated with IBD.
  • Facilitated early detection: the study develops a diagnostic model based on the identified biomarkers, enabling accurate and timely detection of IBD.
  • Personalized treatment approaches: the study’s findings provide insights for tailoring treatment plans based on individual patients’ IBD subtypes and disease severity.
Through these contributions, this study makes significant strides in IBD research. The discovery of novel gene biomarkers, their validation, enhanced understanding of IBD pathogenesis, and the potential for early detection and personalized treatment approaches collectively contribute to the advancement of knowledge and the potential for improved patient care in the field of IBD. Subsequent sections of the research paper present detailed literature reviews, providing comprehensive insights into the existing body of knowledge in the field.
The research article follows a structured outline. Section 1 introduces the research topic, outlines the study’s objectives, and provides an overview of the contributions made by the research work. Section 2 presents a comprehensive review of the existing research on identifying gene biomarkers associated with IBD. In Section 3, the methodology employed in the research paper is described, including the dataset description, explanation of the data preprocessing steps, description of the FS strategy, explanation of the filter, wrapper, and embedded methods of the FS algorithm, explanation of the FS classifiers employed, and discussion of the model performance metrics. Section 4 presents the results of the FS framework and provides metric estimates for the various classifier-based models utilized in classifying IBD patients from healthy controls. The outcomes of the study, including a comparative analysis of the performance against existing gene biomarker-based ML models, are discussed in Section 5. Finally, Section 6 concludes the research article by summarizing the main findings, highlighting limitations, and discussing potential future directions for further research. The ML-based framework, which screens potential gene biomarkers for classifying IBD samples from healthy control samples using gene microarray data, is visually represented in Figure 1.

2. Review of Literature

The literature review is organized into three main categories: (1) studies related to IBD in general, (2) studies specifically focused on UC, and (3) studies specifically focused on CD.

2.1. IBD (CD and UC)-Related Studies

Stemmer et al. [20] conducted a meta-analysis identifying 34 genes, including three novel long non-coding RNAs (lncRNAs), distinguishing inflamed IBD from non-IBD biopsies. They also found that 12 of 29 genes were upregulated in IBD blood, suggesting potential as non-invasive biomarkers. The study further explored potential therapeutic compounds for IBD using the Connectivity Map (CMap) database. Tang et al. [21] discovered Ras homolog family member U (RHOU) as a new IBD biomarker using support vector machine recursive feature elimination (SVM-RFE) and least absolute shrinkage selection operator (LASSO) regression methods. RHOU was validated through quantitative reverse transcription polymerase chain reaction (qRT-PCR) assays and receiver operating characteristic (ROC) analysis. The study also revealed RHOU correlations with immune cell populations. Yu et al. [22] recognized a 32-gene signature that accurately predicted IBD in an independent cohort with 86.5% accuracy using an XGBoost and uniform manifold approximation and projection (UMAP) techniques. Park et al. [23] developed a machine learning model using RNA sequencing data to distinguish inflammatory CD from UC with minimal error, identifying gene signatures that may help differentiate the two conditions. In 2019, Abbas et al. [24] proposed an integrative “Network-Based Biomarker Discovery (NBBD)” approach that combined network analysis and machine learning to identify a classifier with an AUC of 0.82 for distinguishing IBD patients from controls. Smolander et al. [25] compared the performance of support vector machines (SVMs) and deep belief networks (DBNs) in classifying breast cancer and IBD gene expression data. The study provided guidelines for effectively applying DBNs to complex genomics data classification. Biasci et al. [26] developed a 17-gene Quantitative Polymerase chain Reaction (qPCR)-based blood biomarker that could stratify IBD patients into high-risk and low-risk subgroups, predicting disease progression and treatment needs. Han et al. [27] presented a novel pathway-based approach called probabilistic pathway score (PROPS) that outperformed gene-based and alternative pathway-based classifiers in differentiating CD and UC. In 2017, Yuan et al. [28] used a two-step feature selection method and SVM to identify 21 gene biomarkers that could distinguish non-IBD from IBD samples with an accuracy of 0.937. In 2017, Isakov et al. [29] screened 347 potential gene biomarkers using an Elastic net method and built an IBD risk prediction model with high accuracy and AUC. In 2017, Chen et al. [30] used Bayesian hierarchical clustering on a large IBD cohort to develop a model that could predict IBD risk with AUC values of 0.70 for UC and 0.75 for CD. In 2015, Hubenthal et al. [31] used a penalized SVM method to identify a subset of 16 microRNAs from a pool of 863, which could distinguish individuals with and without disease with AUC values ranging from 0.89 to 0.98. In 2013, Wei et al. [32] implemented a two-step feature selection approach using the IIBCD dataset. They first applied a less strict association significance cutoff (<10−4) and minor allele frequency (>0.01) to filter genetic variants, then used LASSO (L1) penalization to screen 573 SNPs related to CD and 366 SNPs associated with UC. The resulting SVM-based model classified CD and UC patients from healthy controls with AUC values of 0.83 and 0.86, respectively.

2.2. Ulcerative Colitis (UC)-Related Studies

Qian et al. [33] identified five ferroptosis-related hub genes (LCN2, MUC1, PARP8, PLIN2, TIMP1) and built a high-performing logistic regression model to diagnose UC. Bu et al. [34] found four potential UC biomarkers (HSPB3, ABCG2, VNN1, SLC6A14) confirmed in an independent dataset (AUC = 0.889). They also observed immune cell differences, with UC having more γδ T cells, neutrophils, memory B cells, activated mast cells, and M1 macrophages. Zhang et al. [35] used machine learning methods to analyze microarray data from 387 UC patients and 139 healthy controls. They identified two genes, OLFM4 and C4BPB, that could effectively distinguish UC patients from controls with AUC > 0.8. These genes’ expression correlated with immune cell levels, suggesting involvement in UC pathogenesis. Khorasani et al. [36] developed an SVM model using a subset of 32 genes identified through feature selection. The model achieved high accuracy in detecting active UC and reasonable performance for inactive UC. Li et al. [37] used RF and artificial neural network approaches to develop a predictive model for UC diagnosis based on the expression of 30 differentially expressed genes. The model showed high predictive performance with an ROC-AUC of 0.95. Duttagupta et al. [38] explored circulating microRNAs in peripheral blood as non-invasive biomarkers for UC. They identified a signature of 31 differentially expressed, platelet-derived microRNAs that could distinguish UC patients from controls with 96.2% specificity, 89.5% sensitivity, and 92.8% accuracy.

2.3. Crohn’s Disease (CD)-Related Studies

Raimondi et al. [39] introduced a low-complexity neural network model for in silico CD diagnosis using whole exome sequencing data, outperforming previous approaches and providing interpretable insights. Romagnoni et al. [40] compared machine learning methods for classifying CD patients from controls using genotyping data, finding that non-linear models like gradient-boosted trees and neural networks can provide robust and complementary approaches. Wang et al. [41] developed an Analysis of Variation for Association with Disease (AVADx) method to predict Crohn’s disease (CD) status using exonic variants from genome/exome data. Their model, trained on 111 individuals, identified known CD genes and potential new ones. Bottigliengo et al. [42] investigated using Bayesian machine learning techniques, including Bayesian Network, Naive Bayes, and Bayesian Additive Regression Trees, to predict extra-intestinal manifestations in Crohn’s patients. However, the results showed poor performance compared to classical statistical tools. Daneshjou et al. [43] discussed the Critical Assessment of Genome Interpretation (CAGI) community experiment, which used CD Exomes sequencing data to predict phenotypes, highlighting such predictions’ challenges and potential applications. Pal et al. [44] utilized genotype data from the CAGI Crohn’s Exome challenge to train machine learning models that outperformed other approaches in predicting disease status. The resulting SVM-based model classified CD and UC patients from healthy controls with AUC values of 0.83 and 0.86, respectively. In 2013, Cui et al. [45] used Recursive SVM, a wrapper-based feature selection method, to identify 200 gene biomarkers. Leave-One-Out Cross-Validation (LOOCV) analysis demonstrated 88% accuracy, validated using an independent dataset.
This literature review highlights the diverse applications of machine learning in IBD research. The studies discussed demonstrate the potential of ML techniques to enhance our understanding of IBD pathogenesis and improve clinical management. Table 1 summarizes selected studies using gene selection and microarray datasets to identify diagnostic gene and microRNA biomarkers for IBD.

3. Materials and Method

This study employed a comprehensive approach to identifying and validating inflammatory bowel disease (IBD) gene biomarkers. We first described the datasets used for biomarker discovery and validation, followed by the data preprocessing steps. Differential gene expression analysis was conducted to identify significant genes differentially expressed between IBD and healthy control samples. Various feature selection methods were applied to select the most informative differentially expressed gene (DEG) biomarkers. We analyzed the expression patterns of these DEGs using Histogram Frequency Curve Plot (HFCP) analysis. The preprocessed microarray data was then split into training and testing sets, with the training set oversampled using SMOTE to address the class imbalance. Supervised machine learning models were trained using the selected DEG biomarkers, and their performance was evaluated using metrics such as accuracy and AUC-ROC. The validated DEG-based machine learning model was further tested on independent cohorts. Finally, gene ontology and pathway enrichment analyses were performed on the selected DEGs to gain insights into their role in IBD pathogenesis.

3.1. Dataset for Gene Biomarker Discovery and Validation

We used microarray data from the Gene Expression Omnibus (GEO) database to identify and validate IBD gene biomarkers. The GEO75214 cohort [17], consisting of 172 IBD patients and 22 healthy controls, was analyzed using the Affymetrix Human Gene 1.0 ST Array. This discovery cohort was used to identify differentially expressed gene (DEG) biomarkers with diagnostic potential for IBD. To validate the identified DEG biomarkers, we utilized two additional independent cohorts from GEO: GEO10616 [46] and GEO36807 [47]. We only considered the DEGs discovered in the original GEO75214 cohort during validation, excluding all other genes in the validation datasets.

3.2. Preprocessing Strategies for GEO75214 Key DEGs Dataset

We preprocessed the GEO75214 dataset using several techniques. Categorical variables were transformed using binary encoding, and quasi-constant features were removed. Outliers were detected and removed using the Interquartile Range method [48]. The data was then normalized using the Min–max algorithm to standardize the gene expression values [49]. These preprocessing steps ensured that the data was ready for further analysis.

3.3. Differential Gene Expression Analysis Methodology

We used an independent t-test to identify differentially expressed genes (DEGs) between IBD and control samples. The t-statistic and p-value were calculated for each gene, representing the significance of the difference in mean expression. We adjusted the p-values using the Benjamini–Hochberg method [50] to account for multiple tests. The fold change for each gene was calculated as the ratio of mean expression in IBD to control. The 95th percentile of the fold change distribution for non-DEGs was used as the fold change threshold. Genes were then categorized as upregulated, downregulated, or non-significant based on their adjusted p-value (q-values) and fold change. We created Venn diagrams, heatmaps, and volcano plots to visualize the DEGs. The volcano plot displayed the log2 fold change and negative log10 p-value, with genes colored by their category. The points on the plot are color-coded based on the gene category: “Upregulated” (red), “Downregulated” (blue), or “non-significant” (gray).

3.4. Feature Selection Approaches for Identification of Informative DEG Biomarkers

After identifying the DEGs, we applied several feature selection methods to select the most informative upregulated and downregulated biomarkers. These included filter-based (e.g., Mutual Information), wrapper-based (e.g., Recursive Feature Elimination), and embedded (e.g., Elastic Net, Gradient Boosting) approaches. We also used a feature complementation approach, selecting features unique to the up-and-down-regulated gene subsets identified using the different methods.

3.4.1. Filter-Based Feature Selection

Filter-based feature selection is a technique for evaluating and ranking features based on their individual properties, such as correlation or mutual information with the target variable. It involves applying a statistical measure to each feature and selecting the top-ranked features for further analysis. Filter-based methods are computationally efficient and can handle high-dimensional datasets, making them popular for initial feature selection.
  • Mutual Information Statistics [51]: Mutual information (MI) measures the mutual dependence between a gene’s expression (X) and the outcome (Y). The MI score is calculated as
MI X , Y = P x , y × l o g P x , y   /   P x × P y
Here, the notation P(x, y) denotes the joint probability distribution of the features X and Y.
The marginal probability distributions of the features X and Y are represented by P(x) and P(y), respectively.
Σ denotes the summation of all possible values of X and Y.
Mutual information measures the decrease in uncertainty about one variable (gene expression) when the value of the other variable (outcome) is known. Higher MI indicates a stronger association between the gene and outcome, suggesting biomarker potential. The scikit-learn parameters for mutual information feature selection are: (a) score_func = mutual_info_classif, (b) k = 2, (c) n_neighbors = 3, (d) random_state = None, (e) discrete_features = ‘auto’.

3.4.2. Wrapper-Based Feature Selection

Wrapper methods evaluate feature subsets by training and testing a specific ML algorithm. They create different feature combinations, train models on each, and select the subset with the best performance on a predefined metric. Wrapper methods can capture complex feature interactions missed by filter methods.
  • The Recursive Feature Elimination with Cross-Validation (RFECV) [52]: RFECV is a variation of Recursive Feature Elimination that automatically uses cross-validation to select the most informative genes for IBD classification. The scikit-learn parameters are:
    (a) estimator = ‘Randomforestclassifier’, (b) step = 1, (c) min_features_to_select = 10, (d) cv = 5, (e) scoring = ‘roc_auc’.

3.4.3. Embedded Feature Selection

Embedded methods integrate feature selection into the learning algorithm itself. They aim to identify the most relevant features during model training by incorporating feature selection as a step within the algorithm. Embedded methods are well-suited for high-dimensional datasets.
  • Elastic Net [53]: Elastic Net is a regularization technique combining L1 (Lasso) and L2 (Ridge) penalties to select features. It shrinks some feature weights to zero, effectively excluding those features. This allows Elastic Net to select groups of highly correlated features, making it effective for high-dimensional, correlated data. The Elastic Net parameters in scikit-learn are as follows: alpha: 1.0, max_iter: 1000, fit_intercept: True, l 1 r a t i o : 1.0, normalize: False, max_features = 2, tol: 1e−4. The Elastic Net objective function is:
Minimize:
1 2 × n s a m p l e s × y X w 2 + α × l 1 r a t i o × w 1 + 0.5 × 1 l 1 r a t i o × w 2 2
In the above equation,
y represents the target variable.
X is the feature matrix.
w is the weight vector that indicates the importance of each feature.
n s a m p l e s is the number of samples in the dataset.
α is a hyperparameter that controls the regularization strength.
l 1 r a t i o is a hyperparameter determining the balance between the L1 and L2 penalties.
The Elastic Net objective function consists of two components:
The squared loss term y X w 2 , which measures the deviation between the predicted values and the actual target values.
The regularization term, which consists of two parts:
The L1 penalty w 1 encourages sparsity in the weight vector, leading to feature selection.
The L2 penalty w 2 2 encourages small weights, preventing overfitting.
  • Gradient Boosting Classifier Feature Selection: This method uses a gradient-boosting classifier to assess feature importance for classification tasks. Feature importance is determined by measuring the reduction in impurity from splits on each feature during decision tree construction in gradient boosting. The most important features can be identified based on their importance scores [54]. The Gradient Boosting Classifier parameters in scikit-learn are as follows: estimator: GradientBoostingClassifier (), max_features: 2, number of estimators: 100, min_samples_leaf: 1, learning_rate: 0.1, max_depth: 3, min_samples_split: 2, subsample: 1.0. The mathematical equation for feature importance in a gradient-boosting classifier can be expressed as follows:
Importance (feature) = ∑ (gain in impurity due to splits on the feature)/(total gain in impurity)
In this equation:
-
“Importance (feature)” represents the importance score of a specific feature.
-
“Gain in impurity due to splits on the feature” refers to the reduction in impurity achieved by splitting on that feature during the construction of decision trees in the gradient boosting process.
-
“Total gain in impurity” represents the total reduction in impurity across all features.

3.5. Histogram Frequency Curve Plot (HFCP) Analysis

The HFCP visualizes the distribution of gene expression levels (continuous features) between IBD patients and healthy controls in the GEO75214 dataset. It highlights differences in mean gene expression between the two groups.

3.6. Partitioning of Transformed Microarray Dataset

The dataset was split into 65% training and 35% test sets. The training set had 126 samples (111 IBD, 15 healthy), while the test set had 69 samples (61 IBD, eight healthy).

3.7. Oversampling of Training Data Using SMOTE

To address the class imbalance, the minority class (healthy) in the training set was oversampled using the Synthetic Minority Oversampling Technique (SMOTE) [55]. This generated synthetic samples to balance the class distribution, but the test set remained unchanged. The SMOTE parameters used in scikit-learn are as follows: (a) sampling_strategy = ‘auto’, (b) random_state = 3, and (c) k_neighbors = 3.

3.8. Leave-One-Out Cross-Validation (LOOCV)

LOOCV [56] was used to assess model performance on the training set. Each data point served as the validation set, and the aggregated confusion matrix was used to compute the average accuracy.

3.9. Training of Supervised Learning Models

The selected gene biomarkers were trained to train supervised learning classifiers, including Logistic Regression, K-Nearest Neighbors, Gaussian Naive Bayes, Support Vector Classifier, Random Forest, Multi-Layer Perceptron, and Decision Tree.

Supervised Learning Classifiers

  • Logistic Regression [57]: LR computes the probability of a sample being assigned to a specific class. The probability is obtained using the logistic (sigmoid) function. The coefficients (w0, w1, w2, …, wn) are estimated during the training process. The parameters used for the LR classifier in scikit-learn are as follows: fit_intercept = True, penalty = “l2”, dual = False, intercept_scaling = 1, C = 1, tol = 0.0001, multi_class = “auto”, class_weight = None, verbose = 0, max_iter = 100, solver = “liblinear”, warm_start = False, random_state = 123. The LR equation for classifying IBD and healthy control samples is as follows:
P y = 1 | x = 1 / 1 + e x p z
where z is the linear combination of the selected gene biomarkers and their corresponding coefficients:
z = w 0 + w 1 x 1 + w 2 x 2 + + w n × x n
  • Support Vector Classifier (SVC): The SVC identifies the most favorable hyperplane that effectively differentiates the classes within the input domain. The decision function is defined as
f x = sign w · ϕ x + b
In the equation, the weight vector is denoted as w, ϕ(x) represents the feature transformation (such as mapping gene biomarkers to a higher-dimensional space using kernel functions), and b represents the bias term.
  • Decision Tree (DT) [58]: DTs generate predictions using a hierarchical structure of decision and leaf nodes. At each decision node, a selected gene biomarker is compared to a threshold value and the prediction is made by traversing the tree structure based on the feature values. In binary classification, the classes are usually denoted as “0” and “1”. The DT algorithm calculates the Gini index for each potential split. The Gini index can be calculated using the following equation:
G i n i   I n d e x = 1 p 0 2 + p 1 2
Here,
the symbol p 0 denotes the probability associated with an instance in the class labeled as “0”.
p 1 represents the probability of an instance being assigned to class “1”.
The parameters used for the DT classifier in scikit-learn are as follows: max_depth = 7, criterion = “gini,” random_state = 1, min_samples_split = 3, min_impurity_decrease = 0, splitter = “best,” max_features = None, min_samples_leaf = 1, max_leaf_nodes = None, class_weight = None, alpha = 0, min_weight_fraction_leaf = 0.
  • Random Forest (RF) [59]: RF combines multiple DTs. The final prediction aggregates the predictions from each tree, e.g., by majority voting. The RF parameters are random_state = 123, n_estimators = 1000, and max_depth = 5.
  • Gaussian Naïve Bayes (GNB) [60]: GNB assumes the gene biomarkers follow a Gaussian distribution. The class probability is calculated using Bayes’ theorem and feature independence. The GNB parameters in scikit-learn are as follows: priors = None, var_smoothing = 1 × 10−9. For binary classification, the equation simplifies to a ratio of probabilities:
P y | x 1 , x 2 , , x n = P y × P x i | y P x 1 , x 2 , , x n  
Here,
the posterior probability of class ‘y’, given the feature values x 1 , x 2 , , x n is represented as P y | x 1 , x 2 , , x n .
P(y) is the prior probability of class “y”.
P x i | y is the likelihood of feature x i given class “y”.
P x 1 , x 2 , , x n is the probability of observing the feature values x 1 , x 2 , , x n .
eXtreme Gradient Boosting Classifier (XGBoost) [61]: XGBoost optimizes weak model weights to minimize a loss function using gradient descent and regularization. It offers advanced features like custom loss functions and handling missing values. The mathematical equation for XGBoost can be represented as
y h a t = w × h x
In the given equation,
y h a t represents the predicted outcome.
denotes the summation.
w represents the weight associated with each model in the ensemble.
h x represents the prediction made by each weak model (e.g., decision tree) in the ensemble.
The parameters used for the XGBoost classifier in scikit-learn are as follows: n_estimators: 100, max_depth: 3, subsample: 1.0, colsample_bytree: 1.0, reg_alpha: 0.0, reg_lambda: 1.0, min_child_weight: 1, gamma: 0.0, random_state: None, learning_rate: 0.1.
  • Multi-Layer Perceptron (MLP) [62]: MLP involves forwarding gene biomarkers through interconnected neurons and calculating outputs using activation functions. The MLP parameters used for the XGBoost classifier in scikit-learn are as follows: hidden_layer_sizes = 100, activation = “relu”, solver = “adam”, alpha = 0.0001, shuffle = True, learning_rate_init = 0.001, learning_rate = “constant”, max_iter = 200, power_t = 0.5, tol = 0.0001, batch_size = “auto”, nesterovs_momentum = True, warm_start = False, early_stopping = False, verbose = False, beta_1 = 0.9, n_iter_no_change = 10, validation_fraction = 0.2, epsilon = 1 × 10−8, beta_2 = 0.999, max_fun = 15,000, random_state: = 123, momentum = 0.9.

3.10. Evaluating Model Performance

We assessed the diagnostic classifier’s performance using standard metrics like the confusion matrix, accuracy, and the AUC-ROC curve [56].

3.11. Validating the DEGs-Based ML Model

We used two independent cohorts, GEO10616 and GEO36807, to validate the gene biomarkers identified from the discovery cohort (GE75214). This allowed us to assess the reliability and applicability of the selected biomarkers.

3.12. Pathway Analysis of Selected DEGs

We conducted Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis on the six identified DEGs (VWF, IL1RL1, DENND2B, MMP14, NAAA, and PANK1) using the ORA method in WebGestalt 2024 [19]. We identified significantly enriched terms using a p-value cutoff of 0.05.

4. Results

The results section presents the key findings of this study in a structured manner. We begin with an overview of the identified differentially expressed genes (DEGs) from the microarray dataset, including up and downregulated DEGs. Visualizations such as heat maps, volcano plots, and Venn diagrams are then used to enhance understanding of gene expression patterns. We then describe the feature selection approach employed to filter the DEGs and the statistical analysis to identify the most significant gene biomarkers. The performance of various supervised machine learning models using these gene biomarkers is evaluated, and the optimal Random Forest (RF) model is selected and tuned. The generalizability and robustness of the RF-based model are then demonstrated through independent validation on external cohorts, and its superior performance is highlighted compared to other published models. Finally, we present the gene ontology and pathway enrichment analysis of the key upregulated and downregulated DEGs to gain insights into the underlying biological processes and mechanisms associated with inflammatory bowel disease.

4.1. Identification of DEGs from GSE75214

We obtained gene expression dataset GSE75214 with microarray data from IBD and normal samples. The GPL6244 platform, Affymetrix Human Gene 1.0 ST Array, was used. Comparative analysis identified 2239 significant DEGs between IBD and Normal groups from the GSE75214 cohort.

4.2. Identification of Top DEGs in IBD vs. Control

Comparative analysis identified the top 10 upregulated genes: DENND2B, LCN2, IFITM3, SLC6A14, BACE2, S100A11, PLS3, PARP8, NFKBIZ, DUOX2, ranked by p-value. The top 10 downregulated genes were: CNTFR, RNY1P5, STBD1, HINFP, PAQR5, CNTN4, RNF135, FRMD1, SFXN1, SLC38A4, also ranked by p-value. See Table 2 for the top 10 upregulated and downregulated genes and Supplementary Tables S1 and S2 for the comprehensive lists.

4.3. Visualizing Upregulated and Downregulated DEGs from GSE75214

Figure 2a shows a heatmap of the top three upregulated genes in each main cluster. Cluster 1 has SLC6A14, DUOX2, MMP3; Cluster 2 has DUOXA2, MMP1, LCN2; Cluster 3 has IDO1, S100A8, SAA2, IL1B. Figure 2b shows a heatmap of each main cluster’s top three downregulated genes. Cluster 1 has PRKG2, MT1M, SLC26A2; Cluster 2 has SLC13A1, HMGCS2, UGT2A3; Cluster 3 has CYP2B6, ABCG2, TMIGD1, MEP1B.
The Volcano plot in Figure 3a visualizes the statistically significant DEGs based on p-value < 0.001 and fold change > 1.06712 for the GSE75214 dataset. The Venn diagram in Figure 3b shows the overlap of DEGs between IBD and control groups. There were 1422 overlapping upregulated genes and 817 overlapping downregulated genes shared between the two groups. No unique DEGs were detected.

4.4. Feature Selection for Potential IBD Biomarkers

A feature selection (FS) approach was used to identify potential IBD biomarkers from the DEGs in the GSE75214 dataset. Four different FS techniques were applied, and the unique features from each method were combined into a master feature subset (Table 3).
A master feature subset was created by selecting and combining the unique features from the subsets of features obtained from the four different FS methods. Consequently, a master feature subset consisting of six gene biomarkers was generated, as shown in Table 4.

4.5. Results of Two-Tailed Unpaired T-Test on Potential IBD Biomarkers

A two-tailed unpaired t-test with a significance level of 5% was performed to identify which of the six selected gene biomarkers showed significant differences in mean expression between the IBD and healthy control groups. The results of this t-test analysis are presented in Table 5. All six gene biomarkers (VWF, IL1RL1, DENND2B, MMP14, NAAA, and PANK1) had p-values less than 0.05, indicating that their mean expression levels significantly differed between the IBD and control samples.

4.6. Potential IBD Biomarkers Identified

All six gene biomarkers (VWF, IL1RL1, DENND2B, MMP14, NAAA, and PANK1) showed significant differences in mean expression levels between the IBD and healthy control groups, as confirmed by the two-tailed unpaired t-test (p < 0.05). The frequency distribution plots in Figure 4 further illustrate the significant differential expression of these six DEG biomarkers across the IBD and control samples. Based on these findings, the final set of six gene biomarkers (VWF, IL1RL1, DENND2B, MMP14, NAAA, and PANK1) will be used to build and evaluate the best-supervised classification model to distinguish IBD patients from healthy controls.

4.7. Screening the Best-Performing Classification Model

The study aimed to identify the most effective classification model for distinguishing IBD from healthy control samples using the set of six potential DEG biomarkers identified earlier. As shown in Figure 5, the supervised classification models were evaluated using the six biomarker features and a baseline set of 33,253 gene features.
The results indicate that the Random Forest (RF) model outperformed the other supervised learning algorithms when utilizing the six selected biomarker features. Based on leave-one-out cross-validation, the RF model achieved the highest aggregated F1 score (0.97628 ± 0.0150), accuracy (0.9767 ± 0.0148), and AUC (0.9767 ± 0.0148) (Table 6). These findings suggest that using the RF classification model, the six DEG biomarkers (VWF, IL1RL1, DENND2B, MMP14, NAAA, and PANK1) can effectively distinguish IBD patients from healthy controls. This approach holds promise for earlier, more accurate diagnosis of IBD.

4.8. Tuning and Validating the Random Forest Classifier

The optimal hyperparameters for the Random Forest (RF) classifier were determined using the GSE75214 training dataset containing the 6 gene biomarkers: n_estimators: 200, max_depth: None, max_features: ‘sqrt’, min_samples_split: 5, and min_samples_leaf: 2. To validate these hyperparameters, 5-fold cross-validation was performed, yielding: F1 Score: 0.9870 ± 0.013, Accuracy: 0.9855 ± 0.0145, and AUC: 0.992 ± 0.018.
The consolidated confusion matrix in Figure 6 illustrates the optimized RF model’s performance on independent test data. In this matrix, the positive class represents IBD, and the negative class represents healthy controls. The validation process improved the classification of true positives and negatives, as evidenced by the higher average accuracy, F1-score, and AUC values.

4.9. Evaluating Model Generalizability and Robustness

The optimized RF model, trained and tested using the 6-gene biomarker dataset, was further evaluated for generalizability and robustness across different IBD gene expression cohorts. As shown in Figure 7, the RF model exhibited strong performance on the GSE10616 cohort (Accuracy: 0.820 [CI: 0.806–0.834], AUC: 0.880 [CI: 0.870–0.890]) and the GSE36807 cohort (Accuracy: 0.850 [CI: 0.842–0.858], AUC: 0.900 [CI: 0.895–0.905]). These results confirm the RF model’s ability to effectively adapt to the GSE36807 and GSE10616 datasets, demonstrating its potential for accurately classifying IBD versus healthy individuals across different cohorts.

4.10. Comparative Performance Evaluation

Figure 8 compares the performance of our proposed 6-gene biomarker-based RF classification model against other published models. Our model achieved an accuracy of 0.9855 ± 0.0145 and an AUC of 0.992 ± 0.018, outperforming the other gene biomarker-based models. These results indicate that the 6-gene biomarker-based RF model has superior classification capability compared to previous approaches. This suggests that the 6-gene signature could significantly contribute to earlier IBD diagnosis, improved treatment strategies, and more personalized patient management.

4.11. Gene Ontology and Pathway Enrichment Analysis of the Six Key DEGs

The Overrepresentation analysis (ORA) method, available in the WebGestalt developed by Wang et al. in 2017, was utilized to perform enrichment analysis of GO and KEGG pathways for the six DEGs (VWF, IL1RL1, DENND2B, MMP14, NAAA, and PANK1). The significance cutoff value of p < 0.05 was employed to determine enriched terms. The analysis of GO biological pathways revealed that the upregulated genes (VWF, IL1RL1, DENND2B, MMP14) were associated with several biological processes, including cell-substrate adhesion, positive regulation of cell activation, extracellular structure organization, regulation of leukocyte activation, and interleukin-5 production. In addition, the analysis of KEGG and reactome pathways revealed significant enrichment of the upregulated genes in pathways such as GnRH signaling pathway, platelet activation, complement and coagulation cascades, ECM-receptor interaction, parathyroid hormone synthesis, secretion and action, TNF signaling pathway, Integrin signaling, and extracellular matrix organization, as presented in Table 7.
The two downregulated DEGs, NAAA and PANK1, were linked to processes like ribose phosphate biosynthesis, cofactor biosynthesis, nucleoside metabolism, neurotransmitter transport, purine biosynthesis, neurotransmitter regulation, nucleoside phosphate biosynthesis, and coenzyme metabolism. Pathway analysis also showed these genes were enriched in pantothenate/CoA biosynthesis, neurotransmitter release, vitamin/cofactor metabolism, and chemical synaptic transmission pathways, as shown in Table 8. These findings provide insights into the biological processes and pathways affected by the DEGs.

5. Discussion

IBD is a chronic inflammatory disorder characterized by persistent symptoms and relatively low mortality. However, the increasing global prevalence of IBD has strained healthcare systems. While the precise cause of IBD is uncertain, understanding the disease’s pathology and molecular mechanisms is crucial for improving diagnosis and treatment. We can identify potential diagnostic biomarkers by leveraging gene expression data and bioinformatics/ML analysis. This study aimed to use the GSE75214 dataset as the primary cohort, with GSE36807 and GSE10616 as validation cohorts. We identified 1422 upregulated and 817 downregulated differentially expressed genes (DEGs) in GSE75214. Our analysis uncovered six potential gene biomarkers (VWF, IL1RL1, DENND2B, MMP14, NAAA, and PANK1) with strong diagnostic potential. Notably, DENND2B and PANK1 represent novel IBD biomarkers. Integrating these six genes into a Random Forest model achieved exceptional performance, with an AUC of 0.992 ± 0.018 and an accuracy of 0.9855 ± 0.0145. Validation on independent cohorts confirmed the model’s robustness. Our study provides novel insights into IBD-associated genes, introduces an innovative ML approach, and highlights DENND2B and PANK1 as new biomarker candidates. These findings could significantly impact IBD research and diagnostics.
Our gene ontology and pathway analysis enhanced our understanding of the processes involved in IBD. The analysis revealed that the upregulated genes, namely IL1RL1, MMP14, and VWF, are involved in key cellular processes. Thus, our upregulation profiling of IL1RL1 corroborates with earlier studies showing that IL1RL1 is upregulated in IBD patients. In the context of IBD, IL1RL1, through its product ST2, may contribute to regulating immune responses in the gut. This finding is significant as IL1RL1 exhibits preferential expression on colonic T-regulatory cells, supporting their function and adaptation to the inflammatory environment. This is crucial in preserving gut homeostasis and potentially attenuating the excessive inflammation associated with IBD [63,64].
MMP14, a matrix metalloproteinase, participates in extracellular matrix degradation, which is essential for tissue remodeling and healing. In IBD, excessive MMP activity and insufficient tissue inhibitors of metalloproteinases (TIMPs) inhibition can contribute to mucosal damage and inflammation [65,66,67]. Our findings corroborate previous studies showing MMP14 upregulation in IBD patients.
VWF is implicated in blood coagulation, platelet adhesion, and wound healing. Elevated VWF levels in active IBD may stem from vascular injury or inflammatory mediator release and contribute to the increased thrombosis risk [68,69]. Monitoring VWF can assist in IBD hemostasis management [70]. Importantly, our findings also show that VWF is upregulated in IBD patients compared to normal samples, further signifying the importance of this gene in our study.
The DENND2B gene, with its predicted guanyl-nucleotide exchange factor activity, could potentially influence MAPK signaling pathways [71,72,73]. DENND2B’s activation of Rab13 enhances the invasive potential of epithelial cancers [74,75]. Conversely, disrupting this DENND2B-Rab13 signaling axis significantly impairs the spread and migratory capacity of highly aggressive epithelial cancer cells in vitro and in vivo [76,77]. Our data revealed DENND2B overexpression in IBD, hinting at its role in inflammation and healing, although its specific function requires further research. These findings open possibilities for therapeutic interventions targeting DENND2B in IBD and cancer.
The gene ontology and enrichment analysis show that the downregulated genes, NAAA and PANK1, are involved in pantothenate/CoA biosynthesis, neurotransmitter regulation, and transport (Table 8). NAAA has limited reported connections to IBD, but studies found decreased PPAR, PPAR, and NAAA, with increased FAAH and iNOS, in colitis mucosa [78]. Another study identified NAAA as a potential UC biomarker [79]. NAAA modulates the endocannabinoid system, which is altered in IBD and influences inflammation and pain [80,81,82]. Therefore, our current study findings suggest that decreased levels of NAAA expression may alter the endocannabinoid signaling pathway, thereby affecting endocannabinoid molecules’ anti-inflammatory effect, leading to inflammation and pain in IBD patients.
PANK1 codes the rate-limiting enzyme in CoA synthesis from pantothenate [83]. PANK1 is associated with CoA biosynthesis, phosphorylation, and acetyl-CoA regulation. Altered CoA metabolism can affect gut epithelium energy and inflammation in IBD [84]. Moreover, the decreased PANK1 level observed in our gene expression analysis in IBD patients suggests intracellular CoA changes may impact the gut epithelium. Research also reveals PANK1’s potential in cancer. PANK1 can inhibit hepatocellular carcinoma by regulating Wnt/β-catenin [85] and modulating the cell cycle [86,87]. Bioinformatic analysis identified PANK1 as differentially expressed between normal and tumor tissues [87]. PANK1 expression correlates with prognosis, tumor immunity, and metabolism in renal cell carcinoma [88]. These findings suggest PANK1’s importance as a therapeutic target and prognostic biomarker in various cancers.
The identified upregulated and downregulated DEGs have significant roles in IBD and cancer, warranting further research on their therapeutic implications. However, the study has limitations that require consideration. The primary findings need validation in larger clinical cohorts. Additionally, immune cell infiltration studies are essential to assess composition and correlations with IBD pathogenesis. Such investigations may yield new insights into the molecular mechanisms underlying IBD.

6. Conclusions

In conclusion, our research identified a six-gene signature, including novel biomarkers DENND2B and PANK1, that effectively distinguished active IBD from healthy controls. Functional analysis revealed the signature genes were associated with key pathways in IBD pathogenesis, such as complement/coagulation, neurotransmitter regulation, and CoA biosynthesis. This six-gene signature demonstrated diagnostic potential beyond IBD, highlighting its versatility.
Future priorities include molecular validation of the biomarkers using qRT-PCR and investigating immune cell infiltration to provide deeper insights into IBD pathogenesis. Overall, our integrative approach of transcriptomics, machine learning, and high-throughput technologies advances the understanding and management of complex diseases like IBD. These findings lay the foundation for further research into genetic biomarkers with diagnostic and therapeutic implications.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/diagnostics14111182/s1, Table S1: List the total upregulated genes in comparison between IBD and healthy controls in GSE75214 data; Table S2: The total downregulated genes in comparison between individuals with IBD and healthy controls in GSE75214 data.

Author Contributions

A.H.S., H.A.S.A. and N.A. apprehended the study’s design; A.H.S. preprocessed the data; S.A. and A.H.S. performed the research and analyzed the data; A.H.S., N.A. and S.J.M. drafted the materials and methodology and edited the figures; A.H.S. drafted the abstract, introduction, result, and discussion; S.A., H.A.S.A., S.J.M. and N.A. Edited and proofread the manuscript. All authors were substantially and intellectually involved in the present study to meet the requirements. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Deanship of Scientific Research (DSR), King Abdulaziz University, Jeddah, under grant No. (D-830-714-1435), Therefore, the authors acknowledge DSR’s technical and financial support.

Institutional Review Board Statement

In the present study, we have worked on the gene expression dataset made publicly available by [17,46,47]. The gene expression data can be downloaded from https://www.ncbi.nlm.nih.gov/geo/, accessed on 20 September 2023. Therefore, the authors in the present study were not involved with animals or human participants. However, the relevant local Ethics Committee approved the original retrospective studies [17,46,47].

Informed Consent Statement

Not applicable.

Data Availability Statement

Datasets are publicly available at: https://www.ncbi.nlm.nih.gov/geo/, accessed on 1 December 2023.

Acknowledgments

This project was funded by the Deanship of Scientific Research (DSR), King Abdulaziz University, Jeddah, under grant no. (D-830-714-1435). The authors, therefore, acknowledge DSR’s technical and financial support.

Conflicts of Interest

The authors declare that the research was conducted without any commercial or financial relationships construed as a potential conflict of interest.

References

  1. Alatab, S.; Sepanlou, S.G.; Ikuta, K.; Vahedi, H.; Bisignano, C.; Safiri, S.; Sadeghi, A.; Nixon, M.R.; Abdoli, A.; Abolhassani, H.; et al. The Global, Regional, and National Burden of Inflammatory Bowel Disease in 195 Countries and Territories, 1990–2017: A Systematic Analysis for the Global Burden of Disease Study 2017. Lancet Gastroenterol. Hepatol. 2020, 5, 17–30. [Google Scholar] [CrossRef] [PubMed]
  2. Wang, R.; Li, Z.; Liu, S.; Zhang, D. Global, Regional and National Burden of Inflammatory Bowel Disease in 204 Countries and Territories from 1990 to 2019: A Systematic Analysis Based on the Global Burden of Disease Study 2019. BMJ Open 2023, 13, e065186. [Google Scholar] [CrossRef]
  3. Bourgonje, A.R.; Van Goor, H.; Faber, K.N.; Dijkstra, G. Clinical Value of Multi-Omics-Based Biomarker Signatures in Inflammatory Bowel Diseases: Challenges and Opportunities. Clin. Transl. Gastroenterol. 2023, 14, e00579. [Google Scholar] [CrossRef] [PubMed]
  4. Seyed Tabib, N.S.; Madgwick, M.; Sudhakar, P.; Verstockt, B.; Korcsmaros, T.; Vermeire, S. Big Data in IBD: Big Progress for Clinical Practice. Gut 2020, 69, 1520–1532. [Google Scholar] [CrossRef]
  5. Dhyani, M.; Joshi, N.; Bemelman, W.A.; Gee, M.S.; Yajnik, V.; D’Hoore, A.; Traverso, G.; Donowitz, M.; Mostoslavsky, G.; Lu, T.K.; et al. Challenges in IBD Research: Novel Technologies. Inflamm. Bowel Dis. 2019, 25, S24–S30. [Google Scholar] [CrossRef] [PubMed]
  6. Alsoud, D.; Vermeire, S.; Verstockt, B. Biomarker Discovery for Personalized Therapy Selection in Inflammatory Bowel Diseases: Challenges and Promises. Curr. Res. Pharmacol. Drug Discov. 2022, 3, 100089. [Google Scholar] [CrossRef] [PubMed]
  7. Xu, C.; Jackson, S.A. Machine Learning and Complex Biological Data. Genome Biol. 2019, 20, 76. [Google Scholar] [CrossRef]
  8. Stańczyk, U. Feature Evaluation by Filter, Wrapper and Embedded Approaches. Stud. Comput. Intell. 2015, 584, 29–44. [Google Scholar] [CrossRef] [PubMed]
  9. Bolón-Canedo, V.; Sánchez-Maroño, N.; Alonso-Betanzos, A. Feature Selection for High-Dimensional Data. Prog. Artif. Intell. 2016, 5, 65–75. [Google Scholar] [CrossRef]
  10. Nguyen, N.H.; Picetti, D.; Dulai, P.S.; Jairath, V.; Sandborn, W.J.; Ohno-Machado, L.; Chen, P.L.; Singh, S. Machine Learning-Based Prediction Models for Diagnosis and Prognosis in Inflammatory Bowel Diseases: A Systematic Review. J. Crohn’s Colitis 2022, 16, 398–413. [Google Scholar] [CrossRef]
  11. Alghoul, Z.; Yang, C.; Merlin, D. The Current Status of Molecular Biomarkers for Inflammatory Bowel Disease. Biomedicines 2022, 10, 1492. [Google Scholar] [CrossRef] [PubMed]
  12. Nowak, J.K.; Kalla, R.; Satsangi, J. Current and Emerging Biomarkers for Ulcerative Colitis. Expert. Rev. Mol. Diagn. 2023, 23, 1107–1119. [Google Scholar] [CrossRef]
  13. Stafford, I.S.; Gosink, M.M.; Mossotto, E.; Ennis, S.; Hauben, M. A Systematic Review of Artificial Intelligence and Machine Learning Applications to Inflammatory Bowel Disease, with Practical Guidelines for Interpretation. Inflamm. Bowel Dis. 2022, 28, 1573–1583. [Google Scholar] [CrossRef] [PubMed]
  14. Gubatan, J.; Levitte, S.; Patel, A.; Balabanis, T.; Wei, M.T.; Sinha, S.R. Artificial Intelligence Applications in Inflammatory Bowel Disease: Emerging Technologies and Future Directions. World J. Gastroenterol. 2021, 27, 1920–1935. [Google Scholar] [CrossRef]
  15. Stankovic, B.; Kotur, N.; Nikcevic, G.; Gasic, V.; Zukic, B.; Pavlovic, S. Machine Learning Modeling from Omics Data as Prospective Tool for Improvement of Inflammatory Bowel Disease Diagnosis and Clinical Classifications. Genes 2021, 12, 1438. [Google Scholar] [CrossRef]
  16. Metwaly, A.; Haller, D. Multi-Omics in IBD Biomarker Discovery: The Missing Links. Nat. Rev. Gastroenterol. Hepatol. 2019, 16, 587–588. [Google Scholar] [CrossRef]
  17. Vancamelbeke, M.; Vanuytsel, T.; Farré, R.; Verstockt, S.; Ferrante, M.; Van Assche, G.; Rutgeerts, P.; Schuit, F.; Vermeire, S.; Arijs, I.; et al. Genetic and Transcriptomic Bases of Intestinal Epithelial Barrier Dysfunction in Inflammatory Bowel Disease. Inflamm. Bowel Dis. 2017, 23, 1718–1729. [Google Scholar] [CrossRef]
  18. Tharwat, A. Classification Assessment Methods. Appl. Comput. Inform. 2018, 17, 168–192. [Google Scholar] [CrossRef]
  19. Wang, J.; Vasaikar, S.; Shi, Z.; Greer, M.; Zhang, B. WebGestalt 2017: A More Comprehensive, Powerful, Flexible and Interactive Gene Set Enrichment Analysis Toolkit. Nucleic Acids Res. 2017, 45, W130–W137. [Google Scholar] [CrossRef] [PubMed]
  20. Stemmer, E.; Zahavi, T.; Kellerman, M.; Sinberger, L.A.; Shrem, G.; Salmon-Divon, M. Exploring Potential Biomarkers and Therapeutic Targets in Inflammatory Bowel Disease: Insights from a Mega-Analysis Approach. Front. Immunol. 2024, 15, 1353402. [Google Scholar] [CrossRef]
  21. Tang, Q.; Shi, X.; Xu, Y.; Zhou, R.; Zhang, S.; Wang, X.; Zhu, J. Identification and Validation of the Diagnostic Markers for Inflammatory Bowel Disease by Bioinformatics Analysis and Machine Learning. Biochem. Genet. 2023, 62, 371–384. [Google Scholar] [CrossRef] [PubMed]
  22. Yu, S.; Zhang, M.; Ye, Z.; Wang, Y.; Wang, X.; Chen, Y.G. Development of a 32-Gene Signature Using Machine Learning for Accurate Prediction of Inflammatory Bowel Disease. Cell Regen. 2023, 12, 8. [Google Scholar] [CrossRef] [PubMed]
  23. Park, S.K.; Kim, S.; Lee, G.Y.; Kim, S.Y.; Kim, W.; Lee, C.W.; Park, J.L.; Choi, C.H.; Kang, S.B.; Kim, T.O.; et al. Development of a Machine Learning Model to Distinguish between Ulcerative Colitis and Crohn’s Disease Using Rna Sequencing Data. Diagnostics 2021, 11, 2365. [Google Scholar] [CrossRef] [PubMed]
  24. Abbas, M.; Matta, J.; Le, T.; Bensmail, H.; Obafemi-Ajayi, T.; Honavar, V.; EL-Manzalawy, Y. Biomarker Discovery in Inflammatory Bowel Diseases Using Network-Based Feature Selection. PLoS ONE 2019, 14, e0225382. [Google Scholar] [CrossRef] [PubMed]
  25. Smolander, J.; Dehmer, M.; Emmert-Streib, F. Comparing Deep Belief Networks with Support Vector Machines for Classifying Gene Expression Data from Complex Disorders. FEBS Open Bio 2019, 9, 1232–1248. [Google Scholar] [CrossRef] [PubMed]
  26. Biasci, D.; Lee, J.C.; Noor, N.M.; Pombal, D.R.; Hou, M.; Lewis, N.; Ahmad, T.; Hart, A.; Parkes, M.; Mckinney, E.F.; et al. A Blood-Based Prognostic Biomarker in IBD. Gut 2019, 68, 1386–1395. [Google Scholar] [CrossRef] [PubMed]
  27. Han, L.; Maciejewski, M.; Brockel, C.; Gordon, W.; Snapper, S.B.; Korzenik, J.R.; Afzelius, L.; Altman, R.B. A Probabilistic Pathway Score (PROPS) for Classification with Applications to Inflammatory Bowel Disease. Bioinformatics 2018, 34, 985–993. [Google Scholar] [CrossRef] [PubMed]
  28. Yuan, F.; Zhang, Y.H.; Kong, X.Y.; Cai, Y.D. Identification of Candidate Genes Related to Inflammatory Bowel Disease Using Minimum Redundancy Maximum Relevance, Incremental Feature Selection, and the Shortest-Path Approach. Biomed Res. Int. 2017, 2017, 5741948. [Google Scholar] [CrossRef] [PubMed]
  29. Isakov, O.; Dotan, I.; Ben-Shachar, S. Machine Learning-Based Gene Prioritization Identifies Novel Candidate Risk Genes for Inflammatory Bowel Disease. Inflamm. Bowel Dis. 2017, 23, 1516–1523. [Google Scholar] [CrossRef]
  30. Chen, G.B.; Lee, S.H.; Montgomery, G.W.; Wray, N.R.; Visscher, P.M.; Gearry, R.B.; Lawrance, I.C.; Andrews, J.M.; Bampton, P.; Mahy, G.; et al. Performance of Risk Prediction for Inflammatory Bowel Disease Based on Genotyping Platform and Genomic Risk Score Method. BMC Med. Genet. 2017, 18, 94. [Google Scholar] [CrossRef]
  31. Hübenthal, M.; Hemmrich-Stanisak, G.; Degenhardt, F.; Szymczak, S.; Du, Z.; Elsharawy, A.; Keller, A.; Schreiber, S.; Franke, A. Sparse Modeling Reveals MiRNA Signatures for Diagnostics of Inflammatory Bowel Disease. PLoS ONE 2015, 10, e0140155. [Google Scholar] [CrossRef] [PubMed]
  32. Wei, Z.; Wang, W.; Bradfield, J.; Li, J.; Cardinale, C.; Frackelton, E.; Kim, C.; Mentch, F.; Van Steen, K.; Visscher, P.M.; et al. Large Sample Size, Wide Variant Spectrum, and Advanced Machine-Learning Technique Boost Risk Prediction for Inflammatory Bowel Disease. Am. J. Hum. Genet. 2013, 92, 1008–1012. [Google Scholar] [CrossRef] [PubMed]
  33. Qian, R.; Tang, M.; Ouyang, Z.; Cheng, H.; Xing, S. Identification of Ferroptosis-Related Genes in Ulcerative Colitis: A Diagnostic Model with Machine Learning. Ann. Transl. Med. 2023, 11, 177. [Google Scholar] [CrossRef] [PubMed]
  34. Bu, M.; Cao, X.; Zhou, B. Identification of Potential Biomarkers and Immune Infiltration Characteristics in Ulcerative Colitis by Combining Results from Two Machine Learning Algorithms. Comput. Math. Methods Med. 2022, 2022, 5412627. [Google Scholar] [CrossRef] [PubMed]
  35. Zhang, L.; Mao, R.; Lau, C.T.; Chung, W.C.; Chan, J.C.P.; Liang, F.; Zhao, C.; Zhang, X.; Bian, Z. Identification of Useful Genes from Multiple Microarrays for Ulcerative Colitis Diagnosis Based on Machine Learning Methods. Sci. Rep. 2022, 12, 9962. [Google Scholar] [CrossRef] [PubMed]
  36. Khorasani, H.M.; Usefi, H.; Peña-Castillo, L. Detecting Ulcerative Colitis from Colon Samples Using Efficient Feature Selection and Machine Learning. Sci. Rep. 2020, 10, 13744. [Google Scholar] [CrossRef] [PubMed]
  37. Li, H.; Lai, L.; Shen, J. Development of a Susceptibility Gene Based Novel Predictive model for the Diagnosis of Ulcerative Colitis Using Random Forest and Artificial Network. Aging 2020, 12, 20471–20482. [Google Scholar] [CrossRef] [PubMed]
  38. Duttagupta, R.; DiRienzo, S.; Jiang, R.; Bowers, J.; Gollub, J.; Kao, J.; Kearney, K.; Rudolph, D.; Dawany, N.B.; Showe, M.K.; et al. Genome-Wide Maps of Circulating MiRNA Biomarkers for Ulcerative Colitis. PLoS ONE 2012, 7, e0031241. [Google Scholar] [CrossRef]
  39. Raimondi, D.; Simm, J.; Arany, A.; Fariselli, P.; Cleynen, I.; Moreau, Y. An Interpretable Low-Complexity Machine Learning Framework for Robust Exome-Based In-Silico Diagnosis of Crohn’s Disease Patients. NAR Genom. Bioinform. 2020, 2, lqaa011. [Google Scholar] [CrossRef]
  40. Romagnoni, A.; Jégou, S.; Van Steen, K.; Wainrib, G.; Hugot, J.P.; Peyrin-Biroulet, L.; Chamaillard, M.; Colombel, J.F.; Cottone, M.; D’Amato, M.; et al. Comparative Performances of Machine Learning Methods for Classifying Crohn Disease Patients Using Genome-Wide Genotyping Data. Sci. Rep. 2019, 9, 10351. [Google Scholar] [CrossRef]
  41. Wang, Y.; Miller, M.; Astrakhan, Y.; Petersen, B.S.; Schreiber, S.; Franke, A.; Bromberg, Y. Identifying Crohn’s Disease Signal from Variome Analysis. Genome Med. 2019, 11, 59. [Google Scholar] [CrossRef]
  42. Bottigliengo, D.; Berchialla, P.; Lanera, C.; Azzolina, D.; Lorenzoni, G.; Martinato, M.; Giachino, D.; Baldi, I.; Gregori, D. The Role of Genetic Factors in Characterizing Extra-Intestinal Manifestations in Crohn’s Disease Patients: Are Bayesian Machine Learning Methods Improving Outcome Predictions? J. Clin. Med. 2019, 8, 865. [Google Scholar] [CrossRef]
  43. Daneshjou, R.; Wang, Y.; Bromberg, Y.; Bovo, S.; Martelli, P.L.; Babbi, G.; Di Lena, P.; Casadio, R.; Edwards, M.; Gifford, D.; et al. Working toward Precision Medicine: Predicting Phenotypes from Exomes in the Critical Assessment of Genome Interpretation (CAGI) Challenges. Hum. Mutat. 2017, 38, 1182–1192. [Google Scholar] [CrossRef] [PubMed]
  44. Pal, L.R.; Kundu, K.; Yin, Y.; Moult, J. CAGI4 Crohn’s Exome Challenge: Marker SNP versus Exome Variant Models for Assigning Risk of Crohn Disease. Hum. Mutat. 2017, 38, 1225–1234. [Google Scholar] [CrossRef]
  45. Cui, H.; Zhang, X. Alignment-Free Supervised Classification of Metagenomes by Recursive SVM. BMC Genom. 2013, 14, 641. [Google Scholar] [CrossRef] [PubMed]
  46. Kugathasan, S.; Baldassano, R.N.; Bradfield, J.P.; Sleiman, P.M.A.; Imielinski, M.; Guthery, S.L.; Cucchiara, S.; Kim, C.E.; Frackelton, E.C.; Annaiah, K.; et al. Loci on 20q13 and 21q22 Are Associated with Pediatric-Onset Inflammatory Bowel Disease. Nat. Genet. 2008, 40, 1211–1215. [Google Scholar] [CrossRef]
  47. Montero-Meléndez, T.; Llor, X.; García-Planella, E.; Perretti, M.; Suárez, A. Identification of Novel Predictor Classifiers for Inflammatory Bowel Disease by Gene Expression Profiling. PLoS ONE 2013, 8, e0076235. [Google Scholar] [CrossRef]
  48. Faizi, N.; Alvi, Y. Introduction to Biostatistics. In Biostatistics Manual for Health Research; Academic Press: Cambridge, MA, USA, 2023; pp. 1–16. [Google Scholar] [CrossRef]
  49. Han, J.; Kamber, M.; Pei, J. Data Preprocessing. In Data Mining; Morgan Kaufmann: Burlington, MA, USA, 2012; pp. 83–124. [Google Scholar] [CrossRef]
  50. Benjamini, Y.; Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Ser. B (Methodol.) 1995, 57, 289–300. [Google Scholar] [CrossRef]
  51. Ross, B.C. Mutual Information between Discrete and Continuous Data Sets. PLoS ONE 2014, 9, e87357. [Google Scholar] [CrossRef]
  52. Guyon, I.; Weston, J.; Barnhill, S. Gene Selection for Cancer Classification Using Support Vector Machines. Mach. Learn. 2002, 46, 389–422. [Google Scholar] [CrossRef]
  53. Friedman, J.; Hastie, T.; Tibshirani, R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J. Stat. Softw. 2010, 33, 1–22. [Google Scholar] [CrossRef]
  54. Friedman, J.H. 1999 Reitz lecture. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar]
  55. Blagus, R.; Lusa, L. SMOTE for High-Dimensional Class-Imbalanced Data. BMC Bioinform. 2013, 14, 106. [Google Scholar] [CrossRef]
  56. Hastie, T.; Tibshirani, R.; Friedman, J.H.; Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: Berlin/Heidelberg, Germany, 2009; Volume 2. [Google Scholar]
  57. Sperandei, S. Understanding Logistic Regression Analysis. Biochem. Med. 2014, 24, 12–18. [Google Scholar] [CrossRef]
  58. Quinlan, J.R. Induction of Decision Trees. Mach. Learn. 1986, 1, 81–106. [Google Scholar] [CrossRef]
  59. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  60. Yang, F.J. An Implementation of Naive Bayes Classifier. In Proceedings of the 2018 International Conference on Computational Science and Computational Intelligence, CSCI 2018, Las Vegas, NV, USA, 12–14 December 2018; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2018; pp. 301–306. [Google Scholar]
  61. Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the KDD ’16: 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016. [Google Scholar] [CrossRef]
  62. Taud, H.; Mas, J.F. Multilayer Perceptron (MLP). In Geomatic Approaches for Modeling Land Change Scenarios; María Teresa, C.O., Martin, P., Jean-Francois, M., Francisco, E., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 451–455. ISBN 978-3-319-60801-3. [Google Scholar]
  63. Brint, E.K.; Xu, D.; Liu, H.; Dunne, A.; McKenzie, A.N.J.; O’Neill, L.A.J.; Liew, F.Y. ST2 Is an Inhibitor of Interleukin 1 Receptor and Toll-like Receptor 4 Signaling and Maintains Endotoxin Tolerance. Nat. Immunol. 2004, 5, 373–379. [Google Scholar] [CrossRef] [PubMed]
  64. Ding, J.; Liu, Y.; Lai, Y. Identifying MMP14 and COL12A1 as a Potential Combination of Prognostic Biomarkers in Pancreatic Ductal Adenocarcinoma Using Integrated Bioinformatics Analysis. PeerJ 2020, 8, e10419. [Google Scholar] [CrossRef] [PubMed]
  65. Fingleton, B. Matrix Metalloproteinases as Regulators of Inflammatory Processes. Biochim. Biophys. Acta Mol. Cell Res. 2017, 1864, 2036–2042. [Google Scholar] [CrossRef]
  66. O’Sullivan, S.; Gilmer, J.F.; Medina, C. Matrix Metalloproteinases in Inflammatory Bowel Disease: An Update. Mediat. Inflamm. 2015, 2015, 964131. [Google Scholar] [CrossRef]
  67. Marônek, M.; Marafini, I.; Gardlík, R.; Link, R.; Troncone, E.; Monteleone, G. Metalloproteinases in Inflammatory Bowel Diseases. J. Inflamm. Res. 2021, 14, 1029–1041. [Google Scholar] [CrossRef]
  68. Schellenberg, C.; Lagrange, J.; Ahmed, M.U.; Arnone, D.; Campoli, P.; Louis, H.; Touly, N.; Caron, B.; Plénat, F.; Perrin, J. The Role of Platelets and von Willebrand Factor in the Procoagulant Phenotype of Inflammatory Bowel Disease. J. Crohn’s Colitis 2023, 18, 751–761. [Google Scholar] [CrossRef]
  69. Abozied, A.; Ahmed, Y.; Saleh, M.F.M.; Galal, H.; Abbas, W. Assessment of Von Willebrand Factor Antigen and Activity Levels in Inflammatory Bowel Diseases. Egypt. J. Haematol. 2021, 46, 227. [Google Scholar] [CrossRef]
  70. Lagrange, J.; Lacolley, P.; Wahl, D.; Peyrin-Biroulet, L.; Regnault, V. Shedding Light on Hemostasis in Patients with Inflammatory Bowel Diseases. Clin. Gastroenterol. Hepatol. 2021, 19, 1088–1097.e6. [Google Scholar] [CrossRef] [PubMed]
  71. Yoshimura, S.I.; Gerondopoulos, A.; Linford, A.; Rigden, D.J.; Barr, F.A. Family-Wide Characterization of the DENN Domain Rab GDP-GTP Exchange Factors. J. Cell Biol. 2010, 191, 367–381. [Google Scholar] [CrossRef] [PubMed]
  72. Majidi, M.; Hubbs, A.E.; Lichy, J.H. Activation of Extracellular Signal-Regulated Kinase 2 by a Novel Abl-Binding Protein, ST5. Cell Biol. Metab. 1998, 273, 16608–16614. [Google Scholar] [CrossRef] [PubMed]
  73. Morrison, D.K. MAP Kinase Pathways. Cold Spring Harb. Perspect. Biol. 2012, 4, a011254. [Google Scholar] [CrossRef]
  74. Tzeng, H.T.; Wang, Y.C. Rab-Mediated Vesicle Trafficking in Cancer. J. Biomed. Sci. 2016, 23, 70. [Google Scholar] [CrossRef]
  75. Ferreira, A.; Castanheira, P.; Escrevente, C.; Barral, D.C.; Barona, T. Membrane Trafficking Alterations in Breast Cancer Progression. Front. Cell Dev. Biol. 2024, 12, 1350097. [Google Scholar] [CrossRef]
  76. Ioannou, M.S.; McPherson, P.S. Regulation of Cancer Cell Behavior by the Small GTPase Rab13. J. Biol. Chem. 2016, 291, 9929–9937. [Google Scholar] [CrossRef]
  77. Ioannou, M.S.; Bell, E.S.; Girard, M.; Chaineau, M.; Hamlin, J.N.R.; Daubaras, M.; Monast, A.; Park, M.; Hodgson, L.; McPherson, P.S. DENND2B Activates Rab13 at the Leading Edge of Migrating Cells and Promotes Metastatic Behavior. J. Cell Biol. 2015, 208, 629–648. [Google Scholar] [CrossRef] [PubMed]
  78. Suárez, J.; Romero-Zerbo, Y.; Márquez, L.; Rivera, P.; Iglesias, M.; Bermúdez-Silva, F.J.; Andreu, M.; Rodríguez de Fonseca, F. Ulcerative Colitis Impairs the Acylethanolamide-Based Anti-Inflammatory System Reversal by 5-Aminosalicylic Acid and Glucocorticoids. PLoS ONE 2012, 7, e0037729. [Google Scholar] [CrossRef]
  79. Chen, Q.; Bei, S.; Zhang, Z.; Wang, X.; Zhu, Y. Identification of Diagnostic Biomarks and Immune Cell Infiltration in Ulcerative Colitis. Sci. Rep. 2023, 13, 6081. [Google Scholar] [CrossRef]
  80. Gorelik, A.; Gebai, A.; Illes, K.; Piomelli, D.; Nagar, B. Molecular Mechanism of Activation of the Immunoregulatory Amidase NAAA. Proc. Natl. Acad. Sci. USA 2018, 115, E10032–E10040. [Google Scholar] [CrossRef] [PubMed]
  81. Malamas, M.S.; Farah, S.I.; Lamani, M.; Pelekoudas, D.N.; Perry, N.T.; Rajarshi, G.; Miyabe, C.Y.; Chandrashekhar, H.; West, J.; Pavlopoulos, S.; et al. Design and Synthesis of Cyanamides as Potent and Selective N-Acylethanolamine Acid Amidase Inhibitors. Bioorg. Med. Chem. 2020, 28, 115195. [Google Scholar] [CrossRef]
  82. Piomelli, D.; Scalvini, L.; Fotio, Y.; Lodola, A.; Spadoni, G.; Tarzia, G.; Mor, M. N-Acylethanolamine Acid Amidase (NAAA): Structure, Function, and Inhibition. J. Med. Chem. 2020, 63, 7475–7490. [Google Scholar] [CrossRef]
  83. Dansie, L.E.; Reeves, S.; Miller, K.; Zano, S.P.; Frank, M.; Pate, C.; Wang, J.; Jackowski, S. Physiological Roles of the Pantothenate Kinases. Biochem. Soc. Trans. 2014, 42, 1033–1036. [Google Scholar] [CrossRef]
  84. Miallot, R.; Millet, V.; Galland, F.; Naquet, P. The Vitamin B5/Coenzyme A Axis: A Target for Immunomodulation? Eur. J. Immunol. 2023, 53, e2350435. [Google Scholar] [CrossRef] [PubMed]
  85. Zi, Y.; Gao, J.; Wang, C.; Guan, Y.; Li, L.; Ren, X.; Zhu, L.; Mu, Y.; Chen, S.H.; Zeng, Z.; et al. Pantothenate Kinase 1 Inhibits the Progression of Hepatocellular Carcinoma by Negatively Regulating Wnt/β-Catenin Signaling. Int. J. Biol. Sci. 2022, 18, 1539–1554. [Google Scholar] [CrossRef]
  86. Böhlig, L.; Friedrich, M.; Engeland, K. P53 Activates the PANK1/MiRNA-107 Gene Leading to Downregulation of CDK6 and P130 Cell Cycle Proteins. Nucleic Acids Res. 2011, 39, 440–453. [Google Scholar] [CrossRef]
  87. Zhang, Y.; Tang, M.; Guo, Q.; Xu, H.; Yang, Z.; Li, D. The Value of Erlotinib Related Target Molecules in Kidney Renal Cell Carcinoma via Bioinformatics Analysis. Gene 2022, 816, 146173. [Google Scholar] [CrossRef] [PubMed]
  88. Wang, B.; Liu, B.; Luo, Q.; Sun, D.; Li, H.; Zhang, J.; Jin, X.; Cheng, X.; Niu, J.; Yuan, Q.; et al. PANK1 Associates with Cancer Metabolism and Immune Infiltration in Clear Cell Renal Cell Carcinoma: A Retrospective Prognostic Study Based on the TCGA Database. Transl. Cancer Res. 2022, 11, 2321–2337. [Google Scholar] [CrossRef] [PubMed]
Figure 1. (A) Illustrates the intended framework for selecting and identifying potential DEGs from the GEO75214 gene expression dataset. (B) Depicts the framework s to screen the best supervised classification model that effectively differentiates IBD from healthy control samples. (C) Represents the RF model built using the six DEG biomarkers in independent cohorts.
Figure 1. (A) Illustrates the intended framework for selecting and identifying potential DEGs from the GEO75214 gene expression dataset. (B) Depicts the framework s to screen the best supervised classification model that effectively differentiates IBD from healthy control samples. (C) Represents the RF model built using the six DEG biomarkers in independent cohorts.
Diagnostics 14 01182 g001
Figure 2. Differential Gene Expression Patterns between IBD and Normal samples of the GSE75214 cohort. (a) The Figure displays the heatmap results of the upregulated genes between the IBD and Normal subjects. (b) The Figure displays the heatmap results of the downregulated genes between the IBD and Normal subjects. The color scale ranges from dark blue, indicating low expression, to dark red, indicating high expression. The expression levels provide insights into the contrasting gene expression patterns associated with IBD and Normal subjects.
Figure 2. Differential Gene Expression Patterns between IBD and Normal samples of the GSE75214 cohort. (a) The Figure displays the heatmap results of the upregulated genes between the IBD and Normal subjects. (b) The Figure displays the heatmap results of the downregulated genes between the IBD and Normal subjects. The color scale ranges from dark blue, indicating low expression, to dark red, indicating high expression. The expression levels provide insights into the contrasting gene expression patterns associated with IBD and Normal subjects.
Diagnostics 14 01182 g002
Figure 3. Analysis of DEGs between IBD and Healthy Controls from the GSE75214 cohort. (a) The volcano plot illustrates the DEGs observed between IBD and normal individuals in the GSE75214 cohort. The y-axis represents the negative logarithm (base 10) of the p-value, while the x-axis represents the log2 fold change. The significant DEGs, meeting the criteria of a p-value less than 0.001 and a fold change exceeding the threshold of 1.06712, are highlighted on the plot. (b) Venn diagram illustrating the overlap of DEGs in the GSE75214 cohorts. The diagram shows the genes that are common DEGs (upregulated and downregulated) between the two groups (IBD and Normal) of the GSE75214 cohort.
Figure 3. Analysis of DEGs between IBD and Healthy Controls from the GSE75214 cohort. (a) The volcano plot illustrates the DEGs observed between IBD and normal individuals in the GSE75214 cohort. The y-axis represents the negative logarithm (base 10) of the p-value, while the x-axis represents the log2 fold change. The significant DEGs, meeting the criteria of a p-value less than 0.001 and a fold change exceeding the threshold of 1.06712, are highlighted on the plot. (b) Venn diagram illustrating the overlap of DEGs in the GSE75214 cohorts. The diagram shows the genes that are common DEGs (upregulated and downregulated) between the two groups (IBD and Normal) of the GSE75214 cohort.
Diagnostics 14 01182 g003
Figure 4. KDE subplots illustrate the expression distribution of six genes (VMF, IL1RL1, DENND2B, MMP14, NAAA, and PANK1) across two groups (IBD patients and Normal controls).
Figure 4. KDE subplots illustrate the expression distribution of six genes (VMF, IL1RL1, DENND2B, MMP14, NAAA, and PANK1) across two groups (IBD patients and Normal controls).
Diagnostics 14 01182 g004
Figure 5. Comparison of Accuracy, F1-Score, and AUC Scores between ‘Six Gene Biomarkers’ and ‘Baseline (33,253 Genes)’ based ML models with SMOTE and without SMOTE. Error bars represent the standard deviation values for each performance evaluator.
Figure 5. Comparison of Accuracy, F1-Score, and AUC Scores between ‘Six Gene Biomarkers’ and ‘Baseline (33,253 Genes)’ based ML models with SMOTE and without SMOTE. Error bars represent the standard deviation values for each performance evaluator.
Diagnostics 14 01182 g005
Figure 6. Illustrates a visualization of the optimized RF-based classification model’s performance using a confusion matrix.
Figure 6. Illustrates a visualization of the optimized RF-based classification model’s performance using a confusion matrix.
Diagnostics 14 01182 g006
Figure 7. Performance of six biomarker-based optimized RF models on different IBD cohorts (GDE30687 and GSE10616).
Figure 7. Performance of six biomarker-based optimized RF models on different IBD cohorts (GDE30687 and GSE10616).
Diagnostics 14 01182 g007
Figure 8. Presents a comparative evaluation of accuracy and AUC values between our and related models [21,22,24,28,29,30,31,32,45].
Figure 8. Presents a comparative evaluation of accuracy and AUC values between our and related models [21,22,24,28,29,30,31,32,45].
Diagnostics 14 01182 g008
Table 1. Comparison of various ML-based studies for selecting a subset of potential biomarkers to classify IBD patients from healthy control samples.
Table 1. Comparison of various ML-based studies for selecting a subset of potential biomarkers to classify IBD patients from healthy control samples.
StudiesFeature Selection MethodsMachine Learning Algorithm (s)ModalityIBD TypePerformance Measures
[Accuracy/AUC]
OutcomeLimitations
IBD (UC/CD) Related Studies
Stemmer, 2024 [20]F-test in one-way analysis of variance (ANOVA F-test) valueK-nearest neighbor (KNN), Naïve Bayes, Extra Trees, and RF Gene Expression ProfilesUC/CDBest model AUC: 0.95Diagnosis of IBD
  • Limited validation cohort.
  • Further validation is required.
Tang, 2023 [21]SVM-RFE and LASSO regressionSVMGene Expression DatasetCD/UCSVM accuracy: 0.8Diagnosis of IBD
  • Incomplete clinical information.
  • Limited validation cohort.
  • Further validation is required.
  • Need for more personalized studies.
Yu, 2023 [22]XGBoost and UMAPXGBoostExpression data (microarray and RNA-seq)UC/CDXGBoost accuracy: 0.865Diagnosis of IBD
  • More features increase complexity and training time.
  • No parameter tuning for the XGBoost algorithm
Park, 2021 [23]Sparse partial least-squares discriminant analysisPartial Least-Squares Discriminant Analysis (PLS-DA)RNA-seq analysisUC/CDAverage error rate across classes: 0.155Classification of UC and CD
  • Not validated in an independent cohort.
Abbas, 2019 [24]The RF feature importance score and NBBD scoresRF classifiersLarge pediatric IBD metagenomics datasetPeds CD/UCRF: Accuracy 0.73, AUC 0.82
  • Lack of biological significance evaluation for identified interactions.
  • No validation cohort.
Smolander, 2019 [25]DBNs and SVMDBNs and SVMGene Expression datasetsCD/UCDBNs: UC accuracy 0.9706, CD accuracy 0.9703Diagnosis of IBDComplexity of the models
Biasci, 2019 [26]Logistic Regression (LR) with an adaptive Elastic-Net penaltyqPCR classifierGene Expression ProfilingCD/UC17-gene qPCR classifier:
-
High sensitivity: 72.7% CD, 100% UC
-
High negative predictive value: 90.9% CD, 100% UC
-
Hazard ratios: 2.65 CD, 3.12 UC.
Diagnosis of IBD
  • Non-interventional, real-world design.
  • Unclear performance in patients on induction therapy.
  • An interventional study is needed to confirm clinical utility.
Han, 2018 [27]-RF, LR, and conditionally responsive genes (CORG)Gene Expression datasetCD/UCGene-based feature sets had a validation AUC range of 0.6 to 0.76Diagnosis of IBD
  • Limited validation cohort.
  • Further validation is required.
Yuan, 2017 [28]Minimum Redundancy and Maximum Relevance (mRMR)
Incremental Feature Selection (IFS)
Shortest-Path (SP)
Sequential minimal optimizationGene Expression datasetsCD/UCHighest accuracy: 0.9370 with 21 genesRisk of IBD
  • Predictive accuracy of 20-gene biomarkers not estimated.
  • The classification task was computationally expensive.
  • The classification task had limited interpretability.
Isakov, 2017 [29]Elastic net regularized generalized linear model.Combined model involving RF, SVM, gradient boosting, and elastic netExpression data (microarray and RNA-seq)CD/UCCombined model:
Accuracy: 0.808
AUC: 0.829
Risk of IBD
  • Complex decision-making due to increased features.
Chen, 2017 [30]Bayesian mixture approachPolygenic score, elastic-net regularization, best linear genomic prediction, and a Bayesian mixture modeGWAS or Immunochip SNP dataCD/UCCD AUC: 0.75, UC AUC: 0.70Risk of IBD
  • Broader variant inclusion did not improve risk prediction.
Hübenthal, 2015 [31]Penalized SVMRFMicroRNAsCD/UC17-gene classifier holdout AUC:
0.75 to 1.00 (including indeterminate)
0.89 to 0.98 (excluding indeterminate)
Diagnosis of IBD
  • Small sample sizes.
  • Further evaluation in larger, independent cohorts is needed.
  • Cohorts should have well-defined clinical characteristics.
Wei, 2013 [32]Lasso penalization, relaxed significance cutoff (<10−4)SVM,
gradient boosted tree
Genetics, ImmunochipCD/UCUC AUC 0.83, CD AUC 0.86Risk of IBD
  • Increased feature screening complexity leads to complex decision-making.
  • No SVM parameter tuning.
  • No gradient-boosted tree parameter tuning.
Ulcerative Colitis (UC)-Related Studies
Qian, 2023 [33]LASSONaïve Bayes, Logistic, IBk, Random ForestGene Expression datasetsUCLR model:
AUC-Training: 1.000, Validation: 0.995
Diagnosis of UC
  • No experimental validation.
  • A larger sample size is required.
  • Need laboratory validation.
Bu, 2022 [34]LASSO regression model and SVM-RFELASSO regression model and SVM-RFE algorithm.Gene Expression ProfilesUC4-gene model AUC: 0.977Diagnosis of UC
  • Limited data requires external validation.
  • Need prospective studies to evaluate biomarker utility.
Zhang, 2022 [35]RF, SVM-RFE, Principal component analysis (PCA), Gradient Boosting Machine (GBM) and LASSO regressionRF, SVM, PCA, GBM and LASSO regressionGene Expression ProfilesUCSVM model AUC: 0.915Diagnosis of UCInsufficient verification
Khorasani, 2020 [36]Dimension reduction through perturbation theory (DRPT)SVMGene Expression datasetUCActive UC AUPRC: 1.0
Inactive UC AUPRC: 0.68
Diagnosis of UC
  • Validation using blood gene expression profiles is required.
  • Restricted data and lack of diverse training/validation sets.
Li, 2020 [37]-RF, artificial neural network (ANN)Gene Expression ProfilesUCBest ANN model AUC: 0.9506Diagnosis of UCThe validation set GSE92415 is relatively small.
Duttagupta, 2012 [38]Recursive SVMRecursive SVMMicroRNAsUCSVM classifier accuracy: 0.92Diagnosis of UC
  • RFE-SVM is computationally intensive.
  • RFE was susceptible to overfitting in high-dimensional data.
Crohn’s Disease (CD)-Related Studies
Raimondi, 2020 [39]-Neural networkWhole exomesCDAUC: 0.74–0.83
AUPRC: 0.81–0.93
Diagnosis of CDSmall sample size limitation.
Romagnoni, 2019 [40]ResDN3 with Permutation feature importance (PFI), Lasso with weight as feature importance score, and LightGBM (LGBM) with gainLR, gradient-boosted trees, neural network, and ensemble methodGenetics, ImmunochipCDSNP-based model AUC: 0.80Risk of CDLimited to a finite set of algorithms.
Wang, 2019 [41]-Analysis of Variation for Association with Disease (AVADx) and two Genome-wide association studies (GWAS)-based CD evaluation methodsWhole Exome or Genome Sequencing DataCDAVADx model:
-
Identified known, new CD genes
-
16% CD detection, 99% precision at strict cutoff
-
58% CD detection, 82% precision at default cutoff
Diagnosis of CDLow overlap between test and training data.
Bottigliengo, 2019 [42]Bayesian machine learning techniques (BMLTs)Generalized additive model (GAM), LR, linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), projection pursuit regression (PPR), ANN, and BART.GeneticsCDPPR model with genetics AUC: 0.94Diagnosis of CD
  • Missing data degraded performance.
  • Simplistic Bayesian approach needs improvement.
Daneshjou, 2017 [43]-Naïve bayes, neural networks, random forestsExome SequencingCDTop methods AUC: 0.87Diagnosis of CDSmall sample: 50 training, 53 testing.
Pal, 2017 [44]-Naïve BayesGenotypes from Exome Sequencing DataCDSNP model AUC for CD risk: 0.72Risk of CDNo validation cohort
Cui and Zhang, 2013 [45]Recursive SVMRecursive SVM16S rRNA gene analysisCD/UCLOOCV accuracy 0.88 with RSVM model.Diagnosis of CD
  • More features increase complexity and training time.
  • Only one Recursive FS method was used for biomarker selection.
The symbol “-” means data is unavailable.
Table 2. (a) List the top 10 upregulated genes compared to IBD and Healthy controls and (b) the top 10 downregulated genes between individuals with ITB and healthy controls.
Table 2. (a) List the top 10 upregulated genes compared to IBD and Healthy controls and (b) the top 10 downregulated genes between individuals with ITB and healthy controls.
(a)
Genelog2 Fold Changep-Valueq-Value CalcMean Fold Change (gene_IBD)Mean Fold Change (Control)Fold Change RatioFold Change ThresholdCategory
DENND2B0.1149682.56 × 10−232.84 × 10−198.0802397.4613161.0829511.06712Upregulated
LCN20.4950319.78 × 10−234.90 × 10−1911.171617.9267771.4093511.06712Upregulated
IFITM30.2168462.20 × 10−229.14 × 10−198.9080897.6649141.162191.06712Upregulated
SLC6A140.9242879.60 × 10−212.46 × 10−179.5824925.0494081.8977461.06712Upregulated
BACE20.2114021.49 × 10−203.31 × 10−1710.005748.6419331.1578131.06712Upregulated
S100A110.1706254.61 × 10−209.01 × 10−1710.388919.2301081.1255461.06712Upregulated
PLS30.2322872.92 × 10−194.86 × 10−167.9489546.7668221.1746951.06712Upregulated
PARP80.1911931.81 × 10−182.51 × 10−158.3997217.3571591.1417071.06712Upregulated
NFKBIZ0.2107114.67 × 10−185.82 × 10−159.3072418.0424941.1572581.06712Upregulated
DUOX20.6542034.72 × 10−185.82 × 10−1511.238657.1413331.5737471.06712Upregulated
(b)
Genelog2 Fold Changep-Valueq-Value CalcMean Fold Change (gene_IBD)Mean Fold Change (Control)Fold Change RatioFold Change ThresholdCategory
CNTFR−0.347361.91 × 10−276.35 × 10−237.2874599.2713270.7860211.06712Downregulated
RNY1P5−0.535672.83 × 10−244.71 × 10−204.837997.0132410.6898371.06712Downregulated
STBD1−0.254673.97 × 10−233.30 × 10−196.5846847.8559630.8381771.06712Downregulated
HINFP−0.110079.25 × 10−234.90 × 10−197.7345888.347780.9265441.06712Downregulated
PAQR5−0.342791.03 × 10−224.90 × 10−197.3095349.2700270.7885131.06712Downregulated
CNTN4−0.287212.49 × 10−229.18 × 10−195.3922636.5800450.8194871.06712Downregulated
RNF135−0.097827.81 × 10−222.60 × 10−188.141458.7126260.9344431.06712Downregulated
FRMD1−0.245351.16 × 10−213.51 × 10−186.3322017.5060430.8436141.06712Downregulated
SFXN1−0.125224.85 × 10−211.34 × 10−178.7438779.5367220.9168641.06712Downregulated
SLC38A4−0.395281.35 × 10−203.21 × 10−174.8568326.3877140.760341.06712Downregulated
Table 3. List upregulated and downregulated features selected using six fs algorithms.
Table 3. List upregulated and downregulated features selected using six fs algorithms.
FS MethodsType of FS MethodSelected Feature (s)
UpregulatedDownregulated
Mutual Information ScoreFilter7960464
(VWF)
8101086
(NAAA)
RFECVWrapper7946401
(DENND2B)
8101086
(NAAA)
Elastic NetEmbedded8044021
(IL1RL1)
7934945
(PANK1)
Gradient Boosting ClassifierEmbedded7973336
(MMP14)
8101086
(NAAA)
Table 4. The master subset of gene biomarkers for classifying IBD from non-IBD samples.
Table 4. The master subset of gene biomarkers for classifying IBD from non-IBD samples.
Selected Gene Biomarkers
7960464 (VWF), 7946401 (DENND2B), 8044021 (IL1RL1), 7973336 (MMP14), 7934945 (PANK1), 8101086 (NAAA)
Table 5. Presents the results of an unpaired t-test to assess potential gene biomarkers’ ability to classify two classes.
Table 5. Presents the results of an unpaired t-test to assess potential gene biomarkers’ ability to classify two classes.
Potential
Gene Biomarkers
IBD MeanIBD StdNormal MeanNormal Stdt-Statisticp-Value
VWF7.960.8826.5280.42607.5222.02 × 10−12
IL1RL15.590.5415.0950.2764.2233.72 × 10−5
DENND2B8.080.2417.4610.23211.3972.56 × 10−23
MMP148.830.5258.0480.2506.9615.21 × 10−11
NAAA9.090.68610.3840.241−8.7181.32 × 10−15
PANK17.950.4858.5630.253−5.7583.32 × 10−8
Table 6. Compares the performance of seven classification models using the five most informative features to baseline models using all features of the test/validation E-GEOD-36807 dataset.
Table 6. Compares the performance of seven classification models using the five most informative features to baseline models using all features of the test/validation E-GEOD-36807 dataset.
ML
Algorithms
Accuracy
(SMOTE)
Accuracy
(No SMOTE)
F1 Score
(SMOTE)
F1 Score
(No SMOTE)
AUC (SMOTE)AUC
(No SMOTE)
Baseline Accuracy
(No SMOTE)
Baseline F1 Score
(No SMOTE)
Baseline AUC
(No SMOTE)
GNB0.9680 ± 0.00610.94331 ± 0.019120.9674 ± 0.00550.96732 ± 0.011260.9680 ± 0.00610.9238 ± 0.05850.8713 ± 0.06660.5930 ± 0.18110.83899 ± 0.1299
RF0.9767 ± 0.01480.9532 ± 0.04540.97628 ± 0.01500.9740 ± 0.02500.9767 ± 0.01480.86840 ± 0.14600.9277 ± 0.03050.5024 ± 0.30750.7021 ± 0.1356
LR0.9667 ± 0.01500.9584 ± 0.03570.9663 ± 0.01520.96706 ± 0.019340.9768 ± 0.01490.87134 ± 0.14050.9432 ± 0.03020.6458 ± 0.23350.7771 ± 0.1471
MLP0.9360 ± 0.02350.9380 ± 0.03540.9324 ± 0.02640.96626 ± 0.01850.9362 ± 0.02300.7693 ± 0.18630.578947 ± 0.37870.082663 ± 0.1010.5 ± 0.0
SVC0.9738 ± 0.010860.9584 ± 0.03570.9731 ± 0.01120.97706 ± 0.01930.9739 ± 0.01060.87134 ± 0.14050.8866 ± 0.01210.0 ± 0.00.5 ± 0.0
DT0.9622 ± 0.02170.9275 ± 0.03510.9615 ± 0.02200.9599 ± 0.01910.9623 ± 0.02180.7705 ± 0.12650.8659 ± 0.06790.450 ± 0.20820.6846 ± 0.1205
XGB0.92766 ± 0.03050.9277 ± 0.03050.9596 ± 0.01680.9596 ± 0.01680.8026 ± 0.11220.8026 ± 0.11220.9381 ± 0.02080.6554 ± 0.16480.7813 ± 0.0968
Table 7. The gene ontology and pathway enrichment analysis outcomes performed on the four upregulated DEGs.
Table 7. The gene ontology and pathway enrichment analysis outcomes performed on the four upregulated DEGs.
Sl. No.Term/Pathwayp-ValueGenes
Gene Ontology
1GO:0050867; positive regulation of cell activation0.0013070IL1RL1 and MMP14
2GO:0002694; regulation of leukocyte activation0.0031340IL1RL1 and MMP14
3GO:0043062; extracellular structure organization0.0022000VWF and MMP14
4GO:0031589; cell-substrate adhesion0.0015271VWF and MMP14
5GO:0032634; interleukin-5 production0.0045196IL1RL1
KEGG Pathway
1hsa04610; Complement and coagulation cascades0.032090VWF
2hsa04512; ECM-receptor interaction0.033294VWF
3hsa04912; GnRH signaling pathway0.037303MMP14
4hsa04928; secretion and action, Parathyroid hormone synthesis0.043109MMP14
5hsa04668; TNF signaling pathway0.044093MMP14
6hsa04611; Platelet activation0.049661VWF
REACTOME Pathway
1R-HSA-430116; GP1b-IX-V activation signaling0.0034394VWF
2R-HSA-140837; Intrinsic Pathway of Fibrin Clot Formation0.0062995VWF
3R-HSA-372708; p130Cas linkage to MAPK signaling for integrins0.0042980VWF, DENND2B
4R-HSA-9006921; Integrin signaling0.0077275VWF
5R-HSA-75892; Platelet Adhesion to Exposed Collagen0.0042980VWF
6R-HSA-354194; GRB2: SOS provides linkage to MAPK signaling for Integrins0.0042980VWF, DENND2B
7R-HSA-1474244; Extracellular matrix organization0.0023830MMP14 and VWF
8R-HSA-1592389; Activation of Matrix Metalloproteinases0.0094393MMP14
9R-HSA-6802948; Signaling by high-kinase activity BRAF mutants0.010294VWF
Table 8. Listing the outcomes of the gene ontology and pathway enrichment analysis performed on the two downregulated DEGs.
Table 8. Listing the outcomes of the gene ontology and pathway enrichment analysis performed on the two downregulated DEGs.
Sl. No.Term/Pathwayp-ValueGenes
Gene Ontology
1GO:0033865; nucleoside bisphosphate metabolic process0.018431PANK1
2GO:0072522; purine-containing compound biosynthetic process0.036017PANK1
3GO:0046390; ribose phosphate biosynthetic process0.034400PANK1
4GO:0006836; neurotransmitter transport0.036286NAAA
5GO:0051188; cofactor biosynthetic process0.037767PANK1
6GO:1901293; nucleoside phosphate biosynthetic process0.046225PANK1
7GO:0006732; coenzyme metabolic process0.047564PANK1
8GO:0001505; regulation of neurotransmitter levels0.045287NAAA
KEGG Pathway
1hsa00770; Pantothenate and CoA biosynthesis0.0026002PANK1
REACTOME Pathway
1R-HSA-199220; Vitamin B5 (pantothenate) metabolism0.0032492PANK1
2R-HSA-196783; Coenzyme A biosynthesis0.0015297PANK1
3R-HSA-112315; Transmission across Chemical Synapses0.042764NAAA
4R-HSA-112310; Neurotransmitter release cycle0.0097318NAAA
5R-HSA-196854; Metabolism of vitamins and cofactors0.035826PANK1
6R-HSA-196849; Metabolism of water-soluble vitamins and cofactors0.023390PANK1
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Syed, A.H.; Abujabal, H.A.S.; Ahmad, S.; Malebary, S.J.; Alromema, N. Advances in Inflammatory Bowel Disease Diagnostics: Machine Learning and Genomic Profiling Reveal Key Biomarkers for Early Detection. Diagnostics 2024, 14, 1182. https://doi.org/10.3390/diagnostics14111182

AMA Style

Syed AH, Abujabal HAS, Ahmad S, Malebary SJ, Alromema N. Advances in Inflammatory Bowel Disease Diagnostics: Machine Learning and Genomic Profiling Reveal Key Biomarkers for Early Detection. Diagnostics. 2024; 14(11):1182. https://doi.org/10.3390/diagnostics14111182

Chicago/Turabian Style

Syed, Asif Hassan, Hamza Ali S. Abujabal, Shakeel Ahmad, Sharaf J. Malebary, and Nashwan Alromema. 2024. "Advances in Inflammatory Bowel Disease Diagnostics: Machine Learning and Genomic Profiling Reveal Key Biomarkers for Early Detection" Diagnostics 14, no. 11: 1182. https://doi.org/10.3390/diagnostics14111182

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop