Next Article in Journal
An Interpretable System for Screening the Severity Level of Retinopathy in Premature Infants Using Deep Learning
Previous Article in Journal
Machine Learning for Biomedical Applications
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Hybrid Predictive Machine Learning Model for the Prediction of Immunodominant Peptides of Respiratory Syncytial Virus

by
Syed Nisar Hussain Bukhari
1,* and
Kingsley A. Ogudo
2
1
National Institute of Electronics and Information Technology (NIELIT), Ministry of Electronics and Information Technology (MeitY), Government of India, Srinagar 191132, India
2
Department of Electrical & Electronics Engineering, Faculty of Engineering and the Built Environment, University of Johannesburg, Johannesburg 0524, South Africa
*
Author to whom correspondence should be addressed.
Bioengineering 2024, 11(8), 791; https://doi.org/10.3390/bioengineering11080791
Submission received: 30 June 2024 / Revised: 26 July 2024 / Accepted: 2 August 2024 / Published: 5 August 2024
(This article belongs to the Special Issue Machine Learning Technology in Predictive Healthcare)

Abstract

:
Respiratory syncytial virus (RSV) is a common respiratory pathogen that infects the human lungs and respiratory tract, often causing symptoms similar to the common cold. Vaccination is the most effective strategy for managing viral outbreaks. Currently, extensive efforts are focused on developing a vaccine for RSV. Traditional vaccine design typically involves using an attenuated form of the pathogen to elicit an immune response. In contrast, peptide-based vaccines (PBVs) aim to identify and chemically synthesize specific immunodominant peptides (IPs), known as T-cell epitopes (TCEs), to induce a targeted immune response. Despite their potential for enhancing vaccine safety and immunogenicity, PBVs have received comparatively less attention. Identifying IPs for PBV design through conventional wet-lab experiments is challenging, costly, and time-consuming. Machine learning (ML) techniques offer a promising alternative, accurately predicting TCEs and significantly reducing the time and cost of vaccine development. This study proposes the development and evaluation of eight hybrid ML predictive models created through the permutations and combinations of two classification methods, two feature weighting techniques, and two feature selection algorithms, all aimed at predicting the TCEs of RSV. The models were trained using the experimentally determined TCEs and non-TCE sequences acquired from the Bacterial and Viral Bioinformatics Resource Center (BV-BRC) repository. The hybrid model composed of the XGBoost (XGB) classifier, chi-squared (ChST) weighting technique, and backward search (BST) as the optimal feature selection algorithm (ChST−BST–XGB) was identified as the best model, achieving an accuracy, sensitivity, specificity, F1 score, AUC, precision, and MCC of 97.10%, 0.98, 0.97, 0.98, 0.99, 0.99, and 0.96, respectively. Additionally, K-fold cross-validation (KFCV) was performed to ensure the model’s reliability and an average accuracy of 97.21% was recorded for the ChST−BST–XGB model. The results indicate that the hybrid XGBoost model consistently outperforms other hybrid approaches. The epitopes predicted by the proposed model may serve as promising vaccine candidates for RSV, subject to in vitro and in vivo scientific assessments. This model can assist the scientific community in expediting the screening of active TCE candidates for RSV, ultimately saving time and resources in vaccine development.

1. Introduction

Respiratory syncytial virus (RSV) stands as a prominent contributor to lower respiratory tract ailments in both young children and the elderly [1]. The initial identification of RSV traces back to 1955, when it was first isolated from chimpanzees displaying respiratory symptoms at the Walter Reed Army Institute of Research in the United States [2]. Over subsequent years, the virus was also discovered in infants who suffered from severe lower respiratory illnesses [3,4]. Since that time, RSV has become known as a widespread pathogen, affecting almost every child by the age of two, with around half of them experiencing two infections within this period. [5]. The primary modes of transmission involve respiratory droplets released during coughs or sneezes, as well as direct contact with contaminated surfaces. Infants, young children, and older adults, particularly those with chronic medical conditions, face an elevated risk of severe illness due to RSV infection [6,7]. People infected with RSV are usually contagious for a period ranging from 3 to 8 days, with the potential to spread the virus a day or two before showing any symptoms [8]. However, some infants and individuals with weakened immune systems can remain contagious even after their symptoms have resolved, occasionally for as long as four weeks. [8]. The typical symptoms of an RSV infection include a runny nose, decreased appetite, coughing, sneezing, fever, and wheezing, which tend to develop gradually rather than all at once [9]. RSV is the most common cause of bronchiolitis and pneumonia in children under one year old [7]. The CDC estimates that RSV causes approximately 58,000 to 80,000 hospitalizations and 100 to 300 deaths among children under five annually. Additionally, it results in 60,000 to 160,000 hospitalizations and 6000 to 10,000 deaths each year among adults aged 65 and older [7]. RSV, which circulates during the winter months alongside influenza (flu) and other respiratory viruses, is frequently misdiagnosed due to its similar symptoms. Like the flu, its prevalence peaks between November and May [10].
RSV is categorized as a filamentous enveloped virus and is part of the Orthopneumovirus genus within the Pneumoviridae family, under the order Mononegavirales [1]. This virus is characterized by its genetic structure, which consists of a single-stranded RNA genome with a negative sense. This genome includes 11 proteins encoded by a 15.2-kilobase (kb) RSV genome. Unlike influenza, RSV possesses a non-segmented genome, which means it lacks the capacity to re-assort genome segments. As a result, RSV cannot undergo the genetic rearrangements known as antigenic shifts, which can lead to major pandemics [11]. RSV particles come in various shapes, including both spherical and filamentous forms of different sizes [12]. These virions have three surface proteins: F, G, and SH (small hydrophobic), as shown in Figure 1 [13]. The G and F proteins are crucial for virion attachment and fusion, binding to specific carbohydrate structures known as GAGs and RhoA, respectively [14,15]. Once fusion takes place, the virion releases its nucleocapsid into the cytosol, permitting the RNA to enter the host cell. The M2 mRNA contains two overlapping open reading frames (ORFs) that code for M2-1 and M2-2. The M2-2 gene regulates the shift from transcription to genomic RNA production [16]. The large (L) protein functions as a viral RNA-dependent RNA polymerase, encompassing multiple enzyme activities essential for RSV replication. This protein enters the genome, facilitating mRNA transcription. During replication, a complete positive-sense RNA complement of the genome called the antigenome is produced and serves as a template for further replication. Throughout this process, the N protein encapsulates the RNA, protecting it from degradation. The M protein plays a crucial role in coordinating the assembly of envelope proteins with nucleocapsid proteins (N, P, and M2-1). It also aids in the budding of new immature virions, a process that uses the host cell membrane. In filamentous virions, a helical arrangement of M (matrix) proteins is present, which is critical for forming infectious filamentous particles [17].
As shown in Figure 1, the RNA genome of RSV includes 10 genes that encode a total of 11 proteins. These proteins include two nonstructural proteins (NS1 and NS2). Additionally, there are four envelope proteins: the attachment glycoprotein (G), the fusion protein (F), the matrix protein (M), and the small hydrophobic protein (SH). Moreover, there are five ribonucleocapsid proteins: the nucleoprotein (N), phosphoprotein (P), large RNA polymerase (L), M2-1 (a transcription antiterminator that binds zinc), and M2-2 (a regulatory factor involved in balancing RNA replication and transcription) [18]. In vaccine development, particular focus is given to the F protein. This is due to its presence on the outer envelope of the RSV virion and its high conservation across different RSV strains, making it a promising target for vaccine development. The F protein exists in two forms, prefusion and postfusion, with the prefusion form being less stable but more immunodominant compared to the postfusion form [18].
To prevent disease outbreaks effectively, there is an urgent need to develop a safe and effective RSV vaccine that can induce immunological memory without causing immune-related complications following natural RSV infections [19]. Current vaccine development efforts have emphasized whole-organism vaccines, including live attenuated and inactivated types. However, these vaccines can be costly to produce, require the cultivation of the infectious agent, and may cause vaccine-related illnesses in recipients [20]. Additionally, they may not be suitable for individuals with compromised immune systems and require precise temperature control for storage [21]. As a result, there has been a shift towards developing peptide-based vaccines (PBVs). PBVs involve identifying and chemically synthesizing immunodominant peptides, known as T-cell epitopes (TCEs), which can elicit specific immune responses against the pathogen [22]. The design of PBVs focuses on removing unnecessary antigenic components, concentrating only on protein sections capable of triggering an immune response [23,24]. PBVs offer several advantages over traditional vaccines, including fewer side effects, simpler manufacturing processes, the absence of whole pathogen elements, increased specificity, greater stability, sustainability, and shorter production timelines [25]. Despite these significant advantages, PBVs have received less attention, and their potential to enhance vaccine safety and immunogenicity remains largely unexplored [26,27].
It is important to emphasize the vital role of T cells in adaptive immunity, as they aid in various immune system functions and significantly contribute to the control, clearance, and protection against most viral infections. [28]. Notably, CD8+ T cells are pivotal in the context of RSV pathogenesis, and there is a suggestion that RSV vaccines capable of inducing both antibodies and CD8+ T cells may prove effective [29]. The consideration of vaccines that trigger CD8+ T-cell responses against both cancer and viruses is a promising avenue in vaccine design [30,31], underscoring the idea that T cells are well equipped to address evolving viral variants [32,33]. Identifying these immunodominant TCEs for PBV design through wet-lab experiments is challenging, costly, and time-consuming. However, the application of machine learning (ML) techniques can enable the accurate prediction of these epitopes, expediting vaccine development and making it more cost-effective compared to traditional wet-lab methods [34]. This study presents a novel method for predicting the TCEs of RSV using a hybrid ML technique that leverages the physicochemical properties of peptides. The identified epitopes could be utilized as candidates in the development of PBVs against the RSV pathogen. The proposed model aims to aid the scientific community in identifying new and immunodominant TCEs specific to RSV.

Contributions

This study makes several significant contributions. Firstly, it involved the development and testing of eight hybrid ML predictive models created through various permutations and combinations of two classification techniques, two feature weighting methods, and two feature selection strategies, all aimed at predicting the TCEs of RSV. Secondly, an innovative feature extraction technique was introduced, capable of extracting the physicochemical properties of peptides at the amino acid level. Thirdly, the study employed heuristic and greedy search techniques to identify optimal features for model training after extracting features from peptide sequences. Fourthly, the research primarily focused on achieving high accuracy in TCE prediction, and the proposed hybrid techniques demonstrated promising results in terms of accuracy. These models were thoroughly evaluated using multiple parameters, including area under the curve (AUC), sensitivity, specificity, Gini, F-score, and MCC. The findings indicate that the combination of XGBoost with chi-squared and backward search is the most accurate and reliable predictive method for TCE prediction in the context of RSV. Finally, K-fold cross-validation (KFCV) was performed, demonstrating that the proposed model is reliable and consistent for TCE predictions across all folds.

2. Related Work

Considerable research has been conducted to identify the TCEs of RSV for the design of PBVs. Chen et al. predicted T-cell epitopes in RSV F and G proteins, finding three RSV-A and two RSV-B clusters, indicating diverse immunogenic profiles. Recent epidemic strains conserved more F protein epitopes but reduced G protein epitopes. This study offers a framework for studying RSV T-cell epitope evolution, crucial for vaccine design [35]. The study [36] aimed to identify RSV-specific T-cell epitopes in BALB/c mice. Novel CD8 T-cell epitopes in the F and G proteins and previously unknown CD4 T-cell epitopes in P, L, M2-1, and N proteins were discovered. Longer 17-mer CD4-T-cell epitopes proved more effective in stimulating CD4-T-cell responses compared to 15-mer peptides. This work addresses the lack of defined RSV-specific T-cell epitopes, enhancing our understanding of RSV-induced disease. Another study [37] focused on designing a potential vaccine for RSV. Using reverse vaccinology, researchers predicted 95 cytotoxic T-lymphocyte (CTL) epitopes from the RSV proteome. After extensive screening for antigenicity, allergenicity, and toxicity, 70 epitopes with desirable properties were selected. Molecular docking identified stable binding in four epitopes, validating their potential as T-cell-specific RSV antigens. This approach provides an efficient method for screening immunogenic epitopes, offering promise for vaccine development against RSV. In [38], the authors aimed to identify CD4+ and CD8+ T-cell epitopes in C57BL/6 mice infected with RSV. Using an overlapping peptide library encompassing the RSV proteome, researchers discovered two new CD4+ and three new CD8+ T-cell epitopes within various RSV proteins. Additionally, they characterized these newly identified epitopes, including their TCR Vb expression profiles and MHC restriction. These findings will advance future research on RSV-specific T-cell responses in C57BL/6 mice. Shah et al. [39] focused on the potential use of epitope-based vaccines against RSV, which poses a significant threat to infants and the elderly. The study specifically targeted the fusion glycoprotein of RSV (RSV-FP) due to its conservation across strains and its ability to elicit cytotoxic T-cell (CTL) responses, crucial for viral clearance. Using immunoinformatics tools, the researchers identified seven 9-mer peptides within RSV-FP that strongly bind to 17 different HLA types, exhibit 100% sequence conservancy, and are estimated to provide a 76.03% population coverage worldwide. These findings hold promise for the development of effective RSV epitope-based vaccines. In this immunoinformatics study [40], researchers aimed to design a multi-epitope vaccine against RSV. They identified eight CD8-T-cell and three CD4-T-cell epitopes from glycoproteins F and G, considering antigenicity and binding affinity. Molecular docking confirmed strong associations with HLA alleles. Using these epitopes, a stable, non-allergenic, and antigenic multi-epitope vaccine with a cholera toxin-derived adjuvant was designed. Computational simulations indicated the vaccine’s potential to generate antibodies and effector T cells. Codon optimization and in silico cloning ensured enhanced expression in Escherichia coli. Further experimental validation is expected to confirm the vaccine’s effectiveness against RSV infections. A study [41] aimed to examine the role of vaccine-induced CD8+ T cells in protecting against RSV. Using a peptide vaccine (TriVax) in mice, researchers discovered that it induced strong anti-RSV CD8+ cytotoxic T lymphocytes. These vaccinated mice were protected against RSV infection, airway mucin expression, and lung inflammation when challenged six days post-vaccination. While effector CD8+ T cells exhibited strong cytokine expression and provided protection, memory CD8+ T cells, elicited 42 days post-vaccination, offered partial protection with lower cytokine expression, suggesting a link between protection and CD8+ T cell cytokine levels. Another study [42] aimed to develop a vaccine against RSV that induces long-lasting immunological memory without causing immunopathology. Researchers used live attenuated influenza vaccine (LAIV) viruses with RSV epitopes integrated into the neuraminidase or NS1 genes. These chimeric vaccines protected against both influenza and RSV without causing harmful effects. The study focused on CD4- and CD8-T-cell responses, particularly lung tissue-resident memory T-cell subsets (TRM). The RSV epitopes did not impact influenza-specific CD4 memory T cells, and both LAIV+NA/RSV and LAIV+NS/RSV vaccines induced strong RSV-specific CD8 TRM cells in the lungs. This research indicates that LAIV-based vaccines can generate robust localized T-cell immunity against foreign pathogens without compromising the vaccine’s immunogenicity. The authors of [43] reviewed computational tools for predicting T-cell epitopes, with a particular focus on neoepitopes relevant to cancer immunotherapy. They assessed various tools based on their methodologies, data utilization, and comparative advantages and disadvantages. The authors of [44] investigated the impact of antigen processing on epitope immunogenicity. They developed an ML model to predict proteasomal degradation scores for peptides and experimentally tested peptides with varying scores. Their findings suggest a correlation between low degradation scores and enhanced T-cell activation, highlighting the potential for improving vaccine efficacy by optimizing antigen processing. The study [45] addressed the challenge of epitope prediction for malaria due to the unique biology and evolving sequences of the parasite. The authors proposed an ML approach to develop a Plasmodium-specific epitope predictor. They built models using various ML algorithms trained on epitope data with sequence features and physicochemical properties. Their analysis suggests a model trained with specific classifiers after preprocessing outperforms others. This research represents the first in silico attempt to benchmark Plasmodium epitopes using ML and paves the way for peptide-based predictors in malaria vaccine development. The study [46] reviewed various in silico methods for predicting SARS-CoV-2 T-cell epitopes, highlighting the importance of T-cell responses in COVID-19. The authors compared various ML-based approaches by evaluating their ability to identify experimentally validated immunogenic epitopes. This review provides insights into the performance of different prediction methods and suggests future research directions.

3. Materials and Methods

In this section, as outlined in Figure 2, we will explain the proposed hybrid approach for predicting the IPs of RSV through the following sub-sections.

3.1. Retrieval of Peptide Sequences

The TCE and non-TCE (NTCE) peptide sequences were retrieved in the form of two CSV files (a TCE sequences file and a non-TCE (NTCE) sequences file) from publicly available repositories, namely, the “Bacterial and Viral Bioinformatics Resource Center (BV-BRC)” in CSV format [47,48]. To perform the binary classification, a target variable called “Class” was added to the CSV file. This variable has a value of 1 for TCE sequences and 0 for NTCE sequences.

3.2. Feature Extraction

The distinct characteristics of each peptide sequence are determined by the specific arrangement of amino acids and their related physicochemical properties, which are central to this study. To extract these properties from the peptide sequences outlined in Section 3.1, we carried out feature extraction (FE) using the peptides [49] and peptider [50] packages within the R programming environment [51]. Before performing FE, the duplicate peptide sequences were removed. The FE process produced a high-dimensional dataset formatted as a CSV file containing 108 features for each sequence. The details of the physicochemical properties used in this analysis are listed in Table 1, and Table 2 presents a snapshot of the dataset after FE.

3.3. Feature Selection

Feature selection (FS) is a crucial phase in the ML pipeline that focuses on identifying and choosing the most pertinent features from a dataset to develop a predictive model. The aim of FS is to minimize the number of features while preserving those that are most valuable for improving the model’s prediction accuracy [52]. This procedure not only aids in mitigating computational expenses and curbing overfitting but also enhances the model’s ability to generalize. The selected features should exhibit independence to avert redundancy and multicollinearity, factors that could compromise the model’s stability and comprehensibility. Due to the high dimensionality of the dataset, it is important to assign weights to all features and subsequently identify the optimal subset to improve the efficiency of the ML model. The following subsections offer an overview of the process for assigning feature weights and determining the optimal feature subset.

3.4. Assigning Weights to Features-Feature Weighting

The feature weighting technique (FWT) in ML involves assigning weights to features to control their influence on the model’s output. The goal of FWT is to increase the importance of relevant features while reducing the impact of irrelevant ones, thereby improving the model’s accuracy and robustness [53]. In this study, two different FWTs, namely the information gain technique (IGT) and the chi-squared technique (ChST) from the FSelector package in R were used to assign weights to the features [54]. A brief description of each technique is provided next.

3.4.1. IGT

Information gain (IG) measures the amount of information obtained when a feature is used to split the data. In other words, it measures the reduction in entropy or uncertainty about a random variable after observing another random variable. In the context of feature selection, IG evaluates how well a feature separates the training examples according to their target class. Higher values indicate that a feature is more effective in distinguishing between classes. The function prototype for IG is represented as information.gain(x, y, …), where x and y are the required parameters corresponding to the feature and class variables, respectively. The formula for information gain is as follows:
IG(T,A) = H(T) − H(T∣A)
where
  • IG(T,A) is the information gain of feature A for target T.
  • H(T) is the entropy of the target variable T.
  • H(T∣A) is the conditional entropy of T given feature A.

3.4.2. ChST

The chi-squared technique evaluates the independence between a feature and the class variable by measuring the difference between observed and expected frequencies of each class across the feature’s categories. In other words, the ChST test is used to determine whether there is a significant association between a categorical feature and the target variable. It measures the discrepancy between the observed and expected frequencies of occurrences. Higher values indicate a stronger association between the feature and the class variable. The function chi.squared(x, y, …) accepts the same parameters as the IGT. The formula for the chi-squared statistic is as follows:
χ2 = ∑ (Oi − Ei)2/Ei
where
  • Oi is the observed frequency for category i.
  • Ei is the expected frequency for category i, assuming no association between the feature and the target.

3.5. Selection of Optimal Subset of Features

Once weights are assigned to the features, identifying the optimal feature subset becomes imperative for constructing a precise and efficient ML model. The optimal feature subset (OFSS) plays a pivotal role in achieving optimal predictive performance [55]. Selecting the OFSS involves choosing a subset of features from a larger set that provides the most accurate predictions for the ML model. This step is especially important in high-dimensional datasets, where an excess of features can negatively affect model performance if not properly managed. To identify the optimal feature subset, two effective techniques are used in this study: the hill climbing search technique (HCST) and the backward search technique (BST). A brief overview of each OFSS technique employed in this study is provided next.

3.5.1. HCST

The HCST is a heuristic optimization method that systematically evaluates the performance of different feature subsets, ultimately selecting the one that achieves the highest accuracy. Starting with an initial subset of features, HCST explores the performance of all possible one-feature additions and identifies the subset that offers the most significant improvement in accuracy [56]. This process continues iteratively until no further enhancements in accuracy can be achieved.

3.5.2. BST

The BST begins with the full set of features and methodically removes those that contribute least to improving accuracy. BST evaluates the impact of removing each feature and selects the subset that offers the greatest increase in accuracy [57]. This iterative process continues until no further gains in accuracy can be achieved.

3.6. Selection of ML Classifiers

The final hybrid techniques were developed by combining two widely used classification methods: XGBoost (extreme gradient boosting) and random forest (RF). The results from the aforementioned optimal feature selection techniques were used as inputs for these two classification methods. This strategy was employed in the final classification phase to allow for comparative analysis, as each method utilizes different approaches to classify the data. Typically, random forest (RF) consists of an ensemble of decision trees, where multiple trees are built, and the final classification is determined by majority voting [58]. On the other hand, XGBoost is an ML algorithm renowned for its efficiency and effectiveness in classification tasks [59]. It operates by iteratively adding decision trees to minimize the residual errors from previous trees, thereby optimizing a specified loss function to produce a highly accurate ensemble model. Table 3 outlines these ML classifiers alongside their respective tuning parameters.

3.7. Model Building

The proposed models were constructed by combining two FWTs, two OFSS techniques, and two classification methods using permutation and combination approaches. Initially, the peptide sequences of RSV were obtained from BV-BRC in CSV format. Feature extraction (FE) was then performed using the “peptides” and “peptider” packages in the R programming language, resulting in high-dimensional data with 108 features extracted for each peptide sequence. After the FE process, the next step was to assign weights to each feature to assess its relative importance. This was accomplished using two different FWTs. Following this, the OFSS was determined using two distinct techniques. Each FWT output was fed into each OFSS technique separately. The results from each OFSS were then used to train two different classification algorithms. The optimal features identified by various FS methods are listed in Table 4. Equations (1) and (2) illustrate the model formulas with the dependent variable “Class” and its corresponding independent variables for training via the hill climbing and backward search techniques, respectively.
Class∼f (F4, F7_18,…………, F8_37, F10)
Class∼f (F2, F5_2,……………, F9_6, F9_8)
Next, the dataset with the optimal features from each combination of FWTs and OFSS techniques was divided into training and testing sets. The data were split at a ratio of 70:30, with 70% allocated for training the model and the remaining 30% reserved for testing. After training, the models were validated using the evaluation metrics described in the following section. To ensure the models’ robustness and reliability, K-fold cross-validation was performed.

4. Model Evaluation

Model evaluation constitutes a crucial aspect of any machine learning (ML) workflow. It is imperative to assess the performance of the ML model to ensure its capability to generalize accurately to new, unseen data [60]. This approach aids in the identification of the optimal model that effectively represents the data and predicts future performance. In this study, a variety of assessment metrics were used, including accuracy, sensitivity, specificity, precision, area under the receiver operating characteristic curve (AUROC), F1 score, and Matthews correlation coefficient (MCC) [61]. Additionally, the performance consistency and robustness of the proposed techniques were evaluated using K-fold cross-validation (KFCV). It is important to note that evaluating ML models is an iterative process. The outcomes of these evaluations can inform adjustments to the model, including adjustments to hyperparameters, feature selection, or data preprocessing. This iterative process continues until the model reaches the desired level of performance. The following section details the metrics used for model evaluation, where TP stands for true positive, TN for true negative, FP for false positive, and FN for false negative.

4.1. Accuracy

The accuracy metric evaluates the proportion of correct predictions made by the model relative to the total number of predictions [62]. It is computed using Equation (3).
Accuracy = (TP + TN)/(TP + TN + FP + FN)

4.2. Sensitivity

Sensitivity, also referred to as recall or true positive rate (TPR), is a metric used to evaluate ML models by measuring the proportion of actual positives that are correctly identified by the model [63]. Recall is calculated using Equation (4).
Sensitivity = TP/(TP + FN)

4.3. Specificity

Specificity, or true negative rate (TNR), is a metric that gauges the proportion of actual negatives that are accurately identified by the model [63]. Specificity is calculated using Equation (5).
Specificity = TN/(TN + FP)

4.4. Precision

Precision, also called positive predictive value (PPV), is a metric that assesses the proportion of positive predictions that are genuinely correct [63]. Precision is calculated using Equation (6).
Precision = TP/(TP + FP)

4.5. F1 Score

The F1 score is a metric used to evaluate ML models by combining both precision and recall into a single measure of performance [63]. It represents the harmonic mean of precision and recall and is calculated using Equation (7).
F1 score = 2 × (precision × recall)/(precision + recall)
The F1 score is represented as a value between 0 and 1, with higher values indicating better model performance in accurately identifying both positive and negative classes.

4.6. Area under the ROC Curve (AUC-ROC)

The receiver operating characteristic (ROC) curve plots the true positive rate (TPR) against the false positive rate (FPR) for various threshold settings [64]. The area under the ROC curve (AUC-ROC) measures the model’s overall ability to differentiate between positive and negative classes. It is a single metric ranging from 0 to 1, where a value of 1 denotes a perfect classifier, and a value of 0.5 represents a random classifier.

4.7. Mathews Correlation Coefficient (MCC)

The Matthews correlation coefficient (MCC) takes into account all four values from the confusion matrix (TP, FP, TN, FN) and provides a score between −1 and 1. A score of 1 represents a perfect prediction, 0 indicates a random prediction, and −1 denotes a prediction that is completely opposite to the true labels [65]. The MCC is calculated using Equation (8).
MCC = (TP × TN − FP × FN)/(Sqrt ((TP + FP) × (TP + FN) × (TN + FP) × (TN + FN)))

4.8. K-Fold Cross Validation (KFCV)

K-fold cross-validation (KFCV) is a method used to evaluate the consistency and robustness of a model by dividing the original dataset into K subsets, or folds, of approximately equal size [66]. The model is trained and tested K times, with each fold serving as the evaluation set once while the remaining K-1 folds are used for training, as illustrated in Figure 3. The typical KFCV process includes several steps: First, the dataset is randomly shuffled to ensure even distribution. Next, the dataset is split into K groups or folds of roughly equal size. Then, the model is trained on K-1 folds and tested on the remaining fold for each iteration. Performance metrics, such as accuracy, are computed for each fold. Finally, the mean and variance of these metrics are calculated across the K folds to provide an overall assessment of the model’s performance. KFCV helps reduce the risk of overfitting and offers a more reliable estimate of the model’s effectiveness on new data. The value of K is usually set to 5 or 10, though it can be adjusted based on the dataset size and model complexity.

5. Results and Discussion

In this section, we present the results obtained from applying various hybrid techniques to a high-dimensional dataset, which comprises 108 features extracted from RSV peptide sequences. To determine the most effective hybrid technique, a comparative analysis was performed among the different hybrid methods used in this study, based on the evaluation parameters previously outlined. Table 5 presents the accuracies achieved by the various hybrid approaches. It is evident from Table 5 that XGBoost (XGB) demonstrates outstanding results, consistently exceeding 93% accuracy across all scenarios.
Notably, the hybrid approach combining ChST and BST achieves the highest accuracy of 97.29% for XGBoost (XGB) models. In terms of accuracy, random forest (RF) with various feature weighting (FW) and optimal feature selection techniques shows a range of accuracy from a low of 79.23% to a high of 94.19% with IGT and HCST among the different hybrid techniques used in this study. However, when evaluating the effectiveness of a hybrid model in a multiclass problem, accuracy alone does not suffice as the sole determining factor [60]. Other crucial parameters such as recall, specificity, precision, negative predicted value of a particular class, AUROC, and F1 score of the predictive method must also be considered. To this end, Table 6 presents a comprehensive comparison of these parameters for the best hybrid models achieved in this study. As depicted in Table 6, the XGB model (Model 1) in combination with the chi-squared and backward search techniques demonstrates superior results across all parameters, boasting an impressive F1 score and AUROC value of 0.98 and 0.99, respectively.
Assessing the reliability of the technique is essential to determine whether the model is susceptible to overfitting or underfitting issues. Overfitting occurs when the model excels with training data but fails to generalize to testing data, while underfitting happens when the model performs poorly on both training and testing data. To verify the reliability and consistency of the hybrid techniques used in this study, 5-fold cross-validation (5 FCV) was performed on the top three hybrid methods. The accuracies achieved by these top-performing hybrid models across different folds are shown in Table 7, and their accuracy is plotted in Figure 4.

6. Conclusions

In conclusion, RSV poses a significant threat to individuals across all age groups, especially infants and young children, with seasonal outbreaks typically peaking during autumn and winter months. Vaccination remains the most effective strategy for managing viral disease outbreaks [67]. While ongoing efforts aim to develop an RSV vaccine, many current methods involve using weakened forms of the entire pathogen to trigger an immune response. In contrast, the potential B-cell vaccine (PBV) concept emphasizes the identification and synthetic creation of specific immunodominant peptides, known as T-cell epitopes (TCEs), as potential components of a vaccine. Despite the many advantages of PBVs, such as enhanced safety, immunogenicity, and cost-effectiveness, they have not received widespread attention [68]. Computational methods provide a quicker and more economical way to identify TCEs compared to traditional laboratory techniques. In this study, we developed and assessed eight hybrid predictive ML models for forecasting the TCEs of RSV [69]. After extracting features from peptide sequences, we used heuristic and greedy search techniques to identify the most effective features for model training. Performance evaluation using various metrics, including accuracy, sensitivity, specificity, and AUROC curve, showed that the combination of XGBoost with ChST and BST was the most accurate and reliable predictive method. Our model provides deterministic TCE prediction, unlike other methods, such as NetMHC [70] and CTLpred [71], which only estimate binding potential. Furthermore, our model can predict peptides of various lengths, including those longer than 9-mers, addressing a limitation of CTLpred. However, it is crucial to validate model predictions through experimental methods (in vivo and in vitro) before considering them for vaccine development [72]. In summary, the hybrid ML techniques proposed in this study demonstrated exceptional performance and surpassed current ML methods for predicting RSV TCEs. Future research should explore additional physicochemical properties and utilize advanced ML classifiers to further improve accuracy and other metrics. Overall, using computational methods to identify potential vaccine candidates could significantly impact global health by saving lives, preventing future outbreaks, and reducing the virus’s capacity to evade immunity through genetic mutations.

Author Contributions

Conceptualization, S.N.H.B.; methodology, S.N.H.B.; software, S.N.H.B.; validation, K.A.O. and S.N.H.B.; formal analysis, S.N.H.B.; investigation, K.A.O.; resources, K.A.O.; data curation, S.N.H.B.; writing—original draft preparation, S.N.H.B.; writing—review and editing, K.A.O.; visualization, K.A.O.; supervision, K.A.O.; project administration, K.A.O.; funding acquisition, K.A.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the University of Johannesburg’s University Research Committee (URC) grant for Prof. K.A. Ogudo (2019) and the Department of Electrical and Electronic Engineering Technology’s K.A. Ogudo research costs center. The APC was funded by a grant from the University of Johannesburg Library Research Funds (UJ).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available on request from the corresponding author.

Acknowledgments

The authors acknowledge the support of and express their gratitude to the University of Johannesburg’s University Research Committee (URC) for the grant awarded to Prof. K.A. Ogudo and the Department of Electrical and Electronic Engineering Technology. This work was partly supported by a grant from the University of Johannesburg Library Research Funds (UJ).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Battles, M.B.; McLellan, J.S. Respiratory syncytial virus entry and how to block it. Nat. Rev. Microbiol. 2019, 17, 233–245. [Google Scholar] [CrossRef] [PubMed]
  2. Blount, R.E., Jr.; Morris, J.A.; Savage, R.E. Recovery of cytopathogenic agent from chimpanzees with coryza. Proc. Soc. Exp. Biol. Med. 1956, 92, 544–549. [Google Scholar] [PubMed]
  3. Chanock, R.; Roizman, B.; Myers, R. Recovery from infants with respiratory illness of a virus related to chimpanzee coryza agent (CCA). I. Isolation, properties and characterization. Am. J. Hyg. 1957, 66, 281–290. [Google Scholar] [PubMed]
  4. Chanock, R.; Finberg, L. Recovery from infants with respiratory illness of a virus related to chimpanzee coryza agent (CCA). II. Epidemiologic aspects of infection in infants and young children. Am. J. Hyg. 1957, 66, 291–300. [Google Scholar]
  5. Glezen, W.P.; Taber, L.H.; Frank, A.L.; Kasel, J.A. Risk of primary infection and reinfection with respiratory syncytial virus. Am. J. Dis. Child 1986, 140, 543–546. [Google Scholar] [CrossRef]
  6. Collins, P.L.; McIntosh, K.; Chanock, R.M. Respiratory Syncytial Virus Fields Virology; Fields, B.N., Ed.; Raven Press: New York, NY, USA, 1996; pp. 1313–1351. [Google Scholar]
  7. Health Alert Network (HAN)—00498. Centers for Disease Control and Prevention. 2023. Available online: https://emergency.cdc.gov/han/2023/han00498.asp (accessed on 19 March 2024).
  8. Transmission of RSV (Respiratory Syncytial Virus). Centers for Disease Control and Prevention. 2023. Available online: https://www.cdc.gov/rsv/causes/index.html (accessed on 19 March 2024).
  9. Symptoms and Care of RSV (Respiratory Syncytial Virus). Centers for Disease Control and Prevention. 2023. Available online: https://www.cdc.gov/rsv/symptoms/?CDC_AAref_Val=https://www.cdc.gov/rsv/about/symptoms.html (accessed on 23 March 2024).
  10. Olson, D. RSV: The Annual Epidemic You May Not Know about (but Should), NFID. 2023. Available online: https://www.nfid.org/rsv-the-annual-epidemic-you-may-not-know-about-but-should/ (accessed on 27 March 2024).
  11. Jha, A.; Jarvis, H.; Fraser, C.; Openshaw, P. Respiratory Syncytial Virus. In SARS, MERS and other Viral Lung Infections; Hui, D.S., Rossi, G.A., Johnston, S.L., Eds.; European Respiratory Society: Sheffield, UK, 2016; Chapter 5. Available online: https://www.ncbi.nlm.nih.gov/books/NBK442240/ (accessed on 24 March 2024).
  12. Bächi, T.; Howe, C. Morphogenesis and ultrastructure of respiratory syncytial virus. J. Virol. 1973, 12, 1173–1180. [Google Scholar] [CrossRef]
  13. Gan, S.W.; Tan, E.; Lin, X.; Yu, D.; Wang, J.; Tan, G.M.Y.; Vararattanavech, A.; Yeo, C.Y.; Soon, C.H.; Soong, T.W.; et al. The small hydrophobic protein of the human respiratory syncytial virus forms pentameric ion channels. J. Biol. Chem. 2012, 287, 24671–24689. [Google Scholar] [CrossRef]
  14. Gower, T.L.; Pastey, M.K.; Peeples, M.E.; Collins, P.L.; McCurdy, L.H.; Hart, T.K.; Guth, A.; Johnson, T.R.; Graham, B.S. RhoA signaling is required for respiratory syncytial virus-induced syncytium formation and filamentous virion morphology. J. Virol. 2005, 79, 5326–5336. [Google Scholar] [CrossRef] [PubMed]
  15. Kwilas, S.; Liesman, R.M.; Zhang, L.; Walsh, E.; Pickles, R.J.; Peeples, M.E. Respiratory syncytial virus grown in Vero cells contains a truncated attachment protein that alters its infectivity and dependence on glycosaminoglycans. J. Virol. 2009, 83, 10710–10718. [Google Scholar] [CrossRef]
  16. Gould, P.S.; Easton, A.J. Coupled translation of the second open reading frame of M2 mRNA is sequence dependent and differs significantly within the subfamily Pneumovirinae. J. Virol. 2007, 81, 8488–8496. [Google Scholar] [CrossRef]
  17. Mitra, R.; Baviskar, P.; Duncan-Decocq, R.R.; Patel, D.; Oomens, A.G.P. The human respiratory syncytial virus matrix protein is required for maturation of viral filaments. J. Virol. 2012, 86, 4432–4443. [Google Scholar] [CrossRef]
  18. Nam, H.H.; Ison, M.G. Respiratory syncytial virus infection in adults. BMJ 2019, 366, l5021. [Google Scholar] [CrossRef] [PubMed]
  19. Kim, H.W.; Canchola, J.G.; Brandt, C.D.; Pyles, G.; Chanock, R.M.; Jensen, K.; Parrott, R.H. Respiratory syncytial virus disease in infants despite prior administration of antigenic inactivated vaccine. Am. J. Epidemiol. 1969, 89, 422–434. [Google Scholar] [CrossRef] [PubMed]
  20. Karch, C.P.; Burkhard, P. Vaccine technologies: From whole organisms to rationally designed protein assemblies. Biochem. Pharmacol. 2016, 120, 1–14. [Google Scholar] [CrossRef] [PubMed]
  21. Bukhari, S.N.H.; Jain, A.; Haq, E.; Mehbodniya, A.; Webber, J. Ensemble machine learning model to predict SARS-CoV-2 t-cell epitopes as potential vaccine targets. Diagnostics 2021, 11, 1990. [Google Scholar] [CrossRef] [PubMed]
  22. Cai, X.; Li, J.J.; Liu, T.; Brian, O.; Li, J. Infectious disease mRNA vaccines and a review on epitope prediction for vaccine design. Brief. Funct. Genom. 2021, 20, 289–303. [Google Scholar] [CrossRef] [PubMed]
  23. Huber, S.R.; van Beek, J.; de Jonge, J.; Luytjes, W.; van Baarle, D. T cell responses to viral infections—opportunities for peptide vaccination. Front. Immunol. 2014, 5, 171. [Google Scholar] [CrossRef]
  24. Bukhari, S.N.H.; Webber, J.; Mehbodniya, A. Decision tree based ensemble machine learning model for the prediction of Zika virus T-cell epitopes as potential vaccine candidates. Sci. Rep. 2022, 12, 7810. [Google Scholar] [CrossRef]
  25. Seder, R.A.; Darrah, P.A.; Roederer, M. T-cell quality in memory and protection: Implications for vaccine design. Nat. Rev. Immunol. 2008, 8, 247–258. [Google Scholar] [CrossRef]
  26. Li, W.; Joshi, M.D.; Singhania, S.; Ramsey, K.H.; Murthy, A.K. Peptide Vaccine: Progress and Challenges. Vaccines 2014, 2, 515–536. [Google Scholar] [CrossRef]
  27. Berger, C.M.; Knutson, K.L.; Salazar, L.G.; Schiffman, K.; Disis, M.L. Peptide-Based Vaccines. In Handbook of Cancer Vaccines. Cancer Drug Discovery and Development; Morse, M.A., Clay, T.M., Lyerly, H.K., Eds.; Humana Press: Totowa, NJ, USA, 2004. [Google Scholar] [CrossRef]
  28. Gilbert, S.C. T-cell-inducing vaccines—What’s the future. Immunology 2012, 135, 19–26. [Google Scholar] [CrossRef] [PubMed]
  29. Graham, B.S. Biological challenges and technological opportunities for respiratory syncytial virus vaccine development. Immunol. Rev. 2011, 239, 149–166. [Google Scholar] [CrossRef] [PubMed]
  30. Cho, H.I.; Celis, E. Optimized peptide vaccines eliciting extensive CD8 T-cell responses with therapeutic antitumor effects. Cancer Res. 2009, 69, 9012–9019. [Google Scholar] [CrossRef] [PubMed]
  31. Uchida, T. Development of a cytotoxic T-lymphocyte-based, broadly protective influenza vaccine. Microbiol. Immunol. 2011, 55, 19–27. [Google Scholar] [CrossRef] [PubMed]
  32. Ura, T.; Takeuchi, M.; Kawagoe, T.; Mizuki, N.; Okuda, K.; Shimada, M. Current Vaccine Platforms in Enhancing T-Cell Response. Vaccines 2022, 10, 1367. [Google Scholar] [CrossRef] [PubMed]
  33. Bukhari, S.N.H.; Jain, A.; Haq, E. A Novel Ensemble Machine Learning Model for Prediction of Zika Virus T-Cell Epitope. In Lecture Notes on Data Engineering and Communications Technologies; Springer: Berlin/Heidelberg, Germany, 2021; Volume 91, pp. 275–292. [Google Scholar] [CrossRef]
  34. Bravi, B. Development and use of machine learning algorithms in vaccine target selection. Vaccines 2024, 9, 15. [Google Scholar] [CrossRef] [PubMed]
  35. Chen, J.; Tan, S.; Avadhanula, V.; Moise, L.; Piedra, P.A.; De Groot, A.S.; Bahl, J. Diversity and evolution of computationally predicted T cell epitopes against human respiratory syncytial virus. PLoS Comput. Biol. 2023, 19, e1010360. [Google Scholar] [CrossRef]
  36. McDermott, D.S.; Knudson, C.J.; Varga, S.M. Determining the breadth of the respiratory syncytial virus-specific T cell response. J. Virol. 2014, 88, 3135–3143. [Google Scholar] [CrossRef] [PubMed]
  37. Anandhan, G.; Narkhede, Y.; Mohan, M.; Paramasivam, P. Immunoinformatics aided approach for predicting potent cytotoxic T cell epitopes of respiratory syncytial virus. J. Biomol. Struct. Dyn. 2023, 41, 12093–12105. [Google Scholar] [CrossRef]
  38. Schmidt, M.E.; Varga, S.M. Identification of Novel Respiratory Syncytial Virus CD4+ and CD8+ T Cell Epitopes in C57BL/6 Mice. Immunohorizons 2019, 3, 1–12. [Google Scholar] [CrossRef]
  39. Shah, M.N.A.; Barua, P.; Khan, M.K. Immunoinformatics Aided Prediction of Cytotoxic T Cell Epitope of Respiratory Syncytial Virus. Biores. Commun. (BRC) 2022, 1, 99–104. Available online: https://www.bioresearchcommunications.com/index.php/brc/article/view/157 (accessed on 19 March 2024).
  40. Dar, H.A.; Almajhdi, F.N.; Aziz, S.; Waheed, Y. Immunoinformatics-Aided Analysis of RSV Fusion and Attachment Glycoproteins to Design a Potent Multi-Epitope Vaccine. Vaccines 2022, 10, 1381. [Google Scholar] [CrossRef] [PubMed]
  41. Lee, S.; Stokes, K.L.; Currier, M.G.; Sakamoto, K.; Lukacs, N.W.; Celis, E.; Moore, M.L. Vaccine-Elicited CD8+ T Cells Protect against Respiratory Syncytial Virus Strain A2-Line19F-Induced Pathogenesis in BALB/c Mice. J. Virol. 2012, 86, 13016–13024. [Google Scholar] [CrossRef] [PubMed]
  42. Matyushenko, V.; Kotomina, T.; Kudryavtsev, I.; Mezhenskaya, D.; Prokopenko, P.; Matushkina, A.; Sivak, K.; Muzhikyan, A.; Rudenko, L.; Isakova-Sivak, I. Conserved T-cell epitopes of respiratory syncytial virus (RSV) delivered by recombinant live attenuated influenza vaccine viruses efficiently induce RSV-specific lung-localized memory T cells and augment influenza-specific resident memory T-cell responses. Antivir. Res. 2020, 182, 104864. [Google Scholar] [CrossRef] [PubMed]
  43. Schaap-Johansen, A.-L.; Vujović, M.; Borch, A.; Hadrup, S.R.; Marcatili, P. T cell epitope prediction and its application to immunotherapy. Front. Immunol. 2021, 12, 712488. [Google Scholar] [CrossRef] [PubMed]
  44. Truex, N.L.; Mohapatra, S.; Melo, M.; Rodriguez, J.; Li, N.; Abraham, W.; Sementa, D.; Touti, F.; Keskin, D.B.; Wu, C.J.; et al. Design of cytotoxic T cell epitopes by machine learning of human degrons. ACS Cent. Sci. 2024, 10, 793–802. [Google Scholar] [CrossRef] [PubMed]
  45. Adiga, R. Benchmarking Datasets from Malaria Cytotoxic T-cell Epitopes Using Machine Learning Approach. Avicenna J. Med. Biotechnol. 2021, 13, 87–91. [Google Scholar] [CrossRef]
  46. Sohail, M.S.; Ahmed, S.F.; Quadeer, A.A.; McKay, M.R. In silico T cell epitope identification for SARS-CoV-2: Progress and perspectives. Adv. Drug Deliv. Rev. 2021, 171, 29–47. [Google Scholar] [CrossRef] [PubMed]
  47. Olson, R.D.; Assaf, R.; Brettin, T.; Conrad, N.; Cucinell, C.; Davis, J.J.; Dempsey, D.M.; Dickerman, A.; Dietrich, E.M.; Kenyon, R.W.; et al. Introducing the Bacterial and Viral Bioinformatics Resource Center (BV-BRC): A resource combining PATRIC, IRD and ViPR. Nucleic Acids Res. 2023, 6, D678–D689. [Google Scholar] [CrossRef] [PubMed]
  48. Vita, R.; Mahajan, S.; Overton, J.A.; Dhanda, S.K.; Martini, S.; Cantrell, J.R.; Wheeler, D.K.; Sette, A.; Peters, B. The Immune Epitope Database (IEDB): 2018 update. Nucleic Acids Res. 2018, 47, D339–D343. [Google Scholar] [CrossRef]
  49. Osorio, D.; Rondón-Villarreal, P.; Torres, R. Peptides: A package for data mining of antimicrobial peptides. R J. 2015, 7, 4–14. [Google Scholar] [CrossRef]
  50. Evaluation of Diversity in Nucleotide Libraries [R Package Peptider Version 0.2.2]. September 2015. Available online: https://cran.r-project.org/package=peptider (accessed on 22 March 2023).
  51. R Core Team. R Foundation for Statistical Computing; R Core Team: Vienna, Austria, 2013. [Google Scholar]
  52. Gupta, V.K.; Rana, P.S. Toxicity prediction of small drug molecules of aryl hydrocarbon receptor using a proposed ensemble model. Turk. J. Electr. Eng. Comput. Sci. 2019, 27, 2833–2849. [Google Scholar] [CrossRef]
  53. Niño-Adan, I.; Manjarres, D.; Landa-Torres, I.; Portillo, E. Feature weighting methods: A review. Expert Syst. Appl. 2021, 184, 115424. [Google Scholar] [CrossRef]
  54. CRAN—Package FSelector. Available online: https://cran.r-project.org/web/packages/FSelector/index.html (accessed on 22 March 2023).
  55. Kang, S.-H.; Kim, K.J. A feature selection approach to find optimal feature subsets for the network intrusion detection system. Cluster Comput. 2016, 19, 325–333. [Google Scholar] [CrossRef]
  56. Więckowski, J.; Kizielewicz, B.; Kołodziejczyk, J. Application of Hill Climbing Algorithm in Determining the Characteristic Objects Preferences Based on the Reference Set of Alternatives. Intell. Decis. Technol. 2020, 193, 341–351. [Google Scholar]
  57. Quinlan, J.R. Induction of decision trees. Mach. Learn. 1986, 1, 81–106. [Google Scholar] [CrossRef]
  58. Liaw, A.; Wiener, M. Classification and Regression by randomForest. R News 2002, 2, 18–22. [Google Scholar]
  59. Tarwidi, D.; Pudjaprasetya, S.R.; Adytia, D.; Apri, M. An optimized XGBoost-based machine learning method for predicting wave run-up on a sloping beach. MethodsX 2023, 10, 102119. [Google Scholar] [CrossRef] [PubMed]
  60. Alpaydin, E. Introduction to Machine Learning, 2nd ed.; The MIT Press: Cambridge, MA, USA, 2010. [Google Scholar]
  61. Cihan, P.; Ozger, Z.B. A new approach for determining SARS-CoV-2 epitopes using machine learning-based in silico methods. Comput. Biol. Chem. 2022, 98, 107688. [Google Scholar] [CrossRef] [PubMed]
  62. Khanna, D.; Rana, P.S. Multilevel ensemble model for prediction of IgA and IgG antibodies. Immunol. Lett. 2017, 184, 51–60. [Google Scholar] [CrossRef]
  63. Powers, D.M.W. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv 2020, arXiv:2010.16061. [Google Scholar]
  64. Bradley, A.P. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 1997, 30, 1145–1159. [Google Scholar] [CrossRef]
  65. Zhu, Q. On the performance of Matthews correlation coefficient (MCC) for imbalanced dataset. Pattern Recognit. Lett. 2020, 136, 71–80. [Google Scholar] [CrossRef]
  66. Refaeilzadeh, P.; Tang, L.; Liu, H. Cross-Validation BT—Encyclopedia of Database Systems; Liu, L., Özsu, M.T., Eds.; Springer: Boston, MA, USA, 2009; pp. 532–538. [Google Scholar]
  67. Hernández-Rivas, L.; Pedraz, T.; Calvo, C.; San Juan, I.; Mellado, M.ªJ.; Robustillo, A. Respiratory syncytial virus outbreak during the COVID-19 pandemic. How has it changed? Enfermedades Infecc. Y Microbiol. Clin. Engl. Ed. 2023, 41, 352–355. [Google Scholar] [CrossRef]
  68. Yang, H.; Cao, J.; Lin, X.; Yue, J.; Zieneldien, T.; Kim, J.; Wang, L.; Fang, J.; Huang, R.P.; Bai, Y.; et al. Developing an Effective Peptide-Based Vaccine for COVID-19: Preliminary Studies in Mice Models. Viruses 2022, 14, 449. [Google Scholar] [CrossRef] [PubMed]
  69. Sunita, S.A.; Singh, Y.; Shukla, P. Computational tools for modern vaccine development. Hum. Vaccines Immunother. 2020, 16, 723–735. [Google Scholar] [CrossRef] [PubMed]
  70. Nielsen, M.; Lundegaard, C.; Worning, P.; Lauemøller, S.L.; Lamberth, K.; Buus, S.; Brunak, S.; Lund, O. Reliable prediction of T-cell epitopes using neural networks with novel sequence representations. Protein Sci. 2003, 12, 1007–1017. [Google Scholar] [CrossRef] [PubMed]
  71. Bhasin, M.; Raghava, G.P.S. Prediction of CTL epitopes using QM, SVM and ANN techniques. Vaccine 2004, 22, 3195–3204. [Google Scholar] [CrossRef]
  72. Danchin, A. In vivo, in vitro and in silico: An open space for the development of microbe-based applications of synthetic biology. Microb. Biotechnol. 2022, 15, 42–64. [Google Scholar] [CrossRef]
Figure 1. Structure of RSV.
Figure 1. Structure of RSV.
Bioengineering 11 00791 g001
Figure 2. Proposed methodology.
Figure 2. Proposed methodology.
Bioengineering 11 00791 g002
Figure 3. K-fold cross-validation technique.
Figure 3. K-fold cross-validation technique.
Bioengineering 11 00791 g003
Figure 4. KFCV results of the hybrid model as depicted in Figure 4; it is evident that the hybrid XGBoost model exhibits the most consistent accuracy results compared to the RF hybrid techniques. The results indicate that the proposed model stands out due to its comprehensive hybrid framework that combines multiple feature weighting, selection, and classification techniques, aiming to capture diverse peptide characteristics for improved accuracy. Unlike many existing tools relying on single models or limited feature engineering, the proposed approach leverages the strengths of different algorithms to mitigate potential biases. Moreover, the KFCV technique mitigates the risk of overfitting by dividing the dataset into K equal-sized folds. The model is trained on K-1 folds and tested on the remaining fold iteratively. This process provides a more reliable estimate of model performance on unseen data by exposing the model to different subsets of the data during training.
Figure 4. KFCV results of the hybrid model as depicted in Figure 4; it is evident that the hybrid XGBoost model exhibits the most consistent accuracy results compared to the RF hybrid techniques. The results indicate that the proposed model stands out due to its comprehensive hybrid framework that combines multiple feature weighting, selection, and classification techniques, aiming to capture diverse peptide characteristics for improved accuracy. Unlike many existing tools relying on single models or limited feature engineering, the proposed approach leverages the strengths of different algorithms to mitigate potential biases. Moreover, the KFCV technique mitigates the risk of overfitting by dividing the dataset into K equal-sized folds. The model is trained on K-1 folds and tested on the remaining fold iteratively. This process provides a more reliable estimate of model performance on unseen data by exposing the model to different subsets of the data during training.
Bioengineering 11 00791 g004
Table 1. Physicochemical properties.
Table 1. Physicochemical properties.
Physicochemical PropertyCountNotation
Aliphatic index1F1
Boman index1F2
Insta index1F3
Probability of detection1F4
Hmoment index2F5_1, F5_2
Molecular weight2F6_1, F6_2
Peptide charge for 45 scales45F7_1 to F7_45
Hydrophobicity at 44 scales44F8_1 to F8_44
Isoelectric point for 9 pKscale9F9_1 to F9_9
Kidera factors1F10
aaComp1F11
Table 2. Snapshot of the dataset.
Table 2. Snapshot of the dataset.
Peptide Sequence F1F2-----F10F11Class
VRSKVF28.100.976-----−2.7654.1011
SRISKDAT34.760.409-----−1.98−1.3421
KFELRZFIG132.602.157-----−0.577−0.0290
SAVFEKTLS97−6.98-----−0.912−4.7190
Table 3. ML classifiers used.
Table 3. ML classifiers used.
ClassifierMethodPackageTuning Parameter
XGBxgbxgboost(booster = “gbtree”, objective = “binary:logistic”, max_depth = 6, min_child_weight = 1, subsample = 1)
RFrandomForestrandomForestNtree = 1500, mtry = 10
Table 4. Optimal feature sets by different FS techniques.
Table 4. Optimal feature sets by different FS techniques.
TechniqueOptimal Feature SetNo. of Features
HcSF7_43, F7_42, F1, F8_43, F7_38, F11_7, F7_28, F8_24, F4, F7_41, F8_13, F7_6, F7_22, F10, F8_23, F7_7, F7_19, F8_16, F9_2, F7_23, F5_2, F11_322
BSTF6_1, F9_4, F9_7, F7_8, F8_7, F7_37, F7_1, F5_2, F8_10, F3, F7_40, F7_39, F7_24, F8_19, F7_5, F1, F7_33, F8_20, F7_34, F7_38, F2, F9_5, F7_14, F11_13, F6_225
Table 5. Accuracies achieved by different hybrid models.
Table 5. Accuracies achieved by different hybrid models.
FWTOFSSCT
BSTHCST
IGT93.6595.12
ChST97.1095.64XGB
IGT79.2394.19
ChST91.3293.68RF
Table 6. Comparative results of best hybrid models.
Table 6. Comparative results of best hybrid models.
ModelSensitivitySpecificityF1 ScoreAUCPrecisionMCC
Model 1: ChST−BST–XGB0.980.970.980.990.990.96
Model 2: IGT−HCST–RF0.920.930.900.960.920.94
Table 7. Top two hybrid models’ accuracy via 5FCV.
Table 7. Top two hybrid models’ accuracy via 5FCV.
RunModel 1Model 2
197.4592.76
297.3194.11
395.7895.19
497.6294.43
597.9293.96
Average Accuracy97.21694.09
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bukhari, S.N.H.; Ogudo, K.A. Hybrid Predictive Machine Learning Model for the Prediction of Immunodominant Peptides of Respiratory Syncytial Virus. Bioengineering 2024, 11, 791. https://doi.org/10.3390/bioengineering11080791

AMA Style

Bukhari SNH, Ogudo KA. Hybrid Predictive Machine Learning Model for the Prediction of Immunodominant Peptides of Respiratory Syncytial Virus. Bioengineering. 2024; 11(8):791. https://doi.org/10.3390/bioengineering11080791

Chicago/Turabian Style

Bukhari, Syed Nisar Hussain, and Kingsley A. Ogudo. 2024. "Hybrid Predictive Machine Learning Model for the Prediction of Immunodominant Peptides of Respiratory Syncytial Virus" Bioengineering 11, no. 8: 791. https://doi.org/10.3390/bioengineering11080791

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop