3. Results and Discussion
A dataset comprising 212 antigenic and immunogenic, and 212 antigenic but non-immunogenic, peptides was assembled following the methodology delineated previously. This dataset was partitioned into a training set consisting of 170 immunogenic and 170 non-immunogenic peptides, and a test set comprising 42 immunogenic and 42 non-immunogenic peptides. Each peptide was numerically represented as strings of 5
n E-descriptors, where
n denotes the number of amino acid residues. To standardize the representation, the strings of varying lengths were subjected to
ACC-transformation with a
L = 7. The
lag value of 7 was chosen as the length of the shortest peptide in the dataset. Consequently, the training set was transformed into a 340 × 175 (7 × 5
2) matrix, while the test set was converted into an 84 × 175 matrix. The flowchart of data preprocessing is presented in
Figure 1.
Six supervised machine learning algorithms were employed to construct classification models for predicting immunogenicity, utilizing the training set. To enhance the precision of the models, their hyperparameters underwent tuning via the grid search algorithm in combination with 10-fold cross-validation on the training set. This method entails predefined dictionaries encompassing hyperparameters and their respective value ranges, with the model evaluated across all feasible combinations employing cross-validation. Optimal hyperparameters were determined for each model as follows: the kNN model exhibited its peak predictive performance at k = 2, while the LDA method demonstrated superior predictions with solver = svd. Similarly, the QDA model achieved its highest performance with reg_param = 0.0. The SVM method reached its apex predictive capabilities with parameters C = 2, gamma = 10, and kernel = rbf. Furthermore, the RF model displayed optimal predictions with max_depth = 80, max_features = 2, and n_estimators = 300. Lastly, the XGBoost method showcased its highest performance with learning_rate = 0.3, max_depth = 3, and n_estimators = 100.
The performance of the models on the training set is presented on
Table 2. The
kNN model demonstrated the best recognition capabilities for classifying the immunogenic peptides (
sensitivity = 0.95), but a very poor performance on the non-immunogenic peptides (
specificity = 0.29). Both LDA and QDA models demonstrated poor overall performance regarding both classes. In contrast, SVM and XGBoost models showcased balanced performances, demonstrating robust predictive capabilities for both immunogenic and non-immunogenic peptides. The RF model exhibited the highest accuracy, primarily attributed to its superior predictive ability for non-immunogens (
specificity 0.9). While the RF model’s capability to predict immunogens was relatively strong (
sensitivity 0.72), it was lower compared to its performance in predicting non-immunogens.
Once the models were trained, they underwent subsequent evaluation on the hold-out test set to give an objective conclusion about their generalization capabilities (
Table 3). The observed results align well with the ones seen on the training set, indicating lack of overfitting. Although exceling in the recognition of immunogens, the
kNN model exhibited a notably poor performance when classifying the non-immunogenic peptides, rendering it unsuitable for effective classification. Conversely, both the LDA and QDA models demonstrated subpar overall performance across various metrics. The SVM, RF, and XGBoost models emerged as the most promising predictive models, displaying balanced and robust performances across all evaluated metrics. Consequently, these models were selected for further validation.
Initially, Y-scrambling was conducted for each of the SVM, RF, and XGBoost models. This process involved randomly shuffling the target pairs in the training data, constructing new models, and subsequently evaluating them on the test data. This iterative procedure was repeated 100 times, and the average accuracy value was computed. The accuracies for each model hovered around 0.5 (0.5075 for the SVM model, 0.5182 for the RF, and 0.4987 for the XGBoost), which is a noticeable deterioration in performance close to that of a random classifier (which has an accuracy score of 0.5 for binary class classification). The resulting poor performance of the model with the shuffled data compared to the good performance of the original data demonstrates the robustness of the models.
Next, we undertook an analysis to determine the significance of each feature in influencing the performance of the chosen models. Employing two feature importance techniques on the test set—permutation feature importance and drop-column feature importance—we evaluated the impact of each feature on the accuracy score of the three models (
Supplementary File S1). Each feature was assessed by comparing the baseline accuracy of the model with the accuracy when the feature was altered. Positive values indicated that altering (permuting or dropping) the feature decreased the model’s performance, suggesting the feature’s importance. Conversely, zero or negative values suggested that altering the feature either had no effect or even improved the model’s performance, indicating a negative correlation with model performance. However, the observed values were generally minimal, precluding definitive conclusions regarding feature importance. After that we observed the common features between the different models.
Table 4 delineates the top 10 most important features for each model. The
ACC features are represented as follows: the first numerical index represents the
E-descriptor of the first amino acid, the second index stands for the
E-descriptor of the second amino acid, and the third one, for the
lag-value.
Both permutation feature importance and drop-column feature importance methods identified four attributes as significant for the SVM model, namely ACC145, ACC147, ACC313, and ACC234. ACC145 and ACC147 denote the cross-covariance between E1 and E4 descriptors at L = 5 and L = 7, respectively, reflecting the relationship between hydrophobicity, partial specific volume, the number of codons, and the relative frequency of amino acids within defined intervals in protein sequences. ACC313 measures the cross-covariance between E3 and E1 descriptors at L = 3, indicating the association between amino acid occurrence in α-helices and hydrophobicity at the specific interval. ACC234 represents the cross-covariance between E2 and E3 descriptors at L = 4, highlighting the relationship between molecular size, steric properties, and amino acid occurrence in α-helices within the defined interval.
Although there were no shared attributes between the permutation feature importance and drop-column feature importance techniques for the RF method, the drop-column feature importance technique highlighted ACC537, ACC536, ACC535, ACC534, ACC533, ACC532, and ACC531 as its most crucial features. This discovery suggests a noteworthy relationship between amino acid propensities for occurrence in α-helices (descriptor E3) and β-strands (descriptor E5) across all possible lag-values (1–7). This indicates that these amino acid properties may exert a significant influence on the immunogenicity of tumor peptides.
ACC441 is the sole attribute shared between the permutation feature importance and drop-column feature importance techniques for the XGBoost model. It quantifies the auto-covariance of E4 descriptors of adjacent amino acids, revealing crucial associations among their partial specific volume, the number of codons, and the relative frequency of amino acids in the protein sequence.
No feature appeared in the top 10 for all the models simultaneously, indicating variability in feature importance across the models. For the drop-column feature importance technique, there were no common features among any of the models.
Figure 2 illustrates the common attributes identified through permutation feature importance technique for the three models. The SVM and RF models both consider
ACC117 and
ACC234 as the most significant attributes.
ACC117 signifies the auto-covariance of hydrophobicity (E1 descriptor) at
L = 7, while
ACC234 measures the cross-covariance between molecular size and steric properties (E2 descriptor) and the propensity of amino acids for occurrence in α-helices (E3 descriptor) at
L = 4. Additionally, the SVM and XGBoost models share two important attributes,
ACC137 and
ACC145.
ACC137 represents the cross-covariance between hydrophobicity (E1 descriptor) and the propensity of amino acids for occurrence in α-helices (E3 descriptor) at
L = 7, whereas
ACC145 denotes the cross-covariance between E1 and E4 descriptors at
L = 5. Finally, the RF and XGBoost models share
ACC254 as a significant attribute for both.
ACC254 quantifies the cross-covariance between molecular size and steric properties, along with the propensity of amino acids for occurrence in β-strands (E5 descriptor).
The two feature importance techniques utilized aim to identify the most significant features, yet using auto- and cross-covariance feature encoding alone cannot definitively explain why a particular feature holds importance. While it can provide insights into the correlation between specific biological properties encoded with corresponding E-descriptors, it falls short of elucidating causation. Moreover, each identified important feature represents a correlation between different biological properties, lacking consensus among them. Coupled with the generally minimal values regarding feature importance, this limitation prevents us from making conclusive suggestions or drawing conclusions about why one feature might outweigh another in importance.
We conducted a comparative analysis of the performance metrics of our three selected models with those of other in silico methods for predicting human tumor antigens. These methods were available online and could be applied without requiring specific programming knowledge. To facilitate this comparison, we evaluated their performance on the current test set and contrasted it with the consensus classification obtained from the majority voting of our three models. Specifically, if two or more models classified a peptide as immunogenic, it received a consensus classification as an immunogen. The results demonstrated that our three models exhibited superior performance across all assessed statistical measures (
Table 5).
All in silico tools and methods for predicting human tumor antigens exhibited very poor performance on the external test set utilized in our study. The Matthew’s correlation coefficient (MCC) values for all tested models indicated complete disagreement between prediction and observation for TTAgP 1.0, iTTCA-Hybrid, and iTTCA-RF, with predictions no better than random for PSRTTCA. A possible explanation for these dismal results across all assessment metrics is a significant disparity between the datasets used to train these models and the training set employed for our models.
Given their remarkable performance, the three selected models have emerged as prime candidates for incorporation into the forthcoming third version of the web-based immunogenicity prediction server, VaxiJen. The newly gathered data spanning 15 years include insights into previously unknown immunogens, which contributed to the updated dataset and models. Consequently, the new dataset along with SVM, RF, and XGBoost models have been selected for integration into the third iteration of the VaxiJen web server (VaxiJen v3.0).