1. Introduction
B-cell epitopes are regions on the surface of antigen molecules that are specifically recognized and bound by antibodies, playing a critical role in immune responses. Accurate prediction of B-cell epitopes (BCEs) is essential for understanding antigen–antibody interactions and holds significant promise for vaccine design, immunodiagnostics, and therapeutic antibody development [
1]. However, traditional experimental techniques such as X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy, which can provide precise three-dimensional structural information, are often time-consuming, expensive, and unsuitable for large-scale applications [
2]. Consequently, developing efficient and accurate computational methods for B-cell epitope prediction has become a focal point of research in this field.
Early computational approaches can be broadly categorized into sequence-based and structure-based methods. Sequence-based methods, such as BcePred, rely on physicochemical properties and statistical learning models trained on linear peptide sequences [
3]. However, these methods are inherently limited as they fail to account for the three-dimensional structure of proteins, making them insufficient for predicting conformational epitopes. To address this limitation, structure-based methods like ElliPro utilize features such as ellipsoid approximation and protrusion index calculations to identify epitope locations [
4]. While these approaches leverage spatial information, they often depend on manually designed heuristics, which restrict their generalizability across diverse antigen structures.
Deep learning has recently driven substantial advancements in biomedical research [
5]. Unlike traditional machine learning techniques, deep learning can automatically extract complex, nonlinear features from large datasets, offering distinct advantages for epitope prediction. Early machine learning models, such as XGBoost and logistic regression, were capable of identifying patterns in known antigen–antibody complexes but struggled with capturing the intricate spatial and nonlinear interactions between protein residues [
6]. Deep learning methods such as convolutional neural networks (CNNs) and graph neural networks (GNNs) are naturally suited to handle spatial relationships in protein 3D structures, with GNNs in particular effectively capturing spatial dependencies and chemical interactions between residues, thus increasing the accuracy and generalizability of epitope prediction [
7,
8].
The incorporation of deep learning, particularly graph-based models, into epitope prediction has substantially improved model performance by better representing the structural complexity of proteins [
9]. Furthermore, generative models, such as VAEs, have demonstrated significant potential in representation learning. Specifically, the VQ-VAE is particularly effective in capturing the underlying structural features of proteins by learning discrete representations [
10]. A similar concept of discretizing protein structures has also been explored in Foldseek, where the 3Di alphabet encodes local structural motifs to enable rapid structural searches [
11].
Building upon these advances, we propose a novel B-cell epitope prediction model—Graph-based Epitope Prediction Network (GraphEPN). This model combines the strengths of a VQ-VAE variant and a graph transformer, employing a two-stage training strategy to fully harness their capabilities. In the first stage, the VQ-VAE variant is pre-trained on a large-scale protein graph dataset, focusing on capturing high-quality representations of protein residues, encompassing both discrete and continuous features. This pre-training step enables the model to extract both discrete representations of amino acid microenvironments and continuous structural embeddings from antigen–antibody interaction data, providing a comprehensive feature set for downstream tasks. In the second stage, the graph transformer leverages the pre-trained VQ-VAE by using its fixed encoder and codebook to map protein graph nodes (residues) into continuous and discrete feature representations. These representations, combined with edge features, allow the model to capture long-range interactions and local dependencies among residues [
12]. To the best of our knowledge, this is the first time that such a deep learning framework has been applied to B-cell epitope prediction in structural bioinformatics. This model fully integrates both node and edge features, enabling the exploitation of graph-based information inherent in protein structures, thereby improving feature representation for epitope prediction. Experimental results demonstrate that GraphEPN outperforms existing methods across multiple datasets, particularly excelling in predicting epitopes in complex protein structures with superior accuracy and robustness.
2. Materials and Methods
2.1. Datasets
This study utilizes three independent datasets to pre-train the VQ-VAE, train the graph transformer model, and conduct independent evaluations.
SAbDab_1323 Dataset: SAbDab_1323 is derived from SAbDab [
13] and comprises 1323 nonredundant protein chains, each containing at least one epitope residue. This dataset is employed to retrain the VQ-VAE model, focusing on learning critical residue representations involved in antibody interactions. The resulting representations provide refined feature vectors crucial for downstream tasks, enhancing the model’s ability to capture essential structural information.
SAbDab_665 Dataset: This dataset, also obtained from SAbDab, contains 665 protein chains annotated with BCEs [
14]. BCEs are residues on antigens that interact with antibodies, with the distance between any heavy atom of the epitope and the antibody being less than 4.0 Å. CD-HIT clustering was applied to reduce sequence identity to below 70% to ensure sequence independence between the pretraining and training datasets. Only antigen chains with at least one epitope residue were retained for model training.
Blind_42 Dataset: The Blind_42 dataset consists of 42 nonredundant protein structures specifically designated for independent testing. Sequence identity between this dataset and SAbDab_1323 and SAbDab_665 was reduced to less than 30% via BLAST+ (version 2.14.0) for sequence alignment and homology filtering. The Blind_42 dataset exhibits significant label imbalance, with only 7.08% of the 12,570 samples labeled as positive (epitope residues), whereas negative samples outnumber positive samples by approximately 13.12 times. This dataset evaluates the model’s generalizability to unseen data and its robustness under highly imbalanced conditions.
2.2. Data Preprocessing and Graph Construction
This study extracts node and edge features from protein residues and constructs 3D graph representations of protein structures based on these features.
2.2.1. Node Feature Extraction
The extracted node features include physicochemical properties, backbone torsion angles (PHI and PSI), relative solvent accessible surface area (rASA), and secondary structure types. These features provide local structural information for each residue.
Given that the Cα coordinate of residue
is
, its physicochemical properties are represented as follows:
where
represents the dimensions of the physicochemical properties. The secondary structure (SS) types, backbone torsion angles
, and rASA values are extracted via the DSSP tool [
15,
16]. Specifically, SS types are encoded using a one-hot representation, backbone torsion angles
are derived from atomic coordinates, and rASA is computed by normalizing the absolute solvent accessible surface area against standard reference values for each amino acid. The node feature vector for residue
i is then represented as follows:
where
represents the one-hot encoding of the secondary structure type of residue
,
and
represent torsion angles, and
represents the relative solvent-accessible surface area.
2.2.2. Edge Feature Extraction
Residue interactions are modeled as edges between nodes [
17]. For each pair of residues
and
, the Euclidean distance is calculated as follows:
To better capture variations in distances, radial basis functions (RBFs) are used to encode the distances:
where
represents the center of the
RBF,
is its standard deviation, and
is the total number of RBF functions.
In addition to distance features, the relative orientation and rotation between residues are calculated and represented via quaternions:
where
is the fundamental part, and where
are the imaginary parts. Finally, the edge feature vector
is constructed by concatenating the RBF-encoded distance features and quaternion-based geometric information:
2.2.3. Protein 3D Graph Construction
After extracting node and edge features, we constructed a graph model based on the 3D structure of the protein [
18]. Each residue in the protein is represented as a node in the graph, and interactions between residues are represented as edges. Given the set of residues in a protein
, the graph
is constructed, where
denotes the set of nodes and where
denotes the set of edges. An edge
is added between residues
and
if the Euclidean distance
between them is less than or equal to a predefined threshold
. The formal representation of the graph is as follows:
Here,
represents the number of residues in the protein, and
is the distance threshold. Finally, the complete 3D protein graph was constructed via the DGL framework [
18]. In DGL, we construct the protein graph by defining nodes based on the residues and adding edges between residues that satisfy the distance threshold condition. Specifically, we calculate the Euclidean distance between all residue pairs and retain only those that meet the threshold criterion. These edges are then stored as adjacency relationships in the graph. Each edge is further assigned a set of features, including distance-based radial basis function (RBF) encodings and quaternion-based orientation features. The final graph representation is implemented in DGL, where node features, edge indices, and edge attributes are defined accordingly.
2.3. Model Architecture
2.3.1. Overview
This study proposes a deep learning model that combines a VQ-VAE with a graph transformer to improve B-cell epitope prediction performance. The model operates in two stages. First, a VQ-VAE model is pre-trained on the SAbDab_1323 dataset to learn universal representations of protein residues. In the downstream task, the fixed encoder and codebook are subsequently applied to the SAbDab_665 dataset to extract continuous and discrete representations. These features, along with edge features, are further processed by a graph transformer to capture complex interactions between residues and predict whether each residue is an epitope (
Figure 1a). This two-stage strategy significantly enhances the prediction accuracy and generalizability.
2.3.2. Pre-Trained VQ-VAE Module
To learn discrete latent representations from protein structures, we employ a VQ-VAE. The VQ-VAE model comprises an encoder, a vector quantization layer, and a decoder (
Figure 1b) and is trained in an unsupervised manner. The encoder, which is based on a graph convolutional network (GCN), processes node features from the protein graph [
19]. The input for each node is
, and the output is the latent representation
. Each GCN layer updates node features by aggregating information from neighboring nodes:
where
is the representation of node
i at the all-th layer, initialized as
. Here,
denotes the set of neighboring nodes of
,
is a trainable weight matrix, and
is a bias term.
The continuous representation
produced by the encoder is passed to the vector quantization layer, which uses a discrete codebook for quantization. Through nearest neighbor search, it is mapped to the closest vector in the codebook
, expressed as follows:
The quantization loss
ensures alignment between the continuous and discrete spaces, while a commitment loss
promotes stable updates for the codebook:
where
denotes the stop-gradient operation to prevent gradients from being propagated to the encoder.
The decoder, which is based on the GCN architecture, decodes the discrete representation
back to the original node feature
. The reconstruction loss
is expressed as follows:
To enhance robustness, a masking mechanism is introduced during training, with a masking loss .
Finally, the total loss function of the VQ-VAE model is as follows:
where
,
, and
are hyperparameters used to balance the contributions of each loss term.
2.3.3. Graph Attention-Based Epitope Prediction Module
In the downstream task of epitope prediction, the pre-trained VQ-VAE model is utilized, with its encoder and codebook parameters remaining fixed. On the new dataset, the pre-trained VQ-VAE model encodes and quantizes the protein graph to generate each node’s feature representation
and quantized representation
. These nodes are concatenated to form the final node feature
, which is further processed via a graph transformer model to capture the complex relationships between residues (
Figure 1c). For each layer, the model updates each node’s representation by combining the node features
and edge features
through a multi-head graph attention network (GAT) mechanism [
20]:
where
is the updated feature of node
,
is a trainable weight matrix, and
is the attention coefficient, which is calculated as follows:
GAT dynamically assigns attention coefficients to each neighboring residue, allowing the model to focus on the most informative ones. By incorporating edge features, which encode spatial relationships between residues, the model effectively captures structural dependencies, enhancing epitope prediction accuracy.
Using multihead attention mechanisms, the model independently calculates the relationships between nodes in different subspaces. The graph transformer is composed of multiple stacked transformer layers, each consisting of a multi-head GAT mechanism, residual connections, and normalization layers. The output of each layer is combined with its input through residual connections and enhanced by layer normalization to improve training stability. Finally, the model’s output layer uses a linear transformation followed by a sigmoid activation function to produce the prediction probability
for each node being an epitope:
where
is a trainable weight matrix, and where
is the predicted probability of node
being an epitope.
To improve the model’s adaptability to imbalanced data, we introduce binary cross-entropy loss and Dice loss as the loss function. The use of Dice loss enhances the model’s performance by emphasizing the prediction of the minority class, which in this case are the epitope residues. The final loss function is as follows:
By jointly optimizing these two loss functions, the graph transformer model effectively captures both local and global information in protein structures, achieving superior performance in epitope prediction tasks.
2.4. Model Training and Evaluation
This study adopts a two-stage training strategy. First, the VQ-VAE model is trained in an unsupervised manner on the SAbDab_1323 dataset to learn the discrete and continuous representations of protein residues. The trained VQ-VAE model is subsequently utilized for encoding features from the SAbDab_665 dataset, and the concatenated features are employed for supervised epitope prediction via the graph transformer model.
During the VQ-VAE pretraining phase, the model optimizes the learned representations by minimizing both the reconstruction loss and quantization loss. A masking mechanism is incorporated, where certain node features are randomly masked, compelling the model to recover these features via neighborhood information. The total loss function is controlled by three key parameters: α, β, and λ, which determine the relative contributions of reconstruction loss, vector quantization loss, and masked loss, respectively. Their values are empirically determined through grid search and set as follows: α = 1.0, β = 0.25, and λ = 1.5, ensuring a balance between feature preservation and generalization. The Adam optimizer is employed for training, with weight decay applied to prevent overfitting [
21]. The optimizer parameters are set as follows: learning rate = 0.01, weight decay = 0.0001, betas = (0.9, 0.999). The learning rate follows a cosine annealing schedule to improve convergence stability. The batch size is set to 64, balancing computational efficiency and generalization performance.
In the training phase of the graph transformer, joint optimization of multiple loss functions, including binary cross-entropy loss and Dice loss, is performed to mitigate the class imbalance inherent in the dataset [
22,
23]. The Adam optimizer, with the same parameter settings as in the VQ-VAE training phase, is employed, and a learning rate scheduler dynamically adjusts the learning rate to ensure stable convergence throughout the training process. A ReduceLROnPlateau scheduler is used to adjust the learning rate based on validation loss trends. Given the underrepresentation of epitope residues in the data, the loss function weights are adjusted to enhance the model’s ability to recognize the minority class. The hidden units are set to 128, the batch size is set to 32, and the dropout rate is 0.4 to mitigate overfitting while maintaining expressiveness. To assess the generalization capability of the model, 5-fold cross-validation is employed [
24]. In each fold, the dataset is split into a training set, a validation set, and a test set. The model is trained on the training set, and the validation set is used for tuning hyperparameters such as the learning rate, batch size, attention heads, hidden units, and dropout rate. The performance is then evaluated on the test set.
After model training, an evaluation is conducted on the independent blind test dataset, Blind_42. Owing to the class imbalance in the data, multiple evaluation metrics are employed to assess the model’s performance, including the F1-score, Matthews correlation coefficient (MCC), balanced accuracy (BACC), area under the curve (AUC), and area under the precision–recall curve (AUPRC) [
25]. The F1-score measures the harmonic mean of precision and recall, making it particularly suitable for imbalanced datasets. MCC provides a robust evaluation metric that considers all four elements of the confusion matrix, ensuring a more reliable assessment even when the class distribution is skewed. BACC is designed to compensate for class imbalance by computing the mean recall of both classes. AUC quantifies the model’s ability to distinguish between positive and negative samples across different decision thresholds, while AUPRC focuses on precision and recall trade-offs, which are particularly crucial for epitope prediction given the typically low prevalence of positive samples. As threshold-independent metrics, AUC and AUPRC serve as the primary evaluation criteria.
3. Results
3.1. Feature Engineering
In the baseline model of this study, amino acid property (AAP) features, while contributing to epitope prediction, were not the sole critical factor. To comprehensively evaluate the impact of other features, we systematically analyzed the effects of spatial features and DSSP features on epitope prediction via the SAbDab_665 dataset and explored how these feature combinations could enhance model performance.
First, we evaluated the impact of spatial features. These spatial features were utilized as edge features in the graph structure, effectively capturing spatial relationships between amino acid residues, including precise distance and directional information [
26]. For the test set, the addition of spatial features alone improved the model’s AUC from 0.735 to 0.802 and the AUPRC from 0.312 to 0.389, as shown in
Table 1. This substantial performance improvement underscores the critical role of spatial relationships between residues, when represented as edge features, in enhancing the accuracy of epitope prediction.
Next, we examined the contribution of DSSP features. DSSP features provide valuable information regarding protein secondary structure and solvent accessibility [
27]. Upon integrating DSSP features into the model, the AUC increased to 0.785, and the AUPRC rose to 0.346. These results emphasize the importance of local structural information in epitope identification.
More importantly, we assessed the model’s performance under the combined influence of spatial and DSSP features. The results demonstrate that the multi-feature fusion model substantially outperformed the single-feature models, with the AUC increasing from the baseline value of 0.735 to 0.829 and the AUPRC improving from 0.312 to 0.433. This outcome validates the hypothesis that the synergistic effect of multidimensional features, particularly the incorporation of spatial information as edge features, can more comprehensively capture the complexity of epitopes, thereby enhancing the model’s predictive ability [
28].
Based on these experimental results, our final model integrates a comprehensive combination of AAP, spatial, and DSSP features. We assert that this multidimensional feature engineering approach, especially the encoding of spatial information as edge features in the graph structure, not only captures the complexity of epitopes more holistically but also significantly improves the accuracy and robustness of the predictions. This integrated model provides a robust foundation for subsequent in-depth analyses and rigorous evaluations, offering a powerful tool for advancing the exploration of cutting-edge questions in the field of epitope prediction.
3.2. Internal Validation
This study conducted a systematic evaluation of the performance of the GraphEPN model through rigorous internal validation. The internal validation employed a five-fold cross-validation strategy, where the dataset was partitioned into five subsets, with one subset used for validation and the remaining four subsets used for training in each iteration. This alternating validation approach ensures the robustness of the model’s performance across different data splits.
To comprehensively assess model performance, we used the AUC and the AUPRC as the primary evaluation metrics. Additionally, to evaluate the relative performance of GraphEPN compared with other commonly used machine learning methods, we compared it with elastic net, gradient boosting, XGBoost, and logistic regression models [
29,
30,
31,
32]. Additional file:
Figures S1 and S2 illustrate the performance of each model across the five-fold cross-validation, whereas
Figure 2a,b highlight the AUC and AUPRC results for GraphEPN in each fold.
To ensure the robustness of the evaluation, the average values across the five folds were taken as the final evaluation metrics. The results show that GraphEPN outperforms all the other methods in terms of both the average AUC and average AUPRC, with values of 82.9% and 45.8%, respectively. Compared with the second-best model, XGBoost (with an AUC of 79.2% and a AUPRC of 38.9%), GraphEPN achieves improvements of 3.6% and 6.6% in these two metrics, respectively.
3.3. Comparison with Peer Methods
To evaluate the performance of GraphEPN, we compared it with several mainstream B-cell conformational epitope prediction methods, including SEMA 2.0, BepiPred 3.0, Ellipro, and SEPPA 3.0. These methods are widely used in the field of epitope prediction, each with its strengths, providing valuable benchmarks for comparison. The evaluation metrics included the AUC, AUPRC, F1-score, MCC, and BACC, with the AUC and AUPRC serving as the primary metrics.
SEMA 2.0 is an advanced web platform for predicting conformational B-cell epitopes via large pre-trained protein language models (PLMs). It integrates sequence-based and structure-based prediction, identifies N-glycosylation sites, and compares antigen structures for enhanced immunogenic analysis [
33].
BepiPred-3.0 is a sequence-based B-cell epitope prediction tool that uses protein language model embeddings (ESM-2) to improve the prediction accuracy for both linear and conformational epitopes. It also incorporates additional input variables and a refined epitope annotation strategy for enhanced performance [
34].
ElliPro is a web tool for predicting antibody epitopes based on the geometric properties of protein structures that uses residue clustering and visualization tools [
4].
SEPPA 3.0 is an advanced tool for predicting B-cell epitopes that integrates glycosylation-related features and microenvironmental information to increase the prediction accuracy for glycoprotein antigens [
35].
In our comparative analysis, we evaluated the performance of each method using the independent Blind_42 test dataset, calculating the AUC and AUPRC scores. The optimal thresholds were selected based on precision–recall curves, and additional metrics such as the F1-score, MCC, and BACC were computed.
Figure S3 presents a comprehensive heatmap of all evaluation metrics for each method, visualizing the comparative performance and highlighting GraphEPN’s consistent superiority across multiple metrics. As shown in
Table 2, GraphEPN outperforms the majority of the peer methods across several key evaluation metrics. Specifically, it demonstrates superior performance in AUC (
Figure 2c) and AUPRC (
Figure 2d), underscoring its exceptional generalization capacity and robustness.
3.4. Case Study
To demonstrate the effectiveness of GraphEPN, we analyzed a representative protein structure, chain A of PDB ID: 6ad8_A, from the independent test set. The performance of GraphEPN was compared against SEMA 2.0, BepiPred 3.0, ElliPro, and SEPPA 3.0.
As shown in
Figure 3, GraphEPN accurately predicted most epitope residues and minimized false positives in non-epitope regions. Its predictions closely matched the annotated ground truth, particularly in regions with complex spatial configurations. In comparison, methods like ElliPro, SEPPA 3.0, and BepiPred 3.0 struggled to achieve the same level of precision, often missing residues in structurally intricate areas or producing higher false positive rates.
To further compare the predictions across different methods at the sequence level, we provide a sequence-based visualization in
Figure S4. This additional analysis offers a complementary perspective to
Figure 3 by allowing for a direct comparison of predicted epitope positions along the sequence. The integration of sequence and structure-based visualizations helps to better illustrate the strengths and limitations of each method.
GraphEPN’s superior performance stems from its ability to combine discrete and continuous feature representations with structural information, enabling it to capture long-range dependencies and local interactions effectively. This advantage is evident in its ability to provide more precise and reliable predictions, even for challenging protein structures.
This case highlights GraphEPN’s robustness and accuracy, showcasing its potential as a valuable tool for epitope prediction, with practical applications in vaccine design and therapeutic antibody development.
3.5. Ablation Study
To assess the impact of the VQ-VAE and different attention mechanisms within the graph transformer architecture, we perform ablation methodological experiments on the SAbDab_665 dataset. The experimental setup consisted of four configurations: (1) full model, which combines the VQ-VAE and GAT modules to capture discrete features effectively and local dependencies; (2) VQ-VAE + multihead self-attention (MSA), which replaces GAT with MSA to examine the effects of different attention mechanisms while preserving feature quantization; (3) direct node features + GAT, which includes VQ-VAE while retaining GAT to focus on local relationships via direct node features; (4) direct node features + MSA, which omits both VQ-VAE and GAT, relying solely on the self-attention mechanism to evaluate the model’s performance. The experimental results are summarized in
Table 3. The full model, which combines VQ-VAE and GAT, achieves the highest performance, demonstrating that discrete feature representations and local dependencies captured by GAT are essential for this task. Configurations without VQ-VAE or with MSA replacing GAT exhibited performance degradation, further reinforcing the importance of discrete representations and the GAT module in capturing local interactions.
3.6. Visualization of the Epitope Prediction Results
To conduct a comprehensive analysis of the epitope prediction results from the GraphEPN model, we employed advanced visualization techniques to effectively present the predicted epitope probabilities and integrate them with protein sequences and 3D structures. This integration not only enhances the interpretability of the predictions but also provides clear insights for guiding subsequent experimental investigations. For each protein chain, the predicted epitope probabilities are represented through bar charts (
Figure 4b), where the height of each bar corresponds to the prediction score of the respective residue. A default threshold of 0.6 is applied to identify potential epitope regions. Additionally, the prediction scores are embedded within the PDB files by substituting the B-factor field, facilitating direct visualization of the epitope distributions via 3D structure visualization tools (
Figure 4a). For each protein chain, two primary output files are generated: one containing the bar chart of epitope scores and the other containing an updated PDB file with the embedded prediction scores.
4. Conclusions
In this study, we present a novel B-cell epitope prediction model, GraphEPN, which integrates the VQ-VAE and graph transformer to capture both the local and global structural features of proteins effectively. By leveraging VQ-VAE for discrete residue representations and employing a graph transformer to model complex residue interactions, GraphEPN achieves superior predictive performance across multiple datasets, significantly improving the accuracy and generalizability of epitope prediction.
The experimental results indicate that GraphEPN excels in various evaluation metrics, confirming its ability to capture essential protein structural information and model intricate residue interactions. Through the incorporation of multiple features and the utilization of graph neural networks and self-attention mechanisms, the model provides a comprehensive understanding of epitope regions and their predictions. Furthermore, the two-stage training strategy enables the model to capture both local structural features and long-range dependencies within the protein structure.
Nevertheless, the current predictive accuracy is still limited by the availability of experimental structural data. Given the scarcity of antigen–antibody complex structures, some of the predicted epitopes may not necessarily be false positives but rather unannotated potential true positives. Additionally, the computational cost associated with integrating VQ-VAE and graph transformers remains a challenge, necessitating further optimization to enhance efficiency. Future work will focus on improving feature representations through contrastive learning and exploring lightweight graph neural networks to reduce computational overhead. Moreover, expanding the dataset by incorporating predicted structures from AlphaFold could improve generalizability, particularly for antigens lacking experimental structural data.
In conclusion, GraphEPN offers a robust and effective tool for B-cell epitope prediction, with significant potential for applications in vaccine development and antibody design.