PMSFF: Improved Protein Binding Residues Prediction through Multi-Scale Sequence-Based Feature Fusion Strategy

Li, Yuguang; Nan, Xiaofei; Zhang, Shoutao; Zhou, Qinglei; Lu, Shuai; Tian, Zhen

doi:10.3390/biom14101220

Open AccessArticle

PMSFF: Improved Protein Binding Residues Prediction through Multi-Scale Sequence-Based Feature Fusion Strategy

by

Yuguang Li

¹,

Xiaofei Nan

¹

,

Shoutao Zhang

^2,3,

Qinglei Zhou

¹,

Shuai Lu

^1,4,*

and

Zhen Tian

^1,5,*

¹

School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China

²

School of Life Sciences, Zhengzhou University, Zhengzhou 450001, China

³

Longhu Laboratory of Advanced Immunology, Zhengzhou 450001, China

⁴

National Supercomputing Center in Zhengzhou, Zhengzhou University, Zhengzhou 450001, China

⁵

Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003, China

^*

Authors to whom correspondence should be addressed.

Biomolecules 2024, 14(10), 1220; https://doi.org/10.3390/biom14101220

Submission received: 15 August 2024 / Revised: 22 September 2024 / Accepted: 24 September 2024 / Published: 27 September 2024

(This article belongs to the Special Issue Computational Intelligence in Structure and Function Prediction and Modeling of Proteins—2nd Edition)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Proteins perform different biological functions through binding with various molecules which are mediated by a few key residues and accurate prediction of such protein binding residues (PBRs) is crucial for understanding cellular processes and for designing new drugs. Many computational prediction approaches have been proposed to identify PBRs with sequence-based features. However, these approaches face two main challenges: (1) these methods only concatenate residue feature vectors with a simple sliding window strategy, and (2) it is challenging to find a uniform sliding window size suitable for learning embeddings across different types of PBRs. In this study, we propose one novel framework that could apply multiple types of PBRs Prediciton task through Multi-scale Sequence-based Feature Fusion (PMSFF) strategy. Firstly, PMSFF employs a pre-trained language model named ProtT5, to encode amino acid residues in protein sequences. Then, it generates multi-scale residue embeddings by applying multi-size windows to capture effective neighboring residues and multi-size kernels to learn information across different scales. Additionally, the proposed model treats protein sequences as sentences, employing a bidirectional GRU to learn global context. We also collect benchmark datasets encompassing various PBRs types and evaluate our PMSFF approach to these datasets. Compared with state-of-the-art methods, PMSFF demonstrates superior performance on most PBRs prediction tasks.

Keywords:

protein binding residues; sequence-based feature; attention mechanism; deep learning

1. Introduction

Multitudinous cellular functions, such as immune reaction, signal transduction, and transcription are driven by protein interactions [1,2]. Those interactions are carried out with a variety of other molecules including DNA, RNA, small ligands and other proteins [3,4]. Research has demonstrated that antibody-antigen interaction plays a crucial role in the immune response and is a specific type of protein-protein interaction [5]. Consequently, accurate detecting of protein interactions contributes to the construction of protein interaction networks [6,7,8], annotation of protein functions [9,10], understanding of molecular mechanisms about diseases [11], design of new vaccines [12,13] and development of novel therapeutic antibodies [14].

Following the success of the Human Genome Project, millions of protein sequences have been determined and are stored in public databases [15,16]. Determining protein binding residues through traditional biological experiments is time-consuming and costly [17]. Therefore, a variety of computational methods have been proposed in recent years to address this issue. Computational methods for predicting protein binding residues (PBRs) can be broadly classified into sequence-based and structure-based approaches [18]. Sequenced-based methods have a broad application range because structure-based methods rely on available protein structures. With the development of protein structure prediction such as AlphaFold [19] and AlphaFold DB [20], structure-based methods are improved by much more predicted protein structures [21,22]. Although structure-based computational methods can provide better results, we focus on sequence-based methods in this study as they are usually much faster and cheaper. PBRs encompass various interaction types, including protein-protein, protein-nucleotide, protein-small ligand, and antibody-antigen binding residues [2,4,23]. As summarized in Table 1, a diverse range of machine learning and deep learning models, such as Multilayer Perceptron (MLP), Support Vector Machine (SVM), Random Forest (RF), Convolutional Neural Networks (CNN), and Recurrent Neural Network (RNN), have been employed in PBR prediction.

Generally speaking, sequence-based methods rely solely on protein sequences for feature extraction. One of the concerns of these approaches is introducing or designing more useful features. For example, Li et al. [24] utilize three novel sequence-based features: HSP, position information and ProtVect. Stringer et al. [25] combine traditional sequence-based features with predicted structure features. Hosseini et al. [26] mainly apply multiple sequence alignment features with attention to create a 768-dimensional vector for each amino acid residue. Those features are generated by various bioinformatics tools which cost much extra time. Moreover, those sequence-based methods are very unstable and poorly generalized [27] as unbalanced benchmark datasets are used for PBRs prediction. The adoption of the ensemble learning method is a popular way to overcome this limitation.

In contrast, structure-based methods typically incorporate both sequence and structural data. These approaches focus on taking full advantage of protein conformation information based on graph representation learning and end-to-end deep learning models. Fout et al. [28] introduced Graph Convolutional Network (GCN) for PPIs prediction. Recently, GCN and its varieties are widely used [29,30]. The end-to-end deep learning models include geometric deep learning [31] and 3D CNN models [32]. The difference between graph representation learning and end-to-end deep learning models is that the graph-based methods need to take protein embeddings as input, but the end-to-end deep learning methods don’t.

As evident in Table 1, MSA (Multiple Sequence Alignment), predicted structural, and protein language model (PLM) features are commonly employed for protein representation learning. MSAs, derived from tools like PSI-BLAST [33], HHblits [34], or MMseqs2 [35], capture evolutionary information. Predicted structural information, such as secondary structure and solvent accessibility, derived from tools like NetSurfP-3.0 [36] or PSIPRED [37], could significantly enhance representation richness. Additionally, physicochemical properties and hydrophobicity are often considered. Most PBRs prediction methods concatenate these features directly into a matrix for embedding each input protein and then feed them into machine learning or deep learning models for predicting. Although this operation is simple, it consumes extra time and introduces uncertainty in feature selection. Inspired by natural language process technique and recent advancements in protein language models (PLMs), exemplified by ProtTrans [38] and ESM [39], have revolutionized protein representation. PLMs have shown utility in diverse biological tasks, including epitope prediction [40], protein-metal ion binding site prediction [41], and ligand binding site prediction [42]. Their integration into PBRs prediction models is increasingly essential.

The predominant approach for aggregating target and neighboring residue features is the sliding window strategy. This involves scanning a protein sequence with a fixed-size window and concatenating residue features within the window. While effective, the optimal window size varies across PBRs types and datasets. Previous studies have explored different window sizes, with 9 being common for protein-protein interactions [43,44]. However, determining the optimal window size is time-consuming and its impact on performance is not always consistent. Furthermore, the influence of neighboring residues on the central residue’s interaction potential is complex and not captured effectively by fixed-size windows. Consequently, there is a compelling need for more sophisticated methods to capture diverse residue dependencies without relying on arbitrary window sizes.

Table 1. Summary of recent studies that focus on sequence-based PBRs prediction of different types.

PBRs Type	Method	Model Type	MSA Feature	Predicted Structural Feature	PLM Feature	Window Size
P	SPPIDER [45]	MLP	✓	×	×	11
	ISIS [46]	MLP	✓	✓	×	-
	PSIVER [47]	Naïve Bayes	✓	✓	×	9
	LORIS [48]	LR	✓	✓	×	9
	SPRINGS [49]	MLP	✓	✓	×	9
	CRFPPI [50]	RF	✓	✓	×	9
	SPRINT [51]	SVM	✓	✓	×	9
	SSWRF [52]	SVM and RF	✓	✓	×	9
	RFPPI [53]	RF	×	✓	×	9
	SeRenDIP [54]	RF	✓	✓	×	9
	DLPred [55]	LSTM	✓	×	×	-
	Lu et al. [43]	CNN and Attention	✓	✓	×	5
N+S+P	SCRIBER [56]	LR	✓	✓	×	11
	ProNA2020 [57]	MLP	✓	✓	✓	11
	PROBselect [58]	LR and SVM	✓	✓	×	-
	DELPHI [24]	CNN and GRU	✓	✓	✓	31
	PPISP-XGBoost [59]	XGBoost	✓	✓	×	9
	PIPENN [25]	CNN and RNN	✓	✓	×	9
	AttCNNPPI [44]	CNN and Attention	✓	✓	×	5
	PITHIA [26]	CNN and RNN	✓	✓	×	9
	StackingPPInet [60]	DCNN and Attention	✓	×	×	32
	EDLMPPI [27]	GRU	×	×	✓	25
	D-PPIsite [61]	CNN	✓	✓	×	17
	ISPRED-SEQ [62]	CNN	✓	×	✓	31
	EnsePPIs [63]	CNN	×	×	✓	-
	DeeBSRPred [64]	MLP	✓	✓	×	11
	DeepRObind [65]	CNN and Attention	✓	✓	×	25
Pa	proABC [66]	RF	×	×	×	-
	Parapred [67]	CNN and RNN	×	×	×	-
	Ag-Fast-Parapred [68]	CNN and RNN	×	×	×	-
	proABC-2 [69]	CNN	✓	×	×	-
	DeepANIS [70]	BiLSTM and Transformer	✓	✓	×	-
	Lu et al. [71]	CNN and BiLSM	✓	✓	×	-
Ep	Bepipred-1.0 [72]	Hidden Markov	×	×	×	11
	AAPPred [73]	MLP	×	×	×	7
	CBTOPE [74]	SVM	×	×	×	13
	Bepipred-2.0 [75]	RF	✓	✓	×	9
	SeRenDIP-CE [76]	RF	✓	✓	×	9
	SEMA [77]	MLP	×	×	✓	9

Notes: P stands for methods are evaluated on Protein-protein binding residues. N+S+P indicates methods are evaluated on datasets containing protein-Nucleotide (DNA, RNA) binding residues, protein-Small ligand binding residues and protein-Protein binding residues. Pa and Ep are short for paratope and epitope which are binding residues from antibody and antigen interaction interface, respectively. MSA feature means protein feature returned by running multiple sequence alignment tools such as PSI-BLAST, HHsuits and MMseqs2. PLM feature means protein feature generated by protein language models such as ESM and ProtTrans. ✓ means this feature is used and × means this feature is not used.

In this paper, we design a novel sequence-based PBRs prediction framework PMSFF. The framework of PMSFF is presented in Figure 1. PMSFF improves the traditional sliding window approach and utilizes information on more scales through multi-scale feature fusion. PMSFF could handle multiple PBRs, provide better prediction results and spend less running time compared with the current prediction approaches. The contributions of our work are summarized as follows:

We construct multi-scale features through multi-size windows for applying more key neighboring residues and multi-size kernels on various scales. Inspired by the NLP technique, we take a protein sequence as a sentence and utilize bidirectional GRU for multi-scale feature fusion. This allows for better matching of different types of PBRs and obtaining proper residue local context and global semantic information.
To leverage transfer learning, PMSFF incorporates ProtT5-XL-UniRef50 (ProtT5), a variant of the ProtTrans model trained on the UniRef50 dataset, for protein feature generation. A comparative analysis between ProtT5 features and traditional concatenated features reveals a substantial reduction in computational time while achieving significantly improved prediction performance.
We collect and reorganize benchmark PBRs datasets covering most of the categories, including PBRs from protein interactions with nucleotide (DNA, RNA), small ligand, common protein, heteromeric protein, homomeric protein as well as antibody and antigen interaction. The results on these datasets demonstrate the superior performance of PMSFF compared with the state-of-the-art methods.

2. Materials and Methods

2.1. Benchmark Dataset Construction

To evaluate the performance of PMSFF on different types of PBRs and make a fair comparison with the state-of-the-art methods, we collect various datasets from previous studies and utilize the same way for training and evaluation [25,54,67].

Table 2 shows the summary of all datasets used in this study. Specifically, the NSP6373 and NSP448 datasets contain multiple PBRs from protein interactions with nucleotide (DNA, RNA), small ligands and other types of protein. Other datasets consist of PBRs from type-specific interactions between protein-protein, heterodimer, homodimer and antibody-antigen.

To train our model, we utilize the NSP6373 dataset, derived from the BioDL_A_TR dataset employed in PIPENN [25]. For evaluation, we employ the widely used NSP448 dataset [56] and its subset, NSP355, derived from DELPHI [24]. These datasets have been shown to exhibit low sequence similarity (<25%) [25]. To further mitigate sequence homology, we applied ECOD with an E-value cutoff of 0.001 [78,79], resulting in a non-redundant NSP6373 training set of 6373 protein sequences. It’s important to note that NSP6373, NSP448, and NSP355 encompass PBRs involved in interactions with nucleotides, small ligands, and other proteins. We use the “NSP” prefix to denote this broad PBRs type. Residues are defined as binding if an atom is within 0.5 Å plus the Van der Waals radii of any partner protein atom [25]. While the NSP6373 and NSP448 datasets encompass a wide range of protein-binding residues (PBRs) [56], it’s important to consider that PBRs from specific protein interactions can exhibit unique characteristics. To assess model performance on these specialized PBRs, we incorporate additional datasets focused on particular protein interfaces.

The P346 and P70 datasets are derived from Dset186, Dset72 [47], and PDBset164 [48], which were originally used to train DeepPPISP [80]. Following DeepPPISP’s methodology, we constructed a non-redundant dataset with a sequence identity threshold of 25% and split it into training (346 proteins) and test (72 proteins) sets. This process ensures minimal sequence homology between the two sets. Both P346 and P70 contain PBRs from common heterodimeric interfaces.

The N376, N38, S3874, and S354 datasets are derived from the BioDL_S_TR, ZK448_S_TE, BioDL_N_TR, and ZK448_N_TE datasets, respectively, which were employed for training and testing PIPENN [25]. To ensure dataset independence, homologous proteins between training and test sets were removed using ECOD. The resulting datasets contain type-specific PBRs involved in interactions with nucleotides (DNA, RNA) and small ligands.

The HHC489 dataset, originally used for training RFPPI [53], comprises PBRs from heterodimer and homodimer interfaces. RFPPI constructed a homodimer dataset (HM_479) with a 25% sequence identity threshold and derived a test set (Ho95) by randomly selecting 20% of HM_479. Similarly, a heterodimer dataset (Dset_119) was trained and tested on Dset_48 (renamed He48) with a 25% sequence identity cutoff. To create a balanced training set, we combined HM_479 and Dset_119, removing any proteins homologous to Ho95 or He48, resulting in the HHC489 dataset containing 489 proteins. Consistent with RFPPI [53], residues in HHC489, Ho95, and He48 are defined as binding if both their pre-association accessible surface area (ASA) and post-association buried surface area (BSA) exceed 6.0 Å.

Antibody-antigen interactions represent a specific type of protein interface [81]. Paratopes and epitopes are the corresponding binding residues on antibodies and antigens, respectively. To evaluate paratope and epitope prediction methods, the Pa277 and Ep280 datasets are commonly employed [67,76]. Liberis et al. curated the Pa277 dataset comprising 277 complexes from the SAbDab database [82], defining paratopes as antibody residues within 4.5 Å of any antigen atom. Similarly, Hou et al. constructed the Ep280 dataset containing 280 complexes from SAbDab, designating residues within 6.0 Å of the antibody as epitopes [76].

2.2. Framework Architecture of the PMSFF

As illustrated in Figure 1A, our proposed PMSFF model comprises three core components: an attention layer, a convolutional neural network (CNN) layer, and a gated recurrent unit (GRU) layer. The model processes an input matrix of residue feature vectors. The attention layer captures contextual information from neighboring residues using a refined sliding window approach across multiple scales. Subsequently, the CNN layer extracts local features through one-dimensional convolutions with varying kernel sizes, followed by max pooling. The bidirectional GRU layer models long-range dependencies among residues. Finally, a fully connected layer predicts the binding probability for each residue.

2.2.1. The Input Representation of Proteins

As depicted in Figure 1, we employ the ProtT5 protein language model for feature extraction. ProtT5 is pre-trained on the BFD dataset [83] and subsequently fine-tuned on UniRef50 [84]. The resulting 1024-dimensional residue embeddings from the final ProtT5 layer serve as our primary protein representation. To assess the impact of different feature encoding strategies, we compare this approach with a standard concatenation of traditional residue features.

As indicated in Table 1, MSA and predicted structural features are prevalent in PBRs prediction. Our feature set, consistent with previous work [44], comprises one-hot encoded residue types, position-specific scoring matrices (PSSMs) generated by MMseqs2 [35], and predicted structural features including secondary structure, solvent accessibility, and backbone dihedral angles from NetSurfP-2.0 (NSP2) [85] or NetSurfP-3.0 (NSP3) [36]. While NSP2 relies on MMseqs2 for feature extraction, NSP3 incorporates the ESM language model [39]. This results in a 52-dimensional residue representation, aligning with our prior work [43]. Given ProtT5’s demonstrated efficacy in residue-level classification [41], we prioritize its use for feature generation. ProtT5 produces a set of embedding vectors for each protein, forming the input matrix for our PMSFF model, which is formulated as:

S = {[r_{1}, r_{2}, r_{3}, \dots, r_{i}, \dots, r_{l}]}^{T}, r_{i} \in R^{d}

(1)

where l is the protein length, d equals to the dimension of the residue feature vector and

r_{i}

is the feature vector of i-th residue.

2.2.2. Attention Layer

While the sliding window approach effectively captures local residue dependencies [53], its performance is hindered by the complex and varied nature of residue interactions. Optimal window sizes can differ across PBRs types and even within the same PBRs category, as evidenced in Table 1. Our previous work demonstrated the efficacy of attention mechanisms in aggregating target and neighboring residues [43], with optimal performance achieved at a window size of 5. To accommodate the diverse characteristics of PBRs, we introduce a multi-scale attention mechanism that employs multiple window sizes.

As depicted in Figure 1A, our attention layer assigns weights to neighbor residues within a specified window, generating a context vector through weighted averaging. Mathematically, the neighboring residue

r_{j}

of the target residue

r_{i}

within the window is defined as:

r_{j} \in {r_{i - w}, r_{i - w + 1}, \dots, r_{i}, \dots, r_{i + w - 1}, r_{i + w}}

(2)

To capture complex patterns between target residue and neighboring residues, we improve the sliding window approach by utilizing n windows of different sizes. The neighboring residue

r_{j}^{n}

of target residue

r_{i}

in the n-th window is shown as follow:

r_{j}^{n} \in {r_{i - w_{n}}, r_{i - w_{n} + 1}, \dots, r_{i}, \dots, r_{i}, r_{i + w_{n} - 1}, r_{i + w_{n}}}

(3)

As our previous work [44], we utilize an attention mechanism to enable the target residue learning to pay different attention to each neighboring residue. The typical additive model is used in this study. In the n-th window, the similarity score

s c o r e (r_{i}, r_{j}^{n})

between target residue

r_{i}

and neighboring residue

r_{j}^{n}

and the attention weight

α_{i, j}^{n}

are calculated as:

s c o r e (r_{i}, r_{j}^{n}) = v_{a}^{n} tanh (W_{a}^{n} [r_{i} \oplus r_{j}^{n}])

(4)

α_{i, j}^{n} = \frac{e x p (s c o r e (r_{i}, r_{j}^{n}))}{\sum_{j^{'}} e x p ((s c o r e (r_{i}, r_{j}^{n}))}

(5)

The residue pair correlation score

s c o r e (r_{i}, r_{j})

is computed by a two-layer neural network described in formula (4) in which

v_{a}^{n}

and

W_{a}^{n}

are weight matrixes when learning within n-th windows. And neighboring residues

r_{j, j \neq i}

with larger scores contribute more to the context vector

g_{i}

. The attention weights

α_{i, j}

should satisfy the following restrictions:

α_{i, j} \geq 0

and

\sum_{i} α_{i, j} = 1

.

The context vector

g_{i}

of the n-th window is constructed by scoring and combing neighboring residue vectors in a weighted sum:

g_{i}^{n} = \sum_{j \neq i} α_{i, j}^{n} * r_{j}^{n}

(6)

The output of the attention layer is the concatenation of the original residue vector and context vectors learning from sliding windows of different sizes shown as follows:

r_{i}^{'} = r_{i} \oplus g_{i}^{1} \oplus \dots \oplus g_{i}^{n}

(7)

2.2.3. CNN Layer

Convolutional neural networks (CNN) have been used in various bioinformatics tasks such as protein structure prediction [86] and protein-compound affinity prediction [87]. The usage of multi-size kernels in CNN is helpful for extracting information because it can account for various scales which has improved the performance in text classification [88] and image classification [89]. Inspired by TextCNN [88], we utilize multi-size kernels in the CNN layer of PMSFF. This architecture allows the network to extract information over multiple scales. The input of the CNN layer is the output of the attention layer which can be shown as follows:

S^{'} = {[r_{1}^{'}, r_{2}^{'}, r_{3}^{'}, \dots, r_{i}, \dots, r_{l - 1}, r_{l}^{'}]}^{T}, r_{i} \in R^{4 d}

(8)

And, m kernels of different sizes are used in this study when carrying out our experiments. Therefore, every convolutional operation is shown as follows:

c_{i}^{m} = f_{c}^{m} (W_{c}^{m} * r_{i}^{'} + b_{c}^{m})

(9)

where

f_{c}^{m}

is m-th non-linear activation function, and

r_{i}^{'}

is the output of the CNN layer.

As illustrated in Figure 1C, the output of the CNN layer is a feature matrix of dimensions l × 3d, where l is the sequence length and d is the number of output channels per convolutional filter. A subsequent max-pooling layer with a filter size of 5 and zero padding downsamples the feature matrix, generating the feature vector

p_{i}

as input to the GRU layer.

p_{i} = m a x p o o l i n g (c_{i}^{1} \oplus \dots \oplus c_{i}^{m})

(10)

2.2.4. GRU Layer

Recurrent Neural Networks (RNN) excel at processing sequential data [90]. While both Long Short-Term Memory (LSTM) [91] and Gated Recurrent Unit (GRU) [92] are RNN variants, GRU’s computational efficiency makes it our preferred choice. To capture intricate residue interactions within the protein sequence, we incorporate a GRU layer in PMSFF. The input to this GRU layer is a matrix of residue feature vectors, formed by concatenating the outputs of all CNN channels. The resulting feature vector for residue i is denoted as

p_{i}

:

p_{i} = [e_{1}, e_{2}, e_{3}, \dots, e_{3 k}]

(11)

To capture both forward and backward contextual information within protein sequences, analogous to natural language processing, we employ a bidirectional gated recurrent unit (GRU) with a hidden state size of 64. The GRU calculations are as follows:

z_{t} = σ (W_{z} \cdot [h_{t - 1}, e_{t}] + b_{z})

(12)

r_{t} = σ (W_{r} \cdot [h_{t - 1}, e_{t}] + b_{r})

(13)

\tilde{h_{t}} = t a n h (W_{c} \cdot [r_{t} * h_{t - 1}, e_{t}] + b_{c})

(14)

h_{t} = (1 - z_{t}) * h_{t - 1} + z_{t} * \tilde{h_{t}}

(15)

There are two gates consisting of one reset gate

z_{t}

with corresponding weight matrix

W_{z}

, and one bias

b_{z}

; one update gate

r_{t}

with corresponding weight matrix

W_{r}

, and a bias

b_{r}

. And,

t a n h

is the element-wise hyperbolic tangent,

σ

is the logistic sigmoid function,

W_{c}

is weight matrix,

e_{t}

and

h_{t - 1}

are inputs, and

h_{t}

is output.

Bidirectional GRU can learn information of input sequence from forward and backward. As shown in Figure 1A, the GRU layer in PMSFF contains two sub-networks for the left and right sequence contexts. For the i-th residue in the input antigen sequence, we combine the forward pass output

\vec{h_{i}}

and backward pass output

\overset{\leftarrow}{h_{i}}

by concatenating them:

h_{t}^{'} = [\vec{h_{t}} \oplus \overset{\leftarrow}{h_{t}}]

(16)

As shown in Figure 1C, the output

h_{t}^{'}

is fed to a fully connected layer. The calculation of binding probability

o_{i}

for each input residue is shown as:

o_{i} = f_{o} (W_{o} h_{t}^{'} + b_{o})

(17)

3. Results

In this section, we first investigate the effects of multi-scale and ProtT5 features on the PMSFF model. Then we compare the proposed model with the SOTA approaches on different evaluation datasets. The evaluation metrics and the implementation details have been displayed in Appendix A.1 and Appendix A.2 respectively.

3.1. The Effectiveness of Multi-Scale Feature

To enhance PBRs prediction performance, we incorporate multi-scale features by applying multiple window sizes within the attention layer and kernel sizes within the CNN layer. A GRU layer is further employed to capture long-range dependencies. Our PMSFF model utilizes three window sizes (5, 15, 25), three kernel sizes (3, 5, 7), and a single GRU layer.

To assess the impact of these components, we conduct experiments varying the number of windows, kernels, and GRU layers from 0 to 4. A value of 0 indicates the absence of the corresponding layer. Each model configuration is trained and tested five times on the NSP6373 and NSP448 datasets for robust evaluation. Mean performance metrics and standard errors are presented in Table 3.

As shown in Table 3, our proposed PMSFF model consistently outperforms other configurations across all evaluation metrics. Increasing the number of attention windows generally improves performance, but excessive windows lead to overfitting. While a CNN layer without kernels yielded reasonable results, employing three kernels was optimal for capturing local features. The GRU layer demonstrated the best performance with a single layer, as additional layers tended to overfit. Overall, these findings suggest that while multi-scale features can enhance model performance, an appropriate balance between complexity and model capacity is essential.

Building upon our previous work [43], we initially selected a window size of 5 for capturing local residue interactions. To incorporate longer-range dependencies, four window sizes of 15 and 35 were introduced. Our results indicate that employing three window sizes (5, 15, 25) outperforms configurations with fewer windows, suggesting the benefit of capturing information from diverse neighborhood scales. However, excessively large window sizes, as demonstrated by the W4K3GRU1 model, can introduce noise and hinder performance.

Inspired by TextCNN [88] and DeepPPISP [80], we incorporate 1D convolutional layers into PMSFF to extract deeper residue-level features. Given that the attention layer already captures local dependencies, we focus on capturing more complex patterns within the sequence. We employ kernel sizes of 3, 5, and 7, consistent with our previous work [43]. Results in Table 3 indicate that three kernels optimize performance while increasing the number of kernels to four (sizes 3, 5, 7, 9) degrades accuracy, similar to the observations made with multiple windows. This suggests that excessive feature engineering can introduce noise and hinder model performance.

To effectively fuse multi-scale features and capture intricate long-range dependencies within the protein sequence, we incorporate a GRU layer into PMSFF. Table 3 demonstrates that a single-layer GRU outperforms deeper GRU architectures, aligning with our previous findings in structure-based epitope prediction [93]. The combination of CNN and GRU layers, integrating spatial and temporal information processing, has been shown to enhance performance in various classification tasks [89]. This synergistic interaction between convolutional and recurrent layers is crucial for learning complex protein representations.

In summary, our proposed PMSFF model integrates attention, convolutional, and recurrent layers to effectively capture diverse feature representations for PBRs prediction. The model employs three windows in the attention layer, three kernels in the CNN layer, and a single-layer GRU to optimize performance. Each component contributes significantly to the model’s overall predictive power.

3.2. The Evaluation Results of ProtT5 Feature

Traditional PBRs prediction methods often rely on concatenating multiple residue features, including MSA and predicted structural information. This approach is computationally demanding and requires extensive experimentation to determine the optimal feature set. To address these limitations, we propose leveraging ProtT5-generated embeddings as a comprehensive protein representation. To evaluate the effectiveness of ProtT5, we compare PMSFF’s performance using ProtT5 embeddings against models employing concatenated features comprising PSSMs from MMseqs2 and structural information from either NSP2 or NSP3. We conduct five-fold cross-validation on the NSP6373 and NSP448 datasets to assess model robustness.

As illustrated in Figure 2, the computational efficiency of generating concatenated features, specifically those incorporating NSP2 [85] or NSP3 [36] predictions, is significantly higher than that of using ProtT5 embeddings. For protein sequences ranging from 200 to 700 residues, NSP2 and NSP3 require approximately 10 s and 3 s, respectively, while ProtT5 completes the process in less than 0.1 s. This substantial time savings enables the application of ProtT5 [38] to larger datasets, potentially enhancing model performance and prediction speed.

ProtT5 embeddings consistently outperform concatenated feature sets, demonstrating the effectiveness of transfer learning. As illustrated in Figure 2B, ProtT5 achieves superior performance across all evaluation metrics, with statistically significant improvements. The ROC and PR curves in Figure 2C,D further emphasize this advantage. Compared to NSP2+MMseqs2 features, ProtT5 exhibits absolute improvements of 0.157 in AUROC and 0.207 in AUPRC. When compared to NSP3+MMseqs2 features, these improvements are 0.132 and 0.187, respectively.

3.3. Performance Comparison with Other Baselines on NSP448 Dataset

To evaluate the performance of PMSFF, we compare it against 19 published sequence-based PBRs prediction methods on the NSP448 dataset. Table 4 presents the dataset-level performance metrics for all compared methods, calculated across all residues in NSP448. Results were obtained from various sources, including the original publications, the SCRIBER study [56], and web server predictions.

As shown in Table 4, PMSFF surpasses all compared methods in terms of F1-score (0.694) and AUPRC (0.514). These metrics are particularly sensitive to class imbalance, making PMSFF well-suited for PBRs prediction. Compared to the second-best performing method, D-PPIsite [61], PMSFF exhibits substantial improvements of 0.038 and 0.214 in the F1-score and AUPRC, respectively. While PMSFF’s MCC (0.388) is slightly lower than the top-performing D-PPIsite (0.399), the overall results highlight the effectiveness of our proposed method. To further illustrate the strengths of PMSFF, we provide the visual prediction results of a representative protein in the Supplementary File. As illustrated by the example shown in Figure S1, PMSFF accurately predicts a greater portion of a typical PBR than 13 other commonly used methods.

To expand the comparative analysis, we constructed the NSP355 dataset, a subset of NSP448 [24]. Performance metrics for additional methods, including DELPHI [24] and results obtained from PIPENN, DeepBSRPred, and DeepRObind web servers, are presented in Table 5. Note that DeepBSRPred’s binary output format precluded the calculation of F1 and MCC metrics. PMSFF demonstrates superior performance in terms of F1-score (0.499) and AUPRC (0.502) on the NSP355 dataset. While PMSFF’s MCC value of 0.389 closely trails D-PPIsite’s top performance of 0.390, the overall results underscore the effectiveness of our proposed method.

Leveraging results available on the DELPHI website (https://delphi.csd.uwo.ca/ (accessed on 26 June 2023)), we calculated AUROC and AUPRC for each protein sequence in the NSP355 dataset across PMSFF and 11 competing methods. Figure 3 depicts these results, highlighting PMSFF’s superior performance with median AUROC (0.833) and median AUPRC (0.567). PIPENN emerges as the second-best performer, achieving median AUROC and AUPRC values of 0.722 and 0.372, respectively. Notably, PMSFF surpasses PIPENN by a substantial margin, demonstrating improvements of 0.111 in median AUROC and 0.195 in median AUPRC.

3.4. Evaluation Results on Type-Specific Datasets

As Table 6 shows, we compare PMSFF with other methods on type-specific PBRs from protein-protein interaction, protein-small ligand interaction, protein-nucleotide interaction, heterodimer interface, homodimer interface and antibody-antigen interaction.

To evaluate PMSFF’s performance on common protein-protein interactions, we trained the model on the P346 dataset and tested it on the P70 dataset, following the protocol of DeepPPISP [80]. Table 6 presents the results of PMSFF and other competing methods, including those reported in DeepPPISP [80] and EnsemPPIS [63]. PMSFF demonstrates superior performance across most metrics, with particularly strong results in AUPRC (0.435), surpassing the second-best performers (EGRET and EnsemPPIS) by 0.03. Notably, DeepPPISP and EGRET are structure-based methods, while PMSFF is solely sequence-based.

To assess PMSFF’s performance on specific PBRs types, we re-trained and evaluated the model on datasets representing interactions with small ligands (S), nucleotides (N), heterodimers (He), homodimers (Ho), paratopes (Pa), and epitopes (Ep). Figure 4 presents a comparative analysis of AUROC values for PMSFF and other methods on the S354, N38, He48, and Ho95 datasets.

As illustrated in Figure 4, PMSFF consistently demonstrates superior AUROC performance across most PBRs types. However, for protein-small ligand binding residues, PIPENN [25] marginally outperforms PMSFF. It is noteworthy that while bindEmbed21(DL) [42] also leverages ProtT5 embeddings, its performance lags behind both PIPENN and PMSFF in this specific task. PMSFF obtains better AUROC than PIPENN [25] when predicting PBRs from protein-nucleotide interaction, heterodimer interface and homodimer interface. It indicates the advantage of PMSFF for predicting type-specific PBRs.

To evaluate paratope prediction performance, PMSFF was subjected to 10-fold cross-validation on the Pa277 dataset. Figure 4 clearly demonstrates PMSFF’s superiority over other paratope prediction methods. Similarly, for epitope prediction, 10-fold cross-validation was performed on the Ep280 dataset, with PMSFF achieving the highest AUROC value. It is noteworthy that Beipred-3.0 [40], a recently proposed method leveraging the ESM language model [39] for antigen representation, serves as a strong baseline.

4. Conclusions

Recent advancements in sequence-based PBRs prediction have encountered challenges in determining optimal feature scales and overcoming the limitations of the sliding window approach. To address these issues, we propose PMSFF, a generalizable framework for multi-PBR prediction. By leveraging the ProtT5 language model [38] for feature extraction, PMSFF efficiently generates comprehensive protein representations without relying on external tools or time-consuming feature engineering.

PMSFF incorporates multi-scale features and employs attention, convolutional, and recurrent layers to capture complex residue interactions. Extensive experiments on diverse benchmark datasets, encompassing various PBRs types, demonstrate PMSFF’s superior performance compared to existing methods.

Although PMSFF achieves the best performances on multiple PBRs predictions, it has a limitation in that partner information isn’t used. There are some proposed partner-specific PBRs prediction methods predicting binding residues for a particular pair interaction [94]. We will design a novel deep learning model applying for partner-specific multiple PBRs prediction in the future.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/biom14101220/s1, Figure S1: Interaction sites of a representative protein in the NSP448 set (PDB ID: 3ZCB, Chain: B). TP predictions are in yellow, FN predictions are in red, FP predictions are in blue and the background grey represents TN predictions. The threshold of each method is determined by making the number of predicted interaction residues equals that of actual interaction residues.

Author Contributions

Y.L.: Conceptualization, formal analysis, writing. S.L.: visualization, methodology, data curation. X.N. and Z.T.: validation and editing. S.Z.: project administration, investigation. Q.Z.: supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant No.62371423), General Project of National Natural Science Foundation of China (61572444), Major Science and Technology Projects in Henan Province (Grant No. 201400210400), Key Research and Development Projects in Henan Province (Grant No. 241111210500), Natural Science Foundation of Henan Province of China (Grant No. 232300421117).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The source code and datasets of our proposed method PMSFF are available at https://github.com/biolushuai/PMSFF-for-multiple-PBRs-prediction (accessed on 23 September 2024).

Acknowledgments

The authors thank the anonymous reviewers for their valuable suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. Evaluation Metrics

As in previous studies, prediction performance is evaluated using a series of classical binary classification metrics including Specificity (SPE), Accuracy (ACC), Precision (PRE), Recall (REC), F1 Score (F1), Matthews Correlation Coefficient (MCC). Equations of all metrics are shown as follows:

S P E = \frac{T N}{T N + F N}

(A1)

A C C = \frac{T P + T N}{T P + F N + T N + F P}

(A2)

P R E = \frac{T P}{T P + F P}

(A3)

R E C = \frac{T P}{T P + F N}

(A4)

F 1 = \frac{2 * P R E * R E C}{P R E + R E C}

(A5)

M C C = \frac{T P * T N - F P * F N}{\sqrt{(T P + F P) (T P + F N) (T N + F P) (T N + F N)}}

(A6)

where, TP (True Positive) is the number of interacting residues that are correctly predicted as interacting, FP (False Positive) is the number of non-interacting residues that are falsely predicted as interacting, TN (True Negative) denotes the number of non-interacting sites that are identified correctly, and FN (False Negative) denotes the number of interacting sites that are identified falsely.

The metrics above are all threshold-dependent which depends on the threshold for judging the positive and negative predictions. Usually, the threshold is set to 0.5. For making the more reasonable judgment, we utilize the strategy as previous studies [24,25] in which the threshold is set for making the number of positive predictions equal to the number of real positive examples.

Table 2 in the manuscript shows the datasets used in this study are highly imbalanced. Therefore, we utilize the area under the receiver operating characteristics curve (AUROC) and the area under the precision-recall curve (AUPRC). AuCROC and AUPRC are threshold-independent metrics and provide an evaluation of the overall performance of predictors. AUPRC is more sensitive than AUROC when working with unbalanced datasets [95]. Table 2 in the manuscript shows that the protein binding residues account for about 10% to 20% in the benchmark datasets used in this study. Therefore, we take AUPRC as the most important metric for framework evaluation and selection.

Appendix A.2. Implementation Details

We implement our proposed framework using PyTorch. PMSFF is trained by minimizing the weighted cross-entropy loss using an Adaptive Momentum (Adam) optimizer. The learning rate is tested among 0.1, 0.01 and 0.001. The training batch size is chosen from 32, 64 and 128. The window sizes used in this study vary from 5 to 35 with step 10. The kernel sizes used in this study consist of 3, 5, 7 and 9. When we use more than one window or kernel, we select them from the smallest size. To achieve better performance and overcome overfitting, the dropout is tested among 0.2, 0.5 and 0.8. We utilize early stopping in the training process and performance drop on the validation set is detected [96]. For each combination, networks are trained until the performance on the validation set stops improving for 10 epochs. All results shown in Table 5 in the manuscript are chosen by maximizing the MCC of the testing data.

The training time of each epoch varies roughly from 5 to 10 min depending on the sliding window length and the number of windows, using a single NVIDIA RTX2080 GPU. An independent validation set (10% of the training set) is used to tune parameters, and 5-fold cross-validation is employed when predicting paratope and epitope for making a fair comparison with comparative methods.

References

Pairo-Castineira, E.; Rawlik, K.; Bretherick, A.D.; Qi, T.; Wu, Y.; Nassiri, I.; McConkey, G.A.; Zechner, M.; Klaric, L.; Griffiths, F.; et al. GWAS and meta-analysis identifies 49 genetic variants underlying critical COVID-19. Nature 2023, 617, 764–768. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Kurgan, L. Review and comparative assessment of sequence-based predictors of protein-binding residues. Briefings Bioinform. 2018, 19, 821–837. [Google Scholar] [CrossRef] [PubMed]
Kundu, S.; Lin, C.; Jaiswal, M.; Mullapudi, V.B.; Craig, K.C.; Chen, S.; Guo, Z. Profiling glycosylphosphatidylinositol (GPI)-interacting proteins in the cell membrane using a bifunctional GPI analogue as the probe. J. Proteome Res. 2023, 22, 919–930. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Ma, Z.; Kurgan, L. Comprehensive review and empirical analysis of hallmarks of DNA-, RNA- and protein-binding residues in protein chains. Briefings Bioinform. 2019, 20, 1250–1268. [Google Scholar] [CrossRef]
Akbar, R.; Robert, P.A.; Pavlović, M.; Jeliazkov, J.R.; Snapkov, I.; Slabodkin, A.; Weber, C.R.; Scheffer, L.; Miho, E.; Haff, I.H.; et al. A compact vocabulary of paratope-epitope interactions enables predictability of antibody-antigen binding. Cell Rep. 2021, 34, 108856. [Google Scholar] [CrossRef]
Gao, Z.; Jiang, C.; Zhang, J.; Jiang, X.; Li, L.; Zhao, P.; Yang, H.; Huang, Y.; Li, J. Hierarchical graph learning for protein–protein interaction. Nat. Commun. 2023, 14, 1093. [Google Scholar] [CrossRef]
Gao, H.; Chen, C.; Li, S.; Wang, C.; Zhou, W.; Yu, B. Prediction of protein-protein interactions based on ensemble residual convolutional neural network. Comput. Biol. Med. 2023, 152, 106471. [Google Scholar] [CrossRef]
Yu, B.; Chen, C.; Wang, X.; Yu, Z.; Ma, A.; Liu, B. Prediction of protein–protein interactions based on elastic net and deep forest. Expert Syst. Appl. 2021, 176, 114876. [Google Scholar] [CrossRef]
Basu, S.; Zhao, B.; Biró, B.; Faraggi, E.; Gsponer, J.; Hu, G.; Kloczkowski, A.; Malhis, N.; Mirdita, M.; Söding, J.; et al. DescribePROT in 2023: More, higher-quality and experimental annotations and improved data download options. Nucleic Acids Res. 2024, 52, D426–D433. [Google Scholar] [CrossRef]
Orii, N.; Ganapathiraju, M.K. Wiki-Pi: A Web-Server of Annotated Human Protein-Protein Interactions to Aid in Discovery of Protein Function. PLoS ONE 2012, 7, e49029. [Google Scholar] [CrossRef]
Panaitiu, A.E.; Basiashvili, T.; Mierke, D.F.; Pellegrini, M. An engineered construct of cFLIP provides insight into DED1 structure and interactions. Structure 2022, 30, 229–239.e5. [Google Scholar] [CrossRef] [PubMed]
Qiu, J.; Liu, Q.; Li, P.; Jiang, Q.; Chen, W.; Li, D.; Li, G.; Shan, G. Ligand-directed photodegradation of interacting proteins: Oxidative HER2/HER3 heterodimer degradation with a lapatinib-derived photosensitizer. J. Med. Chem. 2023, 66, 10265–10272. [Google Scholar] [CrossRef] [PubMed]
Parvizpour, S.; Pourseif, M.M.; Razmara, J.; Rafi, M.A.; Omidi, Y. Epitope-based vaccine design: A comprehensive overview of bioinformatics approaches. Drug Discov. Today 2020, 25, 1034–1042. [Google Scholar] [CrossRef] [PubMed]
Lu, R.M.; Hwang, Y.C.; Liu, I.J.; Lee, C.C.; Tsai, H.Z.; Li, H.J.; Wu, H.C. Development of therapeutic antibodies for the treatment of diseases. J. Biomed. Sci. 2020, 27, 1. [Google Scholar] [CrossRef] [PubMed]
Yang, J.; Roy, A.; Zhang, Y. BioLiP: A semi-manually curated database for biologically relevant ligand-protein interactions. Nucleic Acids Res. 2013, 41, D1096–D1103. [Google Scholar] [CrossRef] [PubMed]
The UniProt Consortium. UniProt: The Universal Protein Knowledgebase in 2023. Nucleic Acids Res. 2023, 51, D523–D531. [Google Scholar] [CrossRef]
Tang, T.; Zhang, X.; Liu, Y.; Peng, H.; Zheng, B.; Yin, Y.; Zeng, X. Machine learning on protein–protein interaction prediction: Models, challenges and trends. Briefings Bioinform. 2023, 24, bbad076. [Google Scholar] [CrossRef]
Soleymani, F.; Paquet, E.; Viktor, H.L.; Michalowski, W.; Spinello, D. ProtInteract: A deep learning framework for predicting protein–protein interactions. Comput. Struct. Biotechnol. J. 2023, 21, 1324–1348. [Google Scholar] [CrossRef]
Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko, A.; et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021, 596, 583–589. [Google Scholar] [CrossRef]
Varadi, M.; Anyango, S.; Deshpande, M.; Nair, S.; Natassia, C.; Yordanova, G.; Yuan, D.; Stroe, O.; Wood, G.; Laydon, A.; et al. AlphaFold protein structure database: Massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 2022, 50, D439–D444. [Google Scholar] [CrossRef]
Jakubec, D.; Skoda, P.; Krivak, R.; Novotny, M.; Hoksza, D. PrankWeb 3: Accelerated ligand-binding site predictions for experimental and modelled protein structures. Nucleic Acids Res. 2022, 50, W593–W597. [Google Scholar] [CrossRef] [PubMed]
McGreig, J.E.; Uri, H.; Antczak, M.; Sternberg, M.J.E.; Michaelis, M.; Wass, M. 3DLigandSite: Structure-based prediction of protein–ligand binding sites. Nucleic Acids Res. 2022, 50, W13–W20. [Google Scholar] [CrossRef] [PubMed]
Esmaielbeiki, R.; Krawczyk, K.; Knapp, B.; Nebel, J.C.; Deane, C.M. Progress and challenges in predicting protein interfaces. Briefings Bioinform. 2016, 17, 117–131. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Golding, G.B.; Ilie, L. DELPHI: Accurate deep ensemble model for protein interaction sites prediction. Bioinformatics 2021, 37, 896–904. [Google Scholar] [CrossRef] [PubMed]
Stringer, B.; de Ferrante, H.; Abeln, S.; Heringa, J.; Feenstra, K.A.; Haydarlou, R. PIPENN: Protein interface prediction from sequence with an ensemble of neural nets. Bioinformatics 2022, 38, 2111–2118. [Google Scholar] [CrossRef]
Hosseini, S.; Ilie, L. PITHIA: Protein Interaction Site Prediction Using Multiple Sequence Alignments and Attention. Int. J. Mol. Sci. 2022, 23, 12814. [Google Scholar] [CrossRef]
Hou, Z.; Yang, Y.; Ma, Z.; Wong, K.c.; Li, X. Learning the protein language of proteome-wide protein-protein binding sites via explainable ensemble deep learning. Commun. Biol. 2023, 6, 73. [Google Scholar] [CrossRef]
Fout, A.; Byrd, J.; Shariat, B.; Ben-Hur, A. Protein interface prediction using graph convolutional networks. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; pp. 6530–6539. [Google Scholar]
Mahbub, S.; Bayzid, M.S. EGRET: Edge aggregated graph attention networks and transfer learning improve protein–protein interaction site prediction. Briefings Bioinform. 2022, 23, bbab578. [Google Scholar] [CrossRef]
Zhou, Y.; Jiang, Y.; Yang, Y. AGAT-PPIS: A novel protein–protein interaction site predictor based on augmented graph attention network with initial residual and identity mapping. Briefings Bioinform. 2023, 24, bbad122. [Google Scholar] [CrossRef]
Tubiana, J.; Schneidman-Duhovny, D.; Wolfson, H.J. ScanNet: An interpretable geometric deep learning model for structure-based protein binding site prediction. Nat. Methods 2022, 19, 730–739. [Google Scholar] [CrossRef]
Jiménez, J.; Doerr, S.; Martínez-Rosell, G.; Rose, A.S.; Fabritiis, G.D. DeepSite: Protein-binding site predictor using 3D-convolutional neural networks. Bioinformatics 2017, 33, 3036–3042. [Google Scholar] [CrossRef] [PubMed]
Altschul, S.F.; Madden, T.L.; Schäffer, A.A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D.J. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 1997, 25, 3389–3402. [Google Scholar] [CrossRef] [PubMed]
Remmert, M.; Biegert, A.; Hauser, A.; Söding, J. HHblits: Lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat. Methods 2011, 9, 173–175. [Google Scholar] [CrossRef] [PubMed]
Mirdita, M.; Steinegger, M.; Breitwieser, F.; Söding, J.; Levy Karin, E. Fast and sensitive taxonomic assignment to metagenomic contigs. Bioinformatics 2021, 37, 3029–3031. [Google Scholar] [CrossRef]
Høie, M.H.; Kiehl, E.N.; Petersen, B.; Nielsen, M.; Winther, O.; Nielsen, H.; Hallgren, J.; Marcatili, P. NetSurfP-3.0: Accurate and fast prediction of protein structural features by protein language models and deep learning. Nucleic Acids Res. 2022, 50, W510–W515. [Google Scholar] [CrossRef]
Buchan, D.W.A.; Minneci, F.; Nugent, T.C.O.; Bryson, K.; Jones, D.T. Scalable web services for the PSIPRED Protein Analysis Workbench. Nucleic Acids Res. 2013, 41, W349–W357. [Google Scholar] [CrossRef]
Elnaggar, A.; Heinzinger, M.; Dallago, C.; Rehawi, G.; Wang, Y.; Jones, L.; Gibbs, T.; Feher, T.; Angerer, C.; Steinegger, M.; et al. ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7112–7127. [Google Scholar] [CrossRef]
Rives, A.; Meier, J.; Sercu, T.; Goyal, S.; Lin, Z.; Liu, J.; Guo, D.; Ott, M.; Zitnick, C.L.; Ma, J.; et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. USA 2021, 118, e2016239118. [Google Scholar] [CrossRef]
Clifford, J.N.; Høie, M.H.; Deleuran, S.; Peters, B.; Nielsen, M.; Marcatili, P. BepiPred-3.0: Improved B-cell epitope prediction using protein language models. Protein Sci. 2022, 31, e4497. [Google Scholar] [CrossRef]
Yuan, Q.; Chen, S.; Wang, Y.; Zhao, H.; Yang, Y. Alignment-free metal ion-binding site prediction from protein sequence through pretrained language model and multi-task learning. Briefings Bioinform. 2022, 23, bbac444. [Google Scholar] [CrossRef]
Littmann, M.; Heinzinger, M.; Dallago, C.; Weissenow, K.; Rost, B. Protein embeddings and deep learning predict binding residues for various ligand classes. Sci. Rep. 2021, 11, 23916. [Google Scholar] [CrossRef] [PubMed]
Lu, S.; Li, Y.; Nan, X.; Zhang, S. Attention-based Convolutional Neural Networks for Protein-Protein Interaction Site Prediction. In Proceedings of the 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Houston, TX, USA, 9–12 December 2021; pp. 141–144. [Google Scholar] [CrossRef]
Li, Y.; Lu, S.; Ma, Q.; Nan, X.; Zhang, S. Protein-Protein Interaction Site Prediction Based on Attention Mechanism and Convolutional Neural Networks. IEEE/Acm Trans. Comput. Biol. Bioinform. 2023, 20, 3820–3829. [Google Scholar] [CrossRef] [PubMed]
Porollo, A.; Meller, J. Prediction-based fingerprints of protein-protein interactions. Proteins Struct. Funct. Bioinform. 2006, 66, 630–645. [Google Scholar] [CrossRef]
Ofran, Y.; Rost, B. ISIS: Interaction sites identified from sequence. Bioinformatics 2007, 23, e13–e16. [Google Scholar] [CrossRef]
Murakami, Y.; Mizuguchi, K. Applying the Naïve Bayes classifier with kernel density estimation to the prediction of protein–protein interaction sites. Bioinformatics 2010, 26, 1841–1848. [Google Scholar] [CrossRef] [PubMed]
Dhole, K.; Singh, G.; Pai, P.P.; Mondal, S. Sequence-based prediction of protein–protein interaction sites with L1-logreg classifier. J. Theor. Biol. 2014, 348, 47–54. [Google Scholar] [CrossRef] [PubMed]
Singh, G.; Dhole, K.; Pai, P.P.; Mondal, S. SPRINGS: Prediction of Protein-Protein Interaction Sites Using Artificial Neural Networks. J. Proteom. Comput. Biol. 2014, 1, 7. [Google Scholar] [CrossRef]
Wei, Z.S.; Yang, J.Y.; Shen, H.B.; Yu, D.J. A Cascade Random Forests Algorithm for Predicting Protein-Protein Interaction Sites. IEEE Trans. Nanobioscience 2015, 14, 746–760. [Google Scholar] [CrossRef]
Taherzadeh, G.; Yang, Y.; Zhang, T.; Liew, A.W.; Zhou, Y. Sequence-based prediction of protein–peptide binding sites using support vector machine. J. Comput. Chem. 2016, 37, 1223–1229. [Google Scholar] [CrossRef]
Wei, Z.S.; Han, K.; Yang, J.Y.; Shen, H.B.; Yu, D.J. Protein–protein interaction sites prediction by ensembling SVM and sample-weighted random forests. Neurocomputing 2016, 193, 201–212. [Google Scholar] [CrossRef]
Hou, Q.; De Geest, P.; Vranken, W.F.; Heringa, J.; Feenstra, K.A. Seeing the Trees through the Forest: Sequence-based Homo- and Heteromeric Protein-protein Interaction sites prediction using Random Forest. Bioinformatics 2017, 33, 1479–1487. [Google Scholar] [CrossRef] [PubMed]
Hou, Q.; De Geest, P.F.G.; Griffioen, C.J.; Abeln, S.; Heringa, J.; Feenstra, K.A. SeRenDIP: SEquential REmasteriNg to DerIve profiles for fast and accurate predictions of PPI interface positions. Bioinformatics 2019, 35, 4794–4796. [Google Scholar] [CrossRef] [PubMed]
Zhang, B.; Li, J.; Quan, L.; Chen, Y.; Lü, Q. Sequence-based prediction of protein-protein interaction sites by simplified long short-term memory network. Neurocomputing 2019, 357, 86–100. [Google Scholar] [CrossRef]
Zhang, J.; Kurgan, L. SCRIBER: Accurate and partner type-specific prediction of protein-binding residues from proteins sequences. Bioinformatics 2019, 35, i343–i353. [Google Scholar] [CrossRef] [PubMed]
Qiu, J.; Bernhofer, M.; Heinzinger, M.; Kemper, S.; Norambuena, T.; Melo, F.; Rost, B. ProNA2020 predicts protein–DNA, protein–RNA, and protein–protein binding proteins and residues from sequence. J. Mol. Biol. 2020, 432, 2428–2443. [Google Scholar] [CrossRef] [PubMed]
Zhang, F.; Shi, W.; Zhang, J.; Zeng, M.; Li, M.; Kurgan, L. PROBselect: Accurate prediction of protein-binding residues from proteins sequences via dynamic predictor selection. Bioinformatics 2020, 36, i735–i744. [Google Scholar] [CrossRef]
Wang, X.; Zhang, Y.; Yu, B.; Salhi, A.; Chen, R.; Wang, L.; Liu, Z. Prediction of protein-protein interaction sites through eXtreme gradient boosting with kernel principal component analysis. Comput. Biol. Med. 2021, 134, 104516. [Google Scholar] [CrossRef]
Cong, H. Protein–protein interaction site prediction by model ensembling with hybrid feature and self-attention. BMC Bioinform. 2023, 24, 456. [Google Scholar] [CrossRef]
Hu, J.; Dong, M.; Tang, Y.X.; Zhang, G.J. Improving protein-protein interaction site prediction using deep residual neural network. Anal. Biochem. 2023, 670, 115132. [Google Scholar] [CrossRef]
Manfredi, M.; Savojardo, C.; Martelli, P.L.; Casadio, R. ISPRED-SEQ: Deep Neural Networks and Embeddings for Predicting Interaction Sites in Protein Sequences. J. Mol. Biol. 2023, 435, 167963. [Google Scholar] [CrossRef]
Mou, M.; Pan, Z.; Zhou, Z.; Zheng, L.; Zhang, H.; Shi, S.; Li, F.; Sun, X.; Zhu, F. A Transformer-Based Ensemble Framework for the Prediction of Protein–Protein Interaction Sites. Research 2023, 6, 0240. [Google Scholar] [CrossRef] [PubMed]
Nikam, R.; Yugandhar, K.; Gromiha, M.M. DeepBSRPred: Deep learning-based binding site residue prediction for proteins. Amino Acids 2023, 55, 1305–1316. [Google Scholar] [CrossRef] [PubMed]
Zhang, F.; Li, M.; Zhang, J.; Shi, W.; Kurgan, L. DeepPRObind: Modular Deep Learner that Accurately Predicts Structure and Disorder-Annotated Protein Binding Residues. J. Mol. Biol. 2023, 435, 167945. [Google Scholar] [CrossRef] [PubMed]
Olimpieri, P.P.; Chailyan, A.; Tramontano, A.; Marcatili, P. Prediction of site-specific interactions in antibody-antigen complexes: The proABC method and server. Bioinformatics 2013, 29, 2285–2291. [Google Scholar] [CrossRef] [PubMed]
Liberis, E.; Veličković, P.; Sormanni, P.; Vendruscolo, M.; Liò, P. Parapred: Antibody paratope prediction using convolutional and recurrent neural networks. Bioinformatics 2018, 34, 2944–2950. [Google Scholar] [CrossRef]
Deac, A.; Veličković, P.; Sormanni, P. Attentive cross-modal paratope prediction. J. Comput. Biol. 2019, 26, 536–545. [Google Scholar] [CrossRef]
Ambrosetti, F.; Olsen, T.H.; Olimpieri, P.P.; Jiménez-García, B.; Milanetti, E.; Marcatilli, P.; Bonvin, A.M.J.J. proABC-2: PRediction of AntiBody contacts v2 and its application to information-driven docking. Bioinformatics 2020, 36, 5107–5108. [Google Scholar] [CrossRef]
Zhang, P.; Zheng, S.; Chen, J.; Zhou, Y.; Yang, Y. DeepANIS: Predicting antibody paratope from concatenated CDR sequences by integrating bidirectional long-short-term memory and transformer neural networks. In Proceedings of the 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Houston, TX, USA, 9–12 December 2021; pp. 118–124. [Google Scholar] [CrossRef]
Lu, S.; Li, Y.; Nan, X.; Zhang, S. A Sequence-Based Antibody Paratope Prediction Model Through Combing Local-Global Information and Partner Features. In Proceedings of the 17th International Symposium on Bioinformatics Research and Applications (ISBRA 2021), Shenzhen, China, 26–28 November 2021; Lecture Notes in Computer Science. Springer: Berlin/Heidelberg, Germany, 2021; Volume 13064, pp. 179–190. [Google Scholar] [CrossRef]
Larsen, J.; Lund, O.; Nielsen, M. Improved method for predicting linear B-cell epitopes. Immunome Res. 2006, 2, 2. [Google Scholar] [CrossRef]
Davydov, Y.I.; Tonevitsky, A.G. Prediction of linear B-cell epitopes. Mol. Biol. 2009, 43, 150–158. [Google Scholar] [CrossRef]
Ansari, H.; Raghava, G.P. Identification of conformational B-cell Epitopes in an antigen from its primary sequence. Immunome Res. 2010, 6, 6. [Google Scholar] [CrossRef]
Jespersen, M.C.; Peters, B.; Nielsen, M.; Marcatili, P. BepiPred-2.0: Improving sequence-based B-cell epitope prediction using conformational epitopes. Nucleic Acids Res. 2017, 45, W24–W29. [Google Scholar] [CrossRef] [PubMed]
Hou, Q.; Stringer, B.; Waury, K.; Capel, H.; Haydarlou, R.; Xue, F.; Abeln, S.; Heringa, J.; Feenstra, K.A. SeRenDIP-CE: Sequence-based interface prediction for conformational epitopes. Bioinformatics 2021, 37, 3421–3427. [Google Scholar] [CrossRef] [PubMed]
Shashkova, T.I.; Umerenkov, D.; Salnikov, M.; Strashnov, P.V.; Konstantinova, A.V.; Lebed, I.; Shcherbinin, D.N.; Asatryan, M.N.; Kardymon, O.L.; Ivanisenko, N.V. SEMA: Antigen B-cell conformational epitope prediction using deep transfer learning. Front. Immunol. 2022, 13, 960985. [Google Scholar] [CrossRef] [PubMed]
Schaeffer, R.D.; Liao, Y.; Cheng, H.; Grishin, N.V. ECOD: New developments in the evolutionary classification of domains. Nucleic Acids Res. 2017, 45, D296–D302. [Google Scholar] [CrossRef] [PubMed]
Mirdita, M.; Steinegger, M.; Söding, J. MMseqs2 desktop and local web server app for fast, interactive sequence searches. Bioinformatics 2019, 35, 2856–2858. [Google Scholar] [CrossRef]
Zeng, M.; Zhang, F.; Wu, F.X.; Li, Y.; Wang, J.; Li, M. Protein–protein interaction site prediction through combining local and global features with deep neural networks. Bioinformatics 2019, 36, btz699. [Google Scholar] [CrossRef]
Stave, J.W.; Lindpaintner, K. Antibody and Antigen Contact Residues Define Epitope and Paratope Size and Structure. J. Immunol. 2013, 191, 1428–1435. [Google Scholar] [CrossRef]
Dunbar, J.; Krawczyk, K.; Leem, J.; Baker, T.; Fuchs, A.; Georges, G.; Shi, J.; Deane, C.M. SAbDab: The structural antibody database. Nucleic Acids Res. 2014, 42, D1140–D1146. [Google Scholar] [CrossRef]
Steinegger, M.; Mirdita, M.; Söding, J. Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nat. Methods 2019, 16, 603–606. [Google Scholar] [CrossRef]
Suzek, B.E.; Huang, H.; McGarvey, P.; Mazumder, R.; Wu, C.H. UniRef: Comprehensive and non-redundant UniProt reference clusters. Bioinformatics 2007, 23, 1282–1288. [Google Scholar] [CrossRef]
Klausen, M.S.; Jespersen, M.C.; Nielsen, H.; Jensen, K.K.; Jurtz, V.I.; Sønderby, C.K.; Sommer, M.O.A.; Winther, O.; Nielsen, M.; Petersen, B.; et al. NetSurfP-2.0: Improved prediction of protein structural features by integrated deep learning. Proteins Struct. Funct. Bioinform. 2019, 87, 520–527. [Google Scholar] [CrossRef] [PubMed]
Yang, J.; Anishchenko, I.; Park, H.; Peng, Z.; Ovchinnikov, S.; Baker, D. Improved protein structure prediction using predicted interresidue orientations. Proc. Natl. Acad. Sci. USA 2020, 117, 1496–1503. [Google Scholar] [CrossRef] [PubMed]
Karimi, M.; Wu, D.; Wang, Z.; Shen, Y. DeepAffinity: Interpretable deep learning of compound-protein affinity through unified recurrent and convolutional neural networks. Bioinformatics 2019, 35, 3329–3338. [Google Scholar] [CrossRef] [PubMed]
Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, 25–29 October 2014; pp. 1746–1751. [Google Scholar] [CrossRef]
Yan, W.; Calhoun, V.; Song, M.; Cui, Y.; Yan, H.; Liu, S.; Fan, L.; Zuo, N.; Yang, Z.; Xu, K.; et al. Discriminating schizophrenia using recurrent neural network applied on time courses of multi-site FMRI data. EBioMedicine 2019, 47, 543–552. [Google Scholar] [CrossRef]
Elman, J.L. Finding Structure in Time. Cogn. Sci. 1990, 14, 179–211. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Gulcehre, C.; Cho, K.; Pascanu, R.; Bengio, Y. Learned-norm pooling for deep feedforward and recurrent neural networks. In Proceedings of the Machine Learning and Knowledge Discovery in Databases—European Conference, ECML PKDD 2014, Nancy, France, 15–19 September 2014. Proceedings, Part I; Calders, T., Esposito, F., Hüllermeier, E., Meo, R., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2014; Volume 8724, pp. 530–546. [Google Scholar] [CrossRef]
Lu, S.; Li, Y.; Ma, Q.; Nan, X.; Zhang, S. A Structure-Based B-cell Epitope Prediction Model Through Combing Local and Global Features. Front. Immunol. 2022, 13, 890943. [Google Scholar] [CrossRef]
Pittala, S.; Bailey-Kellogg, C. Learning context-aware structural representations to predict antigen and antibody binding interfaces. Bioinformatics 2022, 36, 3996–4003. [Google Scholar] [CrossRef]
Staeheli, L.A.; Mitchell, D. The Relationship Between Precision-Recall and ROC Curves. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 25–29 June 2006; pp. 233–240. [Google Scholar] [CrossRef]
Prechelt, L. Early Stopping-But When? In Neural Networks: Tricks of the Trade; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 1996; Volume 1524, pp. 55–69. [Google Scholar] [CrossRef]

Figure 1. The framework of PMSFF. (A) Protein embedding is obtained by the ProtT5 model. (B) The construction of multi-scale features. Left: Context vector is generated by attention mechanism using sliding window approach. Multi-size windows are used for adapting different types of PBRs and capturing complex patterns between target residue and neighboring residues. Right: Multi-size kernels are utilized in convolutional neural networks on input residue feature vectors for information on more scales. The output channels of each kernel are concatenated. (C) Details of the framework architecture. Our proposed framework PMSFF mainly consists of three parts: the attention layer, CNN layer and GRU layer.

Figure 2. Comparison between concatenated features and ProtT5 features on NSP448 set. (A) The time of running NetSurfP-2.0 (NSP2) [85], NetSurfP-3.0 (NSP3) [36] and ProtT5 [38] on various lengths of protein sequences. (B) Prediction performance comparison on SPE, ACC, AUROC, AUPRC, F1 and MCC. We train and test PMSFF using three kinds of features five times. ** denote that concatenated features are significantly worse than ProtT5 features with p < 0.005. (C) The comparison of ROC curves of three kinds of features. (D) The comparison of PR curves of three kinds of features.

Figure 3. Protein-level performance comparison of PMSFF and other methods on NSP355. (A) The distributions of per-protein AUROC values where the thick vertical lines represent the first quartile, median (white dot) and third quartile, whiskers denote the minimal and maximal values. (B) The distributions of per-protein AUPRC values as AUROC values.

Figure 4. Comparison of PMSFF and other methods on type-specific PBRs. N and S mean the protein-Nucleotide (DNA, RNA) binding residues and protein-Small ligand binding residues. He and Ho stand for the binding residues from heterodimers and homodimers. Pa and Ep are short for paratope and epitope which are binding residues from antibody and antigen interaction interface, respectively.

Table 2. Summary of experimental datasets.

Dataset	Proteins	Residues	Binding Residues	Evaluation Method	PBRs Type
NSP6373	6373	2,068,793	203,880 (10.60%)	Hold-out (Training Set)	N+S+P
NSP355	355	95,940	11,467 (11.95%)	Hold-out (Test Set)
NSP448	448	116,500	22,643 (19.44%)	Hold-out (Test Set)
P346	346	75,204	11,027 (14.66%)	Hold-out (Training Set)	P
P70	70	11,791	2330 (19.76%)	Hold-out (Test Set)	P
N376	376	124,476	15,821 (12.71%)	Hold-out (Training Set)	S
N38	38	10,212	1233 (12.07%)	Hold-out (Test Set)	S
S3874	3874	1,246,682	129,245 (10.37%)	Hold-out (Training Set)	N
S354	354	99,318	7175 (3.20%)	Hold-out (Test Set)	N
HHC489	489	126,997	25,564 (20.13%)	Hold-out (Training Set)	He+Ho
He48	48	14,056	1313 (9.34%)	Hold-out (Test Set)
Ho95	95	24,766	5564 (22.47%)	Hold-out (Test Set)
Pa277	277	115,341	6151 (5.33%)	K-fold Cross-validation	Pa
Ep280	280	75,956	6971 (9.18%)	K-fold Cross-validation	Ep

Notes: N+S+P indicates methods are evaluated on datasets containing protein-nucleotide (DNA, RNA) binding residues, protein-small ligand binding residues and protein-protein binding residues. P stands for methods that are tested on protein-protein binding residues. He and Ho are short for heterodimer and homodimer, respectively. HHC means the dataset contains binding residues from heterodimer and homodimer. Pa and Ep are short for paratope and epitope which are binding residues from antibody and antigen interaction interface, respectively. Hold-out and K-fold Cross-validation are two type of ways for evaluating model performance. For Hold-out validation, all data is split into three parts: training, validation and testing. For K-fold Cross-validation, all data is split into K subsets and the model is trained on K-1 subsets and tested on the last subset K times. To make a fair comparison, we utilize the same datasets and evaluating methods as competing models.

Table 3. Impact of multi-scale features on the NSP448 dataset.

Methods		SPE	ACC	F1	MCC	AUROC	AUPRC
Attention	W0K3GRU1	$0.871 (0.001)$	$0.792 (0.002)$	$0.669 (0.004)$	$0.337 (0.008)$	$0.764 (0.005)$	$0.469 (0.009)$
	W1K3GRU1	$0.873 (0.001)$	$0.796 (0.001)$	$0.674 (0.001)$	$0.347 (0.003)$	$0.769 (0.002)$	$0.481 (0.004)$
	W2K3GRU1	$0.874 (0.002)$	$0.797 (0.003)$	$0.676 (0.006)$	$0.351 (0.011)$	$0.772 (0.009)$	$0.483 (0.009)$
	W4K3GRU1	$0.871 (0.001)$	$0.793 (0.001)$	$0.669 (0.002)$	$0.339 (0.005)$	$0.765 (0.005)$	$0.472 (0.002)$
CNN	W3K0GRU1	$0.874 (0.000)$	$0.798 (0.000)$	$0.677 (0.001)$	$0.354 (0.001)$	$0.769 (0.001)$	$0.482 (0.001)$
	W3K1GRU1	$0.869 (0.002)$	$0.789 (0.004)$	$0.663 (0.006)$	$0.326 (0.012)$	$0.759 (0.008)$	$0.454 (0.010)$
	W3K2GRU1	$0.871 (0.002)$	$0.792 (0.003)$	$0.669 (0.004)$	$0.337 (0.008)$	$0.764 (0.006)$	$0.468 (0.008)$
	W3K4GRU1	$0.871 (0.001)$	$0.792 (0.002)$	$0.669 (0.003)$	$0.337 (0.007)$	$0.768 (0.009)$	$0.465 (0.006)$
GRU	W3K3GRU0	$0.874 (0.002)$	$0.797 (0.003)$	$0.676 (0.005)$	$0.351 (0.009)$	$0.777 (0.007)$	$0.486 (0.008)$
	W3K3GRU2	$0.872 (0.002)$	$0.794 (0.003)$	$0.671 (0.004)$	$0.341 (0.009)$	$0.769 (0.008)$	$0.464 (0.008)$
	W3K3GRU3	$0.869 (0.002)$	$0.789 (0.003)$	$0.663 (0.005)$	$0.326 (0.009)$	$0.760 (0.005)$	$0.453 (0.010)$
Proposed	W3K3GRU1 (PMSFF)	0.881(0.000)	0.808(0.001)	0.694(0.001)	0.388(0.003)	0.800(0.001)	0.514(0.002)

Notes: Attention means multi-scale feature vary in attention layer. CNN means multi-scale features vary in the CNN layer. GRU means multi-scale feature fusion varying in the GRU layer. W indicates window. K indicates kernel. The number after W, K and GRU means the number of windows, kernels and GRU layers. The last row is our proposed method PMSFF. The best results are given in bold.

Table 4. Dataset-level comparison with competing methods on NSP448 dataset.

Methods	SPE	ACC	F1	MCC	AUROC	AUPRC
SPPIDER ^a	0.870	0.781	0.198	0.071	0.517	0.159
SPRINT ^a	0.873	0.781	0.183	0.057	0.570	0.167
PSIVER ^a	0.874	0.783	0.191	0.066	0.581	0.170
SPRINGS ^a	0.882	0.796	0.229	0.111	0.625	0.201
LORIS ^a	0.887	0.805	0.263	0.151	0.656	0.228
CRFPPI ^a	0.887	0.805	0.266	0.154	0.681	0.238
SSWRF ^a	0.891	0.811	0.287	0.178	0.687	0.256
SCRIBER ^a	0.896	0.821	0.333	0.230	0.715	0.287
DELPHI ^b	0.901	0.829	0.371	0.272	0.737	0.337
PIPENN ^b	0.870	0.785	0.385	0.254	0.729	0.357
AttCNNPPI ^b	0.831	0.737	0.587	0.174	0.634	0.316
PITHIA ^b	0.907	0.840	0.408	0.315	0.778	0.387
StackingPPINet ^b	NA	NA	0.387	0.129	0.593	0.406
EDLMPPI ^b	0.922	0.858	0.464	0.383	0.820	0.460
D-PPIsite ^b	0.919	0.859	0.480	0.399	0.824	0.476
ISPRED-SEQ ^b	NA	NA	0.470	0.390	0.820	NA
EnsePPIs ^b	NA	0.821	0.385	0.291	0.770	0.354
DeepBSRPred ^b	0.630	NA	0.380	0.260	0.740	0.340
DeepRObind ^c	0.949	0.802	0.272	0.206	0.666	0.354
PMSFF	0.881	0.808	0.694	0.388	0.800	0.514

Notes: ^a Results reported by SCRIBER [56]. ^b Results taken from original published studies. ^c Results obtained by utilizing the web server. The best results are given in bold.

Table 5. Dataset-level comparison with competing methods on NSP355 dataset.

Methods	SPE	ACC	F1	MCC	AUROC	AUPRC
SPPIDER ^a	0.889	0.804	0.180	0.068	0.515	0.138
SPRINT ^a	0.886	0.801	0.168	0.054	0.571	0.150
PSIVER ^a	0.888	0.803	0.177	0.065	0.583	0.155
SPRINGS ^a	0.892	0.811	0.211	0.103	0.608	0.178
LORIS ^a	0.896	0.818	0.241	0.137	0.637	0.203
CRFPPI ^a	0.897	0.819	0.246	0.143	0.662	0.214
SSWRF ^a	0.901	0.825	0.268	0.168	0.667	0.228
SCRIBER ^a	0.908	0.838	0.322	0.230	0.719	0.275
DLPred ^a	0.906	0.835	0.308	0.214	0.724	0.272
DELPHI ^a	0.914	0.848	0.364	0.278	0.746	0.326
PIPENN ^c	0.863	0.863	0.375	0.237	0.710	0.351
PITHIA ^b	0.916	0.852	0.381	0.297	0.762	0.344
D-PPIsite ^b	0.927	0.871	0.460	0.387	0.822	0.448
ISPRED-SEQ ^b	NA	NA	0.460	0.390	0.820	NA
EnsePPIs ^b	NA	0.821	0.385	0.291	0.770	0.354
DeepBSRPred ^c	0.557	0.533	0.244	−0.018	NA	NA
DeepRObind ^c	0.957	0.815	0.248	0.196	0.662	0.331
PMSFF	0.890	0.819	0.499	0.389	0.805	0.502

Notes: ^a Results reported by DELPHI [24]. ^b Results taken from original published studies. ^c Results obtained by utilizing the web server. The best results are given in bold.

Table 6. Dataset-level comparison with competing methods on P70 dataset.

Methods	PRE	SEN	ACC	F1	MCC	AUROC	AUPRC
PSIVER ^a	0.253	0.468	0.653	0.328	0.138	NA	0.250
SPPIDER ^a	0.209	0.459	0.622	0.287	0.089	NA	0.230
SPPIDER ^b	0.240	0.315	0.667	0.273	0.063	0.518	0.235
SPRINGS ^a	0.248	0.598	0.631	0.350	0.181	NA	0.280
ISIS ^a	0.211	0.362	0.622	0.267	0.097	NA	0.240
RFPPI ^a	0.173	0.512	0.598	0.258	0.118	NA	0.210
IntPred ^a	0.247	0.508	0.672	0.332	0.165	NA	NA
DeepPPISP ^a	0.303	0.577	0.655	0.397	0.206	0.671	0.320
DLPred ^b	0.325	0.577	0.680	0.416	0.235	0.697	0.380
ProNA2020 ^b	0.297	0.229	0.741	0.258	0.106	NA	NA
SCRIBER ^b	0.274	0.569	0.616	0.370	0.159	0.635	0.307
DELPHI ^b	0.319	0.604	0.667	0.418	0.236	0.690	0.360
EGRET ^b	0.358	0.561	0.715	0.438	0.270	0.719	0.405
EnsemPPIS ^b	0.375	0.532	0.732	0.440	0.277	0.719	0.405
AttCNNPPI ^c	0.587	0.635	0.536	0.610	0.216	0.690	0.338
PMSFF	0.629	0.679	0.717	0.304	0.653	0.747	0.435

Notes: ^a Results reported by DeepPPISP [80]. ^b reported by EnsemPPIS [63]. ^c Results taken from original published studies. The best results are given in bold.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Y.; Nan, X.; Zhang, S.; Zhou, Q.; Lu, S.; Tian, Z. PMSFF: Improved Protein Binding Residues Prediction through Multi-Scale Sequence-Based Feature Fusion Strategy. Biomolecules 2024, 14, 1220. https://doi.org/10.3390/biom14101220

AMA Style

Li Y, Nan X, Zhang S, Zhou Q, Lu S, Tian Z. PMSFF: Improved Protein Binding Residues Prediction through Multi-Scale Sequence-Based Feature Fusion Strategy. Biomolecules. 2024; 14(10):1220. https://doi.org/10.3390/biom14101220

Chicago/Turabian Style

Li, Yuguang, Xiaofei Nan, Shoutao Zhang, Qinglei Zhou, Shuai Lu, and Zhen Tian. 2024. "PMSFF: Improved Protein Binding Residues Prediction through Multi-Scale Sequence-Based Feature Fusion Strategy" Biomolecules 14, no. 10: 1220. https://doi.org/10.3390/biom14101220

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PMSFF: Improved Protein Binding Residues Prediction through Multi-Scale Sequence-Based Feature Fusion Strategy

Abstract

1. Introduction

2. Materials and Methods

2.1. Benchmark Dataset Construction

2.2. Framework Architecture of the PMSFF

2.2.1. The Input Representation of Proteins

2.2.2. Attention Layer

2.2.3. CNN Layer

2.2.4. GRU Layer

3. Results

3.1. The Effectiveness of Multi-Scale Feature

3.2. The Evaluation Results of ProtT5 Feature

3.3. Performance Comparison with Other Baselines on NSP448 Dataset

3.4. Evaluation Results on Type-Specific Datasets

4. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix A.1. Evaluation Metrics

Appendix A.2. Implementation Details

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI