LPBERT: A Protein–Protein Interaction Prediction Method Based on a Pre-Trained Language Model

Hu, An; Kuang, Linai; Yang, Dinghai

doi:10.3390/app15063283

Open AccessArticle

LPBERT: A Protein–Protein Interaction Prediction Method Based on a Pre-Trained Language Model

by

An Hu

,

Linai Kuang

^*

and

Dinghai Yang

The School of Computer Science, Xiangtan University, Xiangtan 411105, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(6), 3283; https://doi.org/10.3390/app15063283

Submission received: 30 January 2025 / Revised: 28 February 2025 / Accepted: 15 March 2025 / Published: 17 March 2025

Download

Browse Figures

Versions Notes

Abstract

:

The prediction of protein–protein interactions is a key task in proteomics. Since protein sequences are easily available and understandable, they have become the primary data source for predicting protein–protein interactions. With the development of natural language processing technology, language models have become a research hotspot in recent years, and protein language models have also been developed accordingly. Compared with single-encoding methods, such as Word2Vec and one-hot, language models specifically designed for proteins are expected to extract more comprehensive information from sequences, thereby enhancing the performance of protein–protein interaction prediction methods. Inspired by the protein language model ProteinBERT, this study designed the LPBERT deep learning framework, which is a novel end-to-end deep learning architecture. LPBERT, which is based on ProteinBERT, combines Convolutional Neural Networks, Transformer encoders, and Bidirectional Long Short-Term Memory networks to achieve efficient prediction. Upon evaluation using the BioGRID H. sapiens and S. cerevisiae datasets, LPBERT outperformed other comparison methods, where it achieved accuracies of 98.93% and 97.94%, respectively. Moreover, it also demonstrated good performances on multiple other datasets. These experimental results indicate that LPBERT performed excellently in protein–protein interaction prediction tasks, thereby substantiating the effectiveness of introducing protein language models in this field.

Keywords:

protein–protein interaction; language model; ProteinBERT; protein sequence; deep learning

1. Introduction

Proteins play a core role within organisms. They participate in numerous important biological processes in complex biological environments and are a crucial cornerstone of biomolecules. Proteins perform a variety of functions, including facilitating reactions [1,2], assisting in signal transduction [3], regulating biological activities [4], and transporting key substances [5,6]. These functions manifest externally as interactions between two or more proteins, known as protein–protein interactions (PPIs). Research on PPIs is instrumental in revealing the operational mechanisms within organisms, understanding disease mechanisms, and developing new drugs. Initially, researchers verified the interactions between proteins by traditional experimental methods, such as yeast two-hybrid [7,8] and co-immunoprecipitation [9]. These methods can accurately detect interactions between proteins. However, long experimental cycles, a low efficiency, and experimental failure risks are major drawbacks of traditional methods. These factors prompted some researchers to seek more efficient and reliable alternative methods.

Protein sequences, which represent the primary structure of proteins, contain key information about their composition. They are more readily available and understandable than 3D structures of proteins, amino acid physicochemical properties, and other information. The advancement of high-throughput sequencing technologies [10] has provided a wealth of protein sequence data for PPI research, which has driven the development of numerous protein-sequence-based PPI prediction methods. These methods use protein sequences as the entry point and achieve good results in PPI prediction by combining deep learning techniques. PIPR [11] employed residual Recurrent Convolutional Neural Networks (RCNNs) to capture the latent semantic features of protein sequences, then used different loss functions in a Multilayer Perceptron (MLP) to adapt to different prediction tasks. OR-RCNN [12] used the Skip-Gram method for pre-training protein sequences, employed the RCNN module to encode protein embedding representations, and combined it with an ordinal regression-based classifier to calculate interaction scores. DeepTrio [13] utilized multiple parallel Convolutional Neural Networks (CNNs) to capture the contextual information of protein sequences from multiple scales. Notably, DeepTrio also introduced a single-protein class to distinguish the relative and intrinsic properties. RAPPPID [14] utilized the Sentencepiece [15] algorithm to classify protein sequences into common groupings and encoded the grouped sequences with an AWD-LSTM [16] regularization encoder. This method provided a solution to the generalization problem in PPI prediction tasks. The DCSE [17] applied the concepts of natural language processing (NLP). Initially, each amino acid is treated as a word, and the amino acids are mapped to an N-dimensional vector. Subsequently, DCSE employed two different networks to extract global and local features from the protein sequences. ADH-PPI [18] employed two types of dropout strategies to generate corresponding embedding matrices for sequences. It then utilized CNN and Long Short-Term Memory (LSTM) to extract matrix information and employed an attention mechanism to sift important features. The ECA-PHV [19] integrated five sequence-encoding methods: AAC, DDE, MMI, CT, and GTPC. It combined effective channel attention, a Bidirectional Gated Recurrent Unit (BiGRU), and 1D-CNNs to more effectively predict human–virus PPIs. TAGPPI [20] used a CNN to process sequences and applied AlphaFold [21] to predict the corresponding contact maps of sequences. It further processed the contact maps through a graph attention neural network to extract three-dimensional protein structural features. The model then combined sequence and structural features through weighted addition to generate the final fused feature vector. Similarly, SpatialPPI [22] first generated corresponding complex structures for protein sequences via AlphaFold Multimer [23], and then processed the complex structures via one-hot encoding, volume encoding, or distance encoding. In current research, there are many preliminary processing methods for protein sequences, most of which convert character sequences into corresponding vector representations through methods such as one-hot encoding and word embeddings for subsequent studies. Although these methods have made some progress in sequence processing, they still have limitations in capturing complex semantic relationships and long-distance dependencies. For example, one-hot encoding cannot express the semantic relationships between words. While word embedding methods improve on this, their fixed window size limits their ability to handle long-distance dependencies.

The success of language models in the NLP field has proven their potential in completing natural language tasks. Leveraging the similarity between protein sequences and natural language has also promoted the application of language models in proteomics. TAPE [24], ProtTrans [25], ESM [26], and ProteinBERT [27] have become popular protein language models in recent years. By pre-training on large-scale protein sequence datasets, they can effectively capture the characteristic information of proteins from sequences and provide support for downstream protein tasks. xCAPT5 [28] innovatively selected the ProtT5-XL-UniRef50 [29] protein language model to generate encoded representations for protein sequences. It used multi-core CNNs to capture complex interaction features at both the micro and macro levels and integrated XGBoost [30] to enhance the predictive performance. Liu et al. [31] performed a similar study by training the MindSpore ProteinBERT (MP-BERT) model from transformers. They used transfer learning techniques to construct the MPB-PPI architecture for PPI prediction. Protein sequences are extremely similar to a natural language in form or can be viewed as a unique natural language, which makes it a reasonable choice to apply language models to process protein sequences. The successful application of xCAPT5 and MPB-PPI for PPI prediction demonstrated that this research direction is feasible.

This paper introduces the LPBERT deep learning framework. It utilizes the protein language model ProteinBERT [27] as the encoder to obtain embedding representations from protein sequences. By integrating CNNs, Transformer encoders [32], and Bidirectional Long Short-Term Memory (BiLSTM), LPBERT can further process the embedding representations and enhance the predictive ability of the method. In this work, our main contributions are as follows:

First, we introduced ProteinBERT in the field of PPI prediction to obtain rich embedding representations of protein sequences.
Second, with ProteinBERT as the core, we designed the Local Convolutional Recurrent Neural Network (LCR) module and the Global Convolutional Transformer encoder (GCT) module to process the extracted embedding representations and construct the LPBERT framework, which achieved satisfactory results in the PPI prediction tasks.
Third, we conducted extensive experiments and performance evaluations to demonstrate the superior performance of the LPBERT framework on various PPI prediction benchmark datasets. In addition, we conducted ablation experiments to verify the contribution of the LCR and GCT modules to the overall model performance. Through these experiments, we were able to fully understand the superiority and potential application value of LPBERT in the PPI prediction task.

2. Materials and Methods

2.1. Datasets

In this study, we used multiple public high-confidence benchmark datasets for training and testing, including the multi-validated physical interaction datasets H. sapiens and S. cerevisiae produced by Hu et al. They used the CD-HIT tool to process the dataset, which involved removing sequence data with a sequence identity greater than 40% to reduce the redundancy; a Human–virus independent test set; a Multi-species dataset that contained data of three species: C. elegans, E. coli, and D. melanogaster; a Pan standard set; and a Guo standard set [13,28]. Except for the Pan dataset, the ratio of positive to negative samples in the other datasets was 1:1. Table 1 shows the detailed information of these datasets.

2.2. Model Architecture

This section focuses on the composition of the LPBERT framework and its PPI prediction implementation process. LPBERT consists of four modules: sequence encoding, deep feature extraction, concatenation, and prediction. In the sequence-encoding module, the pre-trained protein language model ProteinBERT encodes protein sequences to obtain their corresponding global and local representations. In the deep feature extraction module, we used two proposed architectures, the GCT and LCR, to extract high-dimensional features from global and local representations. The outputs of these layers are concatenated, and layer normalization is applied to improve the model’s convergence speed and training stability. Finally, the MLP classifier calculates a binary result. Figure 1 shows the detailed architecture of LPBERT.

2.3. Sequence Encoding

ProteinBERT, developed by Brandes et al. [27], consists of six Transformer-like blocks. It was pre-trained on rich protein sequence data from the UniRef90 [33] dataset and contains approximately 16 M trainable parameters in total. Compared with the full series of ProtTrans [25] models and the ESM-1v [34] model (also pre-trained on the UniRef90 dataset), ProteinBERT has significantly fewer parameters. Even so, as demonstrated by Brandes et al., ProteinBERT achieves comparable or better performance relative to existing protein language models in multiple benchmark tests. After a comprehensive evaluation, ProteinBERT was selected as the foundation for this study.

ProteinBERT takes the original protein sequence

S e q = (s 1, s 2, \dots, s n)

as the input. It generates a tag sequence

S e q^{'} = (< S T A R T >, s 1, s 2, \dots, s n, < E N D >)

by adding

< S T A R T >

and

< E N D >

characters at both ends of the sequence. If the length of the original sequence is less than the specified length, the

< P A D >

character will be used for padding automatically. The tag sequence

S e q^{'}

is then mapped to the corresponding integer vector

V e c t o r

through an amino acid dictionary

D i c t

.

D i c t

contains the mapping relationship between all amino acids, special characters, and integer values, and is responsible for numerical mapping of the tag sequence. Finally, the pre-trained model encodes the

V e c t o r

to generate two embedded representations, which Brandes et al. called local representation (character level) and global representation (sequence level). The above process can be expressed as

V e c t o r = (D i c t (< S T A R T >), D i c t (s 1), D i c t (s 2), \dots, D i c t (s n), D i c t (< E N D >))

(1)

L o c a l, G l o b a l = P r o t e i n B E R T (V e c t o r)

(2)

where

L o c a l

refers to the local representation with a shape of

(L + 2) \times 1562

and

G l o b a l

refers to the global representation with a shape of 15,599, where the shape is determined by ProteinBERT. Here,

V e c t o r

represents the integer vector, and L denotes its length.

Compared with using single methods, such as Word2Vec [35], Seq2Vec [36], and one-hot, to encode sequences and obtain protein-embedding representations, language models that are specifically designed for protein sequence tasks and pre-trained on large-scale protein datasets are more advantageous.

2.4. Deep Feature Extraction

The local representation generated by ProteinBERT for sequences has dimensions of

(L + 2) \times 1562

. In protein-sequence-based computational methods, input sequences are typically at least 1000 residues [13,14,28], or even longer. Processing large-scale training datasets with such high-dimensional features requires significant storage space and GPU memory. To address this limitation, we applied a global max pooling layer to the local representation, which effectively reduced both the data dimensionality and model parameters. LPBERT extracts high-dimensional features from the local representation using the LCR block, which consists of CNNs, Batch Normalizations (BNs), max poolings, and BiLSTMs. In the LCR architecture, we selected BiLSTM as the core component, which is effective at capturing dependencies in sequences. We combined it with CNNs (kernel size 3, stride 2), BN layers, and max pooling layers (stride 2) to form a fusion module. Additionally, we used LeakyReLU to help the model learn complex features. To alleviate potential gradient vanishing and improve LPBERT’s performance, we established residual connections before and after the BiLSTM layer to enhance the training stability. The functions of each module are as follows:

CNN: Extract deep features of the embedding representation and reduce the data dimension.
BN: Normalize the output of the previous layer to enhance the generalizability of the model [37].
Max pooling: Retention of the most important features and reduction of the data dimension.
BiLSTM: This is a variant of a Recurrent Neural Network (RNN) that captures contextual dependencies and enhances the expressive capability of the model for data [38].

LPBERT processes the global representation using the GCT block, which consists of CNNs, BNs, max poolings, and Transformer encoders. In the GCT architecture, due to the relatively large dimension of global representation, we first reduced the dimensionality and computational complexity using a 1D-CNN (kernel size 7, stride 2). This was followed by a BN layer for increased regularization and LeakyReLU activation to enhance the model’s nonlinear expression capability. The GCT then further extracted features through two combined modules, with each consisting of CNNs (kernel size 3, stride 2), max pooling layers (stride 2), and Transformer encoders. The Transformer encoder processes sequence information at different levels through a multi-head attention mechanism, allocating higher attention weights to essential features for comprehensive sequence representation. The functions of the remaining layers are consistent with those described in the LCR section.

The detailed architectures of the GCT and LCR blocks are shown in Figure 2.

2.5. Classifier

In the prediction phase, we first concatenated the GCT and LCR block outputs from both proteins to form the final embedding representation. Layer normalization was then applied to enhance the model training efficiency and stability. Finally, the classifier computed the prediction result. The process is as follows:

E_{A B} = L a y e r N o r m (c o n c a t (f l a t_{G} A, f l a t_{G} B, f l a t_{L} A, f l a t_{L} B))

(3)

where

E_{A B}

is the final embedding representation, G represents the global representation, and L represents the local representation.

In this work, we used an MLP as the final classifier, which consisted of multiple dense layers with dropout layers. Dropout was applied after each dense layer (except the final one) to improve model generalization and reduce overfitting. The last dense layer used a softmax function to map values to probabilities between 0 and 1, which output binary probabilities that indicate the likelihood of protein interaction and non-interaction. The prediction process is as follows:

O u t p u t_{k} = \{\begin{matrix} \hat{P_{k}} = M L P (D e n s e (s o f t m a x (E_{k}))) & l a b e l = 0, \\ 1 - \hat{P_{k}} & l a b e l = 1 \end{matrix}

(4)

where k represents the k-th protein pair;

E_{k}

is the input to the last dense layer; the

l a b e l

takes the value of 0 or 1, which represents the negative and positive sample classes, respectively; and

\hat{P_{k}}

represents the probability predicted by the model that there is no interaction between the proteins.

2.6. Hyperparameter Optimization

To optimize the model performance, we conducted experiments with various hyperparameters, including the learning rate, optimizer selection, number of attention heads in the Transformer encoder, number of hidden layers in the Feedforward Neural Network (FNN), and dropout rate. We employed Bayesian optimization to search for optimal parameter combinations within a continuous range. This method identified the most effective parameter values, which were then mapped to practical values based on the nearest neighbor principle. Table 2 shows the search ranges and final values for each hyperparameter.

2.7. Evaluation Metrics

To objectively evaluate LPBERT’s predictive performance, we employed six evaluation metrics: accuracy, precision, recall, specificity, F1-score, and Matthews correlation coefficient (MCC). Accuracy measures the model’s overall performance, recall indicates its ability to identify positive samples, specificity reflects its ability to identify negative samples, and MCC quantifies its binary classification performance. The formulas for these standard evaluation metrics are as follows:

A c c u r a c y = \frac{T P + T N}{T P + F P + T N + F N}

(5)

P r e c i s i o n = \frac{T P}{T P + F P}

(6)

S p e c i f i c i t y = \frac{T N}{F P + T N}

(7)

M C C = \frac{T P \times T N - F P \times F N}{\sqrt{(T P + F P) (T P + F N) (T N + F P) (T N + F N)}}

(8)

TP (True Positive) denotes the number of correctly predicted positive samples where the true class is positive. FP (False Positive) denotes the number of incorrectly predicted positive samples where the true class is negative. TN (True Negative) denotes the number of correctly predicted negative samples where the true class is negative. FN (False Negative) denotes the number of incorrectly predicted negative samples where the true class is positive.

3. Results

3.1. Implementation Details

The proposed method was implemented based on the TensorFlow deep learning framework. All hyperparameter configurations used in the experiments are presented in Table 1. All experiments and evaluations were conducted on a high-performance computing platform equipped with NVIDIA A100-SXM4-80GB GPU resources (NVIDIA, Santa Clara, CA, USA).

3.2. Analysis of Sequence Length Parameters

To evaluate LPBERT’s performance on varying protein sequence lengths, we analyzed the length distribution of sequences from two BioGRID [39] datasets previously used in the DeepTrio study. The H. sapiens dataset comprised 38,869 sequences, and the S. cerevisiae dataset contained 17,015 sequences. The sequence length distributions are shown in Figure 3.

The scatter plot in Figure 3 shows the distribution of the sequence lengths, where most protein sequences fell within the range of 0 to 1500 residues. Based on this distribution, we evaluated LPBERT’s performance using sequence lengths of 500, 1000, and 1500 on both the H. sapiens and S. cerevisiae datasets from BioGRID. The comparative results are presented in Table 3.

Table 3 shows that LPBERT achieved its best overall performance on both datasets with a sequence length of 1500. Although the precision and specificity peaked at a sequence length of 1000 for the S. cerevisiae dataset, the performance difference compared with the length of 1500 was negligible. Given that the protein sequence length used by RAPPPID and DeepTrio was 1500, we selected this as the input length for LPBERT.

3.3. Comparative Experiment

To evaluate the prediction performance of LPBERT in existing mainstream protein-sequence-based computational methods, this study compared LPBERT with nine protein-sequence-based computational methods that emerged in recent years, and the experimental results are discussed in detail in this section.

We reproduced the three methods of PIPR, RAPPPID, and DeepTrio. In order to compare the performance of LPBERT with them, we classified part of the original dataset as an independent test set and used the rest of the data for training and validation. To ensure experimental objectivity and reliability, all methods were trained and tested on identical datasets using parameters specified in their respective original papers. Table 4 and Table 5 present the comparative performance results on the BioGRID test set.

Table 4 and Table 5 show that all methods performed effectively regarding PPI prediction using the BioGRID dataset, where they achieved high scores across multiple metrics. LPBERT, which leverages a pre-trained protein language model for sequence embedding extraction, outperformed all other methods across all metrics, with DeepTrio ranking second in overall performance. On the H. sapiens dataset, LPBERT achieved an accuracy 3.24% higher than DeepTrio, while on the S. cerevisiae dataset, LPBERT’s accuracy exceeded DeepTrio’s by 5.17%. These results demonstrate LPBERT’s effectiveness in PPI prediction.

At the same time, we conducted a comparative analysis to evaluate the performance differences between the LPBERT and PPI prediction methods based on Word2Vec and Seq2Vec for sequence encoding, as well as multi-feature fusion PPI prediction methods that integrate multi-dimensional information, such as protein sequence physicochemical properties and amino acid sequences. The results of the comparative experiment are shown in Table 6.

By observing the experimental results in Table 6, we can find that the accuracy of LPBERT was 0.05% higher than that of DeepFE-PPI [40] and 3.57% higher than that of HNSPPI [41]. This result shows that the pre-trained protein language model had a slight advantage over single-encoding methods, such as Word2Vec and Seq2Vec. However, LPBERT showed performance limitations when compared with PPI prediction methods that integrate multiple sequence processing technologies. Both SDNN-PPI [42] and DeepCF-PPI [43], which combine multiple sequence information dimensions, demonstrated superior performance. SDNN-PPI utilizes traditional sequence feature representation, while DeepCF-PPI enhances this method by incorporating Word2Vec technology. In comparative evaluations, LPBERT’s accuracy fell short by 0.65% compared with SDNN-PPI and by 0.56% compared with DeepCF-PPI. When analyzing this result, we think there may have been two reasons. First, although Brandes et al. pre-trained ProteinBERT on a large-scale protein dataset, it was mainly tested on tasks such as protein structure or post-translational modifications and did not directly involve PPI prediction tasks. Therefore, ProteinBERT may not have fully captured the feature paradigm related to PPI. Fine-tuning ProteinBERT for PPI tasks may be an effective strategy to improve performance. Second, after using ProteinBERT to extract feature representations of protein sequences, the GCT and LCR in the deep feature extraction module may not fully express features that are beneficial to the PPI task, resulting in limitations to the performance of LPBERT.

We compared LPBERT with two other protein-language-model-based PPI prediction methods: xCAPT5 and MPB-PPI. As shown in Table 7, LPBERT outperformed MPB-PPI across all the evaluation metrics on multiple datasets. When compared with xCAPT5 on the Pan dataset, LPBERT’s accuracy was 1.04% lower, with similar patterns observed across the other metrics. Although LPBERT showed a slightly lower performance than xCAPT5, it offered significant advantages in computational efficiency. xCAPT5 is recommended on platforms with 80 GB GPU memory and 120 GB RAM, whereas LPBERT works efficiently with just 18 GB GPU memory and 3 GB RAM.

3.4. Ablation Experiment

To assess LPBERT’s component effectiveness, we performed ablation experiments on the BioGRID H. sapiens dataset by systematically removing and replacing specific modules. Table 8 presents the detailed experimental results. The model’s performance metrics declined to varying degrees when either the GCT or LCR module was removed, which demonstrated the essential role of both modules in LPBERT’s architecture. In addition, we replaced the GCT with a Transformer and the LCR with CNN blocks in our experiments. This allowed us to compare the performance differences between the original modules (GCT and LCR) and the conventional architectures. The experimental results of ‘rp GCT with Trans’ and ‘rp LCR with CNNs’ in Table 8 show the advantages of the GCT and LCR. To evaluate the contribution of individual components within the GCT and LCR to LPBERT’s performance, we conducted experiments by separately removing the Transformer encoder from the GCT and the BiLSTM from the LCR. While removing either component increased the recall, all other metrics showed degradation. Removing both components simultaneously decreased all performance metrics, which demonstrated the essential roles of both the Transformer encoder and BiLSTM. Further experiments that replaced BiLSTM with an alternative RNN variant also resulted in performance degradation, which confirmed that the Transformer encoder–BiLSTM combination provided optimal performance for LPBERT.

We also examined performance changes when replacing ProteinBERT with other current pre-trained protein language models. After an in-depth analysis of the experimental results from applying different protein language models to the LPBERT model, we observed that LPBERT showed the best performance when using the ProtT5 model. In contrast, ProteinBERT’s performance ranking was relatively low, where it performed only slightly better than the TAPE model. However, considering that ProteinBERT’s parameter size is significantly smaller than other language models (except TAPE), its accuracy only lagged behind by about 1%. This result is basically consistent with the conclusion reached by Brandes et al. in other benchmark tests: ProteinBERT’s performance was comparable with or exceeded that of existing protein language models with larger parameter sizes. The experimental data are presented in Table 9.

3.5. Sequence Similarity Analysis

In this section, we evaluate the performance of LPBERT under different sequence similarity constraints to assess its generalization capability within the same species. First, we selected the BioGRID S. cerevisiae dataset as the raw data and used the CD-HIT tool to process it, generating experimental data with varying sequence similarities. We then used a Python (version: 3.9) script to split the data into a training–validation dataset and an independent test dataset to ensure that the proteins in the test dataset were completely independent of those in the training–validation dataset. During the experiment, we trained LPBERT on the training–validation dataset using five-fold cross-validation and evaluated it on the independent test dataset. We chose the accuracy, precision, and F1-score as the evaluation metrics for the model. The specific test results are shown in Table 10.

As shown in Table 10, the datasets with different sequence similarity levels exhibited some variability. This was mainly due to the filtering of sequence data by the CD-HIT tool based on the set sequence similarity thresholds. A closer examination of the data in Table 10 reveals that although LPBERT’s performance had slightly decreased compared with the 97.94% accuracy achieved in the testing environment of Table 5, under the current more stringent testing conditions, its accuracy remained stable around 93% to 94%, with minimal fluctuation. This indicates that LPBERT maintained reliable predictive performance. The results suggest that even under varying sequence similarities, LPBERT retained good generalization capability and stability within the same species.

3.6. Cross-Species Generalization Analysis

The above work evaluated the performance of LPBERT when trained and tested within the same species. To fully explore the abilities of LPBERT, we trained it on the BioGRID H. sapiens and then tested it on datasets from other species. The testing scope included five datasets: BioGRID S. cerevisiae, S. cerevisiae core (DeepFE), Guo, Multi-species, and Human–virus. In all the datasets, the ratio of positive to negative samples was approximately 1:1. By isolating the data by species, we evaluated the predictive ability of the model beyond the training data. This validation method provided a reliable reference for analyzing the generalization capability of the model in cross-species prediction, and it helped assess the inference ability of the model on unknown data by existing data. We selected comprehensive metrics, namely, accuracy, precision, and F1-score, as the evaluation metrics. Table 11 presents the specific test results.

The experimental results demonstrate that LPBERT exhibited outstanding performance on both the BioGRID S. cerevisiae and Human–virus datasets, where all metrics achieved high levels. On the BioGRID S. cerevisiae dataset in particular, although there was a slight decline compared with the results obtained from the same-species training and testing, all the metrics remained above 90%. However, on the other three datasets, LPBERT’s performance was less satisfactory, with the highest accuracy reaching only 51.82%. These results indicate that LPBERT encountered difficulties in the cross-species data prediction, as the learned feature patterns failed to support accurate inferences on unknown data. Overall, LPBERT’s cross-species generalization ability appeared to be limited, suggesting room for improvement.

4. Discussion

In this work, we proposed a novel deep learning method, LPBERT, to address the challenges of PPI prediction in proteomics. It is a computational method that relies solely on protein sequences for prediction. As the primary structure of proteins, protein sequences consist of various amino acids arranged in a specific order. These sequences inherently contain rich and critical information about proteins. By utilizing this information, researchers achieved remarkable results in multiple protein-related fields that drove the development of biomedical technology. The DeepMind team developed AlphaFold2 [21], a method based solely on amino acid sequences that achieved the accurate prediction of protein three-dimensional structures at the atomic precision level. Its prediction accuracy led the field in the 14th Critical Assessment of Protein Structure Prediction (CASP14) [44] competition, where it significantly outperformed other methods. This result confirms that deep and interpretable information is indeed embedded within protein sequences.

Earlier research employed various methods to extract features from protein sequences efficiently. For instance, Auto Covariance (AC) [42] and the Position-Specific Scoring Matrix (PSSM) [45] were used to extract information about the physicochemical properties and evolutionary characteristics of proteins. DeepFE-PPI [40] focused on the semantic relationships between amino acids using Word2Vec, while HNSPPI [41] concentrated on extracting structural features from protein sequences using Seq2Vec. While these methods successfully extracted features from protein sequences, they typically focus on only one aspect of the sequence, which leads to a limited understanding of the overall information. In practice, researchers overcame this limitation by combining multiple feature extraction techniques to complement each other and obtain a more comprehensive view of protein sequence information. This strategy is effective, but its technical implementation tends to be more complex. Researchers are required to invest time and effort into selecting suitable feature extraction methods, seeking the optimal combination among various methods, and avoiding redundant extraction of the same feature by different methods. These factors directly affect the final predictive performance of deep learning-based methods. Therefore, exploring a method that can comprehensively capture protein sequence information while simplifying the feature extraction process has become a new research objective. Protein language models, pre-trained on large-scale datasets and specifically designed for protein-related tasks, provide a viable solution. LPBERT is an effort based on this understanding. It utilized the protein language model to directly obtain the initial embedding representations of protein sequence pairs, which laid a solid foundation for subsequent computations within the framework. This feature extraction process is significantly simpler than other methods, and the extracted protein information is relatively rich and comprehensive. The test results of LPBERT on multiple datasets in this study validated this conclusion and demonstrated the superior performance and good generalizability of LPBERT in PPI prediction tasks.

Although LPBERT is at a disadvantage in terms of performance compared with xCAPT5, it surpassed xCAPT5 in terms of computational resource savings. In addition to the RAM resources mentioned above, we also compared the two methods in terms of storage resource consumption. When processing the 8347 protein sequences of the Pan dataset, the resulting file of xCAPT5 was approximately 17 GB, whereas LPBERT required only approximately 550 MB. Obviously, when facing large datasets, the run threshold of xCAPT5 is significantly greater than LPBERT. LPBERT achieved several times more computational resource savings than xCAPT5, with a performance evaluation metric gap of only 1% to 2%. An analysis of this phenomenon found that the main difference lay in the methods used to process protein embedding representations extracted from pre-trained protein language models. xCAPT5 chooses to fully preserve the extracted embedding representations, whereas LPBERT employs global max pooling operations to reduce the dimensionality of local embedding representations. Although this operation may result in the loss of some feature information compared with the savings in computing resources, it is clearly a better choice to perform pooling operations. This also significantly reduces the number of computational parameters and the consumption of computational resources.

In this study on PPI prediction methods, LPBERT demonstrated excellent predictive abilities and showed outstanding performance on multiple public datasets. However, it still has certain limitations. We observed that the predictive performance of LPBERT was relatively insufficient when handling small datasets. Taking the S. cerevisiae core dataset in Table 7 as an example, which contains only 10,537 data samples, the performance of LPBERT on this dataset showed a noticeable gap compared with other datasets. This gap indicates that the predictive ability of LPBERT relies to some extent on the size of the dataset. Furthermore, the comparison of LPBERT’s performance with xCAPT5, SDNN-PPI, and DeepCF-PPI, indicates that there is still room for improvement in the performance of LPBERT. Moreover, when LPBERT is applied to cross-species prediction scenarios, its predictive ability appears somewhat inadequate. This reveals that its ability to infer unknown data from known data needs improvement, providing a valuable reference direction for future researchers. Although using only protein sequences to mine the deep information contained in them to achieve a higher-performance PPI prediction is a very valuable research direction, modeling only from the perspective of sequences is not enough. Multimodal information fusion is a feasible improvement direction. In order to fully express the characteristic information of proteins and achieve higher-performance PPI task prediction, in the future, we can consider further integrating the protein structure, the function, and other information based on protein sequences to construct a multimodal pre-trained protein language model. In this way, the generalization ability and feature expression ability of the pre-trained protein language model can be further improved. At the same time, minimizing the demand for computing resources of the protein language model will also help promote its application in downstream protein tasks and provide more powerful tools for biomedical research.

5. Conclusions

This paper introduces LPBERT, an end-to-end deep learning computational method that efficiently predicts PPIs solely using protein sequences. It centers around the pre-trained protein language model ProteinBERT and captures embedding representations of sequences. Through proposed architectures, the GCT and LCR process the extracted global and local representations, respectively. They help to extract deep-level sequence information further, thereby enhancing the predictive performance of the method. An evaluation across multiple datasets demonstrated that LPBERT achieved excellent performance in PPI prediction. This signifies a significant advancement in computational methods for PPI prediction using only protein sequences. These findings will contribute to understanding unknown protein functions and aid in the development of new drugs. Additionally, this work serves as a valuable reference for similar computational methods in the future.

Author Contributions

Conceptualization, A.H. and L.K.; investigation, A.H.; data curation, A.H.; methodology, A.H.; software, A.H.; writing—original draft preparation, A.H.; resources, L.K.; writing—review and editing, L.K. and D.Y.; supervision, L.K.; validation, D.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

LPBERT is available at https://github.com/yelou2022/LPBERT, accessed on 1 March 2025.

Acknowledgments

We are grateful for the technical support from the High Performance Computing Platform of Xiangtan University.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, Y.; Zhang, Y.; Dong, Y.; Akakuru, O.U.; Yao, X.; Yi, J.; Li, X.; Wang, L.; Lou, X.; Zhu, B.; et al. Ablation of gap junction protein improves the efficiency of nanozyme-mediated catalytic/starvation/mild-temperature photothermal therapy. Adv. Mater. 2023, 35, 2210464. [Google Scholar] [CrossRef] [PubMed]
Wang, H.; Zhao, S.; Xia, X.; Liu, J.; Sun, F.; Kong, B. Interaction of the extracellular protease from Staphylococcus xylosus with meat proteins elucidated via spectroscopic and molecular docking. Food Chem. X 2024, 21, 101204. [Google Scholar] [CrossRef] [PubMed]
Essandoh, K.; Teuber, J.P.; Brody, M.J. Regulation of cardiomyocyte intracellular trafficking and signal transduction by protein palmitoylation. Biochem. Soc. Trans. 2024, 52, 41–53. [Google Scholar] [CrossRef]
Sun, X.; Xie, Y.; Xu, K.; Li, J. Regulatory networks of the F-box protein FBX206 and OVATE family proteins modulate brassinosteroid biosynthesis to regulate grain size and yield in rice. J. Exp. Bot. 2024, 75, 789–801. [Google Scholar] [CrossRef]
Huang, J.; Ecker, G.F. A structure-based view on ABC-transporter linked to multidrug resistance. Molecules 2023, 28, 495. [Google Scholar] [CrossRef]
Hoogstraten, C.A.; Schirris, T.J.; Russel, F.G. Unlocking mitochondrial drug targets: The importance of mitochondrial transport proteins. Acta Physiol. 2024, 240, e14150. [Google Scholar] [CrossRef]
Sato, T.; Hanada, M.; Bodrug, S.; Irie, S.; Iwama, N.; Boise, L.H.; Thompson, C.B.; Golemis, E.; Fong, L.; Wang, H.G. Interactions among members of the Bcl-2 protein family analyzed with a yeast two-hybrid system. Proc. Natl. Acad. Sci. USA 1994, 91, 9238–9242. [Google Scholar] [CrossRef]
Çağlayan, E.; Turan, K. An in silico prediction of interaction models of influenza A virus PA and human C14orf166 protein from yeast-two-hybrid screening data. Proteins Struct. Funct. Bioinform. 2023, 91, 1235–1244. [Google Scholar] [CrossRef]
Free, R.B.; Hazelwood, L.A.; Sibley, D.R. Identifying novel protein-protein interactions using co-immunoprecipitation and mass spectroscopy. Curr. Protoc. Neurosci. 2009, 46, 5–28. [Google Scholar] [CrossRef]
Floyd, B.M.; Marcotte, E.M. Protein sequencing, one molecule at a time. Annu. Rev. Biophys. 2022, 51, 181–200. [Google Scholar] [CrossRef]
Chen, M.; Ju, C.J.T.; Zhou, G.; Chen, X.; Zhang, T.; Chang, K.W.; Zaniolo, C.; Wang, W. Multifaceted protein–protein interaction prediction based on Siamese residual RCNN. Bioinformatics 2019, 35, i305–i314. [Google Scholar] [CrossRef] [PubMed]
Xu, W.; Gao, Y.; Wang, Y.; Guan, J. Protein–protein interaction prediction based on ordinal regression and recurrent convolutional neural networks. BMC Bioinform. 2021, 22, 485. [Google Scholar] [CrossRef] [PubMed]
Hu, X.; Feng, C.; Zhou, Y.; Harrison, A.; Chen, M. DeepTrio: A ternary prediction system for protein–protein interaction using mask multiple parallel convolutional neural networks. Bioinformatics 2022, 38, 694–702. [Google Scholar] [CrossRef] [PubMed]
Szymborski, J.; Emad, A. RAPPPID: Towards generalizable protein interaction prediction with AWD-LSTM twin networks. Bioinformatics 2022, 38, 3958–3967. [Google Scholar] [CrossRef]
Kudo, T. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv 2018, arXiv:1808.06226. [Google Scholar]
Merity, S.; Keskar, N.S.; Socher, R. Regularizing and optimizing LSTM language models. arXiv 2017, arXiv:1708.02182. [Google Scholar]
Chen, W.; Wang, S.; Song, T.; Li, X.; Han, P.; Gao, C. DCSE: Double-Channel-Siamese-Ensemble model for protein protein interaction prediction. BMC Genom. 2022, 23, 555. [Google Scholar] [CrossRef]
Asim, M.N.; Ibrahim, M.A.; Malik, M.I.; Dengel, A.; Ahmed, S. ADH-PPI: An attention-based deep hybrid model for protein-protein interaction prediction. Iscience 2022, 25, 105169. [Google Scholar] [CrossRef]
Wang, M.; Lai, J.; Jia, J.; Xu, F.; Zhou, H.; Yu, B. ECA-PHV: Predicting human-virus protein-protein interactions through an interpretable model of effective channel attention mechanism. Chemom. Intell. Lab. Syst. 2024, 247, 105103. [Google Scholar] [CrossRef]
Song, B.; Luo, X.; Luo, X.; Liu, Y.; Niu, Z.; Zeng, X. Learning spatial structures of proteins improves protein–protein interaction prediction. Briefings Bioinform. 2022, 23, bbab558. [Google Scholar] [CrossRef]
Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko, A.; et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021, 596, 583–589. [Google Scholar] [CrossRef] [PubMed]
Hu, W.; Ohue, M. SpatialPPI: Three-dimensional space protein-protein interaction prediction with AlphaFold Multimer. Comput. Struct. Biotechnol. J. 2024, 23, 1214–1225. [Google Scholar] [CrossRef] [PubMed]
Evans, R.; O’Neill, M.; Pritzel, A.; Antropova, N.; Senior, A.; Green, T.; Žídek, A.; Bates, R.; Blackwell, S.; Yim, J.; et al. Protein complex prediction with AlphaFold-Multimer. bioRxiv 2021. [Google Scholar] [CrossRef]
Rao, R.; Bhattacharya, N.; Thomas, N.; Duan, Y.; Chen, P.; Canny, J.; Abbeel, P.; Song, Y. Evaluating protein transfer learning with TAPE. Adv. Neural Inf. Process. Syst. 2019, 32, 9689–9701. [Google Scholar]
Elnaggar, A.; Heinzinger, M.; Dallago, C.; Rehawi, G.; Wang, Y.; Jones, L.; Gibbs, T.; Feher, T.; Angerer, C.; Steinegger, M.; et al. Prottrans: Toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7112–7127. [Google Scholar] [CrossRef]
Rives, A.; Meier, J.; Sercu, T.; Goyal, S.; Lin, Z.; Liu, J.; Guo, D.; Ott, M.; Zitnick, C.L.; Ma, J.; et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. USA 2021, 118, e2016239118. [Google Scholar] [CrossRef]
Brandes, N.; Ofer, D.; Peleg, Y.; Rappoport, N.; Linial, M. ProteinBERT: A universal deep-learning model of protein sequence and function. Bioinformatics 2022, 38, 2102–2110. [Google Scholar] [CrossRef]
Dang, T.H.; Vu, T.A. xCAPT5: Protein–protein interaction prediction using deep and wide multi-kernel pooling convolutional neural networks with protein language model. BMC Bioinform. 2024, 25, 106. [Google Scholar] [CrossRef]
Elnaggar, A.; Ding, W.; Jones, L.; Gibbs, T.; Feher, T.; Angerer, C.; Severini, S.; Matthes, F.; Rost, B. Codetrans: Towards cracking the language of silicon’s code through self-supervised deep learning and high performance computing. arXiv 2021, arXiv:2104.02443. [Google Scholar]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Liu, T.; Gao, H.; Ren, X.; Xu, G.; Liu, B.; Wu, N.; Luo, H.; Wang, Y.; Tu, T.; Yao, B.; et al. Protein–protein interaction and site prediction using transfer learning. Briefings Bioinform. 2023, 24, bbad376. [Google Scholar] [CrossRef]
Vaswani, A. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Boutet, E.; Lieberherr, D.; Tognolli, M.; Schneider, M.; Bansal, P.; Bridge, A.J.; Poux, S.; Bougueleret, L.; Xenarios, I. UniProtKB/Swiss-Prot, the manually annotated section of the UniProt KnowledgeBase: How to use the entry view. In Plant Bioinformatics: Methods and Protocols; Humana Press: New York, NY, USA, 2016; pp. 23–54. [Google Scholar]
Meier, J.; Rao, R.; Verkuil, R.; Liu, J.; Sercu, T.; Rives, A. Language models enable zero-shot prediction of the effects of mutations on protein function. Adv. Neural Inf. Process. Syst. 2021, 34, 29287–29303. [Google Scholar]
Mikolov, T. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
Kim, H.J.; Hong, S.E.; Cha, K.J. seq2vec: Analyzing sequential data using multi-rank embedding vectors. Electron. Commer. Res. Appl. 2020, 43, 101003. [Google Scholar] [CrossRef]
Bjorck, N.; Gomes, C.P.; Selman, B.; Weinberger, K.Q. Understanding batch normalization. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar] [CrossRef]
Siami-Namini, S.; Tavakoli, N.; Namin, A.S. The performance of LSTM and BiLSTM in forecasting time series. In Proceedings of the 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 9–12 December 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 3285–3292. [Google Scholar]
Oughtred, R.; Rust, J.; Chang, C.; Breitkreutz, B.J.; Stark, C.; Willems, A.; Boucher, L.; Leung, G.; Kolas, N.; Zhang, F.; et al. The BioGRID database: A comprehensive biomedical resource of curated protein, genetic, and chemical interactions. Protein Sci. 2021, 30, 187–200. [Google Scholar] [CrossRef]
Yao, Y.; Du, X.; Diao, Y.; Zhu, H. An integration of deep learning with feature embedding for protein–protein interaction prediction. PeerJ 2019, 7, e7126. [Google Scholar] [CrossRef]
Xie, S.; Xie, X.; Zhao, X.; Liu, F.; Wang, Y.; Ping, J.; Ji, Z. HNSPPI: A hybrid computational model combing network and sequence information for predicting protein–protein interaction. Briefings Bioinform. 2023, 24, bbad261. [Google Scholar] [CrossRef]
Li, X.; Han, P.; Wang, G.; Chen, W.; Wang, S.; Song, T. SDNN-PPI: Self-attention with deep neural network effect on protein-protein interaction prediction. BMC Genom. 2022, 23, 474. [Google Scholar] [CrossRef]
Tran, H.N.; Xuan, Q.N.P.; Nguyen, T.T. DeepCF-PPI: Improved prediction of protein-protein interactions by combining learned and handcrafted features based on attention mechanisms. Appl. Intell. 2023, 53, 17887–17902. [Google Scholar] [CrossRef]
Kryshtafovych, A.; Schwede, T.; Topf, M.; Fidelis, K.; Moult, J. Critical assessment of methods of protein structure prediction (CASP)—Round XIV. In Proteins: Structure, Function, and Bioinformatics; Wiley Online Library: Hoboken, NJ, USA, 2021; Volume 89, pp. 1607–1617. [Google Scholar]
Yang, X.; Yang, S.; Lian, X.; Wuchty, S.; Zhang, Z. Transfer learning via multi-scale convolutional neural layers for human–virus protein–protein interaction prediction. Bioinformatics 2021, 37, 4771–4778. [Google Scholar] [CrossRef]

Figure 1. Framework of LPBERT. Using two protein sequences as model inputs and the ProteinBERT model as an encoder, the sequences are initially encoded to generate their corresponding global and local representations. In the deep extraction module, the GCT and LCR modules process these embedding representations and flatten the results via a flatten layer. The GCT consists of a residual stack composed of CNN and Transformer encoders, while the LCR is formed by a residual stack composed of a CNN and BiLSTM. The flattened results from the global and local representations are concatenated and input into the MLP for calculation. In the final dense layer, the softmax function is utilized for activation, yielding probability scores for protein interactions and non-interactions.

Figure 2. Supplementary information on the architectures of the GCT and LCR.

Figure 3. Statistical analysis of the sequence lengths in the BioGRID dataset.

Table 1. Dataset information statistics.

Dataset	Positive Samples	Negative Samples
H. sapiens	31,164	31,164
S. cerevisiae	13,462	13,462
Guo	5594	5594
Pan	27,593	34,298
Multi-species	32,959	32,959
Human–virus	8929	8929

Table 2. Hyperparameter value ranges and final selections.

Hyperparameter	Value	Best Value	Selection
Learn rate	[ $1 \times 10^{- 4}$ , $1 \times 10^{- 3}$ ]	0.0006	0.0006
Optimizer	[0, 1]	0.0581	Adam
Head1	[2, 8]	2.9361	4
Head2	[2, 8]	2.9360	4
FNN1	[64, 512]	489.92	512
FNN2	[64, 512]	391.93	512
Dropout rate	[0, 0.3]	0.1124	0.1

Optimizer: 0 for Adam, 1 for SGD.

Table 3. Performance comparison of LPBERT at various sequence lengths (%).

Species	Length	Accuracy	Precision	Recall	Specificity	MCC
H. sapiens	500	98.50 ± 0.13	98.59 ± 0.20	98.41 ± 0.09	98.59 ± 0.20	97.00 ± 0.27
	1000	98.76 ± 0.11	99.05 ± 0.21	98.47 ± 0.13	99.05 ± 0.22	97.53 ± 0.22
	1500	98.93 ± 0.13	99.23 ± 0.19	98.62 ± 0.13	99.23 ± 0.19	97.85 ± 0.26
S. cerevisiae	500	97.50 ± 0.26	98.15 ± 0.55	96.83 ± 0.54	98.17 ± 0.56	95.01 ± 0.53
	1000	97.71 ± 0.20	98.64 ± 0.31	96.76 ± 0.60	98.67 ± 0.31	95.45 ± 0.38
	1500	97.94 ± 0.16	98.60 ± 0.42	97.27 ± 0.46	98.61 ± 0.43	95.89 ± 0.31

This table reports the mean and standard deviation of the three test results. The best values are in bold.

Table 4. Comparison of each method on the BioGRID H. sapiens dataset (%).

Method	Accuracy	Precision	Recall	Specificity	F1-Score	MCC
PIPR	94.50 ± 0.22	95.81 ± 0.28	93.08 ± 0.20	95.93 ± 0.28	94.42 ± 0.22	89.04 ± 0.43
RAPPPID	94.16 ± 2.04	95.10 ± 2.30	93.15 ± 2.23	95.17 ± 2.34	94.10 ± 2.06	88.36 ± 4.08
DeepTrio	95.69 ± 0.33	98.57 ± 0.36	92.73 ± 0.37	98.65 ± 0.34	95.56 ± 0.34	91.54 ± 0.65
LPBERT	98.93 ± 0.13	99.23 ± 0.19	98.62 ± 0.13	99.23 ± 0.19	98.93 ± 0.13	97.85 ± 0.26

This table reports the mean and standard deviation of the three test results. The best values are in bold.

Table 5. Comparison of each method on the BioGRID S. cerevisiae dataset (%).

Method	Accuracy	Precision	Recall	Specificity	F1-Score	MCC
PIPR	90.29 ± 1.08	92.01 ± 1.84	88.31 ± 0.32	92.29 ± 1.90	90.11 ± 1.02	80.68 ± 2.21
RAPPPID	91.25 ± 0.27	93.85 ± 0.51	88.28 ± 0.26	94.21 ± 0.51	90.98 ± 0.26	82.64 ± 0.57
DeepTrio	92.77 ± 0.63	93.70 ± 2.01	91.77 ± 1.35	93.76 ± 2.21	92.70 ± 0.55	85.60 ± 1.27
LPBERT	97.94 ± 0.16	98.60 ± 0.42	97.27 ± 0.46	98.61 ± 0.43	97.93 ± 0.16	95.89 ± 0.31

This table reports the mean and standard deviation of the three test results. The best values are in bold.

Table 6. Comparison of LPBERT and different sequence-encoding PPI methods (%).

Dataset	Method	Sequence Encoding	Accuracy	MCC
S. cerevisiae (DeepFE, 5-fold)	DeepFE-PPI	Word2Vec	94.78 ± 0.61	89.62 ± 1.23
	SDNN-PPI	AAC + CT + AC	95.48 ± 0.37	91.02 ± 0.74
	LPBERT	ProteinBERT	94.83 ± 0.16	89.69 ± 0.30
S. cerevisiae (DeepCF, 5-fold)	DeepCF-PPI	Word2Vec + AAC + PseAAC + APAAC + QSO + DPC	95.6 ± 0.57	91.4 ± 1.13
S. cerevisiae (DeepCF, 5-fold)	LPBERT	ProteinBERT	95.04 ± 0.48	90.1 ± 0.96
Human (10-fold)	HNSPPI	Seq2Vec	94.92 ± 0.19	NA
Human (10-fold)	LPBERT	ProteinBERT	98.49 ± 0.36	96.99 ± 0.71

The data of the comparison methods are from the original paper. NA indicates not applicable. The best values are in bold.

Table 7. Performance comparison of LPBERT at various sequence lengths (%).

Dataset	Method	Accuracy	Precision	Recall	Specificity	F1-Score	MCC
Bio H. sapiens	MPB-PPI	98.18 ± 0.05	97.32 ± 0.19	99.10 ± 0.22	97.27 ± 0.21	98.20 ± 0.05	96.39 ± 0.09
Bio H. sapiens	LPBERT	99.31 ± 0.08	99.30 ± 0.13	99.32 ± 0.08	99.30 ± 0.13	99.31 ± 0.08	98.62 ± 0.16
S. cerevisiae core (DeepFE)	MPB-PPI	92.85 ± 0.31	93.95 ± 1.25	91.51 ± 1.43	94.17 ± 1.31	92.69 ± 0.39	85.77 ± 0.63
S. cerevisiae core (DeepFE)	LPBERT	94.83 ± 0.16	96.06 ± 0.54	93.50 ± 0.64	96.17 ± 0.45	94.76 ± 0.20	89.69 ± 0.30
Multi-species	MPB-PPI	98.33	99.30	97.36	99.31	98.32	96.69
Multi-species	LPBERT	98.87 ± 0.12	99.56 ± 0.13	98.19 ± 0.12	99.56 ± 0.14	98.87 ± 0.11	97.76 ± 0.24
Pan	xCAPT5	99.77 ± 0.02	99.75 ± 0.03	99.75 ± 0.02	99.80 ± 0.02	99.62 ± 0.06	99.55 ± 0.03
Pan	LPBERT	98.73 ± 0.09	98.65 ± 0.32	98.51 ± 0.14	98.92 ± 0.25	98.58 ± 0.11	97.44 ± 0.18

Bio: BioGRID. This table reports the mean and variance of the 5-fold cross-validation results. The best values are in bold.

Table 8. Results of the LPBERT ablation experiment (%).

Experiment	Accuracy	Precision	Recall	Specificity	F1-Score	MCC
rm GCT	97.47 ± 0.01	97.48 ± 0.15	97.47 ± 0.15	97.48 ± 0.16	97.47 ± 0.01	94.95 ± 0.01
rm LCR	98.73 ± 0.10	99.15 ± 0.06	98.32 ± 0.23	99.15 ± 0.06	98.73 ± 0.10	97.47 ± 0.19
rp GCT with Trans	98.73 ± 0.01	98.91 ± 0.16	98.55 ± 0.17	98.91 ± 0.16	98.73 ± 0.01	97.46 ± 0.02
rp LCR with CNNs	98.8 ± 0.07	99.01 ± 0.20	98.57 ± 0.16	99.02 ± 0.20	98.8 ± 0.07	97.59 ± 0.15
GCT (rm Trans)	98.89 ± 0.02	99.09 ± 0.07	98.69 ± 0.04	99.09 ± 0.07	98.89 ± 0.02	97.78 ± 0.03
LCR (rm BiL)	98.88 ± 0.10	99.05 ± 0.21	98.71 ± 0.02	99.05 ± 0.21	98.88 ± 0.10	97.75 ± 0.20
rm GCT (Trans) + LCR (BiL)	98.85 ± 0.16	99.14 ± 0.20	98.56 ± 0.22	99.15 ± 0.19	98.85 ± 0.17	97.71 ± 0.33
LCR (GRU)	98.74 ± 0.09	98.80 ± 0.16	98.68 ± 0.07	98.80 ± 0.16	98.74 ± 0.09	97.48 ± 0.17
LCR (BiGRU)	98.64 ± 0.10	99.09 ± 0.28	98.19 ± 0.24	99.10 ± 0.28	98.64 ± 0.10	97.29 ± 0.21
LCR (LSTM)	98.63 ± 0.12	98.76 ± 0.37	98.49 ± 0.17	98.77 ± 0.38	98.63 ± 0.12	97.25 ± 0.24
LPBERT (Ours)	98.93 ± 0.13	99.23 ± 0.19	98.62 ± 0.13	99.23 ± 0.19	98.93 ± 0.13	97.85 ± 0.26

rm: remove; rp: replace; Trans: Transformer encoder; BiL: BiLSTM. This table reports the mean and standard deviation of the three test results. The best values are in bold.

Table 9. Comparison of different sequence encoding methods (%).

Dataset	Method	Encoding Parameters	Accuracy	MCC
BioGRID H. sapiens (5-fold)	∼38~M	LPBERT (with TAPE)	97.66 ± 0.23	95.32 ± 0.47
	LPBERT (with ProtBert)	∼420~M	99.75 ± 0.06	99.49 ± 0.11
	LPBERT (with ESM-2)	∼650~M	99.80 ± 0.02	99.61 ± 0.05
	LPBERT (with ProtT5)	∼3~B	99.82 ± 0.02	99.65 ± 0.04
	LPBERT (Ours)	∼16~M	98.93 ± 0.13	97.85 ± 0.26

This table reports the mean and variance of the 5-fold cross-validation results. The best values are in bold.

Table 10. The sequence similarity analysis of LPBERT (%).

Similarity	Training–Validation Samples	Test Samples	Accuracy	Precision	F1-Score
Any	12,354	2968	93.19	98.48	93.40
≤90%	12,308	2833	94.25	98.50	94.65
≤80%	13,216	2178	94.67	97.96	91.40
≤70%	13,224	2373	95.41	98.49	95.31
≤60%	12,918	2360	94.66	98.21	94.58
≤50%	12,414	2440	94.84	98.24	94.66

Any means it was not processed by CD-HIT.

Table 11. The cross-species generalization ability validation of LPBERT (%).

Dataset	Positive Samples	Negative Samples	Accuracy	Precision	F1-Score
BioGRID S. cerevisiae	13,462	13,462	93.05 ± 0.21	98.96 ± 0.06	92.61 ± 0.25
S. cerevisiae core	5271	5266	51.81 ± 0.55	51.10 ± 0.32	63.93 ± 0.63
Guo	5594	5594	51.82 ± 0.07	51.12 ± 0.04	63.28 ± 0.13
Multi-species	32,959	32,959	47.79 ± 0.36	48.74 ± 0.21	62.05 ± 0.34
Human–virus	8929	8929	86.48 ± 1.07	93.84 ± 1.16	85.22 ± 1.53

This table reports the mean and standard deviation of the three test results.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hu, A.; Kuang, L.; Yang, D. LPBERT: A Protein–Protein Interaction Prediction Method Based on a Pre-Trained Language Model. Appl. Sci. 2025, 15, 3283. https://doi.org/10.3390/app15063283

AMA Style

Hu A, Kuang L, Yang D. LPBERT: A Protein–Protein Interaction Prediction Method Based on a Pre-Trained Language Model. Applied Sciences. 2025; 15(6):3283. https://doi.org/10.3390/app15063283

Chicago/Turabian Style

Hu, An, Linai Kuang, and Dinghai Yang. 2025. "LPBERT: A Protein–Protein Interaction Prediction Method Based on a Pre-Trained Language Model" Applied Sciences 15, no. 6: 3283. https://doi.org/10.3390/app15063283

APA Style

Hu, A., Kuang, L., & Yang, D. (2025). LPBERT: A Protein–Protein Interaction Prediction Method Based on a Pre-Trained Language Model. Applied Sciences, 15(6), 3283. https://doi.org/10.3390/app15063283

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LPBERT: A Protein–Protein Interaction Prediction Method Based on a Pre-Trained Language Model

Abstract

1. Introduction

2. Materials and Methods

2.1. Datasets

2.2. Model Architecture

2.3. Sequence Encoding

2.4. Deep Feature Extraction

2.5. Classifier

2.6. Hyperparameter Optimization

2.7. Evaluation Metrics

3. Results

3.1. Implementation Details

3.2. Analysis of Sequence Length Parameters

3.3. Comparative Experiment

3.4. Ablation Experiment

3.5. Sequence Similarity Analysis

3.6. Cross-Species Generalization Analysis

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI