Advances in Computational Methods for Protein–Protein Interaction Prediction

Xian, Lei; Wang, Yansu

doi:10.3390/electronics13061059

Open AccessReview

Advances in Computational Methods for Protein–Protein Interaction Prediction

by

Lei Xian

and

Yansu Wang

^*

Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 611731, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(6), 1059; https://doi.org/10.3390/electronics13061059

Submission received: 3 February 2024 / Revised: 4 March 2024 / Accepted: 8 March 2024 / Published: 12 March 2024

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figure

Versions Notes

Abstract

:

Protein–protein interactions (PPIs) are pivotal in various physiological processes inside biological entities. Accurate identification of PPIs holds paramount significance for comprehending biological processes, deciphering disease mechanisms, and advancing medical research. Given the costly and labor-intensive nature of experimental approaches, a multitude of computational methods have been devised to enable swift and large-scale PPI prediction. This review offers a thorough examination of recent strides in computational methodologies for PPI prediction, with a particular focus on the utilization of deep learning techniques within this domain. Alongside a systematic classification and discussion of relevant databases, feature extraction strategies, and prominent computational approaches, we conclude with a thorough analysis of current challenges and prospects for the future of this field.

Keywords:

protein–protein interactions; computational methods; biological information; feature extraction

1. Introduction

Proteins, as fundamental constituents of biological cells, have a crucial role in orchestrating and regulating various biological processes within organisms, enabling them to function normally. More than 80% of proteins are thought to interact physically with other proteins rather than acting alone, creating complex “molecular machines” that carry out their biological roles [1,2]. Undoubtedly, these protein–protein interactions (PPIs) play a pivotal role in a multitude of cellular processes, encompassing DNA replication and transcription, signal transduction, immune responses, metabolic regulation, and enzyme catalysis [3,4]. The in-depth exploration and investigation of PPIs yields meaningful insights into the underlying molecular mechanisms and functional aspects of proteins [5,6], thereby facilitating the identification of potential drug targets [7], advancing drug discovery and development [8,9,10], helping research in diagnostic and prognostic markers [11], and enhancing disease management strategies [12].

Over the years, numerous conventional experimental techniques have been widely employed in the identification and characterization of PPIs, including tandem affinity purification (TAP) [13], yeast two-hybrid (Y2H) [14,15], nuclear magnetic resonance (NMR) [16,17], synthetic lethal analysis [18,19], co-immunoprecipitation (Co-IP) [20], and protein microarrays [21]. While these laboratory-based approaches have achieved success, they also possess certain limitations that cannot be overlooked. Firstly, these approaches entail high expenses, demand significant labor and time investments, and are unable to cover the entire spectrum of protein pairs [22]. Secondly, the efficacy of these experimental techniques largely depends on the implementation protocols in the target organisms, alongside potential environmental disturbances and equipment resolution, all of which can impact the final experimental results [23,24]. Additionally, inherent limitations in these experimental procedures can frequently lead to the occurrence of false-positive and false-negative outcomes [25,26,27]. Therefore, reliable and efficient computational methodologies have been devised to probe potential PPIs, combined with experimental procedures to validate the veracity and accuracy of the predicted results, significantly advancing the progress in PPI prediction. Drawing from the finding that interacting proteins in living organisms often possess similar evolutionary histories [28], most computational methodologies assess the possibility of interaction between a given protein pair by learning the various types of features of the interacting and non-interacting protein pairs that have been validated. Due to the swift progress of high-throughput technologies, the fields of genomics and proteomics have amassed vast amounts of biological information. Consequently, the scale of constructed protein interaction networks has expanded, and link prediction algorithms have made significant strides. These algorithms rely on the Ternary Closure Principle (TCP) or the L3 framework and utilize patterns of already mapped interactions to identify missing connections [29,30]. However, their performance greatly hinges on the reliability of the underlying PPI network. The docking algorithm does not directly learn features. Instead, it employs principles of surface complementarity and interaction energy to investigate potential binding modes and construct 3D models for a given query protein pair, and scores candidate models to identify true interactions. In contrast, homology modeling does not actually construct a three-dimensional model for input protein pairs. Instead, it employs sequence alignment and structural alignment to identify structural representatives and neighbors, respectively. Ultimately, the structural representatives are superimposed on the interaction templates formed by the structural neighbors to obtain “interaction models”, which are then scored to identify the real interactions. These algorithms have proven effective in distinguishing genuine interactions from “background” non-interacting protein pairs [31,32,33,34].

Machine learning (ML) techniques have grown in popularity in recent years for PPI prediction. ML is founded on the principle of enabling a computer system to learn patterns and features within vast amounts of input data, enabling it to autonomously undertake prediction, identification, and classification tasks. Frequently employed ML algorithms comprise the Decision Tree, random forest (RF), support vector machine (SVM), and Naïve Bayes (NB), among others. While ML methods exhibit superior performance compared to conventional PPI prediction techniques, the intensive feature engineering associated with such approaches adds to the arduous nature of the prediction task. Deep learning (DL) has risen to prominence as a formidable subfield of ML and has made remarkable strides in various domains, including bioinformatics. Through its impressive capacity for nonlinear transformations, DL has demonstrated a strong ability to tackle complex problems in areas like protein structure prediction [35,36] or RNA-binding site identification [37,38]. DL-based PPI prediction models utilize multi-layer of artificial neurons to combine and abstract various feature data of input protein pairs into higher-level features. This process provides an abstract representation of the data or actual objects, enabling more sophisticated understandings and predictions of such interactions. DL-based models for predicting PPIs have garnered considerable attention from researchers and have exhibited exceptional performance in practical applications.

This review provides a comprehensive and up-to-date overview of the diverse computational methods employed in PPI prediction. This paper is organized as follows: Section 2 outlines widely used biological databases for PPI prediction and explores common approaches for constructing datasets. Section 3 provides an in-depth examination of various methods used for extracting multiple types of features from the input protein pairs. Section 4 offers a detailed summary of various cutting-edge computational methods leveraged for PPI prediction. Finally, Section 5 discusses current challenges and potential future directions.

2. Databases and Dataset Preparation

At present, the field of bioinformatics has yielded a substantial wealth of data due to rapid advancements in high-throughput technologies. Several databases have been established which have played a great role in the research and development of novel methods for predicting PPIs. Among them, the PPI databases are used for the acquisition of interacting protein pairs and the construction of datasets, and the other databases, such as protein sequence and structure databases, are used for the construction of feature vectors for input protein pairs. Typical PPI databases encompass the Database of Interacting Proteins (DIP) [39], the Biological General Repository for Interaction Datasets (BioGRID) [40], IntAct [41], the Search Tool for Retrieval of Interacting Genes/Proteins (STRING) [42], the Human Protein Reference Database (HPRD) [43], the Molecular INTeraction database (MINT) [44], the Human Integrated Protein–Protein Interaction rEference (HIPPIE) [45], and the Biomolecular Interaction Network Database (BIND) [46]. Notably, DIP gathers PPIs that undergo curation via both manual input from biological experts and automatic curation through computational methods, covering a wide range of organisms such as yeast and humans, while HPRD and HIPPIE enjoy widespread usage as human PPI databases. Meanwhile, STRING systematically integrates interactions between proteins, encompassing both physical interactions and functional associations. BioGRID is a biomedical interaction repository with data compiled through comprehensive curation efforts. IntAct provides interaction data taken directly from the literature, submitted by academics and imported from other databases, and MINT mainly provides experimentally validated PPI data. However, BIND databases are seldom employed since they are no longer actively maintained or updated. Moreover, a distinctive database known as Negatome [47] exists, which comprises non-interacting proteins and protein domains. This database is compiled through the manual curation of scientific literature and the analysis of the 3D structure of protein complexes.

Computational models commonly learn relationships between PPIs and various types of protein features of protein sequences, spatial structures, gene ontology (GO), and genomic information to accomplish the prediction task. Databases such as UniProt [48], SWISS-PROT [49], and PIR [50], provide access to protein sequence information. Moreover, SCOP [51] and the RCSB Protein Data Bank (PDB) [52] offer access to higher-level structural knowledge concerning proteins. The GO [53], Clusters of Orthologous Groups (COG) [54], and Kyoto Encyclopedia of Genes and Genomes (KEGG) [55] databases serve as valuable resources that provide annotated and categorized information about genes and proteins, and genomic data can be obtained from the Candida Genome Database (CGD) [56]. Relevant information about the aforementioned databases is listed in Table 1.

The construction of a high-quality dataset is crucial for predicting PPIs. To enhance the accuracy and reliability of prediction, several effective strategies are typically employed to preprocess raw data obtained from databases. These strategies involve filtering out protein pairs with less than 50 amino acids, generating a non-redundant subset with a sequence identity of less than 40% using CD-HIT clustering analysis, among other techniques. Equally significant is the preparation of negative samples, which must adhere to specific criteria [57]. Firstly, positive datasets should only encompass protein pairs that actually interact, excluding those that do not interact. Secondly, the proteins that make up the negative set should be evenly distributed, thus avoiding excessive bias toward specific types of proteins. A common approach is to group the data based on subcellular localization, such as those in the mitochondria, nucleus, endoplasmic reticulum, cytoplasm, peroxisomes, vesicles, and Golgi apparatus. Subsequently, proteins with different subcellular localizations are randomly paired (excluding interacting pairs) as negative samples. Information on subcellular locations can be obtained from Swiss-Prot [49]. Other methods include obtaining negative samples directly from the Negatome [47] database or arbitrarily pairing proteins in the absence of interaction evidence. In practical studies, there exist well-established datasets widely utilized for training PPI prediction models, such as the Profppikernel dataset [58] and Pan’s dataset [59]. The former originates from HIPPIE [45] V1.2 and DIP [39], comprising 842 human PPIs and 746 yeast PPIs after careful screening. The latter is based on HPRD (2007 version) and serves as a benchmark dataset for PPI prediction.

3. Feature Extraction

Data retrieved from PPI databases can be used to construct datasets, which are typically divided into training and testing sets. Most computational models (barring docking and link prediction methods) require learning relevant features of protein pairs from the training set in order to complete PPI prediction tasks on the testing set. Generally, various characteristics of each protein pair in the training set, obtained from other biological databases, are assembled into numerical vectors to serve as input for the computational models.

3.1. Sequence-Based Features

As the primary structure of proteins, each protein sequence typically encompasses a distinctive assemblage of 20 fundamental amino acids. These amino acids are interconnected through peptide bonds, giving rise to a chain of amino acids called a polypeptide. Their placement within the polypeptide chain distinguishes each amino acid, leading to substantial variability among protein sequences. These dissimilar protein sequences ultimately determine their different structures and functions. Hence, from protein sequences, researchers can extract information regarding the composition, order, and physicochemical properties of amino acids to predict the likelihood of protein interactions. Given the inconsistent lengths of protein sequences, efficient strategies have been employed to encode these sequences, thereby generating fixed-length feature vectors, which serve as inputs for computational models. The main coding schemes are summarized in Table 2. Seven frequently employed ones are discussed in detail below.

3.1.1. Amino Acid Composition (AAC)

AAC, a succinct method for extracting protein features, calculates the percentage of 20 different amino acids within a protein sequence, thus transforming all protein sequences of varying lengths into 20-dimensional feature vectors. Each element is computed as

P_{r} = \frac{n_{r}}{N}, r = 1, 2, 3, \dots, 20

(1)

here,

N

represents the length of the protein sequence, while

n_{r}

denotes the count of occurrences of the

r

-th amino acid.

3.1.2. Pseudo Amino Acid Composition (PseAAC)

Chou [61] introduced PseAAC as an enhancement of AAC, considering the effects of both sequence order and composition. This yields a (20 +

λ

)-dimensional feature vector for each sequence, as shown in Equation (2):

X = {[x_{1}, x_{2}, x_{3}, \dots, x_{20}, x_{20 + 1}, \dots, x_{20 + λ}]}^{T} (λ < N)

(2)

20 +

λ

elements are calculated as shown in Equation (3):

x_{u} = \{\begin{matrix} \frac{f_{u}}{\sum_{i = 1}^{20} f_{i} + ω \sum_{j = 1}^{λ} θ_{j}}, (1 \leq u \leq 20) \\ \frac{ω θ_{u - 20}}{\sum_{i = 1}^{20} f_{i} + ω \sum_{j = 1}^{λ} θ_{j}}, (20 + 1 \leq u \leq 20 + λ) \end{matrix}

(3)

where

f_{i}

represents the frequency of different amino acids,

ω

is employed for weighting, and

θ_{j}

is a

j

-tier correlation factor which is calculated by various biochemical quantities and reflects the sequence correlation between continuous residues in the protein chain.

θ_{j} = \frac{1}{N - j} \sum_{k = 1}^{N - j} Φ (R_{k}, R_{k + j}) (j < N)

(4)

Φ (R_{k}, R_{k + j}) = \frac{1}{3} \{{[M (R_{k}) - M (R_{k + j})]}^{2} + {[H_{a} (R_{k}) - H_{a} (R_{k + j})]}^{2} + {[H_{b} (R_{k}) - H_{b} (R_{k + j})]}^{2}\}

(5)

where

M (R_{k}), H_{a} (R_{k}),

and

H_{b} (R_{k})

represent the side chain mass, hydrophobicity value, and hydrophilicity value of each amino acid

R_{k}

, respectively. The value of

λ

should be less than the minimum sequence length in the dataset.

3.1.3. Dipeptide Composition (DPC)

The DPC [63] descriptor provides a convenient way to compare and analyze the percentage of a certain dipeptide, encapsulating details regarding the local order of amino acids within the protein sequence. Twenty amino acids can yield 20 × 20 = 400 different dipeptide combinations, corresponding to a feature vector with 400 dimensions. The definition of DPC is as follows:

D_{s, t} = \frac{n_{s, t}}{N - 1}, s, t = 1, 2, 3, \dots, 20

(6)

here,

n_{s, t}

represents the count of dipeptides composed of amino acids of types

s

and

t

.

3.1.4. Conjoint Triad (CT)

Based on side chain volume and dipole, CT [57] groups the 20 amino acids into 7 distinct classes: {V, A, G}; {P, F, L, I}; {S, T, M, Y}; {W, Q, N, H}; {K, R}; {E, D}; {C}. Subsequently, an integer value within the range of one to seven is allocated to each amino acid, determined by its corresponding category. In other words, amino acids within the same class are assigned the same number. For instance, “AHHYMLPPKDC” is mapped as “14433222567”. To capture the interplay between adjacent amino acids, a protein sequence is treated as a series of tripeptides, where three consecutive amino acids are considered as a unified entity. This arrangement results in a total of 343 distinct triad types. The occurrence frequency of these triads within the sequence forms a 343-dimensional feature vector, with the

k

-th element computed as

f_{k} = \frac{n_{k}}{N - 2}, k = 1, 2, 3, \dots, 343

(7)

where

n_{k}

denotes the number of occurrences of the

k

-th triad.

3.1.5. Composition, Transition, and Distribution (CTD)

According to their spatial structure or physicochemical properties, amino acids can be categorized into different classes. For example, on the basis of hydrophobicity, they can be categorized into three groups: hydrophobic (W, F, M, I, V, L, C), neutral (Y, H, P, T, S, A, G), or polar (N, Q, D, E, K, R). Three categories are then represented by “1”, “2” and “3”, respectively, and the protein sequence undergoes a transformation into a numerical sequence. The sequence “GCAPQIMTQE” will be represented as “2122311233”. Based on the corresponding operation on this array, CTD [66] is a multi-descriptor, which consists of three components:

(1): Composition: This refers to the percentage representation of each kind of amino acid within the protein sequence. According to the example mentioned earlier, the numbers of “1”, “2”, and “3” are 3, 4, and 3, respectively, and their corresponding frequencies are $3 / 10 = 0.3$ , $4 / 10 = 0.4$ , $3 / 10 = 0.3$ . Thus, the classification based on hydrophobicity leads to a three-dimensional vector, and if more properties of amino acids are combined, a higher-dimensional vector can be obtained.
(2): Transition: This is the frequency of occurrence of dipeptides (regardless of their order) composed of different classes of amino acids in the sequence. The definition is as follows:

$T_{i, j} = \frac{n_{i, j} + n_{j, i}}{N - 1}, i, j \in \{(1, 2), (1, 3), (2, 3)\}$

(8)

the quantities $n_{i, j}$ and $n_{j, i}$ denote the counts of dipeptides represented as “ $i j$ ” and “ $j i$ ” correspondingly within the sequence. So, for the given examples, the frequency values were calculated as $T_{1, 2} = 3 / 9$ , $T_{1, 3} = 1 / 9$ , and $T_{2, 3} = 2 / 9$ , respectively.
(3): Distribution: This describes the distribution pattern of amino acids. For each category, the position percentages of the first, first quartile, half, third quartile, and last residues of each group in the entire sequence are calculated. In the above instance, there exist four residues designated as “2”, with their respective positions within the group being $1$ , $4 \times 25 % = 1$ , $4 \times 50 % = 2$ , $4 \times 75 % = 3$ , and $4 \times 100 % = 4$ , respectively. The positions of “2” in the entire sequence are 1, 1, 3, 4, and 8, so the distribution descriptors for “2” are $1 / 10 = 0.1$ , $1 / 10 = 0.1$ , $3 / 10 = 0.3$ , $4 / 10 = 0.4$ , and $8 / 10 = 0.8$ . The descriptors for “1” and “3” are calculated similarly.

3.1.6. Autocorrelation Descriptor (AD)

An AD [65] is employed to extract valuable physicochemical information from protein sequences, taking into account the effects of both nearby and distant interactions. Three commonly used ADs are as follows.

(1): Moreau-Broto AD is denoted as:

$N M B A (l) = \frac{\sum_{k = 1}^{N - l} T (m_{k}) T (m_{k + l})}{N - l}, l = 1, 2, 3, \dots, l a g$

(9)

where $T (m_{k})$ and $T (m_{k + l})$ represent the normalized physicochemical properties of the amino acids at positions $k$ and $(k + l)$ , respectively, while the parameter $l a g$ requires manual configuration.
(2): Moran AD is denoted as:

$M A (l) = \frac{\frac{1}{N - l} \sum_{k = 1}^{N - l} [T (m_{k}) - μ] [T (m_{k + l}) - μ]}{\frac{1}{N} \sum_{k = 1}^{N} {[T (m_{k}) - μ]}^{2}}, l = 1, 2, 3, \dots, l a g$

(10)

where $μ$ is the average value of a particular physicochemical attribute, which is calculated as shown in Equation (9):

$μ = \frac{\sum_{j = 1}^{N} T (m_{j})}{N}$

(11)
(3): Geary AD is denoted as:

$G A (l) = \frac{\frac{1}{2 (N - l)} \sum_{k = 1}^{N - l} {[T (m_{k}) - T (m_{k + l})]}^{2}}{\frac{1}{N} \sum_{k = 1}^{N} {[T (m_{k}) - μ]}^{2}}, l = 1, 2, 3, \dots, l a g$

(12)

By combining the aforementioned equations, we can derive a vector of dimension

3 \times n \times l a g

, with

n

denoting the number of physical and chemical properties considered.

3.1.7. Position-Specific Scoring Matrix (PSSM)

The PSSM, introduced by Gribskov et al. [67], is a potent approach for capturing evolutionary insights from biological sequences, originally employed for identifying distantly related proteins. The PSSM, an

L \times 20

matrix, is derived by performing a multiple sequence alignment on a biological sequence database using the iterative BLAST tool. In this matrix, rows correspond to the amino acid positions within the protein, and columns represent the 20 fundamental amino acids. The representation of each protein can be articulated as the subsequent matrix:

P S S M = (\begin{matrix} \begin{matrix} P_{1, 1} & P_{1, 2} \\ P_{2, 1} & P_{2, 2} \end{matrix} & \begin{matrix} \dots & P_{1, 20} \\ \dots & P_{2, 20} \end{matrix} \\ \begin{matrix} ⋮ & ⋮ \\ P_{L, 1} & P_{L, 2} \end{matrix} & \begin{matrix} ⋮ & ⋮ \\ \dots & P_{L, 20} \end{matrix} \end{matrix})

(13)

where the notation

P_{i, j}

denotes the likelihood of an amino acid at position

i

undergoing a mutation to become the

j

-th amino acid during the course of evolution, with a positive score implying a higher probability of mutation, and vice versa indicating relative conservatism.

3.2. Structure-Based Features

Proteins adopt specific three-dimensional conformations through the folding of their amino acid sequences. Compared to protein sequences, protein structures are more conservative and play a significant role in PPI prediction. Typically, experimental methods such as X-ray crystal diffraction and cryo-electron microscopy are used to determine a protein’s structure. However, recent advancements in the field, such as the development of Alphafold, have revolutionized structure prediction by leveraging multiple sequence alignment. The notable success achieved by Alphafold in forecasting the structures of single-stranded proteins can be attributed to the effective utilization of structural and evolutionary information through advanced deep learning models [35]. This breakthrough has led to a substantial expansion in the availability of protein structure data, with Alphafold creating AlphaFold DB [70], a freely accessible repository containing highly accurate predictions of protein structures available at https://alphafold.ebi.ac.uk (accessed on 15 July 2023), facilitating the study of structure-based PPI prediction methods.

To convert the spatial information associated with proteins into a format suitable for machine learning models, protein structure maps are used. The map transforms the protein’s spatial arrangement into a protein graph object, denoted as

G = (V, E)

. Here,

V

represents the vertices corresponding to the amino acids, and

E

denotes the edges connecting these vertices [71]. An edge is present between two amino acids if their distance falls below a specified threshold. This graph-based representation, which omits geometric information such as bond angles and dihedral angles, provides a method to capture significant topological characteristics, e.g., the spatial proximity of the amino acids along the polypeptide chain. The obtained node degree, adjacency matrix, and eigenvector centrality can serve as inputs for ML models. The node degree refers to the number of edges connected to a vertex. The adjacency matrix reflects the connectivity between vertices. If there is an edge between two vertices, the corresponding entry at the intersection of the rows and columns for those vertices is set to 1; otherwise, it is 0. Eigenvector centrality is a metric used to measure the importance of nodes in a network. If a node is connected to other important nodes (i.e., nodes with high centrality scores), then the eigenvector centrality of that node will be high. The eigenvector centrality

x_{i}

of node

i

can be computed by the formula

A x = λ x

, where

A

is the adjacency matrix of the graph,

x

is the eigenvector corresponding to the largest eigenvalue

λ

of matrix A, and

x_{i}

is the

i

-th element of

x

[72].

Furthermore, protein structure alignment [73] can facilitate the construction of multiple possible interaction models using interaction templates. During this process, correct and incorrect interactions are distinguished based on features such as interface similarity, structural alignment scores, and sequence homogeneity modeled by the two target templates [74].

3.3. GO-Based Features

GO constitutes a fundamental bioinformatics resource comprising structured and rigorously controlled vocabularies, offering standardized functional annotations for an organism’s genes and their corresponding products. Its coverage spans three major domains, namely, the molecular function (MF), cellular component (CC), and biological process (BP) [75]. The organization of GO adheres to the structure of a directed acyclic graph (DAG). Here, nodes represent distinct GO terms, while edges signify the relationships (e.g., is_a, part_of, and regulates) between them. It is generally observed that interacting proteins have a tendency to co-localize inside identical cellular components, possess similar functionalities, or participate in analogous biological processes, consequently exhibiting a remarkable degree of semantic similarity within the ontology [76].

Several approaches have emerged for predicting PPIs by utilizing GO, with the aim of constructing feature vectors for pairs of proteins and subsequently integrating them with traditional classifiers to accomplish the prediction task. In [77], a protein pair is treated as a document with their annotations as constituent words. The eigenvalue of each word is computed by multiplying the information content of the corresponding term with its weight. Another approach, known as HVSM [78], constructs feature vectors based on the presence of GO terms in gene annotations, considering both the parent and child terms linked via the is_a and part_of relationships. Edge-based approaches mainly rely on counting the edges that connect selected GO terms [79]. Additionally, deep learning models have been utilized to generate embeddings of biological entities within GO. For example, Jha et al. [80] harnessed OPA2Vec to translate background information from GO into feature coding. Similarly, Ieremie et al. [81] employed node2vec for the generation of feature vectors associated with GO terms.

3.4. Network-Based Features

The PPI network serves as a graphical representation to elucidate the intricate interplay among various proteins. Within this network, proteins establish complex connections through physical interactions. The nodes symbolize the proteins themselves, while the edges signify the interactions between them. By exploiting the topological layout of the PPI network, alongside node properties and relevant information, it becomes possible to infer potential edges or connections.

The Common Neighbor (CN) algorithm [29] draws inspiration from the social network phenomenon, where individuals possessing more shared friends tend to be acquainted. For a pair of nodes, their CN score is derived from the equation:

S_{x y} = |N (x) \cap N (y)|

(14)

Here,

N (x)

and

N (y)

represent the sets of neighboring nodes for

x

and

y

, respectively. Intuitively, this formula assigns higher prediction scores to nodes exhibiting a greater number of common neighbors, indicative of a higher probability for interaction. Moreover, through implementing the Resource Allocation (RA) algorithm [82], nodes with high degrees are subjected to penalties, effectively mitigating their influence on the CN metrics. The refined formula can be expressed as follows:

S_{x y} = \sum_{z \in N (x) \cap N (y)} \frac{1}{k (z)}

(15)

where

k (z)

denotes the degree of the node

z

. Additionally, alternative metrics such as the Jaccard coefficient (

S_{x y} = \frac{N (x) \cap N (y)}{N (x) \cup N (y)}

) [83], the Adamic–Adar index (

S_{x y} = \sum_{z \in N (x) \cap N (y)} \frac{1}{\log k (z)}

) [84], as well as indexes based on paths of length 3 (elaborated upon in Section 4.3) offer more avenues for analysis.

4. Computational Methods for PPI Prediction

As a complement to experimental approaches, the utilization of computational models for PPI prediction has undergone rapid advancement in recent years. These models leverage an extensive repository of established biological knowledge to accomplish their prediction task with a remarkable level of accuracy and stability. In this paper, according to the specific types of protein feature information employed, we categorize the existing computational methods for predicting PPIs into sequence-based, structure-based, GO-based, and network-based prediction methods, in addition to which we discuss DL-based methods considering the wide application of DL algorithms in PPI prediction. Table 3 briefly describes the specific methods in each category and the different classes of methods are shown in Figure 1.

4.1. Sequence-Based Methods

Considering the sequence similarity between pairs of proteins, sequence-based PPI prediction methods usually extract composition, physicochemical attributes, evolutionary profiles, or other information from protein sequences, construct feature vectors, and then use traditional classifiers to obtain prediction results.

A prediction approach called LightGBM-PPI, developed by Chen et al. [85], combines four protein sequence coding modes (CT, PseAAC, AD, and CTD) to generate feature representations for protein pairs. The initial feature vector is constructed by concatenating these representations. Through the application of elastic net, redundant and unnecessary features are removed, leaving only the best subset of features. Lastly, the LightGBM classifier is employed to predict PPIs by capturing the non-linear associations between category labels and sequence characteristics using the optimal feature vectors.

GcForest-PPI [86], similar to LightGBM-PPI, also utilizes elastic net to remove irrelevant information and noise. The difference is that the latter uses PseAAC, CTD, multivariate mutual information (MMI), AD, an amino acid composition PSSM (AAC-PSSM), and a dipeptide composition PSSM (DPC-PSSM) to extract multivariate features. The method integrates random forest (RF), extremely randomized trees, and XGBoost within a cascade architecture to build a deep forest model for PPI prediction. This approach demonstrates strong prediction performance and generalization ability, enabling cross-species prediction.

FWRF [87] is a novel feature weighted rotating forest algorithm. Initially, FWRF converts each amino acid sequence into a PSSM, which is then transformed into a 256-dimensional feature vector using local phase quantization (LPQ). Recognizing that high-dimensional PPI data may contain noise, FWRF assigns weights to different features and removes noisy features with small weights according to a given selection ratio. This improved rotating forest algorithm effectively reduces data dimensionality, eliminates noise, reduces execution time, and improves classification accuracy. It outperforms the original rotating forest classifier, as well as SVM classifiers and other existing methods.

Profppikernel [58] employs support vector machines (SVMs) for generating hyperplanes that efficiently discriminate between two categories of high-dimensional numerical vectors. Every sequence is represented as a vector depicting the frequency of k-mers in the methodology. The profile kernel is then computed by performing the dot product between the two vectors. The enhancement in prediction is particularly notable for proteins lacking substantial sequence similarity to those with well-established experimental annotations.

The method proposed by Göktepe et al. [88] adopts a novel strategy called weighted skip-sequential conjoint triads for extracting protein features in addition to using PseAAC and Bi-Gram representations. To reduce feature dimensionality, the eigenvalue one criterion (i.e., Kaiser criterion) from principal component analysis is applied, removing components with eigenvalues less than one. The final prediction of interactions is made using an SVM.

CoFex [107] is a novel method of constructing feature vectors for protein pairs rather than individual proteins, which is capable of extracting and identifying the coevolutionary features represented by covariates at coevolutionary positions. CoFex employs a two-step process, involving identifying the covariates and weighting the validated covariates. By combining with existing classifiers, CoFex can efficiently identify PPIs. However, this algorithm also suffers from some limitations, such as the repeated traversal of the protein sequence information when extracting coevolutionary patterns, and excessive memory consumption. In order to overcome these limitations to meet large-scale PPI prediction, Hu et al. [89] introduced an improved algorithm derived from the original CoFex, namely CoFex+, which was integrated with the MapReduce framework. CoFex+ uses a multiway tree-based structure in which each node defines two items: a group of proteins and the amino acid designated for this node. The identification of each coevolutionary pattern is facilitated by CoFex+, which operates through two phases: mapping and reduction. CoFex+ can perform tasks in a distributed manner, greatly improving the computational efficiency, but the existence of data transmission time puts an upper limit on the efficiency of this algorithm.

SSWRF [90] addresses the category imbalance problem in PPI site prediction through integrating an SVM and sample-weighted random forests (SWRF). The SVM computes sample scores in the training set to estimate training sample weights, which are then used to train SWRF. SSWRF extracts averaged the cumulative hydropathy (ACH), PSSM-derived feature, and averaged cumulative relative solvent accessibility (ACRSA) from protein sequences as inputs. By integrating ensemble learning with cost-sensitive learning, the adverse impact of class imbalance can be effectively mitigated.

4.2. Structure-Based Methods

Protein structures provide valuable information beyond protein sequences and can be utilized in the development of structure-based prediction methods. Among them, homology modeling and docking algorithms are widely employed.

For a pair of proteins, PrePPI [91] superimposes the structural representation determined using sequence alignment onto the interaction template found using structure alignment, creating a complex model (without building a three-dimensional model). The five experience scores assessed (SIM, SIZ, COV, OS, and OL) are then combined using Bayesian networks to derive the likelihood ratio (LR) indicating the probability that the proposed complex reflects the real interaction. Finally, for each PPI, the optimal model with the highest LR is selected. PrePPI holds the potential to unveil hitherto undetected interfaces apart from predicting binary interactions. Similarly, InterPred [70] also uses structural alignment and modeling techniques, but it utilizes an RF classifier which integrates multiple structure features to identify the real interaction model and finally performs the refinement of the resulting model. The feature descriptors used in InterPred include interface similarity, structure alignment scores, templates, lengths of targets, alignments, and sequence similarity from the template modeling of two targets.

In their study, Dong et al. [32] applied a structure-based docking algorithm to discern Arabidopsis PPIs. This technique incorporated the HEX algorithm, which rotated each protein incrementally by 15 degrees in three-dimensional space, leading to the computation of docking energies for 191.1 million conformations for each protein pair. Using a spherical polar Fourier correlation to expedite the search process, a hydrophobic repulsive volume model is used to seek potential low-energy configurations. Additionally, shape complementarity serves as a distinctive scoring function.

Bryant et al. [92] combined AlphaFold2 with improved multiple sequence alignment to produce interaction models. A predicted DockQ score (

p D o c k Q = \frac{L}{1 + e^{- k (x - x_{0})}} + b

, where

x

= average interface plDDT·log (number of interface contacts),

L

= 0.724,

x_{0}

= 152.611,

k

= 0.052, and

b

= 0.018) was used to evaluate multiple models of a protein pair, distinguishing acceptable models from erroneous ones with high confidence. pDockQ shows consistent accuracy in identifying real PPIs, correctly recognizing 51% of interacting pairs with a low error rate. However, experimental results indicate that its performance is contingent upon the specific interaction.

4.3. Network-Based Methods

As more PPI data emerges, the scale and complexity of PPI networks continue to grow. Network-based computational models aim to predict PPIs by analyzing the connection patterns within existing networks and assessing the similarities and complementarities between the nodes.

The CN [29] algorithm predicts future connections using current common neighbor information, which will result in new common neighbor nodes, i.e., future common neighbors. Building upon this foundation, Li et al. [97] introduced a similarity-based future common neighbors (SFCN) model. By classifying future public neighbors into three categories based on their topological relationships with other vertices, the SFCN model effectively identifies and quantifies their contributions in complex networks with a high degree of accuracy and robustness.

Lei et al. [99] introduced the random walk with resistance (RWS) algorithm, which hypothesizes that nodes within a PPI network that exhibit comparable “distances” from other nodes are inclined to engage in interactions. First, the “distance” between each protein node and other network nodes is calculated. Then, the topological profile similarity between each pair of nodes is assessed. Finally, topologically similar nodes are connected by edges. This method yields a network with enhanced biological relevance compared to the original network, addressing the issues of high noise, sparsity, and skewed distribution. However, the model’s approach of maintaining the number of edges in the original network through a simple cutoff-based strategy may compromise its robustness.

Kovács et al. [30] challenged the rationality of the CN algorithm, highlighting that common neighbors solely reflect similarities in interaction interfaces among protein pairs and do not necessarily imply direct interaction. They proposed the L3 principle based on protein structure and evolution, which suggests that proteins connected through multiple pathways of length 3 are more likely to have direct connections. To validate this hypothesis and mitigate the influence of high-degree nodes on prediction, they devised a degree-normalized L3 score, as follows:

P_{x y} = \sum_{U, V} \frac{a_{x u} a_{u v} a_{v y}}{\sqrt{k_{u} k_{v}}}

(16)

where

U

and

V

are the set of intermediate nodes on a path of length 3,

k_{u}

and

k_{v}

are the degrees of the nodes in the set, and

a_{x u} = 1

if

x

and

u

interact; otherwise, the value shall be 0. The experimental findings have demonstrated that the predictive performance of L3 on the PPI network outperforms existing link prediction methods, including CN. However, L3 also has the limitation that it is unable to identify interacting partners in the absence of known links. Additionally, the normalization terms used in L3 are empirically derived rather than biologically motivated. To overcome these limitations, Yuen et al. [100] proposed a link predictor called NormalizedL3 (L3N), which optimizes the original L3 algorithm from a network modeling perspective. L3N decomposes the L3 principle into a sequence of computations, employing a similarity metric to compare graph neighborhoods. The key formulas involved in this approach are listed below:

P_{x y} = f [N (x), U]\cdot f[N (y), V] \cdot \sum_{U, V} f [N (u), V]\cdot f[N (v), U]\cdot f[N (x), N (v)]\cdot f[N (y), N (u)]

(17)

where

N (x)

represents the collection of neighbors of

x

,

u

and

v

are nodes intermediate node set

U

and

V

, respectively, and the

f

function is either a simple ratio

f_{1} (A, B) = \frac{|A \cap B|}{|A|}

or Jaccard’s coefficient. L3N addresses some of the shortcomings of other L3 predictors and is capable of accurately forecasting PPIs with significant biological relevance.

Sim [98] is another L3-based PPI network link prediction method, which is proposed from the perspective of protein complementary interfaces and gene duplications. Sim assumes that a link exists between two nodes if the neighbors of them are very similar. This method uses Jaccard similarity as a similarity metric between nodes because it reflects the interface and evolutionary resemblance between pairs of proteins. The link prediction index is computed as

S i m = A S + S A

(18)

where

A

refers to the adjacency matrix and

S

stands for the similarity matrix. By integrating Sim with L3, the accuracy and robustness of the prediction is greatly improved. However, it lacks generality.

4.4. GO-Based Methods

Interactions between proteins are usually implicated by high GO similarity, with the value of this similarity serving as a reliable gauge of interaction confidence.

PPI-MetaGO [101] is a stacked generalization scheme that integrates various learning algorithms and feature representations of protein pairs. It employs a two-stage stacked generalization architecture, with four base classifiers in the bottom layer. First, feature vectors obtained from PPI network topology, physicochemical properties, and GO terms are individually input to each trained classifier. Then, in the top layer, a meta-classifier using a radial basis function (RBF) kernel SVM combines the predictions from foundational classifiers to make the ultimate prediction. Although PPI-MetaGO has demonstrated high feasibility and superiority in PPI prediction, the prediction performance can vary with alterations of training data due to the dependency on features derived from the PPI network and the GO hierarchical clustering.

Bandyopadhyay et al. [77] proposed using GO terms to represent protein pairs. In this scheme, a document symbolizes a protein pair, and each word in the document represents annotations about the protein pair. The feature value of each term is determined by multiplying the information content of corresponding terms with a certain weight. This weight is calculated by term specificity and the topological structure of the GO graph. The final prediction results are output from a trained SVM classifier. However, this method has not fully leveraged the information and relationships that exist in the GO graph.

4.5. DL-Based Methods

The concept of DL emerged from the investigation of artificial neural networks. The pivotal step involves the transformation of low-level features into higher-level representations of abstract attributes or features. DL architecture has been widely applied in PPI prediction, where it can accommodate various types of input data, including protein sequence, 3D structure, and network topology.

For instance, Sun et al. [104] applied the DL algorithm to sequence-based PPI prediction. Their method encoded protein sequences using feature extraction techniques such as autocovariance (AC) and CT. They then employed stacked autoencoders (SAE), a type of neural network with multiple layers of autoencoders. These autoencoders were trained layer by layer, with the output of the preceding layer serving as the input for the subsequent layer. Finally, a SoftMax classifier linked to the final output provided the classification prediction results. SAE effectively learned hidden interaction features from the protein sequence inputs, showcasing good generalization ability.

Similarly, DeepPPI [102] utilized a deep neural network to extract protein features from sequences. DeepPPI combined multiple descriptors, including AAC, CTD, DPC, Amphiphilic Pseudo-Amino Acid Composition (APAAC), and Quasi-Sequence Order (QSO), to get a fixed-length feature representation. The method used two independent networks, each accepting the descriptor input of one of the protein pairs. Each sub-network comprised three hidden layers, merged by a merging layer. The last layer employed a one-hot coded label to distinguish interactions. DeepPPI effectively learned the informative features of protein pairs through hierarchical abstraction, resulting in strong performance. However, it should be noted that the size of the dataset may impede the model’s performance if it is too small.

DPPI [106] introduces a DL framework specialized in forecasting direct physical PPIs through protein sequence modeling. This approach hinges on a Siamese-like architecture of convolutional neural networks (CNNs). To accomplish this, DPPI constructs a feature representation for every protein sequence leveraging an extensive collection of unsupervised information, which is then passed through convolutional module, random projection (RP) module, and prediction module in order to clarify if the interaction between the two input proteins occurs. Notably, the RP module explores the combination of protein motifs, and to ensure the RP module remains invariant to input profile order, DPPI implements a strategy of switching stochastic weights, i.e., no matter the order of the input proteins, the linear layer of the prediction module processes identical input vectors. Furthermore, DPPI can effectively model binding affinity and is applicable to different biological problems.

To address the PPI prediction problem in a cross-species context, Sledzieski et al. [105] proposed a DL model that combines neural language modeling and structure-driven design, called D-SCRIPT. The model employs a pre-trained deep protein language model to generate feature embeddings and utilizes the low-dimensional information from these representations to calculate the protein contact map, which reflects the distance relationships between residues in protein structure and finally is aggregated into interaction probabilities in the interaction module. While D-SCRIPT performs well in cross-species interaction prediction, its simplicity and degree of regularization may limit its prediction ability within species.

Struct2Graph [71], a method for predicting PPIs using only protein 3D structural data, is a mutual attention classifier based on graph convolutional networks (GCNs). First, two protein maps are mapped into real-valued embeddings using concurrent GCNs and then combined through a mutual attention network. The resulting single output vector is fed into a single-layer, fully connected, feed-forward neural network to generate a vector of dimension 2, and finally, the classification probability is obtained through a SoftMax layer. On balanced or imbalanced datasets, Struct2Graph performs at the cutting edge.

Benefiting from AlphaFold2’s precise predictions of protein monomer structures, Huang et al. [103] introduced SGPPI, an efficient model for PPI prediction. The overarching structural features of proteins and the specific structural attributes of patches are taken into account in this model. In their Siamese-like network architecture, each protein is represented individually as a residue contact map, and each residue is annotated by PSSM mapping, local and global geometrical descriptors, and its position in the secondary structure, which are then processed by GCN convolution to give prediction results through the fully connected layer. SGPPI demonstrates high accuracy and robustness on rigorous datasets.

TAGPPI [9] is an end-to-end DL architecture that combines protein sequence and structural features. It leverages a pre-trained language model, TextCNN, to obtain local characteristics from protein sequences. Meanwhile, protein structural features are derived from contact maps using a stacked three graph attention network module. The weighted sum of the two types of features is connected with the feature representation of another protein before feeding it into a binary classifier composed of fully connected layers.

5. Current Challenges and Future Prospects

PPI prediction stands as an ever-evolving and challenging domain of research, holding paramount importance in unraveling biological processes and disease mechanisms. Most of the existing studies use computational methods starting from protein sequences, structures, and other useful biological information, which overcome the limitations of wet lab experiments and boast a high predictive accuracy. However, there are still some problems and challenges that need to be addressed within this research domain.

PPI data are usually limited and incomplete. Although the advent of high-throughput techniques has augmented the PPI network and furnished more data for predictive methodologies, its coverage remains circumscribed. Additionally, interactions between diverse proteins are not uniformly distributed. Certain proteins may involve multiple interactions, while others may involve only one. Furthermore, PPI data encompass false-positive outcomes attributable to experimental conditions and technical biases [25]. When constructing negative datasets, most computational methods use the strategy of randomly pairing proteins with different subcellular localizations [60,69,108,109]. Nevertheless, proteins sharing the same subcellular localization do not invariably interact. Employing datasets founded on biased raw data for model training engenders a decline in its reliability. Therefore, it is necessary to design more rational data preprocessing methods.

PPIs within organisms constitute a dynamic process governed by numerous environmental determinants and regulatory mechanisms. The genuine PPI network in cells undergoes continual changes across various phases of the cell cycle [110], giving rise to an array of dynamic PPI networks. Most existing computational prediction algorithms usually consider only static PPIs, neglecting the diversity and dynamism inherent to PPIs. Gene expression data captured at distinct time points and under varying environmental conditions can unveil dynamic insights concerning proteins [111]. Zhang et al. [112] calculated the likelihood of activity for each protein at different time junctures to construct a dynamic PPI network. Ou-Yang et al. [113] calculated Pearson correlation coefficients based on gene expression profiles, thereby identifying stable and transient interactions. In addition, thermal proximity coaggregation (TPCA) proved to be useful in studying the dynamics of PPIs, as well as the functional characterization of complexes identified by other proteome-wide methods under different cellular states and conditions [114].

Anticipating PPIs across species constitutes another formidable challenge, as proteins exhibit variances to some extent among species. Presently, studies encompassing cross-species PPI prediction are relatively scarce or confined in scope, with the evaluation of predictive performance limited to a small set of frequently utilized datasets. The pursuit of achieving more expansive cross-species PPI prediction constitutes an issue warranting consideration. Certain evolutionary patterns and protein structural features of PPIs exhibit relatively higher conservation across species, thus deserving heightened emphasis in forthcoming studies.

The majority of current computational methodologies center on relatively limited or singular types of information. The development of Alphafold has made structure prediction possible, concurrent with the ongoing progress in genomic and proteomic research. The amalgamation of an array of diverse bioinformatic data types into predictive models represents a pivotal avenue for enhancing predictive accuracy. Techniques rooted in deep learning and artificial intelligence have demonstrated excellent performance in PPI prediction and can be utilized to build more complex and accurate prediction models. Natural language processing (NLP) can be harnessed to acquire feature embeddings of proteins, and when fused with handcrafted features, markedly elevates generalization capabilities and prediction prowess [69]. Hence, there arises a necessity to devise efficient protein-coding techniques to better encapsulate an array of protein features.

6. Key Points

This article presents an exposition on the commonly utilized databases and the methodologies for constructing datasets employed in PPI prediction tasks.
The relevant feature extraction strategies and computational methods have been classified and discussed.
Deep learning algorithms demonstrate significant advantages in extracting diverse protein features and exhibit strong predictive performance.
Three crucial aspects that future research in PPI prediction should prioritize are more accurate and effective datasets, the extraction and integration strategies of multiple kinds of features, as well as more universally applicable prediction methodologies.

Author Contributions

L.X. drafted the manuscript. Y.W. revised the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Nos. 62102269, 62373080).

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Berggard, T.; Linse, S.; James, P. Methods for the detection and analysis of protein-protein interactions. Proteomics 2007, 7, 2833–2842. [Google Scholar] [CrossRef]
De Las Rivas, J.; Fontanillo, C. Protein-Protein Interactions Essentials: Key Concepts to Building and Analyzing Interactome Networks. PLoS Comput. Biol. 2010, 6, e1000807. [Google Scholar] [CrossRef] [PubMed]
Zhou, H.X.; Shan, Y.B. Prediction of protein interaction sites from sequence profile and residue neighbor list. Proteins Struct. Funct. Genet. 2001, 44, 336–343. [Google Scholar] [CrossRef]
Braun, P.; Gingras, A.-C. History of protein-protein interactions: From egg-white to complex networks. Proteomics 2012, 12, 1478–1498. [Google Scholar] [CrossRef] [PubMed]
De Las Rivas, J.; Fontanillo, C. Protein-protein interaction networks: Unraveling the wiring of molecular machines within the cell. Brief. Funct. Genom. 2012, 11, 489–496. [Google Scholar] [CrossRef] [PubMed]
Wang, R.-S.; Wang, Y.; Wu, L.-Y.; Zhang, X.-S.; Chen, L. Analysis on multi-domain cooperation for predicting protein-protein interactions. BMC Bioinform. 2007, 8, 391. [Google Scholar] [CrossRef] [PubMed]
Yang, X.; Niu, Z.; Liu, Y.; Song, B.; Lu, W.; Zeng, L.; Zeng, X. Modality-DTA: Multimodality Fusion Strategy for Drug-Target Affinity Prediction. IEEE/ACM Trans. Comput. Biol. Bioinform. 2023, 20, 1200–1210. [Google Scholar] [CrossRef] [PubMed]
Bakail, M.; Ochsenbein, F. Targeting protein-protein interactions, a wide open field for drug design. Comptes Rendus Chim. 2016, 19, 19–27. [Google Scholar] [CrossRef]
Song, B.; Luo, X.; Luo, X.; Liu, Y.; Niu, Z.; Zeng, X. Learning spatial structures of proteins improves protein-protein interaction prediction. Brief. Bioinform. 2022, 23, bbab558. [Google Scholar] [CrossRef]
Petta, I.; Lievens, S.; Libert, C.; Tavernier, J.; De Bosscher, K. Modulation of Protein-Protein Interactions for the Development of Novel Therapeutics. Mol. Ther. 2016, 24, 707–718. [Google Scholar] [CrossRef]
Zhang, L.; Li, S.; Hao, C.X.; Hong, G.N.; Zou, J.F.; Zhang, Y.N.; Li, P.F.; Guo, Z. Extracting a few functionally reproducible biomarkers to build robust subnetwork-based classifiers for the diagnosis of cancer. Gene 2013, 526, 232–238. [Google Scholar] [CrossRef]
Tian, Y.; Su, X.; Su, Y.; Zhang, X. EMODMI: A Multi-Objective Optimization Based Method to Identify Disease Modules. IEEE Trans. Emerg. Top. Comput. Intell. 2021, 5, 570–582. [Google Scholar] [CrossRef]
Gavin, A.C.; Bosche, M.; Krause, R.; Grandi, P.; Marzioch, M.; Bauer, A.; Schultz, J.; Rick, J.M.; Michon, A.M.; Cruciat, C.M.; et al. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 2002, 415, 141–147. [Google Scholar] [CrossRef] [PubMed]
Parrish, J.R.; Gulyas, K.D.; Finley, R.L., Jr. Yeast two-hybrid contributions to interactome mapping. Curr. Opin. Biotechnol. 2006, 17, 387–393. [Google Scholar] [CrossRef]
Ito, T.; Chiba, T.; Ozawa, R.; Yoshida, M.; Hattori, M.; Sakaki, Y. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc. Natl. Acad. Sci. USA 2001, 98, 4569–4574. [Google Scholar] [CrossRef] [PubMed]
Vinogradova, O.; Qin, J. NMR as a Unique Tool in Assessment and Complex Determination of Weak Protein-Protein Interactions. Top Curr. Chem. 2012, 326, 35–45. [Google Scholar] [CrossRef] [PubMed]
O’Connell, M.R.; Gamsjaeger, R.; Mackay, J.P. The structural analysis of protein-protein interactions by NMR spectroscopy. Proteomics 2009, 9, 5224–5232. [Google Scholar] [CrossRef] [PubMed]
Tong, A.H.Y.; Evangelista, M.; Parsons, A.B.; Xu, H.; Bader, G.D.; Page, N.; Robinson, M.; Raghibizadeh, S.; Hogue, C.W.V.; Bussey, H.; et al. Systematic genetic analysis with ordered arrays of yeast deletion mutants. Science 2001, 294, 2364–2368. [Google Scholar] [CrossRef] [PubMed]
Ooi, S.L.; Pan, X.W.; Peyser, B.D.; Ye, P.; Meluh, P.B.; Yuan, D.S.; Irizarry, R.A.; Bader, J.S.; Spencer, F.A.; Boeke, J.D. Global synthetic-lethality analysis and yeast functional profiling. Trends Genet. 2006, 22, 56–63. [Google Scholar] [CrossRef] [PubMed]
Foltman, M.; Sanchez-Diaz, A. Studying Protein-Protein Interactions in Budding Yeast Using Co-immunoprecipitation. Methods Mol. Biol. 2016, 1369, 239–256. [Google Scholar] [CrossRef]
Zhu, H.; Bilgin, M.; Bangham, R.; Hall, D.; Casamayor, A.; Bertone, P.; Lan, N.; Jansen, R.; Bidlingmaier, S.; Houfek, T.; et al. Global analysis of protein activities using proteome chips. Science 2001, 293, 2101–2105. [Google Scholar] [CrossRef]
Piehler, J. New methodologies for measuring protein interactions in vivo and in vitro. Curr. Opin. Struct. Biol. 2005, 15, 4–14. [Google Scholar] [CrossRef] [PubMed]
Byron, O.; Vestergaard, B. Protein-protein interactions: A supra-structural phenomenon demanding trans-disciplinary biophysical approaches. Curr. Opin. Struct. Biol. 2015, 35, 76–86. [Google Scholar] [CrossRef] [PubMed]
Collins, S.R.; Kemmeren, P.; Zhao, X.-C.; Greenblatt, J.F.; Spencer, F.; Holstege, F.C.P.; Weissman, J.S.; Krogan, N.J. Toward a comprehensive atlas of the physical interactome of Saccharomyces cerevisiae. Mol. Cell. Proteom. 2007, 6, 439–450. [Google Scholar] [CrossRef] [PubMed]
Huang, H.; Bader, J.S. Precision and recall estimates for two-hybrid screens. Bioinformatics 2009, 25, 372–378. [Google Scholar] [CrossRef]
Ding, Z.; Kihara, D. Computational identification of protein-protein interactions in model plant proteomes. Sci. Rep. 2019, 9, 8740. [Google Scholar] [CrossRef]
Gingras, A.-C.; Gstaiger, M.; Raught, B.; Aebersold, R. Analysis of protein complexes using mass spectrometry. Nat. Rev. Mol. Cell Biol. 2007, 8, 645–654. [Google Scholar] [CrossRef]
Marmier, G.; Weigt, M.; Bitbol, A.-F. Phylogenetic correlations can suffice to infer protein partners from sequences. PLoS Comput. Biol. 2019, 15, e1007179. [Google Scholar] [CrossRef]
Liben-Nowell, D.; Kleinberg, J. The link-prediction problem for social networks. J. Am. Soc. Inf. Sci. Technol. 2007, 58, 1019–1031. [Google Scholar] [CrossRef]
Kovacs, I.A.; Luck, K.; Spirohn, K.; Wang, Y.; Pollis, C.; Schlabach, S.; Bian, W.; Kim, D.-K.; Kishore, N.; Hao, T.; et al. Network-based prediction of protein interactions. Nat. Commun. 2019, 10, 1240. [Google Scholar] [CrossRef]
Nicholas Wass, M.; Fuentes, G.; Pons, C.; Pazos, F.; Valencia, A. Towards the prediction of protein interaction partners using physical docking. Mol. Syst. Biol. 2011, 7, 469. [Google Scholar] [CrossRef]
Dong, S.; Lau, V.; Song, R.; Ierullo, M.; Esteban, E.; Wu, Y.; Sivieng, T.; Nahal, H.; Gaudinier, A.; Pasha, A.; et al. Proteome-wide, Structure-Based Prediction of Protein-Protein Interactions/New Molecular Interactions Viewer. Plant Physiol. 2019, 179, 1893–1907. [Google Scholar] [CrossRef]
Pierce, B.G.; Wiehe, K.; Hwang, H.; Kim, B.-H.; Vreven, T.; Weng, Z. ZDOCK server: Interactive docking prediction of protein-protein complexes and symmetric multimers. Bioinformatics 2014, 30, 1771–1773. [Google Scholar] [CrossRef]
Ohue, M.; Matsuzaki, Y.; Uchikoga, N.; Ishida, T.; Akiyama, Y. MEGADOCK: An All-to-All Protein-Protein Interaction Prediction System Using Tertiary Structure Data. Protein Pept. Lett. 2014, 21, 766–778. [Google Scholar] [CrossRef]
Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Zidek, A.; Potapenko, A.; et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021, 596, 583–589. [Google Scholar] [CrossRef]
Chowdhury, R.; Bouatta, N.; Biswas, S.; Floristean, C.; Kharkare, A.; Roye, K.; Rochereau, C.; Ahdritz, G.; Zhang, J.; Church, G.M.; et al. Single-sequence protein structure prediction using a language model and deep learning. Nat. Biotechnol. 2022, 40, 1617–1623. [Google Scholar] [CrossRef]
Li, P.; Liu, Z.-P. PST-PRNA: Prediction of RNA-binding sites using protein surface topography and deep learning. Bioinformatics 2022, 38, 2162–2168. [Google Scholar] [CrossRef]
Zhang, S.; Zhou, J.; Hu, H.; Gong, H.; Chen, L.; Cheng, C.; Zeng, J. A deep learning framework for modeling structural features of RNA-binding protein targets. Nucleic Acids Res. 2016, 44, e32. [Google Scholar] [CrossRef]
Salwinski, L.; Miller, C.S.; Smith, A.J.; Pettit, F.K.; Bowie, J.U.; Eisenberg, D. The Database of Interacting Proteins: 2004 update. Nucleic Acids Res. 2004, 32, D449–D451. [Google Scholar] [CrossRef]
Oughtred, R.; Stark, C.; Breitkreutz, B.-J.; Rust, J.; Boucher, L.; Chang, C.; Kolas, N.; O’Donnell, L.; Leung, G.; McAdam, R.; et al. The BioGRID interaction database: 2019 update. Nucleic Acids Res. 2019, 47, D529–D541. [Google Scholar] [CrossRef]
Kerrien, S.; Aranda, B.; Breuza, L.; Bridge, A.; Broackes-Carter, F.; Chen, C.; Duesbury, M.; Dumousseau, M.; Feuermann, M.; Hinz, U.; et al. The IntAct molecular interaction database in 2012. Nucleic Acids Res. 2012, 40, D841–D846. [Google Scholar] [CrossRef]
Szklarczyk, D.; Kirsch, R.; Koutrouli, M.; Nastou, K.; Mehryary, F.; Hachilif, R.; Gable, A.L.; Fang, T.; Doncheva, N.T.; Pyysalo, S.; et al. The STRING database in 2023: Protein-protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res. 2023, 51, D638–D646. [Google Scholar] [CrossRef]
Prasad, T.S.K.; Goel, R.; Kandasamy, K.; Keerthikumar, S.; Kumar, S.; Mathivanan, S.; Telikicherla, D.; Raju, R.; Shafreen, B.; Venugopal, A.; et al. Human Protein Reference Database-2009 update. Nucleic Acids Res. 2009, 37, D767–D772. [Google Scholar] [CrossRef]
Licata, L.; Briganti, L.; Peluso, D.; Perfetto, L.; Iannuccelli, M.; Galeota, E.; Sacco, F.; Palma, A.; Nardozza, A.P.; Santonico, E.; et al. MINT, the molecular interaction database: 2012 update. Nucleic Acids Res. 2012, 40, D857–D861. [Google Scholar] [CrossRef]
Alanis-Lobato, G.; Andrade-Navarro, M.A.; Schaefer, M.H. HIPPIE v2.0: Enhancing meaningfulness and reliability of protein-protein interaction networks. Nucleic Acids Res. 2017, 45, D408–D414. [Google Scholar] [CrossRef]
Alfarano, C.; Andrade, C.E.; Anthony, K.; Bahroos, N.; Bajec, M.; Bantoft, K.; Betel, D.; Bobechko, B.; Boutilier, K.; Burgess, E.; et al. The Biomolecular Interaction Network Database and related tools 2005 update. Nucleic Acids Res. 2005, 33, D418–D424. [Google Scholar] [CrossRef]
Blohm, P.; Frishman, G.; Smialowski, P.; Goebels, F.; Wachinger, B.; Ruepp, A.; Frishman, D. Negatome 2.0: A database of non-interacting proteins derived by literature mining, manual annotation and protein structure analysis. Nucleic Acids Res. 2014, 42, D396–D400. [Google Scholar] [CrossRef]
Bateman, A.; Martin, M.-J.; Orchard, S.; Magrane, M.; Ahmad, S.; Alpi, E.; Bowler-Barnett, E.H.; Britto, R.; Cukura, A.; Denny, P.; et al. UniProt: The Universal Protein Knowledgebase in 2023. Nucleic Acids Res. 2023, 51, D523–D531. [Google Scholar] [CrossRef]
Boeckmann, B.; Bairoch, A.; Apweiler, R.; Blatter, M.C.; Estreicher, A.; Gasteiger, E.; Martin, M.J.; Michoud, K.; O’Donovan, C.; Phan, I.; et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003, 31, 365–370. [Google Scholar] [CrossRef]
Barker, W.C.; Garavelli, J.S.; McGarvey, P.B.; Marzec, C.R.; Orcutt, B.C.; Srinivasarao, G.Y.; Yeh, L.S.L.; Ledley, R.S.; Mewes, H.W.; Pfeiffer, F.; et al. The PIR-International Protein Sequence Database. Nucleic Acids Res. 1999, 27, 39–43. [Google Scholar] [CrossRef]
Andreeva, A.; Kulesha, E.; Gough, J.; Murzin, A.G. The SCOP database in 2020: Expanded classification of representative family and superfamily domains of known protein structures. Nucleic Acids Res. 2020, 48, D376–D382. [Google Scholar] [CrossRef]
Bittrich, S.; Rose, Y.; Segura, J.; Lowe, R.; Westbrook, J.D.; Duarte, J.M.; Burley, S.K. RCSB Protein Data Bank: Improved annotation, search and visualization of membrane protein structures archived in the PDB. Bioinformatics 2022, 38, 1452–1454. [Google Scholar] [CrossRef]
Carbon, S.; Dietze, H.; Lewis, S.E.; Mungall, C.J.; Munoz-Torres, M.C.; Basu, S.; Chisholm, R.L.; Dodson, R.J.; Fey, P.; Thomas, P.D.; et al. Expansion of the Gene Ontology knowledgebase and resources. Nucleic Acids Res. 2017, 45, D331–D338. [Google Scholar] [CrossRef]
Galperin, M.Y.; Wolf, Y.I.; Makarova, K.S.; Alvarez, R.V.; Landsman, D.; Koonin, E.V. COG database update: Focus on microbial diversity, model organisms, and widespread pathogens. Nucleic Acids Res. 2021, 49, D274–D281. [Google Scholar] [CrossRef]
Kanehisa, M.; Furumichi, M.; Sato, Y.; Kawashima, M.; Ishiguro-Watanabe, M. KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Res. 2023, 51, D587–D592. [Google Scholar] [CrossRef]
Skrzypek, M.S.; Binkley, J.; Binkley, G.; Miyasato, S.R.; Simison, M.; Sherlock, G. The Candida Genome Database (CGD): Incorporation of Assembly 22, systematic identifiers and visualization of high throughput sequencing data. Nucleic Acids Res. 2017, 45, D592–D596. [Google Scholar] [CrossRef]
Shen, J.; Zhang, J.; Luo, X.; Zhu, W.; Yu, K.; Chen, K.; Li, Y.; Jiang, H. Predictina protein-protein interactions based only on sequences information. Proc. Natl. Acad. Sci. USA 2007, 104, 4337–4341. [Google Scholar] [CrossRef]
Hamp, T.; Rost, B. Evolutionary profiles improve protein-protein interaction prediction from sequence. Bioinformatics 2015, 31, 1945–1950. [Google Scholar] [CrossRef]
Pan, X.-Y.; Zhang, Y.-N.; Shen, H.-B. Large-Scale Prediction of Human Protein-Protein Interactions from Amino Acid Sequence Based on Latent Topic Features. J. Proteome Res. 2010, 9, 4992–5001. [Google Scholar] [CrossRef]
Mahapatra, S.; Gupta, V.R.; Sahu, S.S.; Panda, G. Deep Neural Network and Extreme Gradient Boosting Based Hybrid Classifier for Improved Prediction of Protein-Protein Interaction. IEEE/Acm Trans. Comput. Biol. Bioinform. 2022, 19, 155–165. [Google Scholar] [CrossRef]
Chou, K.C. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins Struct. Funct. Genet. 2001, 43, 246–255. [Google Scholar] [CrossRef]
Chou, K.C. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 2005, 21, 10–19. [Google Scholar] [CrossRef]
Saravanan, V.; Gautham, N. Harnessing Computational Biology for Exact Linear B-Cell Epitope Prediction: A Novel Amino Acid Composition-Based Feature Descriptor. Omics A J. Integr. Biol. 2015, 19, 648–658. [Google Scholar] [CrossRef]
Chou, K.C. Prediction of protein subcellular locations by incorporating quasi-sequence-order effect. Biochem. Biophys. Res. Commun. 2000, 278, 477–483. [Google Scholar] [CrossRef]
Chen, Z.; Zhao, P.; Li, F.; Leier, A.; Marquez-Lago, T.T.; Wang, Y.; Webb, G.I.; Smith, A.I.; Daly, R.J.; Chou, K.-C.; et al. iFeature: A Python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics 2018, 34, 2499–2502. [Google Scholar] [CrossRef]
Dubchak, I.; Muchnik, I.; Holbrook, S.R.; Kim, S.H. Prediction of protein-folding class using global description of amino-acid-sequence. Proc. Natl. Acad. Sci. USA 1995, 92, 8700–8704. [Google Scholar] [CrossRef]
Gribskov, M.; McLachlan, A.D.; Eisenberg, D. Profile analysis: Detection of distantly related proteins. Proc. Natl. Acad. Sci. USA 1987, 84, 4355–4358. [Google Scholar] [CrossRef]
Ding, Y.; Tang, J.; Guo, F. Predicting protein-protein interactions via multivariate mutual information of protein sequences. BMC Bioinform. 2016, 17, 398. [Google Scholar] [CrossRef]
Tran, H.-N.; Xuan, Q.N.P.; Nguyen, T.-T. DeepCF-PPI: Improved prediction of protein-protein interactions by combining learned and handcrafted features based on attention mechanisms. Appl. Intell. 2023, 53, 17887–17902. [Google Scholar] [CrossRef]
Varadi, M.; Anyango, S.; Deshpande, M.; Nair, S.; Natassia, C.; Yordanova, G.; Yuan, D.; Stroe, O.; Wood, G.; Laydon, A.; et al. AlphaFold Protein Structure Database: Massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 2022, 50, D439–D444. [Google Scholar] [CrossRef]
Baranwal, M.; Magner, A.; Saldinger, J.; Turali-Emre, E.S.; Elvati, P.; Kozarekar, S.; VanEpps, J.S.; Kotov, N.A.; Violi, A.; Hero, A.O. Struct2Graph: A graph attention network for structure based predictions of protein-protein interactions. BMC Bioinform. 2022, 23, 370. [Google Scholar] [CrossRef]
De Domenico, M.; Sole-Ribalta, A.; Cozzo, E.; Kivelae, M.; Moreno, Y.; Porter, M.A.; Gomez, S.; Arenas, A. Mathematical Formulation of Multilayer Networks. Phys. Rev. X 2013, 3, 041022. [Google Scholar] [CrossRef]
Zhang, C.; Shine, M.; Pyle, A.M.; Zhang, Y. US-align: Universal structure alignments of proteins, nucleic acids, and macromolecular complexes. Nat. Methods 2022, 19, 1109–1115. [Google Scholar] [CrossRef]
Mirabello, C.; Wallner, B. InterPred: A pipeline to identify and model protein-protein interactions. Proteins Struct. Funct. Bioinform. 2017, 85, 1159–1170. [Google Scholar] [CrossRef]
Harris, M.A.; Clark, J.I.; Ireland, A.; Lomax, J.; Ashburner, M.; Collins, R.; Eilbeck, K.; Lewis, S.; Mungall, C.; Richter, J.; et al. The Gene Ontology (GO) project in 2006. Nucleic Acids Res. 2006, 34, D322–D326. [Google Scholar] [CrossRef]
Wu, X.; Zhu, L.; Guo, J.; Zhang, D.-Y.; Lin, K. Prediction of yeast protein-protein interaction network: Insights from the Gene Ontology and annotations. Nucleic Acids Res. 2006, 34, 2137–2150. [Google Scholar] [CrossRef]
Bandyopadhyay, S.; Mallick, K. A New Feature Vector Based on Gene Ontology Terms for Protein-Protein Interaction Prediction. IEEE/ACM Trans. Comput. Biol. Bioinform. 2017, 14, 762–770. [Google Scholar] [CrossRef]
Zhang, J.; Jia, K.; Jia, J.; Qian, Y. An improved approach to infer protein-protein interaction based on a hierarchical vector space model. BMC Bioinform. 2018, 19, 161. [Google Scholar] [CrossRef]
Wu, H.W.; Su, Z.C.; Mao, F.L.; Olman, V.; Xu, Y. Prediction of functional modules based on comparative genome analysis and Gene Ontology application. Nucleic Acids Res. 2005, 33, 2822–2837. [Google Scholar] [CrossRef]
Jha, K.; Saha, S.; Dutta, P. Incorporation of gene ontology in identification of protein interactions from biomedical corpus: A multi-modal approach. Ann. Oper. Res. 2022, 39, 1–19. [Google Scholar] [CrossRef]
Ieremie, I.; Ewing, R.M.; Niranjan, M. TransformerGO: Predicting protein-protein interactions by modelling the attention between sets of gene ontology terms. Bioinformatics 2022, 38, 2269–2277. [Google Scholar] [CrossRef]
Zhou, T.; Lu, L.; Zhang, Y.-C. Predicting missing links via local information. Eur. Phys. J. B 2009, 71, 623–630. [Google Scholar] [CrossRef]
Samanthula, B.K.; Jiang, W. Secure Multiset Intersection Cardinality and its Application to Jaccard Coefficient. IEEE Trans. Dependable Secur. Comput. 2016, 13, 591–604. [Google Scholar] [CrossRef]
Adamic, L.A.; Adar, E. Friends and neighbors on the Web. Soc. Netw. 2003, 25, 211–230. [Google Scholar] [CrossRef]
Chen, C.; Zhang, Q.; Ma, Q.; Yu, B. LightGBM-PPI: Predicting protein-protein interactions through LightGBM with multi-information fusion. Chemom. Intell. Lab. Syst. 2019, 191, 54–64. [Google Scholar] [CrossRef]
Yu, B.; Chen, C.; Wang, X.; Yu, Z.; Ma, A.; Liu, B. Prediction of protein-protein interactions based on elastic net and deep forest. Expert Syst. Appl. 2021, 176, 114876. [Google Scholar] [CrossRef]
Wang, L.; You, Z.-H.; Xia, S.-X.; Chen, X.; Yan, X.; Zhou, Y.; Liu, F. An improved efficient rotation forest algorithm to predict the interactions among proteins. Soft Comput. 2018, 22, 3373–3381. [Google Scholar] [CrossRef]
Goktepe, Y.E.; Kodaz, H. Prediction of Protein-Protein Interactions Using An Effective Sequence Based Combined Method. Neurocomputing 2018, 303, 68–74. [Google Scholar] [CrossRef]
Hu, L.; Yang, S.; Luo, X.; Yuan, H.; Sedraoui, K.; Zhou, M. A Distributed Framework for Large-scale Protein-protein Interaction Data Analysis and Prediction Using MapReduce. IEEE-CAA J. Autom. Sin. 2022, 9, 160–172. [Google Scholar] [CrossRef]
Wei, Z.-S.; Han, K.; Yang, J.-Y.; Shen, H.-B.; Yu, D.-J. Protein-protein interaction sites prediction by ensembling SVM and sample-weighted random forests. Neurocomputing 2016, 193, 201–212. [Google Scholar] [CrossRef]
Zhang, Q.C.; Petrey, D.; Deng, L.; Qiang, L.; Shi, Y.; Thu, C.A.; Bisikirska, B.; Lefebvre, C.; Accili, D.; Hunter, T.; et al. Structure-based prediction of protein-protein interactions on a genome-wide scale. Nature 2012, 490, 556–560. [Google Scholar] [CrossRef]
Bryant, P.; Pozzati, G.; Elofsson, A. Improved prediction of protein-protein interactions using AlphaFold2. Nat. Commun. 2022, 13, 1265. [Google Scholar] [CrossRef]
Comeau, S.R.; Gatchell, D.W.; Vajda, S.; Camacho, C.J. ClusPro: An automated docking and discrimination method for the prediction of protein complexes. Bioinformatics 2004, 20, 45–50. [Google Scholar] [CrossRef]
De Vries, S.J.; van Dijk, M.; Bonvin, A.M.J.J. The HADDOCK web server for data-driven biomolecular docking. Nat. Protoc. 2010, 5, 883–897. [Google Scholar] [CrossRef] [PubMed]
Xue, L.C.; Rodrigues, J.; Kastritis, P.L.; Bonvin, A.; Vangone, A. PRODIGY: A web server for predicting the binding affinity of protein-protein complexes. Bioinformatics 2016, 32, 3676–3678. [Google Scholar] [CrossRef] [PubMed]
Schneidman-Duhovny, D.; Inbar, Y.; Nussinov, R.; Wolfson, H.J. PatchDock and SymmDock: Servers for rigid and symmetric docking. Nucleic Acids Res. 2005, 33, W363–W367. [Google Scholar] [CrossRef] [PubMed]
Li, S.; Huang, J.; Zhang, Z.; Liu, J.; Huang, T.; Chen, H. Similarity-based future common neighbors model for link prediction in complex networks. Sci. Rep. 2018, 8, 17014. [Google Scholar] [CrossRef]
Chen, Y.; Wang, W.; Liu, J.; Feng, J.; Gong, X. Protein Interface Complementarity and Gene Duplication Improve Link Prediction of Protein-Protein Interaction Network. Front. Genet. 2020, 11, 291. [Google Scholar] [CrossRef]
Lei, C.; Ruan, J. A novel link prediction algorithm for reconstructing protein-protein interaction networks by topological similarity. Bioinformatics 2013, 29, 355–364. [Google Scholar] [CrossRef]
Yuen, H.Y.; Jansson, J. Normalized L3-based link prediction in protein-protein interaction networks. BMC Bioinform. 2023, 24, 59. [Google Scholar] [CrossRef]
Chen, K.-H.; Wang, T.-F.; Hu, Y.-J. Protein-protein interaction prediction using a hybrid feature representation and a stacked generalization scheme. BMC Bioinform. 2019, 20, 308. [Google Scholar] [CrossRef]
Du, X.; Sun, S.; Hu, C.; Yao, Y.; Yan, Y.; Zhang, Y. DeepPPI: Boosting Prediction of Protein-Protein Interactions with Deep Neural Networks. J. Chem. Inf. Model. 2017, 57, 1499–1510. [Google Scholar] [CrossRef]
Huang, Y.; Wuchty, S.; Zhou, Y.; Zhang, Z. SGPPI: Structure-aware prediction of protein-protein interactions in rigorous conditions with graph convolutional network. Brief. Bioinform. 2023, 24, bbad020. [Google Scholar] [CrossRef]
Sun, T.; Zhou, B.; Lai, L.; Pei, J. Sequence-based prediction of protein protein interaction using a deep-learning algorithm. BMC Bioinform. 2017, 18, 277. [Google Scholar] [CrossRef] [PubMed]
Sledzieski, S.; Singh, R.; Cowen, L.; Berger, B. D-SCRIPT translates genome to phenome with sequence-based, structure-aware, genome-scale predictions of protein-protein interactions. Cell Syst. 2021, 12, 969–982.e6. [Google Scholar] [CrossRef]
Hashemifar, S.; Neyshabur, B.; Khan, A.A.; Xu, J. Predicting protein-protein interactions through sequence-based deep learning. Bioinformatics 2018, 34, 802–810. [Google Scholar] [CrossRef] [PubMed]
Hu, L.; Chan, K.C.C. Extracting Coevolutionary Features from Protein Sequences for Predicting Protein-Protein Interactions. IEEE/ACM Trans. Comput. Biol. Bioinform. 2017, 14, 155–166. [Google Scholar] [CrossRef] [PubMed]
Sharma, A.; Singh, B. AE-LGBM: Sequence-based novel approach to detect interacting protein pairs via ensemble of autoencoder and LightGBM. Comput. Biol. Med. 2020, 125, 103964. [Google Scholar] [CrossRef] [PubMed]
Yu, B.; Chen, C.; Zhou, H.; Liu, B.; Ma, Q. GTB-PPI: Predict Protein-protein Interactions Based on L1-regularized Logistic Regression and Gradient Tree Boosting. Genom. Proteom. Bioinform. 2020, 18, 582–592. [Google Scholar] [CrossRef]
Przytycka, T.M.; Singh, M.; Slonim, D.K. Toward the dynamic interactome: It’s about time. Brief. Bioinform. 2010, 11, 15–29. [Google Scholar] [CrossRef]
Jenghara, M.M.; Ebrahimpour-Komleh, H.; Parvin, H. Dynamic protein-protein interaction networks construction using firefly algorithm. Pattern Anal. Appl. 2018, 21, 1067–1081. [Google Scholar] [CrossRef]
Zhang, Y.; Lin, H.; Yang, Z.; Wang, J. Construction of dynamic probabilistic protein interaction networks for protein complex identification. BMC Bioinform. 2016, 17, 186. [Google Scholar] [CrossRef] [PubMed]
Ou-Yang, L.; Dai, D.-Q.; Li, X.-L.; Wu, M.; Zhang, X.-F.; Yang, P. Detecting temporal protein complexes from dynamic protein-protein interaction networks. BMC Bioinform. 2014, 15, 335. [Google Scholar] [CrossRef] [PubMed]
Tan, C.S.H.; Go, K.D.; Bisteau, X.; Dai, L.; Yong, C.H.; Prabhu, N.; Ozturk, M.B.; Lim, Y.T.; Sreekumar, L.; Lengqvist, J.; et al. Thermal proximity coaggregation for system-wide profiling of protein complex dynamics in cells. Science 2018, 359, 1170–1176. [Google Scholar] [CrossRef]

Figure 1. An illustration of computational methods for PPI prediction using various biological information.

Table 1. Summary of databases for PPI prediction.

Category	Database	Description	URL
PPI networks	DIP [39]	A database documenting experimentally determined PPIs, with the majority of data being sourced from yeast, Helicobacter pylori, and humans.	https://dip.doe-mbi.ucla.edu/dip/Main.cgi (accessed on 24 June 2023)
	BioGRID [40]	A repository containing data on post translational modifications (PTMs), chemical interactions, and protein and gene interactions.	http://www.thebiogrid.org/ (accessed on 24 June 2023)
	IntAct [41]	A user-driven open-source database and analytical tools for data on molecular interactions that are gathered from both direct user submissions and literature curation.	http://www.ebi.ac.uk/intact (accessed on 24 June 2023)
	STRING [42]	Covering 67,592,464 proteins from 14,094 organisms and 20,052,394,041 total interactions.	https://string-db.org/ (accessed on 24 June 2023)
	HPRD [43]	Composed of 41,327 PPIs, 93,710 PTMs, 22,490 subcellular localizations and 112,158 protein expressions.	http://www.hprd.org/ (accessed on 24 June 2023)
	MINT [44]	Housing PPIs experimentally confirmed and various other forms of functional interactions.	https://mint.bio.uniroma2.it (accessed on 24 June 2023)
	HIPPIE [45]	Providing confidence-scored and functionally annotated human PPIs.	http://cbdm.uni-mainz.de/hippie/ (accessed on 24 June 2023)
	BIND [46]	Containing over 200,000 molecular interactions, and over 3750 biological complexes involving over 1000 species.	http://download.baderlab.org/BINDTranslation/ (accessed on 24 June 2023)
Protein sequences	UniProt [48]	Consisting of UniProtKB, Proteomes, UniRef, and UniParc. UniProtKB covers 569,793 reviewed (Swiss-Prot) and 248,272,897 unreviewed (TrEMBL) protein sequences.	http://www.uniprot.org/ (accessed on 25 June 2023)
	SWISS-PROT [49]	Containing protein sequences along with comprehensive annotations. It has been integrated into UniProt.	http://www.expasy.org/sprot/ (accessed on 25 June 2023)
	PIR [50]	A comprehensive collection of protein data encompassing protein sequences and annotations.	http://pir.georgetown.edu/ (accessed on 25 June 2023)
Protein structures	SCOP [51]	Providing a detailed account of the evolutionary and structural connections within the entire set of proteins with established structures and covering 72,544 non-redundant domains.	http://scop.mrc-lmb.cam.ac.uk/scop (accessed on 25 June 2023)
	RCSB PDB [52]	Including 207,791 structures, 63,708 structures of human sequences, and 16,385 nucleic acid-containing structures.	https://www.rcsb.org/ (accessed on 25 June 2023)
Other databases	GO [53]	The primary repository of knowledge regarding gene functionalities, which contains 42,950 GO terms, 7,453,079 annotations and 1,504,969 gene products.	http://geneontology.org/ (accessed on 25 June 2023)
	CGD [56]	A repository housing protein and gene data for Candida albicans and its related species.	http://www.candidagenome.org/ (accessed on 25 June 2023)
	Negatome [47]	A resource composed of protein and domain pairs, which are improbable to interact directly.	http://mips.helmholtz-muenchen.de/proj/ppi/negatome/ (accessed on 25 June 2023)
	KEGG [55]	A repository that integrates 18 databases arranged into categories like systems, chemicals, genomic, and health.	https://www.kegg.jp/kegg/ (accessed on 25 June 2023)
	COG [54]	Covering proteins and genes across a broad spectrum of biological fields and providing tools for their functional annotation and evolutionary analysis.	https://www.ncbi.nlm.nih.gov/research/cog/ (accessed on 25 June 2023)

Table 2. Details of various sequence encoding schemes.

Type	Encoding Method	Description	Vector Dimension	Reference
Handcrafted	Amino acid compositi-on	The proportion of each type of amino acid in a protein.	20	[60]
	Pseudo amino acid composition	Extracting the physicochemical and composition information.	20 + $λ$	[61]
	Amphiphilic pseudo amino acid composition	The first 20 numbers represent the classical amino acid composition, followed by $2 λ$ discrete numbers representing amphiphilic sequence correlations along the protein chain.	20 + $2 λ$	[62]
	Dipeptide composition	The ratio of all possible pairs of amino acids in a protein.	400	[63]
	Conjoint triad	The frequency at which triads, comprising three adjacent amino acids along with their respective three-digit mapped numbers, occur in a protein sequence.	343	[57]
	Quasi-Sequence-Order descriptors	Representing the distribution of amino acids along the protein sequence based on the sequence-order effect and physicochemical properties.	20 + $φ$	[64]
	Autocorrelation descriptors	Extracting the physicochemical properties of proteins to complete the encoding, including the Moreau-Broto descriptor, Moran descriptor, and Geary descriptor.	$3 \times n \times l a g$	[65]
	Composition	The proportion of every category of amino acids after division by attributes.	39	[66]
	Transition	The ratio of dipeptides consisted of different classes of amino acids.	39	[66]
	Distribution	Describing the distribution of amino acids in each attribute group throughout the sequence.	195	[66]
	Position-specific scoring matrix	Reflecting the probability of each amino acid occurring at a particular position.	—	[67]
	Multivariate mutual information	Combined information regarding amino acids obtained by retrieving group-specific characteristics and information entropy.	119	[68]
Learned	Word2vec	Converting protein sequences into numerical feature representations utilizing each semantic relationship between amino acids after training on the corpus generated from the protein sequences.	—	[69]
	TextCNN	Consisting of stacked CNNs and max-pooling layers and capturing local features of protein sequences.	128	[9]

Table 3. Overview of computational methods for PPI prediction.

Category	Name	Details	URL
Sequence-based	LightGBM-PPI [85]	This method extracts composition and physicochemical information from protein sequences, uses elastic net for feature selection to obtain the best subset, and ultimately employs LightGBM to complete the classification for PPI prediction.	https://github.com/QUST-AIBBDRC/LightGBM-PPI/ (accessed on 24 July 2023)
	GcForest-PPI [86]	GcForest-PPI constructs deep forest models through a cascade architecture, where all levels of the cascade (except the last one) consist of 4 XGBoosts, 4 RFs, and 4 Extra-Trees.	https://github.com/QUST-AIBBDRC/GcForest-PPI (accessed on 24 July 2023)
	FWRF [87]	FWRF is an improved rotating forest algorithm, which calculates the weights of features by means of $χ^{2}$ statistics and removes features with low weight values based on the selection rate.	http://202.119.201.126:8888/FWRF/ (accessed on 24 July 2023)
	Profppikernel [58]	Profppikernel uses evolutionary profiles as features and profile-kernel SVMs as a classifier to compute hyperplanes that optimally separate the two classes of data points.	https://rostlab.org/owiki/index.php/Profppikernel (accessed on 24 July 2023)
	Goktepe et al. [88]	The proposed method utilizes the Kaiser criterion in principal component analysis (PCA) as a component selection criterion for lowering the dimensionality of feature vectors and SVM to complete the classification task.	N/A
	Hu et al. [89]	Hu et al. propose a distributed framework integrated with MapReduce, CoFex+, to achieve a large-scale PPI prediction.	N/A
	SSWRF [90]	SSWRF is an integrated approach that combines sample weighted random forest with SVM.	http://csbio.njust.edu.cn/bioinf/SSWRF (accessed on 25 July 2023)
Structure-based	PrePPI [91]	PrePPI combines structural information with additional functional features into a Naïve Bayesian network to predict PPIs.	http://bhapp.c2b2.columbia.edu/PrePPI/ (accessed on 25 July 2023)
	InterPred [74]	InterPred is a fully automated computational pipeline that integrates structural modeling, large-scale structural alignment, and molecular docking techniques to predict and model PPIs.	http://wallnerlab.org/InterPred/ (accessed on 25 July 2023)
	Dong et al. [32]	Dong et al. use structural modeling and a docking algorithm to predict Arabidopsis PPIs.	http://bar.utoronto.ca/interactions2/ (accessed on 25 July 2023)
	Bryant et al. [92]	Bryant et al. employ AlphaFold2 protocol and optimized multiple sequence alignments for modeling, and design pDockQ score to distinguish acceptable models from incorrect ones.	https://gitlab.com/ElofssonLab/FoldDock (accessed on 25 July 2023)
	Cluspro [93]	Cluspro is an automated rigid-body docking and recognition algorithm that quickly filters docked conformations and ranks them according to their clustering properties.	http://structure.bu.edu (accessed on 4 March 2024)
	HADDOCK [94]	HADDOCK is a docking technique guided by (experimental) knowledge regarding the molecular interface and relative orientations.	http://haddock.chem.uu.nl/Haddock (accessed on 4 March 2024)
	PRODIGY [95]	PRODIGY can predict binding affinities based on a 3D structure.	http://milou.science.uu.nl/services/PRODIGY (accessed on 4 March 2024)
	PATCHDOCK [96]	PATCHDOCK is a molecular docking algorithm based on the principle of shape complementarity.	http://bioinfo3d.cs.tau.ac.il (accessed on 4 March 2024)
Network-based	L3 [30]	The L3 principle holds that protein pairs are likely to interact if connected through numerous $l = 3$ paths in the network.	N/A
	SFCN [97]	The SFCN model can accurately identify all future common neighbors and measure their contributions using only existing similarity indicators.	N/A
	Sim [98]	Sim is a link prediction means that relies on gene duplication and the complementarity of protein–protein interfaces.	https://github.com/wingroy001/L3Sim (accessed on 26 July 2023)
	Lei et al. [99]	The proposed approach introduces a new random walk algorithm with resistance, which predicts PPIs by measuring the higher order topological similarity of two proteins.	www.cs.utsa.edu/jruan/RWS/ (accessed on 26 July 2023)
	L3N [100]	L3N addresses certain missing elements in the L3 predictor from a network modeling perspective.	https://github.com/andy897221/BMC_PPI_L3N (accessed on 26 July 2023)
GO-based	PPI-MetaGO [101]	PPI-MetaGO is an ensemble meta-learning approach that leverages GO semantic similarity and other feature representations and multiple ML algorithms to predict PPIs.	https://github.com/mlbioinfolab/ppi-metago (accessed on 26 July 2023)
	Bandyopadhyay et al. [77]	Protein pairs are represented by weighted eigenvectors based on GO terms, and SVM is used to predict new PPIs.	N/A
Deep learning-based	DeepPPI [102]	DeepPPI uses two separate neural networks receiving raw input from each of the two proteins to predict the PPIs.	http://ailab.ahu.edu.cn:8087/DeepPPI/index.html (accessed on 26 July 2023)
	Struct2Graph [71]	Struct2Graph, based on GCNs, is a mutual attention classifier specialized in predicting PPIs from 3D structural data.	https://github.com/baranwa2/Struct2Graph (accessed on 26 July 2023)
	TAGPPI [9]	TAGPPI extracts multidimensional features from protein sequences and contact maps created by AlphaFold.	https://github.com/xzenglab/TAGPPI (accessed on 26 July 2023)
	SGPPI [103]	SGPPI employs graph convolutional neural networks as part of its structure-based DL framework.	https://github.com/emerson106/SGPPI (accessed on 26 July 2023)
	Sun et al. [104]	Sun et al. apply stacked autoencoders to predict human PPIs.	http://repharma.pku.edu.cn/ppi (accessed on 27 July 2023)
	D-SCRIPT [105]	D-SCRIPT is a DL model that utilizes the binding compatibility of protein structures to predict PPIs.	https://cb.csail.mit.edu/cb/dscript/ (accessed on 27 July 2023)
	DPPI [106]	DPPI employs a convolutional neural network architecture which combines stochastic projection and data augmentation to accomplish the prediction task.	https://github.com/hashemifar/DPPI/ (accessed on 27 July 2023)

N/A, not available.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xian, L.; Wang, Y. Advances in Computational Methods for Protein–Protein Interaction Prediction. Electronics 2024, 13, 1059. https://doi.org/10.3390/electronics13061059

AMA Style

Xian L, Wang Y. Advances in Computational Methods for Protein–Protein Interaction Prediction. Electronics. 2024; 13(6):1059. https://doi.org/10.3390/electronics13061059

Chicago/Turabian Style

Xian, Lei, and Yansu Wang. 2024. "Advances in Computational Methods for Protein–Protein Interaction Prediction" Electronics 13, no. 6: 1059. https://doi.org/10.3390/electronics13061059

APA Style

Xian, L., & Wang, Y. (2024). Advances in Computational Methods for Protein–Protein Interaction Prediction. Electronics, 13(6), 1059. https://doi.org/10.3390/electronics13061059

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Advances in Computational Methods for Protein–Protein Interaction Prediction

Abstract

1. Introduction

2. Databases and Dataset Preparation

3. Feature Extraction

3.1. Sequence-Based Features

3.1.1. Amino Acid Composition (AAC)

3.1.2. Pseudo Amino Acid Composition (PseAAC)

3.1.3. Dipeptide Composition (DPC)

3.1.4. Conjoint Triad (CT)

3.1.5. Composition, Transition, and Distribution (CTD)

3.1.6. Autocorrelation Descriptor (AD)

3.1.7. Position-Specific Scoring Matrix (PSSM)

3.2. Structure-Based Features

3.3. GO-Based Features

3.4. Network-Based Features

4. Computational Methods for PPI Prediction

4.1. Sequence-Based Methods

4.2. Structure-Based Methods

4.3. Network-Based Methods

4.4. GO-Based Methods

4.5. DL-Based Methods

5. Current Challenges and Future Prospects

6. Key Points

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI