1. Introduction
The majority of genes in the human genome are classified as non-coding RNAs (ncRNAs) because they do not engage in protein synthesis, while only 2% of genes are responsible for encoding proteins [
1]. Historically, ncRNAs were considered biologically insignificant and categorized as non-functional. However, recent discoveries have identified a variety of functional ncRNAs, including long non-coding RNAs (lncRNAs), small nuclear RNAs (snRNAs), transfer RNAs (tRNAs), and small RNAs (miRNAs and siRNAs), which play significant roles in gene regulation, chromatin remodeling, and essential cellular processes [
2,
3,
4,
5]. These ncRNAs interact with other RNAs, proteins, and DNA, influencing numerous molecular functions, and their role in diseases such as cancer further highlights the need for accurate prediction of ncRNA–protein interactions (ncRPIs) [
3,
6,
7,
8,
9].
Experimental techniques, such as PAR-CLIP [
10], RNAcompete [
11], and HITS-CLIP [
12], have traditionally been used to study ncRPIs. These methods, although effective, are costly and time-consuming, leading to the development of various computational approaches to predict ncRPIs efficiently. ncRNA–protein interaction prediction methods can be categorized into experimental and computational approaches. Experimental methods directly measure interactions in laboratory settings, while computational methods predict interactions using various algorithms. Computational techniques include network-based analysis, traditional machine learning, deep learning, integrated machine learning-deep learning methods, and Graph Neural Network (GNN)-based approaches. Each computational method offers distinct advantages, with GNNs particularly suited for capturing complex network structures inherent in biological interactions.
Network-based methods utilize the topological properties of biological networks to infer potential interactions. For example, LPIHN constructs a heterogeneous network from lncRNA and protein data, using random walk algorithms to predict new interactions [
13]. Another approach, LPI-IBNRA, uses a bipartite network and manages second-order correlations to predict ncRPIs [
14]. While network-based methods focus primarily on topological features, traditional machine learning models, such as support vector machines (SVM) and random forests (RF), use sequence-based features to train classifiers that predict interactions [
15,
16].
More recently, deep learning has been applied to ncRPI prediction, leveraging neural networks to automatically extract high-level features from sequence and structural data. IPMiner, for instance, uses stacked ensembling to predict ncRPIs from sequences [
17], while RPIFSE combines CNN and extreme learning machine (ELM) classifiers [
18].
Among GNN-based methods, the NPI-GNN model [
19] represents a pioneering approach for ncRPI prediction. NPI-GNN transforms the NPI prediction task into a graph link prediction problem, where ncRNAs and proteins are represented as nodes and potential interactions as edges. The authors employed a graph convolutional network with three graphSAGE layers, enabling message passing between nodes to non-linearly transform feature vectors and learn low-dimensional embeddings. These embeddings are based on extracted protein and RNA sequence features, allowing the model to predict interactions and even reconstruct ncRNA–protein networks under different conditions. The use of RNA-seq data to identify abundant RNAs and highly probable negative interactions further enhances the network’s representation, while CLIP-seq data are employed to construct an RNA–protein interaction network specific to cell lines.
Despite these advances, most existing methods either focus solely on topological information or exclusively rely on sequence-based data. This highlighted the need for a comprehensive approach that integrates both types of information. Graph neural networks (GNNs) have emerged as a powerful solution for this problem, allowing for the combination of network topology and node-specific features in a unified framework. GNNs, originally designed to operate on graph-structured data, excel in tasks such as link prediction, including predicting ncRNA–protein interactions [
20].
Figure 1 provides a comprehensive categorization of the various features employed in NPIs.
In this work, we propose NPI-WGNN, a novel weighted graph neural network model designed to enhance ncRPI prediction by incorporating both topological and node-specific features. We introduce a bipartite version of the high-order common neighbor (HOCN) similarity measure to calculate edge weights, which are then used in a weighted node2vec embedding. Additionally, various centrality measures, such as degree, betweenness, and Katz centralities, are integrated into the embeddings to capture the hierarchical organization and information flow within the network.
Our GNN architecture includes three types of graph convolutional layers (GCNConv, GATConv, and GraphSAGE) and top-k pooling layers, which capture both local and global graph features. An ablation study confirms that each GNN layer contributes uniquely to model performance: GraphSAGE enhances scalability by sampling from large neighborhoods, GCNConv captures global structure effectively for strongly connected graphs, and GATConv dynamically weights neighbors, improving accuracy on heterogeneous graphs. These features are further processed by a recurrent neural network (RNN) to integrate spatial and temporal information, allowing NPI-WGNN to adapt to diverse graph complexities.
The experimental results, obtained from three benchmark datasets, demonstrate that NPI-WGNN consistently outperforms existing methods, achieving high accuracy, sensitivity, and MCC scores. Our approach, which integrates topological insights and node-specific data into a weighted GNN framework, offers a robust and accurate solution for predicting ncRNA–protein interactions, providing new opportunities for understanding regulatory functions and therapeutic interventions.
2. Algorithm and Model Architecture
In our efforts to improve the predictive abilities of graph neural networks (GNNs) in predicting ncRNA–protein interactions (NPIs), we have introduced an innovative method called NPI-WGNN. The flowchart of our method is shown in
Figure 2. As shown in
Figure 2, NPI-WGNN consists of four main phases: (1) Construction of a weighted ncRNA–protein bipartite graph; (2) Extracting enclosing subgraphs for every observed positive link and unobserved negative link; (3) Building a node information matrix including structural labels, weighted node2vec, and centrality measures; (4) Using a graph neural network (GNN) including graph colnvolutional layers, pooling layers, fully connected layers, and recurrent neural network (RNN) layers for learning extracted features and performing the classification task.
2.1. Construction of Weighted ncRNA–Protein Bipartite Network
After reviewing the previous studies, we realized that the topological and structural features extracted from the ncRNA–protein network significantly influence the prediction of relationships between ncRNAs and proteins. Therefore, in this section, we use a criterion that assigns weight to each edge depending on how similar two nodes are. According to this criterion, the nodes exhibiting greater similarity should carry more weight. To determine the weight of both positive and negative links, we utilize similarity-based metrics of link prediction [
21]. This allows us to accurately calculate the weight of each link and analyze the network structure comprehensively.
In a previous study, Wang et al. [
22] proposed a novel weighting technique known as high-order common neighbor (HOCN) for unipartite graphs. This method was effective in identifying protein complexes and establishing reliable networks of protein-protein interactions (PPIs). Building on this foundation, we have modified the HOCN weighting method to fit our specific bipartite ncRNA–protein network. By customizing this method to our network structure, we aim to increase the accuracy and reliability of identifying ncRNA–protein interactions. The high-order common neighbor (HOCN) between
v and
u is defined as Equation (1):
where
is a Jaccard’s coefficient similarity [
23] that is defined by Equation (2):
and the set of all node
v’s neighbors is shown by
, while the common neighbors of nodes
v and
u are represented by
, and
refers to the set of all neighbors of neighbors of node
u. CNS refers to the common neighbors support, which means the likelihood that
u and
v’s common neighbors do support the edge
.
The algorithm for constructing a weighted ncRNA–protein network is presented in Algorithm 1, providing a more comprehensive depiction of the construction process.
Algorithm 1 Building a weighted ncRNA–protein garph |
Input: The ncRNA–protein graph, . Output: The weighted ncRNA–protein graph, .
- 1:
for each edge in E - 2:
calculate the of each edge according to Equation (2); - 3:
if then - 4:
calculate the weight of edge , according to Equation (1); - 5:
else - 6:
remove edge from E; - 7:
return weighted ncRNA–protein network .
|
2.2. Extracting Enclosing Subgraphs
The SEAL framework [
24] demonstrates that all the information obtained from global link prediction approaches, such as Katz index [
25] and PageRank [
26], can be derived through local enclosing subgraphs. Therefore, for each positive (existing) and negative (non-existing) connection, we extract a 1-hop local enclosing subgraph. As you can see in
Figure 3, the enclosing subgraph for a pair of nodes
refers to the subgraph formed by considering the neighbors of both
x and
y up to a specified number of hops, denoted as
h. This subgraph is obtained by taking the union of the neighbors within the specified hop distance.
In a typical graph neural network (GNN) architecture, two matrices are commonly utilized as input. The first matrix is the enclosing subgraph’s adjacency matrix (A), which captures the relationships between nodes in the graph. The adjacency matrix represents the connectivity pattern of the graph, indicating which nodes are connected to each other. The second matrix is the node information matrix (X), which provides details about the properties or attributes of each node in the graph. This matrix describes specific characteristics or features associated with individual nodes, including structural node labels, node embeddings, and node attributes. By combining the local enclosing subgraphs extracted from the adjacency matrix (A) and the node information matrix (X), the GNN can effectively process and analyze the graph data, leveraging both the structural relationships between nodes and the node-specific information. This integration enables the GNN to learn and make predictions based on the collective knowledge encoded within the graph.
2.3. Building Node Information Matrix
The construction of the node information matrix involves several components, including structural labels, node embeddings, and node attributes.
2.3.1. Structural Label
The node information matrix X begins with the graph’s structural label. To determine the structural label for a node v, which could represent either a protein or a ncRNA, we first label the two target nodes x and y as 0. For any node v, its structural label is derived by finding the minimum of the shortest path distances from v to x and from v to y. This approach ensures that the structural label reflects the proximity of v to the target nodes within the graph.
2.3.2. Weighted node2vec
The second component of the node information matrix (X) is the node embedding. In our implementation, we used the weighted node2vec algorithm [
27] as the node embedding method. This technique allows us to create a vector representation, called
, for each node. The node2vec algorithm uses a second-order random walk and controls the walk process with two parameters,
p and
q. This process is outlined as follows:
where
is the un-normalized transitional probability between nodes
v and
x,
Z is the normalizing constant, and
E is the collection of edges. In Equation (
6),
t is the node that is traversed before
v;
indicates the distance on the shortest path between nodes
t and
x, and
is the weight of edge
.
The weighted node2vec algorithm enhances link prediction by incorporating edge weights, resulting in a more detailed and accurate representation of the strengths of connections between nodes. This method effectively captures the complexities of network relationships, thereby achieving higher prediction accuracy compared with the traditional node2vec algorithm, which assumes uniform edge weights.
2.3.3. Centrality Measures
The node attributes make up the third element of the node information matrix X. Extra details about nodes may be found in their characteristics. In this work, we used centrality measures as the node attributes. In the realm of complex networks, it is crucial to acknowledge that each node has some properties that determine its significance within a given application-specific context. These characteristics can be identified using centrality measures. These measures provide numerical values that reflect a node’s importance [
28]. In this research, we employed bipartite versions of degree, closeness, betweenness, Katz, and propagation entropy (PE) centralities for our ncRNA–protein bipartite networks.
Degree Centrality
Typically, the degree centrality values are divided by the maximum degree to normalize them when working with unipartite networks. In such cases, the maximum possible degree is determined by subtracting one from the total node count of the network
. However, in the case of bipartite networks, the number of nodes in the opposing node set determines a node’s maximum degree within a bipartite node set [
29]. A node
v’s degree centrality in a bipartite network when one node set has
u nodes and the other node set contains
v nodes can be specifically found using the following formula:
where
n and
m are the total number of nodes in each region of the graph, and
is the degree of node
v.
Closeness Centrality
A metric known as closeness centrality is used to quantify how near a node is to other nodes in a network. Usually, the minimum distance that may be traveled serves as the normalization. A node in a bipartite node set must be at least one step away from every other node in the other node set and two steps away from every other node inside its own set in the context of bipartite networks [
29]. As a result, the closeness centrality of a particular node,
v, in a bipartite network with two node sets,
U and
V, each comprising
n and
m nodes, can be expressed as follows:
where the total distance from node
v to all other nodes is represented by
d.
Betweenness Centrality
Based on the quantity of shortest paths that travel through a node, betweenness centrality calculates the node’s importance within a network. It is determined by calculating the number of shortest pathways that connect every pair of nodes that travel through a certain node and then normalizing this count by the total number of shortest paths in the network. High betweenness centrality is thought to be crucial for preserving the network’s ability to transfer resources or information. For a given node
v, the betweenness centrality can be written as follows:
In the context of network analysis, the notation
denotes the count of the shortest paths between nodes
s and
t, where node
v serves as an intermediary node along these shortest paths, and
N is the normalized value [
28]. By considering the role of node
u as an intermediate point, we gain insights into the influence and connectivity it contributes to the overall network structure. The highest value that may be obtained for betweenness values in bipartite graphs is determined by the relative sizes of the two node sets [
29]. Nodes in
U are normalized by dividing by
where
n is the number of nodes in
U set and
m is the number of nodes in
V set and
s and
t is determined as follow:
and nodes in
V are normalized by dividing by
where
Katz Centrality
Katz centrality [
25] is a measure of node influence in a network that takes into account both direct and indirect connections. It calculates a node’s centrality score by considering all paths leading to it, giving more weight to shorter paths. This is done using an adjacency matrix A and iteratively updating the centrality score
for each node. The formula
is used, where
is a damping factor (less than the inverse of the largest eigenvalue of A) and
is a constant for baseline centrality. This process continues until the scores stabilize, providing a detailed measure of influence across the network. Note that in our weighted graph,
represents the weight of the edge between nodes
i and
j.
Propagation Entropy Centrality
Node propagation entropy [
30] is a metric that evaluates the importance of nodes in a network by combining both local and global factors. It integrates the local clustering coefficient, which indicates how clustered a node’s neighbors are, with the node’s influence through its first- and second-order neighbors. The clustering coefficient and neighbors
metric measures a node’s ability to disseminate information based on these factors. To reflect global network context, node propagation entropy (PE) is derived from the entropy of the normalized propagation capacity of a node’s neighbors. The node propagation entropy is calculated as follows:
where
and
is the clustering coefficient and neighbors metric. The local clustering coefficient
for a bipartite graph is as follows:
where
are the second-order neighbors of
i except itself and
.
The cascade vector of every component listed above is the whole node property
F, which looks like this:
The vector
c represents the centrality metric in five dimensions for each ncRNA and protein node. We kept the default dimension option for building the node2vec embedding (
), which produced a 64-D vector. Additionally, the whole node property
F becomes a vector in 75 dimensions by integrating the 1-D structural label (
).
2.4. Model Structure
Our NPI-WGNN model comprises three graph convolutional-based modules, three global pooling modules, one additive module, two fully connected layers, and a two-layer recurrent neural network (RNN). The graph convolutional-based modules include three types of graph neural network layers: a graph convolutional layer (GCNConv), a graph attention layer (GATConv), and a graphSAGE layer (SAGEConv), each followed by a Rectified Linear Unit (ReLU) activation and top-k pooling layers, arranged sequentially. The global pooling modules utilize global average pooling and global max pooling. The outputs from these three global pooling modules are combined using the additive module. Subsequently, two fully connected layers with 128 and 64 neurons and a two-layer recurrent neural network with two neurons, respectively, process the output from the additive module. The final layer’s output is then processed by a – function to produce a 2-D vector, representing the likelihood of positive and negative samples.
Integrating GCNConv [
31], GATConv [
32], and SAGEConv [
33] creates a powerful and well-rounded framework for graph representation learning. GCNConv effectively captures local graph structures and node features, forming a solid foundation for learning. GATConv introduces attention mechanisms that differentiate the importance of various nodes, allowing the model to focus on the most crucial interactions. SAGEConv utilizes sampling techniques to aggregate information from node neighborhoods, enabling scalability to larger graphs while maintaining diverse information. The combination of these methods ensures comprehensive feature extraction, attention-based prioritization, and efficient neighborhood aggregation, resulting in improved accuracy and robustness in predictive tasks.
In the upper part of our model, there is a two-layer recurrent network [
34] to combine information. The combined information is used to make predictions. Combining RNN layers with GNNs frequently enhances performance on tasks involving graphs [
20]. By utilizing the strengths of both architectures, the model is able to capture spatial and temporal aspects of the data, resulting in more robust and accurate predictions. The experimental results indicate that combining RNN with GNN has significantly improved the performance of the network.
2.4.1. Graph Convolutional Layer
Graph convolutional layer (GCNConv) [
31] is a fundamental component within the domain of graph neural networks (GNNs), essential for understanding the complex relationships and structures inherent in graph-based data. By leveraging feature data alongside local network connections, GCNConv aims to develop meaningful node representations by integrating information from neighboring nodes. It employs the adjacency matrix to map node connections, ensuring that nodes with similar structures produce comparable embeddings. With adjustable parameters and non-linear activation functions, GCNConv refines these representations, enabling the model to identify subtle patterns and complex dependencies within the graph. As a key component of GNN architectures, GCNConv plays a crucial role in link prediction tasks, providing a robust framework for extracting insights from complex graph-structured data.
When given X as the information matrix and A as the adjacency matrix, the node representation at layer
is as follows:
where
represents the node representation at layer
l,
is the activation function,
is the normalized adjacency matrix of the graph,
is the degree matrix of
, and
is the weight matrix for layer
l.
2.4.2. Graph Attention Layer
The graph attention layer (GATConv) [
32] dynamically allocates attention weights to neighboring nodes, allowing each node to emphasize informative neighbors while downplaying less relevant ones. By adaptively assigning attention weights based on the features of neighboring nodes, GATConv learns richer and more discriminative representations for each node. This flexibility enables the model to effectively leverage both local and global information, thereby enhancing performance. The process of updating node features through the GATConv layer after information aggregation and transformation is as follows:
where
represents the set of neighboring nodes of node
i,
and
are the feature vectors of nodes
i and
j, respectively,
a is a learnable parameter vector, and
W is a learnable weight matrix.
2.4.3. GraphSAGE Layer
GraphSAGE, or Graph Sample and Aggregation, represents a significant advancement in graph representation learning, particularly suited for large-scale graphs with diverse node features. Its primary objective is to capture the structural information and dependencies inherent in graphs in order to generate meaningful node embedding [
33]. It samples a fixed-size neighborhood for each node, capturing local graph structure. Then, it aggregates information from the sampled neighborhood to generate a representation for the target node. The aggregation process can be executed using various methods, such as mean aggregator, LSTM aggregator, and pooling aggregator. In this paper, the mean aggregator was chosen for implementation. The aggregated features at layer
are as follows:
where
represents the aggregated features from sampled neighboring nodes at layer
l,
is the feature vector of the central node,
W is a learnable weight matrix, and
denotes the aggregation function.
2.4.4. Top-k Pooling Layer
Top-k pooling is used by NPI-WGNN, and it dynamically adjusts to the data to reduce graph size as the GNN becomes deeper. A parameter called k—which ranges from 0 to 1—is introduced by the top-k pooling method that determines the percentage of nodes to be kept [
35].
3. Datasets
The datasets used in this study include RPI2241, NPInter2, and RPI7317, each encompassing interactions between various types of ncRNAs and RNA-binding proteins (RBPs). The ncRNAs in these datasets include ribosomal RNA (rRNA), microRNA (miRNA), small nuclear RNA (snRNA), transfer RNA (tRNA), and long non-coding RNAs (lncRNAs). The RNAs exhibit diverse molecular sizes, with miRNAs typically ranging from 19 to 25 nucleotides, snRNAs averaging around 150 nucleotides, tRNAs between 76 and 90 nucleotides, and lncRNAs often exceeding 200 nucleotides and extending up to several thousand nucleotides. RNA-binding proteins (RBPs), essential for RNA metabolism, vary in molecular size from approximately 20 kDa to over 150 kDa depending on their function and structure.
3.1. RPI2241
The RPI2241 dataset was constructed by inferring ncRNA–protein interactions from structural data on protein–ncRNA complexes in the Protein–RNA Interface Database (PRIDB). This dataset emphasizes structural interactions derived from computational modeling, providing a high-confidence set of predicted interactions based on known RNA-binding motifs and structural interfaces.
3.2. NPInter2
The NPInter2 dataset was produced through a combination of high-throughput experimental techniques, capturing a broad spectrum of RNA–protein interactions across different species. This dataset integrates data from multiple experimental sources, including RNA immunoprecipitation (RIP), cross-linking immunoprecipitation (CLIP), and yeast two-hybrid assays, to enhance interaction coverage and diversity. This multi-source approach provides a comprehensive dataset of ncRNA–protein interactions that is well-suited for evaluating prediction models across varied interaction types.
3.3. RPI7317
The RPI7317 dataset primarily relies on CLIP-seq (Cross-Linking and Immunoprecipitation coupled with high-throughput sequencing) to identify direct interactions between RNAs and proteins. By employing stringent filtering processes to reduce false positives, this dataset ensures a high-confidence set of RNA–protein interactions. CLIP-seq’s precise detection of interaction sites at nucleotide-level resolution allows for reliable data that is instrumental for training and evaluating interaction prediction models.
Each dataset offers distinct characteristics: RPI2241 is structurally inferred, NPInter2 integrates multiple experimental techniques, and RPI7317 focuses on high-confidence direct interactions via CLIP-seq. Together, these datasets provide a comprehensive basis for evaluating the predictive capabilities of our proposed NPI-WGNN model across different RNA types, protein sizes, and interaction verification methods.
4. Experimental Results
4.1. Datasets and Evaluation Strategies
As shown in
Table 1, we collected three different ncRNA–protein datasets in order to assess the effectiveness of our technique. RPI2241 [
16] is a dataset predicted by the computation of structural information, whereas NPInter2.0 [
36] and RPI7317 [
37] are datasets confirmed experimentally.
We built negative samples by randomly selecting non-interacting pairings of ncRNA and protein, as NPInter2.0 and RPI7317 only include positive samples. The negative samples were constructed to be equivalent in number to the positive samples.
Since the GNN layers extract features for each node based on its neighboring nodes, the number of neighboring nodes plays a crucial role in effectively capturing the characteristics of nodes. Therefore, the prediction performance of our method is directly influenced by the number of nodes within the local graph structures. As evident in
Table 1, the mean count of nodes inside the enclosing subgraph with one hop is substantially lower for RPI2241. These findings suggest that the prediction performance of this dataset would likely be less effective compared with others.
We used the Matthews correlation coefficient (MCC), accuracy, sensitivity (recall), F1-score, specificity, and precision to assess the effectiveness of our techniques. The definition of these measures is as follows:
The numbers for true positives, true negatives, false positives, and false negatives are, respectively,
,
,
, and
. Five-fold cross-validation is used to optimize the model parameters. We implemented NPI-WGNN using Python, utilizing PyTorch 2.2.1, PyTorch-geometric 2.5.2, and networkx 3.3 for efficient development and experimentation. The input channels of GCNConv are 75, representing the feature vector size, while GCNConv’s output channels are 128. The input and output channels of GATConv and SAGEConv are both 128. The other parameters remain the same as those in NPI-GNN.
4.2. Prediction Performance of NPI-WGNN
We calculated the prediction performance of NPI-WGNN on the NPInter2.0, RPI7317, and RPI2241 datasets using five-fold cross-validation. The comparison of our method with NPI-GNN [
19], RPISeq-RF [
16], EDLMC [
38], and IPMiner [
17] is shown in
Table 2. These findings indicate that our method achieves significant improvements in accuracy, sensitivity, specificity, precision, Matthews correlation coefficient (MCC), and F1-score compared with state-of-the-art methods.
As shown in
Table 2, NPI-WGNN achieves the highest performance on NPInter2.0 and RPI7317, demonstrating its effectiveness across various evaluation metrics. However, the prediction performance on RPI2241 is comparatively lower, which may be attributed to several characteristics inherent to this dataset as mentioned in [
19].
Unlike NPInter2 and RPI7317, which are built on experimentally verified interactions obtained through high-throughput techniques such as CLIP-seq and yeast two-hybrid assays, the RPI2241 dataset is derived from inferred structural information on protein–ncRNA complexes in the PRIDB database. This reliance on structural inference introduces specific limitations. First, inferred datasets often have lower diversity compared with datasets constructed from direct experimental interactions, as they represent interactions within a specific structural context. This potentially limits the generalizability of interactions, resulting in lower diversity within RPI2241 and impacting the ability of our model to capture the full spectrum of ncRNA–protein interactions.
Additionally, RPI2241 has a lower average node density in the enclosing subgraphs, which may contribute to decreased performance by limiting the local neighborhood information that NPI-WGNN can leverage during the learning process. This sparsity, combined with lower diversity, affects the F1-score and other metrics on RPI2241, as seen in
Table 2. Notably, while NPI-WGNN generally outperforms other methods across datasets, its F1-score on RPI2241 is slightly lower than that of the IPMiner method, suggesting that IPMiner’s approach might be less affected by sparsity in this particular dataset.
In summary, while NPI-WGNN demonstrates robust improvements over other methods on NPInter2.0 and RPI7317, the inherent limitations of RPI2241—particularly data sparsity and its structural inference basis—highlight the importance of dataset characteristics in ncRNA–protein interaction prediction. This underscores the potential for further improvements in model adaptation across datasets with varying levels of diversity and structural context.
Furthermore, as part of our study, we plotted precision–recall curves and ROC curves to further evaluate the effectiveness of the proposed methods with the NPI-GNN on NPInter2.0. As shown in
Figure 4, the AUPR of the NPI-GNN and NPI-WGNN methods are 0.960 and 0.977, respectively. According to
Figure 5, the AUROC of the NPI-GNN and NPI-WGNN methods are 0.969 and 0.982, respectively.
4.3. The Effect of Different GNN Layers on NPI-WGNN Performance
In this section, we conducted an ablation study to investigate the impact of each GNN layer on our model’s performance.
Table 3 presents the results obtained from replacing each of the GCN, GAT, and GraphSAGE layers with GraphSAGE (according to the NPI-GNN model) and the combination of all three layers.
The impact of GCN, GAT, and GraphSAGE layers on model performance reveals unique advantages when applied individually or in combination. Analyzing each layer separately, GraphSAGE excels in scalability due to its sampling-based aggregation, which suits large graphs but may overlook intricate local connections in dense areas, thereby affecting localized accuracy. GCN effectively captures global graph structure by leveraging full adjacency matrices, achieving high accuracy in strongly connected graphs; however, this approach can limit scalability for extensive datasets. GAT’s attention mechanism enables dynamic neighbor weighting, which enhances accuracy on heterogeneous graphs but increases computational demand. When combined, these layers complement each other, integrating GraphSAGE’s scalability, GCN’s global perspective, and GAT’s adaptive attention, collectively enhancing predictive accuracy and robustness across varying graph complexities. The empirical results confirm this complementary effect, underscoring the advantage of hybrid architectures in diverse graph-based applications.
4.4. Sequence Information Versus Network Topological Data
To further investigate the impact of additional information, a comparison was conducted between the results achieved by employing k-mer as additional information in the NPI-GNN method and the results obtained from the incorporation of centrality measures as additional node attributes. The results presented in
Figure 6 indicate that the NPI-GNN model primarily depends more on the network structure than on the particular data sequence. By applying the network structure, NPI-GNN demonstrates its ability to provide valuable insights and analysis, showcasing its potential as a powerful tool in various domains.
Given the extensive extraction of structural and topological data, such as network edge weights and centrality measures, in this research, the reliance of this model on topological data further substantiates our claim regarding the effectiveness of the NPI-WGNN approach in achieving proper performance.
5. Conclusions
This study presents NPI-WGNN, an innovative approach for predicting ncRNA–protein interactions using weighted graph neural networks with a focus on network topological features. Our method demonstrates significant improvements over existing techniques, particularly the base NPI-GNN model, across multiple datasets. The key contributions and findings of this work include: Enhancing graph representation by incorporating edge weights derived from a bipartite high-order common neighbor (HOCN) similarity measure, we created a more informative graph structure. This weighted approach allows the model to better capture the strength of relationships between nodes, leading to more accurate predictions. Improving node embeddings using weighted node2vec, which takes into account edge weights during the random walk process, enables a more nuanced exploration of the graph structure. This results in richer and more informative node representations compared with traditional node2vec. Incorporation of centrality measures by including various centrality measures (degree, closeness, betweenness, etc.) as node attributes, we provide the model with additional topological insights. This enhances the model’s ability to understand the importance and roles of different nodes within the network. Advancing GNN architecture by the combination of multiple graph convolutional layers (GCNConv, GATConv, and SAGEConv) along with top-k pooling and recurrent neural network layers. This architecture allows for effective feature extraction, aggregation, and processing of graph data, leading to superior predictive performance. Extensive experiments on different benchmark datasets demonstrate the superiority of NPI-WGNN over the base NPI-GNN model. Our approach achieves higher accuracy, sensitivity, specificity, precision, and Matthews correlation coefficient (MCC) across different datasets. Highlighting the impact of data sparsity and average node density in enclosing subgraphs on model performance, as evidenced by the reduced performance on the RPI2241 dataset. Furthermore, our experiments reinforce the significance of network topological information over sequence-based data in predicting ncRNA–protein interactions. This finding supports our focus on enhancing and leveraging topological features in the NPI-WGNN model. In conclusion, NPI-WGNN represents a significant advancement in computational methods for predicting ncRNA–protein interactions. By effectively leveraging graph structure, topological features, and advanced neural network architectures, our approach provides more accurate and reliable predictions. This can greatly aid researchers in understanding the complex regulatory functions of ncRNAs and their roles in various biological processes and diseases.
While NPI-WGNN demonstrates promising results, there are several avenues for future research and improvement:
Handling data sparsity: developing techniques to improve performance on sparse datasets like RPI2241 is crucial. This could involve exploring methods for data augmentation or developing models that can better handle limited node neighborhoods.
Incorporating additional biological information: while our model focuses on topological features, integrating other types of biological data (e.g., evolutionary conservation, structural information) could potentially enhance predictive power further.
Explainable AI techniques: developing methods to interpret the decisions made by the NPI-WGNN model could provide valuable insights into the biological mechanisms underlying ncRNA–protein interactions.
Large-scale application: applying and validating the model on larger, more comprehensive datasets of ncRNA–protein interactions as they become available.
Transfer learning: investigating the potential of transfer learning approaches to improve performance on smaller or domain-specific datasets by leveraging knowledge from larger, more general datasets.
Dynamic interaction prediction: extending the model to predict dynamic changes in ncRNA–protein interactions under different cellular conditions or in response to various stimuli.
Integration with other omics data: exploring ways to integrate NPI-WGNN predictions with other omics data (e.g., transcriptomics, proteomics) to provide a more comprehensive understanding of cellular regulatory networks.
Optimization and scalability: further optimizing the model architecture and implementation to improve computational efficiency and scalability for very large datasets.
Experimental validation: collaborating with experimental biologists to validate novel predictions made by NPI-WGNN and refine the model based on new experimental data.
Application to drug discovery: investigating the potential of NPI-WGNN in identifying novel therapeutic targets or predicting drug-target interactions involving ncRNAs.
By pursuing these future directions, we can continue to improve our understanding of ncRNA–protein interactions and their roles in cellular processes, ultimately contributing to advancements in fields such as systems biology, personalized medicine, and drug discovery.