NPI-WGNN: A Weighted Graph Neural Network Leveraging Centrality Measures and High-Order Common Neighbor Similarity for Accurate ncRNA–Protein Interaction Prediction

Khoushehgir, Fatemeh; Noshad, Zahra; Noshad, Morteza; Sulaimany, Sadegh

doi:10.3390/analytics3040027

Open AccessArticle

NPI-WGNN: A Weighted Graph Neural Network Leveraging Centrality Measures and High-Order Common Neighbor Similarity for Accurate ncRNA–Protein Interaction Prediction

by

Fatemeh Khoushehgir

¹,

Zahra Noshad

¹,

Morteza Noshad

^2,* and

Sadegh Sulaimany

³

¹

Department of IT and Computer Engineering, Azarbaijan Shahid Madani University, Tabriz P.O. Box 53714-161, Iran

²

Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI 48109, USA

³

Social and Biological Network Analysis Laboratory (SBNA), Department of Computer Engineering, University of Kurdistan, Sanandaj P.O. Box 416, Iran

^*

Author to whom correspondence should be addressed.

Analytics 2024, 3(4), 476-492; https://doi.org/10.3390/analytics3040027

Submission received: 7 October 2024 / Revised: 31 October 2024 / Accepted: 12 November 2024 / Published: 2 December 2024

Download

Browse Figures

Versions Notes

Abstract

:

Predicting ncRNA–protein interactions (NPIs) is essential for understanding regulatory roles in cellular processes and disease mechanisms, yet experimental methods are costly and time-consuming. In this study, we propose NPI-WGNN, a novel weighted graph neural network model designed to enhance NPI prediction by incorporating topological insights from graph structures. Our approach introduces a bipartite version of the high-order common neighbor (HOCN) similarity metric to assign edge weights in an ncRNA–protein network, refining node embeddings via weighted node2vec. We further enrich these embeddings with centrality measures, such as degree and Katz centralities, to capture network hierarchy and connectivity. To optimize prediction accuracy, we employ a hybrid GNN architecture that combines graph convolutional network (GCN), graph attention network (GAT), and GraphSAGE layers, each contributing unique advantages: GraphSAGE offers scalability, GCN provides a global structural perspective, and GAT applies dynamic neighbor weighting. An ablation study confirms the complementary strengths of these layers, showing that their integration improves predictive accuracy and robustness across varied graph complexities. Experimental results on three benchmark datasets demonstrate that NPI-WGNN outperforms state-of-the-art methods, achieving up to 96.1% accuracy, 97.5% sensitivity, and an F1-score of 0.96, positioning it as a robust and accurate framework for ncRNA–protein interaction prediction.

Keywords:

ncRNA–protein interaction; weighted graph neural network; link prediction; topological features

1. Introduction

The majority of genes in the human genome are classified as non-coding RNAs (ncRNAs) because they do not engage in protein synthesis, while only 2% of genes are responsible for encoding proteins [1]. Historically, ncRNAs were considered biologically insignificant and categorized as non-functional. However, recent discoveries have identified a variety of functional ncRNAs, including long non-coding RNAs (lncRNAs), small nuclear RNAs (snRNAs), transfer RNAs (tRNAs), and small RNAs (miRNAs and siRNAs), which play significant roles in gene regulation, chromatin remodeling, and essential cellular processes [2,3,4,5]. These ncRNAs interact with other RNAs, proteins, and DNA, influencing numerous molecular functions, and their role in diseases such as cancer further highlights the need for accurate prediction of ncRNA–protein interactions (ncRPIs) [3,6,7,8,9].

Experimental techniques, such as PAR-CLIP [10], RNAcompete [11], and HITS-CLIP [12], have traditionally been used to study ncRPIs. These methods, although effective, are costly and time-consuming, leading to the development of various computational approaches to predict ncRPIs efficiently. ncRNA–protein interaction prediction methods can be categorized into experimental and computational approaches. Experimental methods directly measure interactions in laboratory settings, while computational methods predict interactions using various algorithms. Computational techniques include network-based analysis, traditional machine learning, deep learning, integrated machine learning-deep learning methods, and Graph Neural Network (GNN)-based approaches. Each computational method offers distinct advantages, with GNNs particularly suited for capturing complex network structures inherent in biological interactions.

Network-based methods utilize the topological properties of biological networks to infer potential interactions. For example, LPIHN constructs a heterogeneous network from lncRNA and protein data, using random walk algorithms to predict new interactions [13]. Another approach, LPI-IBNRA, uses a bipartite network and manages second-order correlations to predict ncRPIs [14]. While network-based methods focus primarily on topological features, traditional machine learning models, such as support vector machines (SVM) and random forests (RF), use sequence-based features to train classifiers that predict interactions [15,16].

More recently, deep learning has been applied to ncRPI prediction, leveraging neural networks to automatically extract high-level features from sequence and structural data. IPMiner, for instance, uses stacked ensembling to predict ncRPIs from sequences [17], while RPIFSE combines CNN and extreme learning machine (ELM) classifiers [18].

Among GNN-based methods, the NPI-GNN model [19] represents a pioneering approach for ncRPI prediction. NPI-GNN transforms the NPI prediction task into a graph link prediction problem, where ncRNAs and proteins are represented as nodes and potential interactions as edges. The authors employed a graph convolutional network with three graphSAGE layers, enabling message passing between nodes to non-linearly transform feature vectors and learn low-dimensional embeddings. These embeddings are based on extracted protein and RNA sequence features, allowing the model to predict interactions and even reconstruct ncRNA–protein networks under different conditions. The use of RNA-seq data to identify abundant RNAs and highly probable negative interactions further enhances the network’s representation, while CLIP-seq data are employed to construct an RNA–protein interaction network specific to cell lines.

Despite these advances, most existing methods either focus solely on topological information or exclusively rely on sequence-based data. This highlighted the need for a comprehensive approach that integrates both types of information. Graph neural networks (GNNs) have emerged as a powerful solution for this problem, allowing for the combination of network topology and node-specific features in a unified framework. GNNs, originally designed to operate on graph-structured data, excel in tasks such as link prediction, including predicting ncRNA–protein interactions [20]. Figure 1 provides a comprehensive categorization of the various features employed in NPIs.

In this work, we propose NPI-WGNN, a novel weighted graph neural network model designed to enhance ncRPI prediction by incorporating both topological and node-specific features. We introduce a bipartite version of the high-order common neighbor (HOCN) similarity measure to calculate edge weights, which are then used in a weighted node2vec embedding. Additionally, various centrality measures, such as degree, betweenness, and Katz centralities, are integrated into the embeddings to capture the hierarchical organization and information flow within the network.

Our GNN architecture includes three types of graph convolutional layers (GCNConv, GATConv, and GraphSAGE) and top-k pooling layers, which capture both local and global graph features. An ablation study confirms that each GNN layer contributes uniquely to model performance: GraphSAGE enhances scalability by sampling from large neighborhoods, GCNConv captures global structure effectively for strongly connected graphs, and GATConv dynamically weights neighbors, improving accuracy on heterogeneous graphs. These features are further processed by a recurrent neural network (RNN) to integrate spatial and temporal information, allowing NPI-WGNN to adapt to diverse graph complexities.

The experimental results, obtained from three benchmark datasets, demonstrate that NPI-WGNN consistently outperforms existing methods, achieving high accuracy, sensitivity, and MCC scores. Our approach, which integrates topological insights and node-specific data into a weighted GNN framework, offers a robust and accurate solution for predicting ncRNA–protein interactions, providing new opportunities for understanding regulatory functions and therapeutic interventions.

2. Algorithm and Model Architecture

In our efforts to improve the predictive abilities of graph neural networks (GNNs) in predicting ncRNA–protein interactions (NPIs), we have introduced an innovative method called NPI-WGNN. The flowchart of our method is shown in Figure 2. As shown in Figure 2, NPI-WGNN consists of four main phases: (1) Construction of a weighted ncRNA–protein bipartite graph; (2) Extracting enclosing subgraphs for every observed positive link and unobserved negative link; (3) Building a node information matrix including structural labels, weighted node2vec, and centrality measures; (4) Using a graph neural network (GNN) including graph colnvolutional layers, pooling layers, fully connected layers, and recurrent neural network (RNN) layers for learning extracted features and performing the classification task.

2.1. Construction of Weighted ncRNA–Protein Bipartite Network

After reviewing the previous studies, we realized that the topological and structural features extracted from the ncRNA–protein network significantly influence the prediction of relationships between ncRNAs and proteins. Therefore, in this section, we use a criterion that assigns weight to each edge depending on how similar two nodes are. According to this criterion, the nodes exhibiting greater similarity should carry more weight. To determine the weight of both positive and negative links, we utilize similarity-based metrics of link prediction [21]. This allows us to accurately calculate the weight of each link and analyze the network structure comprehensively.

In a previous study, Wang et al. [22] proposed a novel weighting technique known as high-order common neighbor (HOCN) for unipartite graphs. This method was effective in identifying protein complexes and establishing reliable networks of protein-protein interactions (PPIs). Building on this foundation, we have modified the HOCN weighting method to fit our specific bipartite ncRNA–protein network. By customizing this method to our network structure, we aim to increase the accuracy and reliability of identifying ncRNA–protein interactions. The high-order common neighbor (HOCN) between v and u is defined as Equation (1):

H O C N (v, u) = \frac{(J C (v, u) + C N S (v, u))}{(| C N (v, u) | + 1)}

(1)

where

J C (v, u)

is a Jaccard’s coefficient similarity [23] that is defined by Equation (2):

J C (v, u) = \frac{| C N (v, u) |}{| N (v) \cup N^{'} (u) |}

(2)

C N (v, u) = N (v) \cap N^{'} (u)

(3)

and the set of all node v’s neighbors is shown by

N (v)

, while the common neighbors of nodes v and u are represented by

C N (v, u)

, and

N^{'} (u)

refers to the set of all neighbors of neighbors of node u. CNS refers to the common neighbors support, which means the likelihood that u and v’s common neighbors do support the edge

(v, u)

.

C N S (v, u) = \sum_{w \in C N (v, u)} (J C (v, w) * J C (w, u)) .

(4)

The algorithm for constructing a weighted ncRNA–protein network is presented in Algorithm 1, providing a more comprehensive depiction of the construction process.

Algorithm 1 Building a weighted ncRNA–protein garph

Input: The ncRNA–protein graph,

G = (V, E)

.
Output: The weighted ncRNA–protein graph,

G = (V, E, W)

.

1:: for each edge $(v, u)$ in E
2:: calculate the $J C$ of each edge according to Equation (2);
3:: if $| N (v) \cap N^{'} (u) | > = 1$ then
4:: calculate the weight of edge $(v, u)$ , $H O C N (v, u) = \frac{(J C (v, u) + C N S (v, u))}{(| C N (v, u) | + 1)}$ according to Equation (1);
5:: else
6:: remove edge $(v, u)$ from E;
7:: return weighted ncRNA–protein network $G = (V, E, W)$ .

2.2. Extracting Enclosing Subgraphs

The SEAL framework [24] demonstrates that all the information obtained from global link prediction approaches, such as Katz index [25] and PageRank [26], can be derived through local enclosing subgraphs. Therefore, for each positive (existing) and negative (non-existing) connection, we extract a 1-hop local enclosing subgraph. As you can see in Figure 3, the enclosing subgraph for a pair of nodes

(x, y)

refers to the subgraph formed by considering the neighbors of both x and y up to a specified number of hops, denoted as h. This subgraph is obtained by taking the union of the neighbors within the specified hop distance.

In a typical graph neural network (GNN) architecture, two matrices are commonly utilized as input. The first matrix is the enclosing subgraph’s adjacency matrix (A), which captures the relationships between nodes in the graph. The adjacency matrix represents the connectivity pattern of the graph, indicating which nodes are connected to each other. The second matrix is the node information matrix (X), which provides details about the properties or attributes of each node in the graph. This matrix describes specific characteristics or features associated with individual nodes, including structural node labels, node embeddings, and node attributes. By combining the local enclosing subgraphs extracted from the adjacency matrix (A) and the node information matrix (X), the GNN can effectively process and analyze the graph data, leveraging both the structural relationships between nodes and the node-specific information. This integration enables the GNN to learn and make predictions based on the collective knowledge encoded within the graph.

2.3. Building Node Information Matrix

The construction of the node information matrix involves several components, including structural labels, node embeddings, and node attributes.

2.3.1. Structural Label

The node information matrix X begins with the graph’s structural label. To determine the structural label for a node v, which could represent either a protein or a ncRNA, we first label the two target nodes x and y as 0. For any node v, its structural label is derived by finding the minimum of the shortest path distances from v to x and from v to y. This approach ensures that the structural label reflects the proximity of v to the target nodes within the graph.

2.3.2. Weighted node2vec

The second component of the node information matrix (X) is the node embedding. In our implementation, we used the weighted node2vec algorithm [27] as the node embedding method. This technique allows us to create a vector representation, called

f_{n e}

, for each node. The node2vec algorithm uses a second-order random walk and controls the walk process with two parameters, p and q. This process is outlined as follows:

If (v, x) \in E then p (x | v) = \frac{1}{Z} * π_{v x}; otherwise, p (x | v) = 0

(5)

where

π_{v x}

is the un-normalized transitional probability between nodes v and x, Z is the normalizing constant, and E is the collection of edges. In Equation (6), t is the node that is traversed before v;

d_{t x}

indicates the distance on the shortest path between nodes t and x, and

W_{v x}

is the weight of edge

(v, x)

.

π_{v x} = \{\begin{matrix} \frac{W_{v x}}{p}, & if d (t, x) = 0 \\ W_{v x}, & if d (t, x) = 1 \\ \frac{W_{v x}}{q}, & if d (t, x) = 2 . \end{matrix}

(6)

The weighted node2vec algorithm enhances link prediction by incorporating edge weights, resulting in a more detailed and accurate representation of the strengths of connections between nodes. This method effectively captures the complexities of network relationships, thereby achieving higher prediction accuracy compared with the traditional node2vec algorithm, which assumes uniform edge weights.

2.3.3. Centrality Measures

The node attributes make up the third element of the node information matrix X. Extra details about nodes may be found in their characteristics. In this work, we used centrality measures as the node attributes. In the realm of complex networks, it is crucial to acknowledge that each node has some properties that determine its significance within a given application-specific context. These characteristics can be identified using centrality measures. These measures provide numerical values that reflect a node’s importance [28]. In this research, we employed bipartite versions of degree, closeness, betweenness, Katz, and propagation entropy (PE) centralities for our ncRNA–protein bipartite networks.

Degree Centrality

Typically, the degree centrality values are divided by the maximum degree to normalize them when working with unipartite networks. In such cases, the maximum possible degree is determined by subtracting one from the total node count of the network

(n - 1)

. However, in the case of bipartite networks, the number of nodes in the opposing node set determines a node’s maximum degree within a bipartite node set [29]. A node v’s degree centrality in a bipartite network when one node set has u nodes and the other node set contains v nodes can be specifically found using the following formula:

d_{v} = \frac{d e g (v)}{m} for v \in U

(7)

d_{v} = \frac{d e g (v)}{n} for v \in V

(8)

where n and m are the total number of nodes in each region of the graph, and

d e g (v)

is the degree of node v.

Closeness Centrality

A metric known as closeness centrality is used to quantify how near a node is to other nodes in a network. Usually, the minimum distance that may be traveled serves as the normalization. A node in a bipartite node set must be at least one step away from every other node in the other node set and two steps away from every other node inside its own set in the context of bipartite networks [29]. As a result, the closeness centrality of a particular node, v, in a bipartite network with two node sets, U and V, each comprising n and m nodes, can be expressed as follows:

C_{v} = \frac{m + 2 (n - 1)}{d} for v \in U

(9)

C_{v} = \frac{n + 2 (m - 1)}{d} for v \in V

(10)

where the total distance from node v to all other nodes is represented by d.

Betweenness Centrality

Based on the quantity of shortest paths that travel through a node, betweenness centrality calculates the node’s importance within a network. It is determined by calculating the number of shortest pathways that connect every pair of nodes that travel through a certain node and then normalizing this count by the total number of shortest paths in the network. High betweenness centrality is thought to be crucial for preserving the network’s ability to transfer resources or information. For a given node v, the betweenness centrality can be written as follows:

b_{v} = \sum_{s \neq v \neq t} \frac{\frac{\partial_{s t} (v)}{\partial_{s t}}}{N}

(11)

In the context of network analysis, the notation

\partial_{s t} (v)

denotes the count of the shortest paths between nodes s and t, where node v serves as an intermediary node along these shortest paths, and N is the normalized value [28]. By considering the role of node u as an intermediate point, we gain insights into the influence and connectivity it contributes to the overall network structure. The highest value that may be obtained for betweenness values in bipartite graphs is determined by the relative sizes of the two node sets [29]. Nodes in U are normalized by dividing by

\frac{1}{2} * [m^{2} {(s + 1)}^{2} + m (s + 1) (2 * t - s - 1) - t (2 * s - t + 3)]

(12)

where n is the number of nodes in U set and m is the number of nodes in V set and s and t is determined as follow:

s = \frac{(n - 1)}{m}, t = (n - 1) % m

(13)

and nodes in V are normalized by dividing by

\frac{1}{2} * [n^{2} {(p + 1)}^{2} + n (p + 1) (2 r - p - 1) - r (2 p - r + 3)]

(14)

where

p = \frac{(m - 1)}{n}, r = (m - 1) % n .

(15)

Katz Centrality

Katz centrality [25] is a measure of node influence in a network that takes into account both direct and indirect connections. It calculates a node’s centrality score by considering all paths leading to it, giving more weight to shorter paths. This is done using an adjacency matrix A and iteratively updating the centrality score

C_{i}

for each node. The formula

C_{i} = α \sum_{j} A_{i j} C_{j} + β

is used, where

α

is a damping factor (less than the inverse of the largest eigenvalue of A) and

β

is a constant for baseline centrality. This process continues until the scores stabilize, providing a detailed measure of influence across the network. Note that in our weighted graph,

A_{i j}

represents the weight of the edge between nodes i and j.

Propagation Entropy Centrality

Node propagation entropy [30] is a metric that evaluates the importance of nodes in a network by combining both local and global factors. It integrates the local clustering coefficient, which indicates how clustered a node’s neighbors are, with the node’s influence through its first- and second-order neighbors. The clustering coefficient and neighbors

(c n)

metric measures a node’s ability to disseminate information based on these factors. To reflect global network context, node propagation entropy (PE) is derived from the entropy of the normalized propagation capacity of a node’s neighbors. The node propagation entropy is calculated as follows:

P E_{i} = - \sum_{j \in N (i)} I_{j} l n I_{j}

(16)

where

I_{i} = \frac{{c n}_{i}}{\sum_{j = 1}^{n} {c n}_{j}}

and

{c n}_{i} = (N_{2} (i) + N (i)) / (1 + c_{i})

is the clustering coefficient and neighbors metric. The local clustering coefficient

c_{i}

for a bipartite graph is as follows:

c_{i} = \frac{\sum_{j \in N (N (i))} c_{i j}}{| N (N (i)) |}

(17)

where

N (N (i))

are the second-order neighbors of i except itself and

c_{i j} = \frac{| N (i) \cap N (j) |}{| N (i) \cup N (j) |}

.

The cascade vector of every component listed above is the whole node property F, which looks like this:

F = {[l_{s} * f_{n e} * c]}^{T} .

(18)

The vector c represents the centrality metric in five dimensions for each ncRNA and protein node. We kept the default dimension option for building the node2vec embedding (

f_{n e}

), which produced a 64-D vector. Additionally, the whole node property F becomes a vector in 75 dimensions by integrating the 1-D structural label (

l_{s}

).

2.4. Model Structure

Our NPI-WGNN model comprises three graph convolutional-based modules, three global pooling modules, one additive module, two fully connected layers, and a two-layer recurrent neural network (RNN). The graph convolutional-based modules include three types of graph neural network layers: a graph convolutional layer (GCNConv), a graph attention layer (GATConv), and a graphSAGE layer (SAGEConv), each followed by a Rectified Linear Unit (ReLU) activation and top-k pooling layers, arranged sequentially. The global pooling modules utilize global average pooling and global max pooling. The outputs from these three global pooling modules are combined using the additive module. Subsequently, two fully connected layers with 128 and 64 neurons and a two-layer recurrent neural network with two neurons, respectively, process the output from the additive module. The final layer’s output is then processed by a

l o g

–

s o f t m a x

function to produce a 2-D vector, representing the likelihood of positive and negative samples.

Integrating GCNConv [31], GATConv [32], and SAGEConv [33] creates a powerful and well-rounded framework for graph representation learning. GCNConv effectively captures local graph structures and node features, forming a solid foundation for learning. GATConv introduces attention mechanisms that differentiate the importance of various nodes, allowing the model to focus on the most crucial interactions. SAGEConv utilizes sampling techniques to aggregate information from node neighborhoods, enabling scalability to larger graphs while maintaining diverse information. The combination of these methods ensures comprehensive feature extraction, attention-based prioritization, and efficient neighborhood aggregation, resulting in improved accuracy and robustness in predictive tasks.

In the upper part of our model, there is a two-layer recurrent network [34] to combine information. The combined information is used to make predictions. Combining RNN layers with GNNs frequently enhances performance on tasks involving graphs [20]. By utilizing the strengths of both architectures, the model is able to capture spatial and temporal aspects of the data, resulting in more robust and accurate predictions. The experimental results indicate that combining RNN with GNN has significantly improved the performance of the network.

2.4.1. Graph Convolutional Layer

Graph convolutional layer (GCNConv) [31] is a fundamental component within the domain of graph neural networks (GNNs), essential for understanding the complex relationships and structures inherent in graph-based data. By leveraging feature data alongside local network connections, GCNConv aims to develop meaningful node representations by integrating information from neighboring nodes. It employs the adjacency matrix to map node connections, ensuring that nodes with similar structures produce comparable embeddings. With adjustable parameters and non-linear activation functions, GCNConv refines these representations, enabling the model to identify subtle patterns and complex dependencies within the graph. As a key component of GNN architectures, GCNConv plays a crucial role in link prediction tasks, providing a robust framework for extracting insights from complex graph-structured data.

When given X as the information matrix and A as the adjacency matrix, the node representation at layer

l + 1

is as follows:

H^{(l + 1)} = σ ({\hat{D}}^{- 1 / 2} \hat{A} {\hat{D}}^{- 1 / 2} H^{(l)} W^{(l)})

(19)

where

H^{(l)}

represents the node representation at layer l,

σ

is the activation function,

\hat{A}

is the normalized adjacency matrix of the graph,

\hat{D}

is the degree matrix of

\hat{A}

, and

W^{(l)}

is the weight matrix for layer l.

2.4.2. Graph Attention Layer

The graph attention layer (GATConv) [32] dynamically allocates attention weights to neighboring nodes, allowing each node to emphasize informative neighbors while downplaying less relevant ones. By adaptively assigning attention weights based on the features of neighboring nodes, GATConv learns richer and more discriminative representations for each node. This flexibility enables the model to effectively leverage both local and global information, thereby enhancing performance. The process of updating node features through the GATConv layer after information aggregation and transformation is as follows:

h_{i}^{'} = σ (\sum_{j \in N (i)} s o f t m a x (L e a k y R e L U (a^{T} [W h_{i} | | W h_{j}])) . W h_{j})

(20)

where

N_{i}

represents the set of neighboring nodes of node i,

h_{i}

and

h_{j}

are the feature vectors of nodes i and j, respectively, a is a learnable parameter vector, and W is a learnable weight matrix.

2.4.3. GraphSAGE Layer

GraphSAGE, or Graph Sample and Aggregation, represents a significant advancement in graph representation learning, particularly suited for large-scale graphs with diverse node features. Its primary objective is to capture the structural information and dependencies inherent in graphs in order to generate meaningful node embedding [33]. It samples a fixed-size neighborhood for each node, capturing local graph structure. Then, it aggregates information from the sampled neighborhood to generate a representation for the target node. The aggregation process can be executed using various methods, such as mean aggregator, LSTM aggregator, and pooling aggregator. In this paper, the mean aggregator was chosen for implementation. The aggregated features at layer

l + 1

are as follows:

h_{a g g}^{(l + 1)} = R e L U (W . A G G R E G A T E (h_{s a m p l e s}^{(l)}, h_{v}^{(l)}))

(21)

where

h_{s a m p l e s}^{(l)}

represents the aggregated features from sampled neighboring nodes at layer l,

h_{v}^{(l)}

is the feature vector of the central node, W is a learnable weight matrix, and

A G G R E G A T E

denotes the aggregation function.

2.4.4. Top-k Pooling Layer

Top-k pooling is used by NPI-WGNN, and it dynamically adjusts to the data to reduce graph size as the GNN becomes deeper. A parameter called k—which ranges from 0 to 1—is introduced by the top-k pooling method that determines the percentage of nodes to be kept [35].

3. Datasets

The datasets used in this study include RPI2241, NPInter2, and RPI7317, each encompassing interactions between various types of ncRNAs and RNA-binding proteins (RBPs). The ncRNAs in these datasets include ribosomal RNA (rRNA), microRNA (miRNA), small nuclear RNA (snRNA), transfer RNA (tRNA), and long non-coding RNAs (lncRNAs). The RNAs exhibit diverse molecular sizes, with miRNAs typically ranging from 19 to 25 nucleotides, snRNAs averaging around 150 nucleotides, tRNAs between 76 and 90 nucleotides, and lncRNAs often exceeding 200 nucleotides and extending up to several thousand nucleotides. RNA-binding proteins (RBPs), essential for RNA metabolism, vary in molecular size from approximately 20 kDa to over 150 kDa depending on their function and structure.

3.1. RPI2241

The RPI2241 dataset was constructed by inferring ncRNA–protein interactions from structural data on protein–ncRNA complexes in the Protein–RNA Interface Database (PRIDB). This dataset emphasizes structural interactions derived from computational modeling, providing a high-confidence set of predicted interactions based on known RNA-binding motifs and structural interfaces.

3.2. NPInter2

The NPInter2 dataset was produced through a combination of high-throughput experimental techniques, capturing a broad spectrum of RNA–protein interactions across different species. This dataset integrates data from multiple experimental sources, including RNA immunoprecipitation (RIP), cross-linking immunoprecipitation (CLIP), and yeast two-hybrid assays, to enhance interaction coverage and diversity. This multi-source approach provides a comprehensive dataset of ncRNA–protein interactions that is well-suited for evaluating prediction models across varied interaction types.

3.3. RPI7317

The RPI7317 dataset primarily relies on CLIP-seq (Cross-Linking and Immunoprecipitation coupled with high-throughput sequencing) to identify direct interactions between RNAs and proteins. By employing stringent filtering processes to reduce false positives, this dataset ensures a high-confidence set of RNA–protein interactions. CLIP-seq’s precise detection of interaction sites at nucleotide-level resolution allows for reliable data that is instrumental for training and evaluating interaction prediction models.

Each dataset offers distinct characteristics: RPI2241 is structurally inferred, NPInter2 integrates multiple experimental techniques, and RPI7317 focuses on high-confidence direct interactions via CLIP-seq. Together, these datasets provide a comprehensive basis for evaluating the predictive capabilities of our proposed NPI-WGNN model across different RNA types, protein sizes, and interaction verification methods.

4. Experimental Results

4.1. Datasets and Evaluation Strategies

As shown in Table 1, we collected three different ncRNA–protein datasets in order to assess the effectiveness of our technique. RPI2241 [16] is a dataset predicted by the computation of structural information, whereas NPInter2.0 [36] and RPI7317 [37] are datasets confirmed experimentally.

We built negative samples by randomly selecting non-interacting pairings of ncRNA and protein, as NPInter2.0 and RPI7317 only include positive samples. The negative samples were constructed to be equivalent in number to the positive samples.

Since the GNN layers extract features for each node based on its neighboring nodes, the number of neighboring nodes plays a crucial role in effectively capturing the characteristics of nodes. Therefore, the prediction performance of our method is directly influenced by the number of nodes within the local graph structures. As evident in Table 1, the mean count of nodes inside the enclosing subgraph with one hop is substantially lower for RPI2241. These findings suggest that the prediction performance of this dataset would likely be less effective compared with others.

We used the Matthews correlation coefficient (MCC), accuracy, sensitivity (recall), F1-score, specificity, and precision to assess the effectiveness of our techniques. The definition of these measures is as follows:

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

(22)

Sensitivity = \frac{T P}{T P + F N}

(23)

Specificity = \frac{T N}{T N + F P}

(24)

Precision = \frac{T P}{T P + F P}

(25)

MCC = \frac{T P T N - F P F N}{\sqrt{(T P + F P) (T P + F N) (F N + F P) (T N + F N)}}

(26)

F 1 = \frac{2 * Precision * Sensitivity}{Precision + Sensitivity} .

(27)

The numbers for true positives, true negatives, false positives, and false negatives are, respectively,

T P

,

T N

,

F P

, and

F N

. Five-fold cross-validation is used to optimize the model parameters. We implemented NPI-WGNN using Python, utilizing PyTorch 2.2.1, PyTorch-geometric 2.5.2, and networkx 3.3 for efficient development and experimentation. The input channels of GCNConv are 75, representing the feature vector size, while GCNConv’s output channels are 128. The input and output channels of GATConv and SAGEConv are both 128. The other parameters remain the same as those in NPI-GNN.

4.2. Prediction Performance of NPI-WGNN

We calculated the prediction performance of NPI-WGNN on the NPInter2.0, RPI7317, and RPI2241 datasets using five-fold cross-validation. The comparison of our method with NPI-GNN [19], RPISeq-RF [16], EDLMC [38], and IPMiner [17] is shown in Table 2. These findings indicate that our method achieves significant improvements in accuracy, sensitivity, specificity, precision, Matthews correlation coefficient (MCC), and F1-score compared with state-of-the-art methods.

As shown in Table 2, NPI-WGNN achieves the highest performance on NPInter2.0 and RPI7317, demonstrating its effectiveness across various evaluation metrics. However, the prediction performance on RPI2241 is comparatively lower, which may be attributed to several characteristics inherent to this dataset as mentioned in [19].

Unlike NPInter2 and RPI7317, which are built on experimentally verified interactions obtained through high-throughput techniques such as CLIP-seq and yeast two-hybrid assays, the RPI2241 dataset is derived from inferred structural information on protein–ncRNA complexes in the PRIDB database. This reliance on structural inference introduces specific limitations. First, inferred datasets often have lower diversity compared with datasets constructed from direct experimental interactions, as they represent interactions within a specific structural context. This potentially limits the generalizability of interactions, resulting in lower diversity within RPI2241 and impacting the ability of our model to capture the full spectrum of ncRNA–protein interactions.

Additionally, RPI2241 has a lower average node density in the enclosing subgraphs, which may contribute to decreased performance by limiting the local neighborhood information that NPI-WGNN can leverage during the learning process. This sparsity, combined with lower diversity, affects the F1-score and other metrics on RPI2241, as seen in Table 2. Notably, while NPI-WGNN generally outperforms other methods across datasets, its F1-score on RPI2241 is slightly lower than that of the IPMiner method, suggesting that IPMiner’s approach might be less affected by sparsity in this particular dataset.

In summary, while NPI-WGNN demonstrates robust improvements over other methods on NPInter2.0 and RPI7317, the inherent limitations of RPI2241—particularly data sparsity and its structural inference basis—highlight the importance of dataset characteristics in ncRNA–protein interaction prediction. This underscores the potential for further improvements in model adaptation across datasets with varying levels of diversity and structural context.

Furthermore, as part of our study, we plotted precision–recall curves and ROC curves to further evaluate the effectiveness of the proposed methods with the NPI-GNN on NPInter2.0. As shown in Figure 4, the AUPR of the NPI-GNN and NPI-WGNN methods are 0.960 and 0.977, respectively. According to Figure 5, the AUROC of the NPI-GNN and NPI-WGNN methods are 0.969 and 0.982, respectively.

4.3. The Effect of Different GNN Layers on NPI-WGNN Performance

In this section, we conducted an ablation study to investigate the impact of each GNN layer on our model’s performance. Table 3 presents the results obtained from replacing each of the GCN, GAT, and GraphSAGE layers with GraphSAGE (according to the NPI-GNN model) and the combination of all three layers.

The impact of GCN, GAT, and GraphSAGE layers on model performance reveals unique advantages when applied individually or in combination. Analyzing each layer separately, GraphSAGE excels in scalability due to its sampling-based aggregation, which suits large graphs but may overlook intricate local connections in dense areas, thereby affecting localized accuracy. GCN effectively captures global graph structure by leveraging full adjacency matrices, achieving high accuracy in strongly connected graphs; however, this approach can limit scalability for extensive datasets. GAT’s attention mechanism enables dynamic neighbor weighting, which enhances accuracy on heterogeneous graphs but increases computational demand. When combined, these layers complement each other, integrating GraphSAGE’s scalability, GCN’s global perspective, and GAT’s adaptive attention, collectively enhancing predictive accuracy and robustness across varying graph complexities. The empirical results confirm this complementary effect, underscoring the advantage of hybrid architectures in diverse graph-based applications.

4.4. Sequence Information Versus Network Topological Data

To further investigate the impact of additional information, a comparison was conducted between the results achieved by employing k-mer as additional information in the NPI-GNN method and the results obtained from the incorporation of centrality measures as additional node attributes. The results presented in Figure 6 indicate that the NPI-GNN model primarily depends more on the network structure than on the particular data sequence. By applying the network structure, NPI-GNN demonstrates its ability to provide valuable insights and analysis, showcasing its potential as a powerful tool in various domains.

Given the extensive extraction of structural and topological data, such as network edge weights and centrality measures, in this research, the reliance of this model on topological data further substantiates our claim regarding the effectiveness of the NPI-WGNN approach in achieving proper performance.

5. Conclusions

This study presents NPI-WGNN, an innovative approach for predicting ncRNA–protein interactions using weighted graph neural networks with a focus on network topological features. Our method demonstrates significant improvements over existing techniques, particularly the base NPI-GNN model, across multiple datasets. The key contributions and findings of this work include: Enhancing graph representation by incorporating edge weights derived from a bipartite high-order common neighbor (HOCN) similarity measure, we created a more informative graph structure. This weighted approach allows the model to better capture the strength of relationships between nodes, leading to more accurate predictions. Improving node embeddings using weighted node2vec, which takes into account edge weights during the random walk process, enables a more nuanced exploration of the graph structure. This results in richer and more informative node representations compared with traditional node2vec. Incorporation of centrality measures by including various centrality measures (degree, closeness, betweenness, etc.) as node attributes, we provide the model with additional topological insights. This enhances the model’s ability to understand the importance and roles of different nodes within the network. Advancing GNN architecture by the combination of multiple graph convolutional layers (GCNConv, GATConv, and SAGEConv) along with top-k pooling and recurrent neural network layers. This architecture allows for effective feature extraction, aggregation, and processing of graph data, leading to superior predictive performance. Extensive experiments on different benchmark datasets demonstrate the superiority of NPI-WGNN over the base NPI-GNN model. Our approach achieves higher accuracy, sensitivity, specificity, precision, and Matthews correlation coefficient (MCC) across different datasets. Highlighting the impact of data sparsity and average node density in enclosing subgraphs on model performance, as evidenced by the reduced performance on the RPI2241 dataset. Furthermore, our experiments reinforce the significance of network topological information over sequence-based data in predicting ncRNA–protein interactions. This finding supports our focus on enhancing and leveraging topological features in the NPI-WGNN model. In conclusion, NPI-WGNN represents a significant advancement in computational methods for predicting ncRNA–protein interactions. By effectively leveraging graph structure, topological features, and advanced neural network architectures, our approach provides more accurate and reliable predictions. This can greatly aid researchers in understanding the complex regulatory functions of ncRNAs and their roles in various biological processes and diseases.

While NPI-WGNN demonstrates promising results, there are several avenues for future research and improvement:

Handling data sparsity: developing techniques to improve performance on sparse datasets like RPI2241 is crucial. This could involve exploring methods for data augmentation or developing models that can better handle limited node neighborhoods.
Incorporating additional biological information: while our model focuses on topological features, integrating other types of biological data (e.g., evolutionary conservation, structural information) could potentially enhance predictive power further.
Explainable AI techniques: developing methods to interpret the decisions made by the NPI-WGNN model could provide valuable insights into the biological mechanisms underlying ncRNA–protein interactions.
Large-scale application: applying and validating the model on larger, more comprehensive datasets of ncRNA–protein interactions as they become available.
Transfer learning: investigating the potential of transfer learning approaches to improve performance on smaller or domain-specific datasets by leveraging knowledge from larger, more general datasets.
Dynamic interaction prediction: extending the model to predict dynamic changes in ncRNA–protein interactions under different cellular conditions or in response to various stimuli.
Integration with other omics data: exploring ways to integrate NPI-WGNN predictions with other omics data (e.g., transcriptomics, proteomics) to provide a more comprehensive understanding of cellular regulatory networks.
Optimization and scalability: further optimizing the model architecture and implementation to improve computational efficiency and scalability for very large datasets.
Experimental validation: collaborating with experimental biologists to validate novel predictions made by NPI-WGNN and refine the model based on new experimental data.
Application to drug discovery: investigating the potential of NPI-WGNN in identifying novel therapeutic targets or predicting drug-target interactions involving ncRNAs.

By pursuing these future directions, we can continue to improve our understanding of ncRNA–protein interactions and their roles in cellular processes, ultimately contributing to advancements in fields such as systems biology, personalized medicine, and drug discovery.

Author Contributions

Conceptualization, F.K.; methodology, F.K. and Z.N.; software, F.K. and Z.N.; validation, F.K. and Z.N.; formal analysis, F.K. and Z.N.; investigation, M.N.; writing—original draft preparation, F.K.; writing—review and editing, F.K., Z.N. and S.S.; visualization, F.K.; supervision, M.N. and S.S.; project administration, M.N. and S.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Acknowledgments

We would like to express our sincere gratitude to Aso Mafakheri for his invaluable help and support throughout the development of this paper.

Conflicts of Interest

The author declares no conflicts of interest.

References

Knowling, S.; Morris, K.V. Non-coding RNA and antisense RNA. Nature’s trash or treasure? Biochimie 2011, 93, 1922–1927. [Google Scholar] [CrossRef]
Henras, A.K.; Dez, C.; Henry, Y. RNA structure and function in C/D and H/ACA s(no)RNPs. Curr. Opin. Struct. Biol. 2004, 14, 335–343. [Google Scholar] [CrossRef]
Kung, J.T.; Colognori, D.; Lee, J.T. Long noncoding RNAs: Past, present, and future. Genetics 2013, 193, 651–669. [Google Scholar] [CrossRef]
Okamura, K.; Lai, E.C. Endogenous small interfering RNAs in animals. Nat. Rev. Mol. Cell Biol. 2008, 9, 673–678. [Google Scholar] [CrossRef]
Hogan, D.J.; Riordan, D.P.; Gerber, A.P.; Herschlag, D.; Brown, P.O. Diverse RNA-binding proteins interact with functionally related sets of RNAs, suggesting an extensive regulatory system. PLoS Biol. 2008, 6, e255. [Google Scholar] [CrossRef]
Kang, Q.; Meng, J.; Luan, Y. RNAI-FRID: Novel feature representation method with information enhancement and dimension reduction for RNA–RNA interaction. Brief. Bioinform. 2022, 23, bbac107. [Google Scholar] [CrossRef]
Kang, Q.; Meng, J.; Su, C.; Luan, Y. Mining plant endogenous target mimics from miRNA-lncRNA interactions based on dual-path parallel ensemble pruning method. Brief. Bioinform. 2022, 23, bbab440. [Google Scholar] [CrossRef]
Lim, G.H.; Zhu, S.; Zhang, K.; Hoey, T.; Deragon, J.M.; Kachroo, A.; Kachroo, P. The analogous and opposing roles of double-stranded RNA-binding proteins in bacterial resistance. J. Exp. Bot. 2019, 70, 1627–1638. [Google Scholar] [CrossRef]
Yuan, L.; Huang, D.S. A Network-guided Association Mapping Approach from DNA Methylation to Disease. Sci. Rep. 2019, 9, 5601. [Google Scholar] [CrossRef]
Hafner, M.; Landthaler, M.; Burger, L.; Khorshid, M.; Hausser, J.; Berninger, P.; Rothballer, A.; Ascano, M.; Jungkamp, A.C.; Munschauer, M.; et al. Transcriptome-wide identification of RNA-binding protein and microRNA target sites by PAR-CLIP. Cell 2010, 141, 129–141. [Google Scholar] [CrossRef]
Ray, D.; Kazan, H.; Chan, E.T.; Castillo, L.P.; Chaudhry, S.; Talukder, S.; Blencowe, B.J.; Morris, Q.; Hughes, T.R. Rapid and systematic analysis of the RNA recognition specificities of RNA-binding proteins. Nat. Biotechnol. 2009, 27, 667–670. [Google Scholar] [CrossRef] [PubMed]
Licatalosi, D.D.; Mele, A.; Fak, J.J.; Ule, J.; Kayikci, M.; Chi, S.W.; Clark, T.A.; Schweitzer, A.C.; Blume, J.E.; Wang, X.; et al. HITS-CLIP yields genome-wide insights into brain alternative RNA processing. Nature 2008, 456, 464–469. [Google Scholar] [CrossRef] [PubMed]
Li, A.; Ge, M.; Zhang, Y.; Peng, C.; Wang, M. Predicting long noncoding RNA and protein interactions using heterogeneous network model. BioMed Res. Int. 2015, 2015, 671950. [Google Scholar] [CrossRef]
Xie, G.; Wu, C.; Sun, Y.; Fan, Z.; Liu, J. Lpi-ibnra: Long non-coding rna-protein interaction prediction based on improved bipartite network recommender algorithm. Front. Genet. 2019, 10, 343. [Google Scholar] [CrossRef]
Wang, J.; Zhao, Y.; Huang, X.; Shi, Y.; Tan, J. Recent Advances in Predicting ncRNA-Protein Interactions Based on Machine Learning. Curr. Chin. Sci. 2021, 1, 513–522. [Google Scholar] [CrossRef]
Muppirala, U.K.; Honavar, V.G.; Dobbs, D. Predicting RNA-protein interactions using only sequence information. BMC Bioinform. 2011, 12, 489. [Google Scholar] [CrossRef]
Pan, X.; Fan, Y.X.; Yan, J.; Shen, H.B. IPMiner: Hidden ncRNA-protein interaction sequential pattern mining with stacked autoencoder for accurate computational prediction. BMC Genom. 2016, 17, 582. [Google Scholar] [CrossRef]
Wang, L.; Yan, X.; Liu, M.L.; Song, K.J.; Sun, X.F.; Pan, W.W. Prediction of RNA-protein interactions by combining deep convolutional neural network with feature selection ensemble method. J. Theor. Biol. 2019, 461, 230–238. [Google Scholar] [CrossRef] [PubMed]
Shen, Z.A.; Luo, T.; Zhou, Y.K.; Yu, H.; Du, P.F. NPI-GNN: Predicting ncRNA-protein interactions with deep graph neural networks. Brief. Bioinform. 2021, 22, bbab051. [Google Scholar] [CrossRef]
Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Yu, P.S. A Comprehensive Survey on Graph Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 4–24. [Google Scholar] [CrossRef]
Khoushehgir, F.; Sulaimany, S. Negative link prediction to reduce dropout in Massive Open Online Courses. Educ. Inf. Technol. 2023, 28, 10385–10404. [Google Scholar] [CrossRef] [PubMed]
Wang, R.; Liu, G.; Wang, C. Identifying protein complexes based on an edge weight algorithm and core-attachment structure. BMC Bioinform. 2019, 20, 471. [Google Scholar] [CrossRef] [PubMed]
Jaccard, P. The distribution of the flora in the alpine zone. New Phytol. 1912, 11, 37–50. [Google Scholar] [CrossRef]
Zhang, M.; Chen, Y. Link Prediction Based on Graph Neural Networks. arXiv 2018, arXiv:1802.09691. [Google Scholar]
Katz, L. A new status index derived from sociometric analysis. Psychometrika 1953, 18, 39–43. [Google Scholar] [CrossRef]
Brin, S.; Page, L. Reprint of: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. 2012, 56, 3825–3833. [Google Scholar] [CrossRef]
Grover, A.; Leskovec, J. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 855–864. [Google Scholar] [CrossRef]
Saxena, A.; Iyengar, S. Centrality measures in complex networks: A survey. arXiv 2020, arXiv:2011.07190. [Google Scholar]
Borgatti, S.P.; Halgin, D.S. Analyzing Affiliation Networks. In The SAGE Handbook of Social Network Analysis; SAGE Publications Ltd.: London, UK, 2014. [Google Scholar] [CrossRef]
Yu, Y.; Zhou, B.; Chen, L.; Gao, T.; Liu, J. Identifying important nodes in complex networks based on node propagation entropy. Entropy 2022, 24, 275. [Google Scholar] [CrossRef] [PubMed]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
Hamilton, W.L.; Ying, R.; Leskovec, J. Inductive Representation Learning on Large Graphs. arXiv 2017, arXiv:1706.02216. [Google Scholar]
Jain, L.C.; Medsker, L.R. Recurrent Neural Networks: Design and Applications; CRC Press: Boca Raton, FL, USA, 1999. [Google Scholar]
Gao, H.; Ji, S. Graph U-Nets. arXiv 2019, arXiv:1905.05178. [Google Scholar]
Yuan, J.; Wu, W.; Xie, C.; Zhao, G.; Zhao, Y.; Chen, R. NPInter v2.0: An updated database of ncRNA interactions. Nucleic Acids Res. 2014, 42, D104–D108. [Google Scholar] [CrossRef] [PubMed]
Fan, X.N.; Zhang, S.W. LPI-BLS: Predicting LncRNA–protein Interactions with a Broad Learning System-Based Stacked Ensemble Classifier. Neurocomputing 2019, 370, 88–93. [Google Scholar] [CrossRef]
Wang, J.; Zhao, Y.; Gong, W.; Liu, Y.; Wang, M.; Huang, X.; Tan, J. EDLMFC: An ensemble deep learning framework with multi-scale features combination for ncRNA–protein interaction prediction. BMC Bioinform. 2021, 22, 133. [Google Scholar] [CrossRef]

Figure 1. Different types of ncRNA–protein extracted features.

Figure 2. Flowchart of NPI-WGNN. (a) Extracting a weighted bipartite graph of ncRNA–protein interactions and then extracting 1-hop enclosing subgraphs for every protein and ncRNA nodes. (b) Extracting node features containing structural label, weighted node2vec, and centrality measures for every ncRNA and protein nodes. (c) Every enclosing subgraph and node information matrix is given as input to the network, which contains three graph convolutional-based modules, three global pooling modules, an additive module, two fully connected layers, and a two-layer recurrent neural network.

Figure 3. One-hop enclosing subgraph. X and Y are the target nodes, and 0, 1, 2, 3, 4, 8, 9, 10, and 13 are the extracted one-hop neighboring nodes. 4 and 13 are common nodes.

Figure 4. The precision–recall curves of NPI-GNN and NPI-WGNN methods.

Figure 5. The ROC curves of NPI-GNN and NPI-WGNN methods.

Figure 6. Performance comparison of NPI-GNN on NPInter2.0, with centrality measures vs. sequence information (k-mer frequencies) as the node attributes.

Table 1. Dataset details.

Datasets	Links	RNAs	Proteins	$N_{es}$
NPInter2.0	10,412	4636	449	215.8
RPI7317	7317	1874	118	376.4
RPI2241	2241	838	2040	9.1

Note:

N_{e s}

is the average number of nodes in the 1-hop enclosing subgraph.

Table 2. Comparing the effectiveness of different techniques on NPInter2.0, RPI7317, and RPI2241 by five-fold cross-validation. Note that the performance data for the EDLMFC approach was only available for the NPInter2.0 dataset.

Datasets	Methods	Acc	Sensitivity	Specificity	Precision	MCC	F1
NPInter2.0	RPISeq-RF	94.4%	94%	94.9%	94%	0.889	94%
	IPMiner	95.2%	94.6%	95.9%	94.5%	0.904	94.5%
	NPI-GNN	93.3%	95.6%	91.1%	91.5%	0.868	93.5%
	EDLMFC	89.7%	91.7%	87.7%	88.2%	79.5%	89.9%
	NPI-WGNN	96.1%	97.5%	94.8%	94.7%	0.921	96%
RPI7317	RPISeq-RF	91.2%	91.1%	91.5%	91.3%	0.825	91.1%
	IPMiner	91.3%	90.2%	92.4%	92.2%	0.827	91.2%
	NPI-GNN	91.5%	92.7%	90.7%	90.7%	0.830	91.7%
	NPI-WGNN	92.5%	93.1%	91.8%	92.2%	0.846	92.6%
RPI2241	RPISeq-RF	64.6%	65.2%	63%	66.3%	0.293	65.7%
	IPMiner	66%	65.9%	66%	66%	0.320	65.9%
	NPI-GNN	62.6%	49.8%	74.8%	67.2%	0.270	57.2%
	NPI-WGNN	69.3%	58.6%	80.2%	74.5%	0.394	65.6%

Table 3. The effect of different GNN layers on NPI-WGNN performance.

Datasets	Methods	Acc	Sensitivity	Specificity	Precision	MCC	F1
NPInter2.0	SAGE+SAGE+SAGE	95.1%	96.4%	93.8%	94%	0.902	95.2%
	SAGE+GAT+SAGE	95.2%	95.6%	94.7%	94.6%	0.904	95.1%
	GCN+SAGE+SAGE	94.3%	95.1%	93.5%	93.6%	0.875	94.3%
	GCN+GAT+SAGE	96.1%	97.5%	94.8%	94.7%	0.921	96%
RPI7317	SAGE+SAGE+SAGE	92.3%	93%	91.5%	91.9%	0.844	92.4%
	SAGE+GAT+SAGE	92.2%	92%	91.7%	92.1%	0.843	92%
	GCN+SAGE+SAGE	92.1%	92.4%	91.5%	91.6%	0.842	91.9%
	GCN+GAT+SAGE	92.5%	93.1%	91.8%	92.2%	0.846	92.6%
RPI2241	SAGE+SAGE+SAGE	66.1%	50.3%	75.2%	68.3%	0.325	57.9%
	SAGE+GAT+SAGE	66.5%	57.9%	73.8%	69.3%	0.334	63.1%
	GCN+SAGE+SAGE	69.1%	56.7%	80%	74.2%	0.391	64.2%
	GCN+GAT+SAGE	69.3%	58.6%	80.2%	74.5%	0.394	65.6%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Khoushehgir, F.; Noshad, Z.; Noshad, M.; Sulaimany, S. NPI-WGNN: A Weighted Graph Neural Network Leveraging Centrality Measures and High-Order Common Neighbor Similarity for Accurate ncRNA–Protein Interaction Prediction. Analytics 2024, 3, 476-492. https://doi.org/10.3390/analytics3040027

AMA Style

Khoushehgir F, Noshad Z, Noshad M, Sulaimany S. NPI-WGNN: A Weighted Graph Neural Network Leveraging Centrality Measures and High-Order Common Neighbor Similarity for Accurate ncRNA–Protein Interaction Prediction. Analytics. 2024; 3(4):476-492. https://doi.org/10.3390/analytics3040027

Chicago/Turabian Style

Khoushehgir, Fatemeh, Zahra Noshad, Morteza Noshad, and Sadegh Sulaimany. 2024. "NPI-WGNN: A Weighted Graph Neural Network Leveraging Centrality Measures and High-Order Common Neighbor Similarity for Accurate ncRNA–Protein Interaction Prediction" Analytics 3, no. 4: 476-492. https://doi.org/10.3390/analytics3040027

APA Style

Khoushehgir, F., Noshad, Z., Noshad, M., & Sulaimany, S. (2024). NPI-WGNN: A Weighted Graph Neural Network Leveraging Centrality Measures and High-Order Common Neighbor Similarity for Accurate ncRNA–Protein Interaction Prediction. Analytics, 3(4), 476-492. https://doi.org/10.3390/analytics3040027

Article Menu

NPI-WGNN: A Weighted Graph Neural Network Leveraging Centrality Measures and High-Order Common Neighbor Similarity for Accurate ncRNA–Protein Interaction Prediction

Abstract

1. Introduction

2. Algorithm and Model Architecture

2.1. Construction of Weighted ncRNA–Protein Bipartite Network

2.2. Extracting Enclosing Subgraphs

2.3. Building Node Information Matrix

2.3.1. Structural Label

2.3.2. Weighted node2vec

2.3.3. Centrality Measures

Degree Centrality

Closeness Centrality

Betweenness Centrality

Katz Centrality

Propagation Entropy Centrality

2.4. Model Structure

2.4.1. Graph Convolutional Layer

2.4.2. Graph Attention Layer

2.4.3. GraphSAGE Layer

2.4.4. Top-k Pooling Layer

3. Datasets

3.1. RPI2241

3.2. NPInter2

3.3. RPI7317

4. Experimental Results

4.1. Datasets and Evaluation Strategies

4.2. Prediction Performance of NPI-WGNN

4.3. The Effect of Different GNN Layers on NPI-WGNN Performance

4.4. Sequence Information Versus Network Topological Data

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI