NodeVector: A Novel Network Node Vectorization with Graph Analysis and Deep Learning

Altuntas, Volkan

doi:10.3390/app14020775

Open AccessArticle

NodeVector: A Novel Network Node Vectorization with Graph Analysis and Deep Learning

by

Volkan Altuntas

Computer Engineering Department, Bursa Technical University, 16310 Yildirim, Turkey

Appl. Sci. 2024, 14(2), 775; https://doi.org/10.3390/app14020775

Submission received: 22 November 2023 / Revised: 10 January 2024 / Accepted: 11 January 2024 / Published: 16 January 2024

(This article belongs to the Special Issue Artificial Neural Network Applications in Pattern Recognition)

Download

Browse Figures

Versions Notes

Abstract

Network node embedding captures structural and relational information of nodes in the network and allows for us to use machine learning algorithms for various prediction tasks on network data that have an inherently complex and disordered structure. Network node embedding should preserve as much information as possible about important network properties where information is stored, such as network structure and node properties, while representing nodes as numerical vectors in a lower-dimensional space than the original higher dimensional space. Superior node embedding algorithms are a powerful tool for machine learning with effective and efficient node representation. Recent research in representation learning has led to significant advances in automating features through unsupervised learning, inspired by advances in natural language processing. Here, we seek to improve the representation quality of node embeddings with a new node vectorization technique that uses network analysis to overcome network-based information loss. In this study, we introduce the NodeVector algorithm, which combines network analysis and neural networks to transfer information from the target network to node embedding. As a proof of concept, our experiments performed on different categories of network datasets showed that our method achieves better results than its competitors for target networks. This is the first study to produce node representation by unsupervised learning using the combination of network analysis and neural networks to consider network data structure. Based on experimental results, the use of network analysis, complex initial node representation, balanced negative sampling, and neural networks has a positive effect on the representation quality of network node embedding.

Keywords:

node embeddings; network analysis; unsupervised learning; node classification; graph representations; pattern recognition

1. Introduction

Network analysis is a powerful methodology used to study and understand the intricate relationships and connections between entities within a system [1]. It involves analyzing the structure, dynamics, and characteristics of networks, such as social networks, computer networks, or biological networks, to uncover valuable insights [2]. By examining nodes (individual entities) and edges (links or relationships), network analysis allows for us to examine patterns, identify key influencers, detect communities, and measure overall network resilience and efficiency [3]. It enables us to visualize and quantify complex interactions, reveal hidden relationships, and predict the behavior or spread of information, diseases, or trends [4]. Network analysis has applications in various fields, including sociology, business, epidemiology, and information technology, providing a versatile tool to comprehend the interconnected nature of our world.

Features refer to the distinctive characteristics or attributes that define and differentiate something from others. In various domains, features play a crucial role in describing and understanding objects, data, or systems. In computer science and machine learning, features are used to represent data points or objects in a way that facilitates analysis and pattern recognition [5]. They can be numerical, categorical, or textual, and are selected or designed based on their relevance and ability to capture meaningful information. Effective features capture relevant patterns, trends, or properties that help in solving specific tasks, such as classification, clustering, or regression [6]. Whether it is identifying important variables in a dataset, extracting key characteristics from images or text, or describing the distinguishing attributes of a product or service, features are essential for making sense of complex information and enabling effective decision-making.

Feature engineering is a critical process in machine learning and data analysis, involving the creation, transformation, and selection of input features to improve the performance and effectiveness of predictive models. It aims to extract relevant information from raw data and represent it in a form that facilitates the learning process and enhances the model’s ability to capture patterns and make accurate predictions [7]. Feature engineering involves several techniques, including feature extraction, where new features are derived from existing ones, feature transformation, which involves applying mathematical or statistical operations to the features, and feature selection, which involves identifying the most informative and discriminative features for the task at hand. By carefully crafting and refining features, practitioners can reduce noise, highlight important patterns, handle missing data, and ensure that the model has access to the most relevant and meaningful information [8]. Effective feature engineering requires a deep understanding of the problem domain, the available data, and the underlying relationships, and it can significantly impact the performance and interpretability of machine learning models.

Unsupervised learning is a branch of machine learning that deals with the analysis of data without explicit labels or target variables. Unlike supervised learning, where the model is trained on labeled data to make predictions or classifications, unsupervised learning focuses on finding patterns, structures, and relationships within the data themselves [9]. The primary goal of unsupervised learning is to find out hidden or inherent structures and gain insights into the data without prior knowledge or guidance. Unsupervised learning is a powerful tool for exploratory data analysis, data preprocessing, and gaining a deeper understanding of complex datasets, even when labeled data are scarce or unavailable [10].

Studies on the vectorization of nodes in the network dataset are inspired by natural language processing (NLP) techniques. To use NLP techniques, sentences are derived from the network data structure. The created sentences are suitable for the use of natural language processing algorithms. Sentence data type contains information within itself. The words contained in the sentence are in order, they can form subgroups according to the meaning of the sentence, their locations in the sentence are determined according to the meaning of the word, the positional placement of the words with each other, and their distribution in the sentence constitute the meaning of the sentence. Due to the natural structure of the sentence, the information it contains is valuable and it is desired to be used without loss. For this reason, natural language processing algorithms are based on sentence processing.

Sentences can be converted into graph data so that the words in the sentence represent nodes and the interactions between words represent edges. During this transformation, the information contained in the sentence is lost. For this reason, the transformation is one-way and it is not possible to create only the same sentences from graph data. When sentences are derived from graph data, the probability of containing the initial sentences is very low, but a large number of meaningless and incorrect sentences are produced. A simple derivation scenario is shown in Figure 1.

In this paper, we consider that developing a node representation method by deriving sentences from the dataset modeled with a graph causes the information contained in the graph data to be lost. We suggest that the use of graph analysis instead of deriving sentences from graph data produces more useful results. Our motivation is that there is no work in the literature that produces node representation through unsupervised learning using the combination of network analysis and neural networks to consider the network data structure. Therefore, our proposed method differs from all similar methods in the literature. Our contributions: in order to test our hypothesis, we develop a node representation algorithm based on network analysis, examine the effects of techniques that affect the algorithms and the results, and compare the performance of the algorithm on different graph datasets.

The rest of the paper is organized as follows. The next section (Section 2) exhibits the literature overview about existing state-of-the-art network node vectorization algorithms. Then, we describe networks (Section 3.1), the skip-gram model (Section 3.2), negative sampling (Section 3.3), and network analysis (Section 3.4). We then discuss our node representation method and node similarity detection algorithm (Section 3.5) that performs network enrichment with the help of network analysis. We then share the pseudocode and discuss our NodeVector algorithm (Section 3.6) in detail. The last section offers details about the dataset, and we discuss the results (Section 4 and Section 5).

2. Background

Studies in network node embedding represent a diverse range of techniques and approaches for learning meaningful representations of nodes in networks. Perozzi et al. [11] develop an unsupervised method, DeepWalk, for learning latent representations of vertices in a network. They draw inspiration from natural language processing and word embeddings to create meaningful representations for nodes in networks. The algorithm learns node representations by treating random walks in the network as sentences and applying skip-gram models. The algorithm creates node representations through unsupervised learning. To learn feature representations for nodes in a network, Grover et al. [12] propose a framework, node2vec, that maps nodes to a low-dimensional space of features. By employing a combination of breadth-first and depth-first search strategies during random walks, this algorithm allows for various neighborhoods around nodes to be transformed into paths (sentences) consisting of consecutive pairs of nodes. Then, the algorithm creates node representations through unsupervised learning using the generated sentences. For network embedding, LINE is an unsupervised technique developed by Tang et al. [13]. It focuses on preserving network proximities and structures. This algorithm has a sampling strategy and an objective function that takes into account the probabilities of node pairs being connected in both first-order and second-order proximity. Then, the algorithm generates node representations via unsupervised learning from sampled sentences. To learn node representations in network data from structural identity, Ribeiro et al. [14] propose the Struc2Vec algorithm. This approach considers not just the local neighborhood, but also the structural identity of nodes in a network. This algorithm is inspired by natural language processing. It leverages random walks to capture structural information about nodes in a network. The algorithm incorporates higher-order structural information by considering nodes with similar random walk patterns. The algorithm generates node representations via the process of unsupervised learning. Developed by Kipf et al. [15], Graph Convolutional Networks (GCNs) generalize the concept of convolutional neural networks (CNNs) to handle irregular and non-Euclidean data structures such as graphs. GCNs operate by recursively aggregating information from neighboring nodes in a network, allowing for each node to update its feature representation based on its local connectivity patterns. This algorithm can create node representations through supervised learning using node labels. Hamilton et al. [16] propose GraphSAGE that learns embeddings for nodes by sampling and aggregating features from their neighbors. Instead of relying solely on the local neighborhood of nodes, GraphSAGE employs a sampling and aggregation framework. It samples and aggregates feature information from a node’s neighbors. The algorithm can take node labels as input and produce new node embeds with the supervised learning method.

Unsupervised techniques for network node embedding hold significant value within the field due to their capacity to discern meaningful node representations exclusively derived from the inherent topological structure of the network. This feature makes them particularly suitable in situations characterized by scarcity or complete absence of labeled data. It creates input for supervised learning techniques and enables the methods to produce more accurate results. All studies made significant contributions to the field, and they continue to be foundational for various applications in network analysis. Researchers interested in network embeddings should conduct studies to gain a deeper understanding of the techniques and principles underlying this important sub-field of machine learning.

In this paper, we focus on unsupervised learning techniques. Generating sentences using various techniques from the dataset modeled with a network may lead to loss of information contained in the network. Unlike existing literature, our new algorithm generates node representation using network analysis rather than generating sentences from network data. To the best of our knowledge, this is the first study to combine network analysis with deep learning to produce a node representation with unsupervised learning.

3. Materials and Methods

We develop a node vector representation algorithm, NodeVector, to be used in solving problems modeled with networks. Given a network, our method finds vector representations of all nodes. We introduce the networks, the skip-gram model, negative sampling, and network analysis in Section 3.1, Section 3.2, Section 3.3, and Section 3.4, respectively. We discuss our node representation method and node similarity detection algorithm for network enrichment in Section 3.5. We then share the pseudocode and discuss our NodeVector algorithm in detail in Section 3.6.

3.1. Networks

Many problems in the literature can be modeled with networks and solved with the help of various network analysis techniques and learning algorithms. In this section, we provide an overview of the different categories of datasets we used in the experiments such as protein interaction networks, social networks, and citation networks.

Protein–protein interaction networks depict the interactions between proteins in a biological system. These networks reveal the connections and communication among proteins, shedding light on cellular processes [17]. By mapping experimental or predicted interactions, these networks provide insights into signaling pathways, protein complexes, and regulatory networks. They help identify key proteins and modules, aiding in disease research and drug target identification [18]. Protein–protein interaction networks are valuable resources for understanding cellular systems and developing therapeutic interventions.

Social networks are online platforms that revolutionize communication, connecting individuals, communities, and businesses. Users create profiles, connect with others, and share posts, photos, and messages [19]. These networks form virtual communities based on shared interests and facilitate interactions. They also serve as powerful tools for businesses to engage with a wide audience and promote their offerings [20]. Social networks have transformed how we connect and communicate in the digital age.

Citation networks capture the connections between academic papers through citations. They reveal patterns of influence, track knowledge dissemination, and aid in research evaluation [21]. By analyzing these networks, researchers gain insights into scholarly impact and the flow of ideas, fostering knowledge discovery and advancement. Citation networks are valuable tools in bibliometrics and information retrieval, facilitating the identification of influential papers and research trends [22].

3.2. Skip-Gram Model

The skip-gram model is a popular algorithm used in natural language processing (NLP) and word embedding. It is designed to learn high-quality distributed representations of words by predicting the context words given a target word [23]. The model aims to capture the semantic and syntactic relationships between words by training on large amounts of text data [24]. By considering the surrounding context words, the skip-gram model generates vector representations, known as word embeddings, that encode the meaning and relationships of words in a dense, continuous vector space. These embeddings can be used in various NLP tasks, such as language modeling, information retrieval, and sentiment analysis [25]. The skip-gram model, along with other word embedding techniques, has revolutionized NLP by enabling algorithms to understand and process natural language more effectively. The skip-gram model is designed to analyze model sentences. Although there are various uses of the network data type in the literature, all of them are based on deriving sentences from the network data type. In this study, the skip-gram model was adopted to preserve network information along with network analysis without deriving sentences from the target network.

3.3. Negative Sampling

Negative sampling is a technique commonly used in training algorithms for word embeddings and recommendation systems [26]. It addresses the computational inefficiency of using traditional methods that require calculating probabilities for all possible negative samples. Instead of considering all negative samples, negative sampling randomly selects a small subset of negative samples during each training iteration [27]. The idea is to create a balanced training set that includes both positive (observed) and negative (unobserved) samples. By focusing on a subset of negative samples, negative sampling makes the training process more efficient and scalable [25]. This technique allows for the model to learn to differentiate between positive and negative samples and captures meaningful relationships and similarities in the embedding space [28]. Negative sampling has proven to be effective in training embedding models, improving training speed and the quality of learned representations.

3.4. Network Analysis

Network analysis researches the relationships and components of complex systems using nodes and edges. It examines network structure, dynamics, and properties to uncover insights and patterns [1]. Applications range from social and biological networks to transportation and computer networks [2]. By analyzing connections, interactions, and information flow, network analysis reveals system behavior, resilience, and efficiency. It identifies important nodes, communities, and pathways, aiding interventions, resource allocation, and predicting information spread [4]. This framework visualizes and interprets the interdependencies and emergent properties within networks.

Centrality measures rank the importance of nodes in a network based on different metrics. Degree centrality focuses on the number of connections, while betweenness centrality identifies nodes that act as bridges [29]. These measures help to understand network structure and identify influential nodes. Centrality measures are valuable for studying information flow, disease spread, and the diffusion processes in complex networks [3].

3.5. Node Similarities

Networks exhibit uncertain topologies for various reasons [30]. This means that the connections and relationships in these networks are subject to false positives and false negatives. These false positive and false negative interactions can arise due to various general factors such as data collection limitations, noise in the data, privacy concerns, or network-specific factors such as the dynamic nature of social interaction networks or the inherent uncertainty in biological processes for biological networks. The occurrence of false positives and false negatives poses a significant challenge in network analysis, with a high prevalence in numerous networks such as social or biological networks. Research has shown that in protein interaction networks, the data often contain a significant number of false positive and false negative interactions. Specifically, it has been reported that the rates of false positive and false negative interactions can exceed half of the total network [31].

Network edge enrichment is a method for improving the accuracy of prediction by adding similarity-inferred edges to networks. This is achieved by first identifying pairs of nodes that are similar, and then adding an edge between these pairs if they do not already have one. The resulting enriched network is then used for the prediction task. Network edge enrichment has successful results in network analysis such as social network analysis [32] and PPI network analysis [33]. In social network analysis tasks, identifying communities and predicting future interactions can be examples of application, and in protein interaction network analysis tasks such as prediction of protein functions can be named. Results are typically more accurate than those obtained from the original network. Based on these literature results, we developed the hypothesis that network enrichment methods can increase the representation ability of network node embedding methods. To test and validate our hypothesis, we developed network enrichment methods that can be used with our new node encoding algorithm and examined their effects on the results. As a result of our hypothesis, we consider that the use of network enrichment instead of a high percentage of negative samples affects the estimation results positively and that the network enrichment methods we develop are preferable to reduce the negative sample rate.

To process network nodes by computational methods such as measuring similarity between nodes or to use machine learning algorithms, at least a simple numerical representation of the nodes is required. Since nodes are considered independent units, the one-hot encoding method is the most commonly used representation type in the literature. Unlike in the literature, we use an alternative encoding approach for representing network nodes, departing from the commonly used one-hot encoding method. While one-hot encoding is widely employed, it has limitations in terms of capturing interaction information and can lead to a loss of important contextual details during subsequent operations. To address these limitations, in this study, the row belonging to the relevant node in the adjacency matrix is improved and used as a multi-label one-hot encoding for the initial node representation.

For the mathematical representation of network nodes, we define the leveled neighborhood matrix representation. The level corresponds to the shortest path distance measurement between nodes. From this definition, for source node n,

L_{1}

level representation includes nodes that can be accessed with at most 1 edge, while

L_{2}

level representation includes nodes that can be accessed with at most 2 edges. Similarly,

L_{k}

representation contains nodes accessible with at most k edges. In the neighborhood matrix representation, unlike the original representation, a self-loop is added to each node. In this way, it is ensured that the node itself is included in the representation. This added edge also makes it different in its representations of nodes

n_{x}

and

n_{y}

, whose neighbors are all the same. Figure 2 summarizes the

L_{1}

,

L_{2}

, and

L_{3}

level sample node representation. For given nodes

n_{x}

and

n_{y}

on target network G, we use the Matching Index [34] to measure how similar (in terms of neighbors) the representations of nodes

n_{x}

and

n_{y}

are. We formulate the matching index between the representations of nodes

n_{x}

and

n_{y}

as the ratio of the number of common neighbors to that of all their neighbors. For a given level, k, let us denote the set of neighbors of nodes

n_{x}

and

n_{y}

in network G with

n_{x}^{n}

,

n_{y}^{n}

, respectively. We compute the representation similarity (RS) of these given nodes with the following Equation (1). Due to the symmetric structure of the data structure, the computational complexity is

O (n * (\frac{n + 1}{2}))

(n is node size). RS takes the value of 1 only when the target node is compared to itself,

R S_{k} (n_{x}, n_{x})

, and is less than 1 in all other cases.

R S_{k} (n_{x}, n_{y}) = \frac{n_{x}^{n} \cap n_{y}^{n}}{n_{x}^{n} \cup n_{y}^{n}} .

(1)

Nodes in networks tend to form clusters to perform various functions. With the RS similarity measurement we defined, we measure the similarity of the clusters formed by the neighbors of the nodes located at a predefined distance such as

L_{1}

,

L_{2}

, and

L_{3}

where k is 1, 2 and 3, respectively. With the help of the representation and similarity measurement, we define the clusters formed by nodes; we move from a narrow node definition (there is interaction between nodes, there is no interaction between nodes, and nodes are located on the same path, etc.) to a broader node definition. We perform network enrichment by finding possible edges that are not yet detected or that are likely to occur with the calculated similarities. Networks tend to feature smallworld. That is, interaction between two nodes can be established with a minimal number of edges. Due to this feature of the networks, a maximum of 3 is used for the number of edges parameter k. As a first step in identifying candidate interactions, we calculate

L_{1}

,

L_{2}

, and

L_{3}

similarities between all paired nodes. We then sort the similarities at each level, with the most similar pair of nodes first. For a given network, G, let us denote the candidate edge ratio with

ϵ

, the number of nodes with

n n

, and the candidate edge count by adjacent similarity with

c e c_{a d j s i m}

. We calculate the number of new edges to be added from each level to the given network, G, with the following Equation (2):

c e c_{a d j s i m} = n n * ϵ / 100 .

(2)

Finally, we select the first

c e c_{a d j s i m}

pairs of nodes ordered by similarity at

L_{1}

,

L_{2}

, and

L_{3}

levels and add them as edges to target network G. Algorithm 1 presents the pseudocode of our network enrichment method by our adjacent similarity.

Algorithm 1 Candidate Edge Calculation.

1:: function Get_Edges(G, $L e v e l s$ , $M a x_L e v e l$ , $c e c_{a d j s i m}$ )
2:: /*Step I: adjacent similarity*/
3:: for $k = 1$ to $m a x_L e v e l$ do
4:: /*similarities of levels*/
5:: $G E d g e S i m s L_{k} \leftarrow R S_{k} (L e v e l s [k])$
6:: $G E d g e S i m s L_{k} \leftarrow s o r t (G E d g e S i m s L_{k})$
7:: $G C a n d E d g e s . a p p e n d ((G E d g e S i m s L_{k} [0 : c e c_{a d j s i m}]))$
8:: /*add topk interactions as candidate*/
9:: end for
10:: /*Step II: centrality similarity*/
11:: for node n in G do
12:: $c v [n, :] \leftarrow$ all centralities of n
13:: end for
14:: for $i d x = 1$ to $N u m b e r o f C e n t r a l i t y$ do $c v [:, i d x] \leftarrow N o r m a l i z e (c v [:, i d x])$
15:: /*normalize by each centrality*/
16:: end for
17:: /*pairwise vector distance*/
18:: for $i = 1$ to $l e n (G)$ do
19:: for $j = 1$ to $l e n (G)$ do
20:: $L 1 D_{i, j} \leftarrow | | c v_{i} [i, :] - c v_{j} [j, :] {| |}_{1}$
21:: if $L 1 D_{i, j} > 95$ then
22:: $G F e a t u r e S i m s . a p p e n d ([i, j, L 1 D_{i, j}])$
23:: /*use only $> 95$ similar*/
24:: end if
25:: end for
26:: end for
27:: $G F e a t u r e S i m s \leftarrow s o r t (G F e a t u r e S i m s)$ by $L 1 D$
28:: /*get sorted similarities*/
29:: $G C a n d F E d g e s \leftarrow (G F e a t u r e S i m s [0 : c e c_{a d j s i m}])$
30:: /*add topk interactions as candidate*/
31:: $C a n d i d a t e E d g e s . a p p e n d (G C a n d E d g e s)$
32:: $C a n d i d a t e E d g e s . a p p e n d (G C a n d F E d g e s)$
33:: return $C a n d i d a t e E d g e s$
34:: end function

As another approach, for network enrichment, network node centrality analysis can be used. Nodes in the network can be evaluated as important or unimportant for the network according to different centrality measures. Likewise, some nodes can be identified as similar or not similar according to different centrality measures. Although it is known that network centrality measurements are useful in solving various problems, there is no standard because the problem to which each measurement can be applied differs. When different centrality measures are calculated for each node in the network, a unique pattern is formed for each node. We develop our second network enrichment algorithm to be used in our vectorization method by measuring the similarity of the patterns formed for each node. As a first step in identifying candidate interactions, we calculate node centrality measures for all nodes in a given network, G. We then normalize each measurement to 0–1. For centrality measurements, we use degree centrality, closeness centrality, information centrality, current flow closeness centrality, betweenness centrality, load centrality, harmonic centrality, current flow betweenness centrality, second-order centrality, and PageRank centrality. The computational complexity of all network centrality metrics is

< =

O (n^{2})

(n is node size). For each node, we combine the calculated measurements into a vector to form the centrality vector of the corresponding node. To identify a pair of nodes with maximally similar patterns, first, we calculate the

L 1

norm distance (

L 1 D

) between all paired nodes. For given nodes

n_{x}

and

n_{y}

, let us denote the centrality measure vectors with

c v_{x}

and

c v_{y}

, respectively. We calculate the

L 1

norm distance for nodes

n_{x}

and

n_{y}

with the following Equation (3):

L 1 D_{(x, y)} = | | c v_{x} - c v_{y} {| |}_{1} .

(3)

We then apply a 0–1 normalization to the vector containing the distance for each node with the other nodes, and we delete the nodes below the 95% distance for filtering. Then, we sort the calculated

L 1

norm distances, with the most similar pair of nodes leading. Finally, we select the first

c e c_{a d j s i m}

pairs of nodes ordered by the

L 1

norm distance and add them as edges to the target network, G. Algorithm 1 presents the pseudocode of our network enrichment method by our node centrality similarity.

Negative sampling is an important technique for using unsupervised machine learning techniques on graph-modeled problems. On the other hand, we think that learning algorithms tend to learn randomly generated negatives instead of information in the network when more negative samples are used, except for the use of equal numbers of positive and negative samples for the purpose of balancing the dataset. In this study, we present our hypothesis that the use of a balanced positive–negative sample distribution in the enriched network improves the quality of node representation rather than highly negative sampling. Our method of generating node representation with candidate edge detection using the existing information in the network may lead to new network analysis development studies conducted in the future.

3.6. NodeVector

Here, we formally define our novel network node vectorization algorithm NodeVector. Let us denote the given network with

G = (V, E)

, where V and E denote the set of nodes and the set of interactions among those nodes, respectively. In this paper, we focus on undirected networks and consider all interactions in networks such as Protein–Protein Interactions (PPI), BlogCatalog, etc., as undirected.

Let us denote the adjacent matrix of G with

A_{G}

. Each row and each column in the

A_{G}

corresponds to a specific node of network G. Our approach operates through two distinct phases, with the initial phase dedicated to network enrichment using node similarities. The second one adapts deep neural networks to generate novel vector representations of each node. Algorithm 2 presents the pseudocode of our method. Our algorithm requires two input parameters; an undirected network,

G = (V, E)

, and candidate edge ratio

ϵ

as input. Our algorithm generates a vector representation of each node in network G.

In the first step, the network enrichment step, we use the

L_{k}

level of adjacent similarity and network centrality pattern similarities (see Section 3.3) to identify possible candidate interactions of the input network, G. The purpose of phase 1 is basically to identify undetected interactions and add them to the target network, G. For each node in target network G, we calculate the multi-label neighborhood representations,

L_{k}

, for different k neighborhood level parameters. We then calculate the

R S

similarity measure for all pairs of nodes and for each level. We add the top

c e c_{a d j s i m}

edge with the highest similarity at each level to the target network, G. Similarly, for each node in target network G, we calculate node centrality measures and apply 0–1 normalization. We then calculate the

L 1 D

similarity measure for all pairs of nodes. Let us denote node centrality similarity with

c s

. We perform 0–1 normalization and similarity filtering on

c s

for each node in target network G. Finally, we add the top highest similar

c e c_{a d j s i m}

edges to network G.

Let us denote the network formed after the added new edges for network enrichment with

G^{'}

. The new network,

G^{'}

, can contain repeating edges. Let us denote the edge between node

n_{x}

and node

n_{y}

in target network G with

e_{x, y}

. In adjacent similarity or centrality similarity calculations, we allow repeating

e_{x, y}

edge addition if node

n_{x}

and node

n_{y}

are determined as candidate edges with high similarity. So, we receive more edge representation in the second phase of our algorithm with our hypothesis that such repetitive edges are true with higher probability and contain more information about the network.

In the second phase of our algorithm, we adopt a skip-gram and the deep neural network for node vectorization with an unsupervised learning approach. This approach, which has achieved successful results in natural language processing studies in the literature, is based on the principle of predicting the context words surrounding a specific target word within a fixed-size window. The literature studies applying this technique to the graph dataset have focused on generating various sentences from the graph data. In this study, unlike the existing literature, we do not produce sentences.

Algorithm 2 NodeVector Calculation.

Require: Network

G = (V, E), ϵ

Ensure: Node embeddings of target network

1:: /*Initial node representations*/
2:: $L_{1}, L_{2}, L_{3} \leftarrow$ calculate $L_{1}, L_{2}, L_{3}$ level repreantations
3:: $c e c_{a d j s i m} = n n * ϵ / 100$
4:: /*Phase I: Network enrichment*/
5:: $L e v e l s \leftarrow [L_{1}, L_{2}, L_{3}]$
6:: $M a x_L e v e l \leftarrow 3$
7:: $C a n d i d a t e E d g e s \leftarrow G e t_E d g e s (G, L e v e l s, M a x_L e v e l, c e c_{a d j s i m})$
8:: /*Phase II: Embed calculation*/
9:: /*create dataset*/
10:: $S^{P} . a p p e n d (G E d g e s, C a n d i d a t e E d g e s)$
11:: /*negative sampling*/
12:: for node pairs $(n x, n y)$ in $S^{P}$ do
13:: $N_{x y}^{S} = N_{G^{'}}^{S} - (N_{x}^{L 2} \cup N_{y}^{L 2})$
14:: $s a m p l e \leftarrow s a m p l e (N_{x y}^{S})$
15:: /*get random sample*/
16:: $S^{N} . a p p e n d (n x, s a m p l e)$
17:: $S^{N} . a p p e n d (n y, s a m p l e)$
18:: end for
19:: /*create positive data*/
20:: for node pairs $(n 1, n 2)$ in $S^{P}$ do
21:: $t a r g e t . a p p e n d (L 1 (n 1))$
22:: $c o n t e x t . a p p e n d (L 1 (n 2))$
23:: $l a b e l . a p p e n d (t r u e)$
24:: end for
25:: /*create negative data*/
26:: for node pairs $(n 1, n 2)$ in $S^{N}$ do
27:: $t a r g e t . a p p e n d (L 1 (n 1))$
28:: $c o n t e x t . a p p e n d (L 1 (n 2))$
29:: $l a b e l . a p p e n d (f a l s e)$
30:: end for
31:: /*create network and train*/
32:: $E m b e d d i n g S i z e \leftarrow 128$
33:: $E m b e d M o d e l . a p p e n d (I n p u t L a y e r (l e n (G)))$
34:: $E m b e d M o d e l . a p p e n d (D e n s e (E m b e d d i n g S i z e))$
35:: $T a r g e t M o d e l . a p p e n d (E m b e d M o d e l (t a r g e t))$
36:: $C o n t e x t M o d e l . a p p e n d (E m b e d M o d e l (c o n t e x t))$
37:: $N e t w o r k M o d e l . a p p e n d ($ $M a k s (T a r g e t M o d e l, C o n t e x t M o d e l))$
38:: $N e t w o r k M o d e l . a p p e n d (D e n s e (1))$
39:: $N e t w o r k M o d e l$ train on $t a r g e t, c o n t e x t, l a b e l$
40:: $N o d e E m b e d d i n g s \leftarrow N e t w o r k M o d e l (G)$
41:: return $N o d e E m b e d d i n g s$

In the initial step of the second phase within our algorithm, we create the training dataset containing positive and negative samples from the enriched network,

G^{'}

. Let us denote positive samples and negative samples with

S^{P}

and

S^{N}

, respectively. To generate positive samples

S^{P}

from enriched network

G^{'}

, we consider the neighbors located one edge away from each node. We add all interactions matching the description to the positive sample list of the target node. We evaluate the repeating edges separately and do not delete them. So in total, we have as many positive samples as the number of edges in network

G^{'}

. We generate a negative sample from network

G^{'}

so that our training dataset remains balanced. The disadvantage is that the dataset is a network rather than a sentence. The information contained in the sentences is not available on the network. In addition, networks have a high rate of false positives and false negatives (see Section 1). We narrow the sampling space to select the most accurate negative samples. As we move away from a given node,

n_{x}

, in network

G^{'}

, the probability of interaction with node

n_{x}

(false negative) decreases. Let us denote the edge between node

n_{x}

and node

n_{y}

in positive samples with

e_{x y}

, the node sampling space of edge

e_{x y}

with

N_{x y}^{S}

, all nodes in network

G^{'}

with

N_{G^{'}}^{S}

,

L 2

level neighbors of node

n_{x}

and

n_{y}

with

N_{x}^{L 2}

and

N_{y}^{L 2}

, respectively. We create the negative sampling space for edge

e_{x y}

with the following Equation (4):

N_{x y}^{S} = N_{G^{'}}^{S} - (N_{x}^{L 2} \cup N_{y}^{L 2}) .

(4)

To generate negative samples

S^{N}

, we randomly select nodes from the negative sampling node set of the node pair for all nodes in the positive sample edge set. With our sampling algorithm, we generate balanced positive

S^{P}

and negative

S^{N}

datasets that represent the information contained in the target network, G.

We adopt skip-gram and a deep neural network with an unsupervised learning approach for our node vectorization algorithm, which creates node vectors that represent high-level information that each node has in target network G in a smaller dimensional space. We use four-level depth neural network (DNN) to generate node vector representations. Figure 3 shows the graphical representation of the designed depth neural network. The first level is the input level for positive or negative node pairs. We use multi-label one-hot encoding for node input representation in the neural network. That is, the input layer works with

L_{1}

level node representation (see Section 3.5). At this level, we use two node inputs. The first input node is called the target node and the second input node is called the context node. We use all the nodes in the network one by one as the target node. Against each target node, positive and negative sample nodes created by our sampling algorithm are used as context nodes. The second level of the DNN named EncoderLevel is a fully connected dense layer with glorot uniform initialization and the Rectified Linear Unit (ReLU) activation function. Let us denote the embedding size for each node vector representation in network

G^{'}

with

E S_{G^{'}}

and the number of nodes in network

G^{'}

with

N S_{G^{'}}

. The input size of the EncoderLevel is equal to

N S_{G^{'}}

, and the output size is equal to

E S_{G^{'}}

. EncoderLevel basically converts the high-level node representation to the low-level node representation. We define only one EncoderLevel and use it to convert the target and the context input layers. We design the third level of the DNN to merge the low-level representation of the target and context nodes produced by the EncoderLevel. At this level of DNN, we use the maximum operator to merge node representations. The last level of the DNN is a fully connected dense layer with glorot uniform initialization and the sigmoid activation function. The input size of this layer is equal to embedding size

E S_{G^{'}}

, and the output size to 1. This level converts the merged embedding to the true/false (interaction/no interaction) information. With this method that we develop, the nodes in the target network are initially represented by their

L_{1}

level neighborhoods. As a result, a low-dimensional representation vector of each node in the network is created with the interaction information between the nodes, which is created by analyzing the information of the entire network, the node, and its neighbors. Our algorithm can be applied to any type of network and enriched with network-specific analysis. Algorithm 2 presents the pseudocode of our network node vectorization method NodeVector.

4. Results

In this section, we evaluate the experimental performance of our algorithm on target networks. We perform experiments using well-known datasets, then measure the performance with the

F 1

score of the standard supervised learning task multi-label node classification algorithm. In detail, the

F 1

score of the target network is the average of the

F 1

score of all individual nodes. Let us denote the

F 1

score of node i and the overall

F 1

score of network G with

F 1_{i}

and

F 1_{G}

consecutively.

F 1_{i}

is calculated with the micro-average

F 1

score from all predictions belonging to node i, and

F 1_{G}

is calculated with the following formula:

F 1_{G} = (\frac{F 1_{1} + . . . + F 1_{N}}{N})

. For the classification task, the node feature representations are input to a one-vs-rest logistic regression classifier with L2 regularization. All experimental studies are carried out using the Python 3 (www.python.org) programming language. In the following, we describe the datasets used in the experiments an the implementation details; then, we provide the evaluation results.

4.1. Dataset

We use a subgraph extracted from the Protein–Protein Interaction (PPI) network [35], specifically within the context of Homo Sapiens. This subgraph was carefully curated to encompass nodes associated with labels derived from hallmark gene sets [36], serving as indicators of distinct biological states. The collection has 50 hallmarks from more than 4000 gene sets. This subgraph is constituted by 3890 nodes, intricately interconnected by 76,584 edges, and characterizes itself with a diverse repertoire of 50 distinct labels.

As a second PPI dataset in this study, we use model organism S. cerevisiae. We source the PPI dataset, specifically S. cerevisiae validated ORFs, from BioGRID [35]. This dataset comprises consecutive edges and features, focusing exclusively on protein–protein physical interactions validated through wet-lab experiments. Repetitive interactions, loops, and unconfirmed ORF node interactions are deliberately excluded. We use the MIPS (Munich Information Center for Protein Sequences) [37] functional catalog version 2.1 (FunCat) third level for functional category tags. This version includes 181 functional categories at the third level.

BlogCatalog [38] dataset represents a complex social network comprising bloggers registered on the BlogCatalog platform. Each node within the network corresponds to an individual blogger, and the edges interconnecting these nodes denote interpersonal relationships. The assigned labels associated with each node denote inferred blogger interests, which are determined through the analysis of metadata provided by the bloggers themselves. Notably, this network dataset exhibits substantial scale, encompassing a total of 10,312 nodes, connected by 333,983 edges, and categorizes bloggers into 39 distinct interest categories. This resource serves as a valuable asset for the empirical study of social dynamics and content preferences within the blogging community.

The Wikipedia [39] dataset in question pertains to a cooccurrence network constructed from words found in the initial million bytes of the Wikipedia dump. These words are assigned labels corresponding to their Part-of-Speech (POS) tags, as determined through the Stanford POS-Tagger [40]. The network consists of 4777 nodes, interconnected by 184,812 edges, and encompasses a diverse array of 40 distinct labels.

It is important to note that for all networks, the evaluation of node and interaction accuracy of original networks falls outside the scope of our study. To represent graph G with n nodes, an adjacency matrix with

n \times n

elements can be used. The row and column numbers of the matrix represent the nodes of the network, and each column and row intersection point represents the edge presence between these nodes.

A_{i j} \geq 1

if there is an edge/edges between nodes

v_{i}

and

v_{j}

, and

A_{i j} = 0

if there is no edge between nodes.

4.2. Evaluation

Our main purpose in this study is to develop a more general node vectorization method for a network data structure that includes network analysis methods. In order to evaluate our method, first, the effect of negative sample percentage on the results is examined. In addition, the case of using all interactions instead of sampling with random walk from the dataset is examined. In order to evaluate the effect of the initial node representation method, we obtain the results by using both single label one-hot encoding and multi-label one-hot encoding on the dataset in which all interactions are used. In this experiment, we aim to show the effect of node vectorization phases on vector representation performance. To observe the effect of negative sampling ratio, we use 10, 5, 4, 3, 2, 1 and 0 as the ratio for negative sampling calculations. In order to evaluate the sampling effect with random walk, we use a separate dataset in which all interactions are used instead of sampling. We use the Homo Sapiens PPI as a network dataset with MSigDB Collections hallmark gene set labels. The purpose of this experiment is to evaluate the performance of our algorithm with the state-of-the-art approaches in the node classification task. We compare the

F 1

score of our method with DeepWalk, LINE, Node2Vec and Struct2Vec. Parameters of competing algorithms are used as recommended in relevant reference articles. All algorithms use 128 as the node embedding size. For NodeVector algorithm, different values can be chosen for the k variable in the

L_{k}

adjacent similarity calculation. The networks used in the experiments in the literature have the small-world feature, and the neighborhood clusters formed at

k > 3

values cover more than half of the target network. For this reason, we limit the k parameter of NodeVector to a maximum of three. We limit the candidate edge ratio

ϵ

variable to two in

c e c_{a d j s i m}

edge count calculation of the NodeVector algorithm. Figure 4 illustrates the results.

As shown in Figure 4, the use of negative samples has a positive effect on all algorithms. This is an expected result due to the dataset balancing function. Although the role of negative sampling is critical, its impact is limited. Continuously increasing the negative sample rate does not ensure continuous improvement of the results for all algorithms. However, the results of all algorithms decrease dramatically when negative samples are not used; the least affected algorithm is the NodeVector algorithm. Experiments using only all interactions have the lowest performances. Sampling with random walk has a positive effect on the results. The network enrichment and cluster node representation adopted by NodeVector have a more positive impact on the results. In all cases, the NodeVector algorithm produces better results. All algorithms have acceptable results on the problem.

The method in which all interactions are used directly and nodes are represented independently with single-label one-hot encoding has the lowest performance results. On the other hand, in the same method, when the node representation is replaced with multi-label one-hot encoding to cover the cluster it is in, the results are lower than those its competitors, but increase to an acceptable performance level. This result shows the importance of initial node representation in the methods used. Contrary to the one-hot encoding method, which provides independent coding of words in natural language processing, the data in the network data structure can be represented as clusters with multi-label one-hot encoding due to its structure. Experimental results show that making the nodes independent causes information loss in the network, and the representation of the neighborhood patterns they create with the nodes around them contains a high amount of information. Unlike the text data type, using

L_{1}

level representation for the network data type increases representation performance.

The method of generating sentences from the network by random walk, which is used by state-of-the-art algorithms in the literature, is a common method that has proven itself with a positive effect in many studies and forms the basis of existing methods. In this set of experiments, we examine the effect of random walk sample selection on NodeVector algorithm performance. Our method can produce repetitive edges during mesh enhancement. To sample repeated edges with random walk, we use the number of repeated edges between two nodes as the weight of the edge between two nodes. With random walk, we take into account all the weights in each node when choosing the direction. To create a balanced dataset, we use as many negative examples as positive examples. Another parameter that can affect the performance of our method is the number of candidate edges used in network enrichment. This experiment evaluates the ways in which the number of candidate edges affects performance. Different values can be selected for the

ϵ

variable in

c e c_{a d j s i m}

edge count calculation. We use two, three, four, five as the candidate edge ratio,

ϵ

. Figure 5 illustrates the results.

According to the experimental results, random walk sampling provides a positive effect on the results of the NodeVector algorithm, as in similar studies in the literature. Although random walk provides a positive effect, the performance increase over the NodeVector algorithm remains limited. As shown in Figure 5, the two percent network enhancement edge addition rate parameter has the highest performance values. Increasing the edge ratio beyond three percent of the number of available edges adversely affects the results.

The basis of the node vectorization method that we propose in this study is the embedding of multidimensional representation data into lower-dimensional vector representations with the help of deep artificial neural networks. Since training neural networks requires time and processing power, a long training phase can be a disadvantage. The parameter denoting the number of epochs in the context of gradient descent serves as a hyperparameter, exerting influence over the complete traversal of the training dataset. In this set of experiments, we examine the effect of the training repetition parameter epoch on the results. We use values 2, 4, 6, 8, 10, 20, 50, 100 as the epoch parameter. Figure 6 illustrates the results.

According to experimental results, low-dimensional vector representations cannot be learned fully at epoch numbers below eight and have a negative impact on the results. Vector representations created with epoch numbers of 10 and above have maximum efficiency. Increasing the epoch number above 10 does not increase performance. In vector representation learning, it is sufficient to use 10 as the epoch number. The results obtained are compatible with the literature. With the network enrichment used in the NodeVector algorithm, increasing the number of edges in the network and the initial representation of the nodes with a multi-label one-hot vector instead of a one-hot vector does not create an additional load on deep artificial neural network training. The training cost is the same as that of competing algorithms.

Although our study mainly focuses on node function predictions in protein interaction networks, which are a multi-label node classification problem, the method we develop is designed to work in all network data structures, regardless of the problem. In this set of experiments, we test the performance of our method with different network datasets available in the literature. We use datasets S. cerevisiae, BlogCatalog, and Wikipedia in experimental studies. The parameters used in our method are the parameters obtained in previous experimental datasets, and they belong to the best performances. Likewise, it is used with the parameters recommended in state-of-the-art methods in the literature. As can be seen from previous experimental results, comparisons are fair since all techniques offer good results with similar parameters. Figure 7 illustrates the results.

It can be seen in the experimental results in Figure 7 that our NodeVector algorithm has a positive effect on all datasets. NodeVector has the highest

F 1

score. Struct2Vec has the lowest

F 1

score. All algorithms have acceptable results on all datasets. Algorithms based on sampling with random walk have similar performance. NodeVector can be applied to all problems that can be expressed with the graph data type. Regardless of the problem, NodeVector can transfer the information in the network to the node representation with the help of network analysis and embedding.

According to all experimental results, our NodeVector algorithm has successful results.

L_{1}

neighborhood representation, which incorporates the interactions of each node in the network, contains a high amount of information and contributes positively to performance. Detection of undetected potential edges by network analysis and network enrichment by adding repetitive edges that produce positive results in different analyzes to the network increases the node representation quality and improves the results. Although random walk sampling has a positive effect, it is limited. Although using negative samples to create a balanced dataset has a positive effect, there is no need to use a high rate of negative samples. Low-level node representations created by unsupervised learning with the help of skip-gram and deep artificial neural network are applicable to all problems stored in the network data structure. Unlike the text data type, since the information contained in the sentences is not available in the network data type, it is beneficial to use network analyzes to preserve the information contained in the network data instead of reproducing sentences. For the network data type based on network analysis, our generalized NodeVector algorithm has results at least as good as its alternatives. Our method of creating node representation using existing information in the network may lead to novel network analysis development work in the future.

5. Conclusions

In this paper, we propose that using network analysis instead of deriving sentences from networks will produce more effective results in order to develop a method of embedding nodes in datasets modeled with networks. This is the first study to combine network analysis with deep learning to produce a node representation with unsupervised learning. With the purpose of demonstrating that our algorithm transfers the information in the network data structure to node embeddings and performs better than its competitors, first, we define the NodeVector which is a novel node embedding algorithm for network data type. Then, we evaluate the performance of NodeVector with competitor state-of-the-art network node vectorization algorithms and we evaluate the relationship between the performance and negative sample percentage, initial node representation, random walk sampling, train epoch size, and different network datasets. In contrast to the majority of existing studies, we take advantage of network analysis, complex initial node representation, and balanced negative sampling. Our method consists of two phases. The first phase implements network enrichment using node similarities. The second one adapts deep neural networks to generate novel vector representations of each node. Experimental results confirm that our method has successful results and it is good in transferring the information in the network to node embeddings. Experimental results demonstrate the potential for collaboration between network analysis and deep machine learning to produce higher information node embeddings for use in classification problems. Due to the NodeVector algorithm we developed, more informative node embeddings are obtained. According to the experimental results, the use of network analysis, complex initial node representation, balanced negative sampling, and deep neural networks has a significant effect on the classification performance of node embedding. Therefore, techniques that make inferences from the target network through network analysis will have a higher success rate. Moreover, results are a brilliant indication that using network analysis is a potential candidate information extraction method for node embedding algorithms. These findings may lead to the investigation of subsequent novel studies for forthcoming research on network node embbedding techniques. It is a limitation that not all problems and all network analysis techniques can be examined. For this reason, network analysis can be diversified problem specifically and can have a positive impact on performance. In future studies, developing problem-specific network analysis methods for network data sources of different problems and finding integration methods with complex machine learning algorithms will enhance the progression of the developed methodology and increase the effectiveness of the outcomes.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data used in the study can be accessed from reference sources. Additionally, all data and source codes in the study can be accessed at nodevector.btu.edu.tr (accessed on 20 August 2023).

Acknowledgments

The authors are grateful to TUBITAK ULAKBIM High Performance and Grid Computing Center (TRUBA resources), Bursa Technical University High-Performance Computing Laboratory.

Conflicts of Interest

The authors declare no conflict of interest.

References

Borgatti, S.P.; Mehra, A.; Brass, D.J.; Labianca, G. Network analysis in the social sciences. Science 2009, 323, 892–895. [Google Scholar] [CrossRef] [PubMed]
Newman, M. Networks; Oxford University Press: Oxford, UK, 2018. [Google Scholar]
Freeman, L.C.; Borgatti, S.P.; White, D.R. Centrality in valued graphs: A measure of betweenness based on network flow. Soc. Netw. 1991, 13, 141–154. [Google Scholar] [CrossRef]
Pavlopoulos, G.A.; Wegener, A.L.; Schneider, R. A survey of visualization tools for biological network analysis. Biodata Min. 2008, 1, 12. [Google Scholar] [CrossRef]
Guyon, I.; Elisseeff, A. An introduction to feature extraction. In Feature Extraction: Foundations and Applications; Springer: Berlin/Heidelberg, Germany, 2006; pp. 1–25. [Google Scholar]
Tzanakou, E.M. Supervised and Unsupervised Pattern Recognition: Feature Extraction and Computational Intelligence; CRC Press: Boca Raton, FL, USA, 2017. [Google Scholar]
Ozdemir, S.; Susarla, D. Feature Engineering Made Easy: Identify Unique Features From Your Dataset in Order to Build Powerful Machine Learning Systems; Packt Publishing Ltd.: Birmingham, UK, 2018. [Google Scholar]
Bonaccorso, G. Machine Learning Algorithms; Packt Publishing Ltd.: Birmingham, UK, 2017. [Google Scholar]
Dridi, S. Supervised Learning-A Systematic Literature Review. 2021, preprint . Available online: https://osf.io/preprints/osf/tysr4 (accessed on 10 January 2023).
Watson, D.S. On the Philosophy of Unsupervised Learning. Philos. Technol. 2023, 36, 28. [Google Scholar] [CrossRef]
Perozzi, B.; Al-Rfou, R.; Skiena, S. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 24–27 August 2014; pp. 701–710. [Google Scholar]
Grover, A.; Leskovec, J. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 855–864. [Google Scholar]
Tang, J.; Qu, M.; Wang, M.; Zhang, M.; Yan, J.; Mei, Q. Line: Large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, New York, NY, USA, 18–22 May 2015; pp. 1067–1077. [Google Scholar]
Figueiredo, D.R.; Ribeiro, L.F.R.; Saverese, P.H. struc2vec: Learning node representations from structural identity. arXiv 2017, arXiv:1704.03165. [Google Scholar]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
Hamilton, W.L.; Ying, R.; Leskovec, J. Representation learning on graphs: Methods and applications. arXiv 2017, arXiv:1709.05584. [Google Scholar]
Koh, G.C.; Porras, P.; Aranda, B.; Hermjakob, H.; Orchard, S.E. Analyzing protein—Protein interaction networks. J. Proteome Res. 2012, 11, 2014–2031. [Google Scholar] [CrossRef]
Bajpai, A.K.; Davuluri, S.; Tiwary, K.; Narayanan, S.; Oguru, S.; Basavaraju, K.; Dayalan, D.; Thirumurugan, K.; Acharya, K.K. Systematic comparison of the protein-protein interaction databases from a user’s perspective. J. Biomed. Inform. 2020, 103, 103380. [Google Scholar] [CrossRef]
Knoke, D.; Yang, S. Social Network Analysis; SAGE Publications: Thousand Oaks, CA, USA, 2019. [Google Scholar]
Milroy, L.; Llamas, C. Social networks. In The Handbook of Language Variation and Change; Wiley: Hoboken, NJ, USA, 2013; pp. 407–427. [Google Scholar]
Radicchi, F.; Fortunato, S.; Vespignani, A. Citation networks. In Models of Science Dynamics: Encounters between Complexity Theory and Information Sciences; Springer: Berlin/Heidelberg, Germany, 2011; pp. 233–257. [Google Scholar]
McLaren, C.D.; Bruner, M.W. Citation network analysis. Int. Rev. Sport Exerc. Psychol. 2022, 15, 179–198. [Google Scholar] [CrossRef]
Bengio, Y.; Ducharme, R.; Vincent, P.; Jauvin, C. A neural probabilistic language model. J. Mach. Learn. Res. 2003, 3, 1137–1155. [Google Scholar]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 2013, 26. [Google Scholar]
Park, J.; Lee, Y.C.; Kim, S.W. Effective and efficient negative sampling in metric learning based recommendation. Inf. Sci. 2022, 605, 351–365. [Google Scholar] [CrossRef]
Hafidi, H.; Ghogho, M.; Ciblat, P.; Swami, A. Negative sampling strategies for contrastive self-supervised learning of graph representations. Signal Process. 2022, 190, 108310. [Google Scholar] [CrossRef]
Gutmann, M.U.; Hyvärinen, A. Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics. J. Mach. Learn. Res. 2012, 13. [Google Scholar]
Brandes, U. Network Analysis: Methodological Foundations; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2005; Volume 3418. [Google Scholar]
Gabr, H.; Rivera-Mulia, J.C.; Gilbert, D.M.; Kahveci, T. Computing interaction probabilities in signaling networks. EURASIP J. Bioinform. Syst. Biol. 2015, 2015, 10. [Google Scholar] [CrossRef] [PubMed]
Stumpf, M.P.; Wiuf, C. Incomplete and noisy network data as a percolation process. J. R. Soc. Interface 2010, 7, 1411–1419. [Google Scholar] [CrossRef]
De Oliveira, É.T.C.; de França, F.O. Enriching networks with edge insertion to improve community detection. Soc. Netw. Anal. Min. 2021, 11, 89. [Google Scholar] [CrossRef]
Zhou, J.; Xiong, W.; Wang, Y.; Guan, J. Protein function prediction based on PPI networks: Network reconstruction vs. edge enrichment. Front. Genet. 2021, 12, 758131. [Google Scholar] [CrossRef]
Li, A.; Horvath, S. Network neighborhood analysis with the multi-node topological overlap measure. Bioinformatics 2007, 23, 222–231. [Google Scholar] [CrossRef]
Chatr-Aryamontri, A.; Breitkreutz, B.J.; Oughtred, R.; Boucher, L.; Heinicke, S.; Chen, D.; Stark, C.; Breitkreutz, A.; Kolas, N.; O’Donnell, L.; et al. The BioGRID interaction database: 2015 update. Nucleic Acids Res. 2015, 43, D470–D478. [Google Scholar] [CrossRef] [PubMed]
Liberzon, A.; Subramanian, A.; Pinchback, R.; Thorvaldsdóttir, H.; Tamayo, P.; Mesirov, J.P. Molecular signatures database (MSigDB) 3.0. Bioinformatics 2011, 27, 1739–1740. [Google Scholar] [CrossRef] [PubMed]
Ruepp, A.; Zollner, A.; Maier, D.; Albermann, K.; Hani, J.; Mokrejs, M.; Tetko, I.; Güldener, U.; Mannhaupt, G.; Münsterkötter, M.; et al. The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. Nucleic Acids Res. 2004, 32, 5539–5545. [Google Scholar] [CrossRef] [PubMed]
Zafarani, R.; Liu, H. Social Computing Data Repository at ASU. 2009. Available online: http://datasets.syr.edu (accessed on 5 January 2023).
Mahoney, M. Large Text Compression Benchmark. 2011. Available online: https://cs.fit.edu/~mmahoney/compression/text.html (accessed on 5 November 2022).
Toutanova, K.; Klein, D.; Manning, C.D.; Singer, Y. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, Stroudsburg, PA, USA, 27 May–1 June 2003; pp. 252–259. [Google Scholar]

Figure 1. Sentence–graph–sentence derivation scenario. (a) Representation of the original sentences that existed at the beginning. (b) Representation of sentences with undirected network data structure. Articles and auxiliary verbs are deleted for fair comparison. (c) Sentences generated from the network.

Figure 2.

L_{1}

,

L_{2}

, and

L_{3}

level sample network node representation. Each row in the table is related to

L_{1}

,

L_{2}

, and

L_{3}

level representation for Node 2. Each different level neighborhood is colored separately.

Figure 2.

L_{1}

,

L_{2}

, and

L_{3}

level sample network node representation. Each row in the table is related to

L_{1}

,

L_{2}

, and

L_{3}

level representation for Node 2. Each different level neighborhood is colored separately.

Figure 3. Graphical representation of a depth neural network. N represents the node size of a target network.

Figure 4. Evaluation of the

F 1

score for competitor algorithms on Homo Sapiens. The negative sampling ratio varies from 10 to 5, 4, 3, 2, 1, and 0. OnlyInteractions: uses all interactions without sampling. SL: uses single-label one-hot encoding. ML uses multi-label one-hot encoding. The y-axis represents the

F 1

score, the x-axis represents the applied algorithms, and the coloring represents the negative sampling ratio.

Figure 4. Evaluation of the

F 1

score for competitor algorithms on Homo Sapiens. The negative sampling ratio varies from 10 to 5, 4, 3, 2, 1, and 0. OnlyInteractions: uses all interactions without sampling. SL: uses single-label one-hot encoding. ML uses multi-label one-hot encoding. The y-axis represents the

F 1

score, the x-axis represents the applied algorithms, and the coloring represents the negative sampling ratio.

Figure 5. Evaluation of the effect of random walk sampling and candidate edge ratio on the NodeVector algorithm. The candidate edge ratio varies from 2, 3, 4, to 5. The y-axis represents the

F 1

score, the x-axis represents the NodeVector algorithm with/without sampling, and the coloring represents the candidate edge ratio.

Figure 5. Evaluation of the effect of random walk sampling and candidate edge ratio on the NodeVector algorithm. The candidate edge ratio varies from 2, 3, 4, to 5. The y-axis represents the

F 1

score, the x-axis represents the NodeVector algorithm with/without sampling, and the coloring represents the candidate edge ratio.

Figure 6. Evaluation of the effect of the training repetition parameter epoch on the NodeVector algorithm. The epoch parameter varies from 2, 4, 6, 8, 10, 20, 50 to 100. The y-axis represents the

F 1

score, the x-axis represents the Node Vector algorithm for various training repetition parameters.

Figure 6. Evaluation of the effect of the training repetition parameter epoch on the NodeVector algorithm. The epoch parameter varies from 2, 4, 6, 8, 10, 20, 50 to 100. The y-axis represents the

F 1

score, the x-axis represents the Node Vector algorithm for various training repetition parameters.

Figure 7. Evaluation of the

F 1

score for competitor algorithms on S. cerevisiae, BlogCatalog, and Wikipedia datasets. The y-axis represents the

F 1

score, the x-axis represents competitor algorithms work on different network datasets.

Figure 7. Evaluation of the

F 1

score for competitor algorithms on S. cerevisiae, BlogCatalog, and Wikipedia datasets. The y-axis represents the

F 1

score, the x-axis represents competitor algorithms work on different network datasets.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Altuntas, V. NodeVector: A Novel Network Node Vectorization with Graph Analysis and Deep Learning. Appl. Sci. 2024, 14, 775. https://doi.org/10.3390/app14020775

AMA Style

Altuntas V. NodeVector: A Novel Network Node Vectorization with Graph Analysis and Deep Learning. Applied Sciences. 2024; 14(2):775. https://doi.org/10.3390/app14020775

Chicago/Turabian Style

Altuntas, Volkan. 2024. "NodeVector: A Novel Network Node Vectorization with Graph Analysis and Deep Learning" Applied Sciences 14, no. 2: 775. https://doi.org/10.3390/app14020775

APA Style

Altuntas, V. (2024). NodeVector: A Novel Network Node Vectorization with Graph Analysis and Deep Learning. Applied Sciences, 14(2), 775. https://doi.org/10.3390/app14020775

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

NodeVector: A Novel Network Node Vectorization with Graph Analysis and Deep Learning

Abstract

1. Introduction

2. Background

3. Materials and Methods

3.1. Networks

3.2. Skip-Gram Model

3.3. Negative Sampling

3.4. Network Analysis

3.5. Node Similarities

3.6. NodeVector

4. Results

4.1. Dataset

4.2. Evaluation

5. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI