Next Article in Journal
A Real-Time Sound Field Rendering Processor
Next Article in Special Issue
Developing Automatic Form and Design System Using Integrated Grey Relational Analysis and Affective Engineering
Previous Article in Journal
PHD and CPHD Algorithms Based on a Novel Detection Probability Applied in an Active Sonar Tracking System
Previous Article in Special Issue
Simulation and Fabrication of HfO2 Thin Films Passivating Si from a Numerical Computer and Remote Plasma ALD
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Fusion Link Prediction Method Based on Limit Theorem

National Digital Switching System Engineering and Technological R&D Center, Zhengzhou 450002, China
*
Authors to whom correspondence should be addressed.
Appl. Sci. 2018, 8(1), 32; https://doi.org/10.3390/app8010032
Submission received: 3 December 2017 / Revised: 17 December 2017 / Accepted: 22 December 2017 / Published: 28 December 2017
(This article belongs to the Special Issue Selected Papers from IEEE ICICE 2017)

Abstract

:

Featured Application

The proposed theory can guide the design of combination methods, and the proposed TLF method can fuse multiple similarity indices in link prediction.

Abstract

The theoretical limit of link prediction is a fundamental problem in this field. Taking the network structure as object to research this problem is the mainstream method. This paper proposes a new viewpoint that link prediction methods can be divided into single or combination methods, based on the way they derive the similarity matrix, and investigates whether there a theoretical limit exists for combination methods. We propose and prove necessary and sufficient conditions for the combination method to reach the theoretical limit. The limit theorem reveals the essence of combination method that is to estimate probability density functions of existing links and nonexistent links. Based on limit theorem, a new combination method, theoretical limit fusion (TLF) method, is proposed. Simulations and experiments on real networks demonstrated that TLF method can achieve higher prediction accuracy.

1. Introduction

Limit theory is a basic theoretical issue and has attracted wide interest across many fields. On the 100th anniversary of its foundation, Science raised 125 unresolved scientific questions, and many of these issues related to limit theory [1]. Link prediction predicts missing links in current networks and new or dissolution links in future networks [2]. With continuous improvement of link prediction methods and, the theoretical limit of link prediction has attracted considerable research interest [3].
Considering structure or attribute features, link prediction methods based on classification have been proposed by computer science community [4,5]. Subsequently, more insightful methods of network structure, such as similarity based methods [6], have become a focus, these methods pay more attention to the physical meaning. At the same time, similarity index fusion methods are springing up [7,8]. Recent years, with the development of deep learning, some deep features extraction methods have been proposed [9,10], the fusion of structure and attribute information has been attached importance again [11,12,13,14]. These methods have strong consistency. We divide link prediction method into single and combination methods, based on whether they use multidimension information, and whether they define the relation of multidimension information directly. For example, single methods, such as RA index [15], which defines the relation of common neighbors and degree of nodes directly; and classification based methods, index fusion methods, fusion of structure and attribute information methods belong to link prediction combination methods.
Most combination methods perform better than single methods that will be fused, and are robust to many network types. However, what is the reason for this improved accuracy and robustness, and is there a theoretical limit for combination methods? This paper proposes the mathematic description of combination methods, and obtains the necessary and sufficient conditions for theoretical limit. The limit theorem also has important practical application value. It reveals the ultimate goal of combination methods that is to estimate probability density functions of existing links and nonexistent links. Thus, an appropriate form of the transformation function could be selected from the complete set. Based on the limit theorem, a new combination method, theoretical limit fusion (TLF) method, is proposed. We use the Parzen kernel method [16] of destiny estimation in the TLF method. Simulations and empirical studies have shown that TLF method can achieve higher prediction accuracy.
Section 2 introduces a mathematical description for the theoretical limit of combination methods and evaluation metrics for link prediction. Section 3 proposes and proves necessary and sufficient conditions for the theoretical limit of combination methods. Section 4 proposes a fusion link prediction method based on limit theorem (TLF method). Section 5 provides simulation examples for limit theorem and proposed TLF method with other combination methods, and gives comparison experiments in real networks. Section 6 and Section 7 discuss some results and conclude the paper.

2. Problem Description and Evaluation Metrics

2.1. Problem Description

Given a network G ( V , E ) at time t, where V = { v 1 , v 2 , , v N } is the set of nodes and E = { e 1 , e 2 , , e M } is the set of links. The observed links, E, are randomly divided into training, E T , and probe, E P , sets, where E = E T E P and E T E P = . Link prediction aims to predict missing links at current network or new links for a future time t ( t > t ) [2]. Link prediction combination methods fuse several similarity indices and obtain a synthetic index and can be described in mathematic as follows. Let X = ( X 1 , X 2 , , X n ) T be the scores of existing links as given by n structural similarity indices, and follow probability density function (pdf) f ( x ) = f ( x 1 , x 2 , , x n ) . Let Y = ( Y 1 , Y 2 , , Y n ) T be the scores of nonexistent links as n structural similarity indices, and follow g ( x ) = g ( x 1 , x 2 , , x n ) . We need to find the transformation function, l ( x ) , and obtain the synthetic score, X = l ( X ) , Y = l ( Y ) that maximizes evaluation metrics. Figure 1 is the diagram of combination methods.

2.2. Evaluation Metrics

Let the synthetic score X = l ( X ) follow pdf f X ( x ) , and Y = l ( Y ) follow g Y ( x ) . X and Y are independent. We have the following metrics.

2.2.1. Area under the Receiver Operation Characteristics Curve (AUC)

A receiver operating characteristics (ROC) curve is a two-dimensional depiction of classifier performance [17]. In the field of link prediction, the ROC curve abscissa represents the probability of nonexistent links i.e., the false positive rate (FPR), when the link prediction score is greater than some threshold, μ , and FPR = μ g Y ( x ) d x . The ordinate represents the probability of missing links, i.e., the true positive rate (TPR), when score > μ , and TPR = μ f X ( x ) d x , TPR is equivalent to Recall. According to [18], AUC can be derived as
P ( X > Y ) = X > Y f X ( x ) g Y ( y ) d x d y = 1 2 X > Y f X ( x ) g Y ( y ) d x d y + 1 2 ( 1 X Y f X ( x ) g Y ( y ) d x d y ) = 1 2 sgn ( x y ) f X ( x ) g Y ( y ) d x d y + 1 2 = 1 2 E [ sgn ( X Y ) + 1 ] ,
where
sgn ( x ) = { 1 , x > 0 0 , x = 0 1 , x < 0
In the real network, original data is randomly divided into training set and the probe set. Equation (1) means that for n independent comparisons, if there are n′ comparisons where the missing link returns a higher score and n″ comparisons where the missing and nonexistent links return the same score, we can obtain the algorithm expression of AUC:
AUC = n + 0.5 n n

2.2.2. Precision

Precision can be defined as the ratio of correct to (correct and error) prediction proportions when score > μ , i.e.,
Precision = P ( ω 1 ) μ + f X ( x ) d x P ( ω 1 ) μ + f X ( x ) d x + P ( ω 2 ) μ + g Y ( x ) d x = P ( ω 1 ) TPR P ( ω 1 ) TPR + P ( ω 2 ) FPR .
In the real network, if the top L links are predicted ones, with m links being right (i.e., there are m links in EP), then
Precision = m L
Owing to the imbalance of positive and negative samples, link prediction usually uses AUC metric. In application, high Precision means target links are accurate, and these links can be used directly. AUC and Precision are two important metrics in link prediction, we will study the theoretical limit using the two metrics.

3. Theoretical Limit Theorem

Theorem 1.
Let X = ( X 1 , X 2 , , X n ) T and Y = ( Y 1 , Y 2 , , Y n ) T be random vectors following the joint distributions f ( x ) and g ( x ) , respectively, where m { x : f ( x ) / g ( x ) = C ,   g ( x ) 0 , C } = 0 . (m represents the measure of a set.) Then the following conditions are equivalent.
(a)
A monotonically increasing function r ( x ) exists, such that l ( x ) = r [ f ( x ) / g ( x ) ] ,   g ( x ) 0 , a.e. x n .
(b)
Transformation function l ( x ) produces maximum AUC. If we add a condition in Theorem that prior probability of existing and nonexistent links be P ( ω 1 ) and P ( ω 2 ) , respectively. Then the following conditions are equivalent to (a) and (b):
(c)
for any α , there exists the corresponding threshold μ l for transformation l ( x ) , and satisfies α = P ( ω 1 ) μ l + f X ( x ) d x + P ( ω 2 ) μ l + g Y ( x ) d x , such that transformation function l ( x ) produces maximum Precision.
Proof. 
( a ) ( b ) :
From the equivalent definition, AUC maximum is the maximum area under the ROC curve. For any FPR, if the TPRs corresponding to the ROC curve reach maximum, then the AUC reaches the maximum, i.e.,
FPR = μ g Y ( x ) d x = E ( l ( x ) > μ ) g ( x ) d x ,
TPR = μ + f X ( x ) d x = E ( l ( x ) > μ ) f ( x ) d x ,
where E ( l ( x ) > μ ) is a set { x n : μ ,   l ( x ) > μ } , and m { x : l ( x ) = C ,   C } = 0 .
We use Lagrange’s undetermined multipliers to solve this problem. For any specified FPR (denoted as FPR0), the TPR corresponding to the ROC curve reaches maximum is equivalent as φ reaches maximum,
φ = E ( l ( x ) > μ ) f ( x ) d x + λ [ FPR 0 E ( l ( x ) > μ ) g ( x ) d x ] = λ FPR 0 + E ( l ( x ) > μ ) [ f ( x ) λ g ( x ) ] d x .
Function φ will be maximized if we choose set E ( l ( x ) > μ ) such that the integrand is positive, i.e., if
f ( x ) λ g ( x ) > 0 ,
then x E ( l ( x ) > μ ) . Which means, no matter what is λ , if we select the set of x which makes the integrand f ( x ) λ g ( x ) always be positive, the function φ will reach maximum; if the set contains x that makes the integrand be negative, function φ will decrease. Let l ( x ) = f ( x ) / g ( x ) and μ = λ , and the set, E ( l ( x ) > μ ) , equals to E ( f ( x ) / g ( x ) > λ ) , which satisfies (8), i.e.,
φ = λ FPR 0 + E ( f ( x ) / g ( x ) > λ ) [ f ( x ) λ g ( x ) ] d x
Thus, for any FPR, the TPR corresponding to the ROC curve reaches the maximum, so the AUC reaches the maximum when X and Y are transformed by l ( x ) = f ( x ) / g ( x ) .
Let r ( x ) be a monotonically increasing function; and h ( x ) be the inverse function of r ( x ) . If h ( x ) = 1 / r ( x ) , then h ( x ) and r ( x ) have the same monotonicity, and both are increasing functions. Thus, | h ( x ) | = h ( x ) . The pdf of X 2 = r ( X 1 ) is f X 2 ( x ) = f X 1 [ h ( x ) ] h ( x ) , and the pdf of Y 2 = r ( Y 1 ) is g Y 2 ( x ) = g Y 1 [ h ( x ) ] h ( x ) . Thus,
AUC = P ( X 2 > Y 2 ) = + f X 2 ( x ) x g Y 2 ( y ) d y d x = + f X 1 ( h ( x ) ) h ( x ) x g Y 1 ( h ( y ) ) h ( y ) d y d x = + f X 1 ( x ) x g ( y ) Y 1 d y d x = P ( X 1 > Y 1 ) .
We have proved ( a ) ( b ) .
( b ) ( a ) : If l 2 ( x ) r [ l ( x ) ] , where r ( x ) is increasing function, there exists l 2 ( x ) such that X ,   Y transforming from l 2 ( x ) can also produce maximum AUC, and then the corresponding ROC curves are the same. Otherwise, if ROC curves are different, except the same part, for any FPR, there is at least a ROC curve which doesn’t reach maximum TPR, and contradict with maximum AUC. Since m { x : f ( x ) / g ( x ) = C , g ( x ) 0 , C } = 0 and the ROC curve is the same for any point ( FPR ,   TPR ) on the two ROC curves, thus,
  • For any FPR [ 0 , 1 ] , and any μ FPR , there exist μ 2 FPR , such that E ( l ( x ) > μ FPR ) = E ( l 2 ( x ) > μ 2 FPR ) for a.e. x n ;
  • For any μ FPR > μ FPR , if E ( l ( x ) > μ FPR ) = E ( l 2 ( x ) > μ 2 FPR ) and E ( l ( x ) > μ FPR ) = E ( l 2 ( x ) > μ 2 FPR ) , then μ 2 FPR > μ 2 FPR .
Let y 1 = l ( x ) , then a set of y 1 exist with nonzero measure, such that l 2 ( x ) r [ l ( x ) ] , i.e., m { y 1 : l 2 ( x ) r [ l ( x ) ] } 0 . Let σ = { y 1 : l 2 ( x ) r [ l ( x ) ] } . If y 1 σ , l 2 ( x ) , l 1 ( x ) satisfies function relation l 2 ( x ) = s [ l ( x ) ] , but s ( x ) is not increasing, then for any μ 1 σ , condition (ii) does not hold. If y 1 σ , l 2 ( x ) and l ( x ) are not functionally related, then neither condition (i) or (ii) hold. Thus ( b ) ( a ) is established.
( c ) ( b ) : Let k = TPR / FPR be the slope of the secant for any point on the ROC curve to the origin, then Precision = k / ( k + λ ) ,   λ = P ( ω 2 ) / P ( ω 1 ) . For any α , that l ( x ) produces maximum Precision is equivalent that k reaches maximum. And equivalent that for any α , α = P ( ω 1 ) μ l + f X ( x ) d x + P ( ω 2 ) μ l + g Y ( x ) d x , TPR = μ l + f X ( x ) d x is maximum. Since this condition is established for any α , then it is equivalent that for any FPR [ 0 , 1 ] , the corresponding TPR reaches maximum, and equivalent to l ( x ) produces maximum AUC. ☐
Note 1: the condition m { x : f ( x ) / g ( x ) = C ,   g ( x ) 0 ,   C } = 0 is for exclusion that when f ( x ) / g ( x ) = C , (C is a constant), transformation function can be defined randomly on set σ = { x : f ( x ) / g ( x ) = C ,   g ( x ) 0 ,   C } n . For example, let us construct the pdf of some random vector X ˜ as
f ˜ ( x ) = { f ( x ) ,   x n \ σ k g ( x ) ,   x σ ,   k ,   k < f ( x ) g ( x ) ,
n f ˜ ( x ) d x = 1 . Let the transformation function be
l ( x ) = { f ( x ) g ( x ) ,   x n \ σ l ( x ) ,   x σ
then no matter how l ( x ) is defined, only when l ( x ) < min [ f ( x ) / g ( x ) ] can the l ( x ) produce the maximum AUC of ( X ˜ , Y ) . In particular, if f ( x ) = g ( x ) , x n , regardless of how l ( x ) is defined, AUC = 0.5. Thus, maximum AUC = 0.5.
Note 2: The arbitrariness of the ratio α must be emphasized in condition (c). If we omitted “any α ”, then (b) (c) can be established but (c) (b) cannot. The meaning of α in application is a ratio of the whole data, for any l ( x ) , a ratio α corresponds a threshold μ .
Theorem shows that no matter which evaluation criteria choose, transformation functions that provide maximum link prediction accuracy constitute a function cluster, Φ = { l ( x ) : l ( x ) = r [ f ( x ) / g ( x ) ] , g ( x ) 0 } , where r ( x ) is a monotonically increasing function. Therefore, the accuracy of the combination method must be greater than or equal to the accuracy of each single dimension.

4. A Fusion Link Prediction Method Based on Limit Theorem

4.1. The Algorithm

The limit theorem of combination method shows that when selecting transformation function as l ( x ) = f ( x ) / g ( x ) or its monotone increasing transformation, the AUC and Precision of synthetic score reaches the maximum. In the real network, because f ( x ) and g ( x ) are unknown, the pdfs need to be estimated from multidimensional data. Let the estimated pdfs be f ^ ( x ) and g ^ ( x ) . On the basis of limit theorem, we define the transformation function as the ratio of estimated pdfs, i.e.,
l ^ ( x ) = f ^ ( x ) / g ^ ( x )
then we obtained the synthetic score, s = l ^ ( x ) , and used for link prediction. This method is called theoretical limit fusion (TLF) method.
Before evaluating f ( x ) and g ( x ) , the input link prediction scores need to be normalized,
s k ( i , j ) = 0.5 N 2 · s k ( i , j ) i = 1 N j = 1 N s k ( i , j ) ,   k = 1 ,   2 ,   ,   d
s k ( i , j ) represents the k-th similarity score for node pair i , j . N is the dimension of adjacent matrix, and d is the number of similarity indices.
The limit theorem of combination method transformed the link prediction indices fusion problem into the estimation of pdfs. Statistical methods for estimating density functions can be applied to this problem, directly. The Parzen kernel method [16] of destiny estimation is used in this paper. The multivariate kernel density estimate defined as:
f ^ ( x ) = 1 n s h d i = 1 n s K [ 1 h ( x x i ) ]
where h is the window width, n s is the sample size, and K ( x ) is a multivariate kernel defined for d-dimensional x , such that
d K ( x ) d x = 1
A form of the pdf estimate commonly used is Gauss kernel,
K ( x ) = 1 ( 2 π ) d / 2 exp ( x T x 2 )
In summary, the steps of TLF are listed as Table 1.

4.2. Complexity Analysis

For a given undirected, unweighted graph G(V, E), let N = | V | be the number of nodes and let m = | E | be the number of edges, and let n s be the sample size. During the estimation of pdfs in (16), the entire samples are scanned once. A scan of samples requires time O ( d · n s ) and it is less than O ( N 2 ) . This is the step of model training or pdf estimation. Among all combination methods, there is an inevitable time complexity, that is to obtain the similarity matrix or final link prediction scores according to Equation (14). This step requires time O ( d · n s · N 2 ) . So, the TLF method will take time more than O ( N 3 ) . The main space needs to storage estimator and adjacent matrix or final similarity matrix. The spatial complexity is O ( N 2 ) .

5. Simulation and Experiment

We programmed the algorithm using Matlab (MathWorks, Beijing, China, 2014), and runs on a single machine equipped with RedHat 6.4. The host memory is 16 GB, with 3.4 GHz CPU, and the Matlab version is R2014b. In simulations from Section 5.1, 4-dimisional pdfs are supported to verify limit theorem and the effectiveness of TLF method. We also test the resulting method in real networks. We use TLF method to fuse 4 local similarity indices, CN [19], AA [20], RA and PA [21,22]. These indices are 4 simple indices with low computation complexity about O ( N · < k > 2 ) , where <k> represents the average degree of nodes in a network. CN index only considers common neighbors of node pairs; PA index only considers the degree of two nodes; AA and RA consider both common neighbors and degree of nodes with different weights. And compare the method with fusion methods such as naïve Bayes and logistic regression and other global indices with computation complexity more than O ( N 3 ) .

5.1. Simulation Examples

Four types of structural similarity indices were simulated to evaluate node pairs with and without links. The pdfs of the structural similarity indices are also provided. We construct 3 groups of known distributions for the similarity indices pdfs. One thousand samples extracted from 10,000 existing links and 100,000 samples of nonexistent links were generated following the appropriate pdfs. The 1000 samples serve as probe set; the 100,000 samples with 1000 probe links serve as unknown links for training; and the remaining 9000 samples serve as train set of existing links. Each sample had 4 dimensions to simulate 4 similarity scores. We first compute AUC and Precision for each dimension, then use proposed TLF method to obtain the synthetic score and calculate the AUC and Precision, compared with other combination methods such as Naïve Bayes and logistic regression. Finally, we calculate AUC and Precision using the theoretical limit theorem and compare with the above methods.
Let random vectors X = ( X 1 , X 2 , X 3 , X 4 ) T and Y = ( Y 1 , Y 2 , Y 3 , Y 4 ) T be the scores of existing and nonexistent links, which follow f ( x ) = f ( x 1 , x 2 , , x n ) and g ( x ) = g ( x 1 , x 2 , , x n ) pdfs, respectively.
Let f ( x ) ,   g ( x ) are 4-dimensional normal distributions,
f ( x ) = 1 ( 2 π ) p / 2 | Σ | 1 / 2 exp { 1 2 ( x μ ) T Σ 1 ( x μ ) } ,
where diag ( Σ ) 1 = ( σ 1 2 , σ 2 2 , σ 3 2 , σ 4 2 ) T , and Σ i j = r i j σ i σ j .
The parameter sets for the 2 groups of simulation examples are as follows.
Group 1: Θ 1 f = { μ 1 f , Σ 1 f } , and Θ 1 g = { μ 1 g , Σ 1 g } ;
Group 2: Θ 2 f = { μ 2 f , Σ 2 f } and Θ 2 g = { μ 2 g , Σ 2 g } .
In each group, μ 1 f = ( 1 , 2 , 1.7 , 2.1 ) T , μ 1 g = ( 1.3 , 2.5 , 2.1 , 2.8 ) T , μ 2 f = ( 1 , 2 , 1.7 , 2.1 ) T , μ 2 g = ( 1.5 , 3.5 , 2.8 , 3 ) T , diag ( Σ 1 f ) 1 = ( 1.5 2 , 2.2 2 , 3 2 , 2.5 2 ) T , diag ( Σ 1 g ) 1 = ( 2 2 , 2.2 2 , 3 2 , 2.5 2 ) T , diag ( Σ 2 f ) 1 = ( 1.5 2 , 2.2 2 , 3 2 , 2.5 2 ) T , diag ( Σ 2 g ) 1 = ( 2.5 2 , 3.5 2 , 4 2 , 2.5 2 ) T , r 1 f = r 1 g = [ 1 0.8 0.76 0.56 0.8 1 0.85 0.74 0.76 0.85 1 0.93 0.56 0.74 0.93 1 ] , and r 2 f = r 2 g = [ 1 0.62 0.45 0.34 0.62 1 0.28 0.47 0.45 0.28 1 0.65 0.34 0.47 0.65 1 ] .
The window width h of TLF method in the group 1 and 2 is h = 0.1.
Group 3: Let
f 3 ( x ) = x 1 x 2 x 3 x 4 + x 1 x 4 + x 3 exp ( x 1 ) log ( x 2 ) ( 0 x 1 3 , 1 x 2 3 , 3 x 3 5 , 2 x 4 3.5 ) ,
and
g 3 ( x ) = x 1 x 2 x 3 x 4 + x 3 exp ( x 1 ) log ( x 2 ) ( 0 x 1 4 , 1 x 2 3 , 3 x 3 5 , 2.5 x 4 5 ) .
We ignore the constant that makes the integral of f ( x ) ,   g ( x ) equal to 1. The simulation results of group 3 are shown as Table 2.
The window width h of TLF method in the group 3 is h = 0.1.
The simulation results in Table 2 and Table 3 show us that we can calculate the theoretical limit of combination method based on Theorem 1, and the limit AUC and Precision are highest among all listed methods, though we cannot list all possible conditions. Results also show that TLF method can fuse the information effectively, and obtain the optimum accuracy. We also verify that the transformation of monotonically increasing function does not change the theoretical limit. Theorem 1 provides a platform that can compare each combination method by constructing some distributions, and direct an effect combination method TLF.

5.2. Experiments in Real Networks

The significance of simulation is that the theoretical limit can be derived by theoretical calculation or numerical calculation, and all combination methods can be used to compare with it, finding shortcomings and gaps to design a more rational method. However, the simulation data is different from real network data. We use TLF method to fuse several similarity indices and test in real networks. The basic similarity indices we use are Common Neighbor index (CN) [19], Adamic-Adar index (AA) [15], Resource Allocation index (RA) and Preferential Attachment index (PA) [21,22]. These indices are local indices. Several global indices such as Katz index [23], Average Commute Time index (ACT) and Cosine Similarity Time index (Cos+) are served as comparisons [24,25]. The definitions of the above indices and their meanings are listed as Table 4.
We use TLF method to fuse 4 local similarity indices, and compare with fusion method such as naïve Bayes and logistic regression and other global indices. Our experiments are performed on 11 different real networks. (1) Food Web Everglades Web (FWEW) [26]; (2) Food Web Florida Bay(FWFB) [27]; (3) Protein-protein Interactions Cell (PPI Cell) [28]; (4) CKM-3 [29]; (5) Netscience (NS) [30]; (6) Yeast [31]; (7) Political Blogosphere(PB) [32]; (8) Email [33]; (9) CA-GrQc(CG) [34]; (10) Com-dblp(CD) [35]; (11) Email Enron (EE) [36,37]. The basic topological features of 11 real networks are listed in Table 5. Each original data is randomly divided into training set of 90% links, and the probe set of 10% links.
Table 6 and Table 7 show the comparisons between TLF method and other combination methods or global indices using AUC and Precision metrics. Each result is the average of 10 realizations. When calculating the Precision metric in Equation (5), we take L = 100 in datasets 1 to 8, and take L = 1000 in datasets 9 to 11. In the large networks, TLF method needs to sample to save the computing sources, and in datasets 10 to 11, the under-sampling rate is set as 1000.
The results show us that TLF method performs better than other fusion methods such as naïve Bayes and logistic regression, no matter what evaluation metric use. Almost all combination methods are better than 4 basic indices. From the limit theorem, combination methods are dependent with each dimension. The promotion of fusion index is restrict to each similarity index. Experiment results also exposed this problem: if the single similarity indices perform not well, the fusion index cannot significantly improve the accuracy. For example, in the CKM-3 network, though we use TLF method to fuse 4 basic similarity indices can improve the AUC obviously, it cannot be better than Katz index (0.928).

6. Discussion

Many combination methods try to find the nonlinear relation of every dimensions, and want to obtain a more reasonable fusion function to promote the prediction accuracy. For example, link prediction method based on the choquet fuzzy integral [7] uses fuzzy measures to measure the importance of each similarity index in the fusion process and the interaction between them. Logistic regression based index adopts logistic function to learn the relation of multiple structural features and obtain an adaptive link prediction method [38]. In fact, according to the limit theorem, the nonlinear relation is the ratio of two joint probability destiny functions or its monotone increasing transformation. The best fusion function is a measurement of difference between existing and nonexistent links, and it reflects the relativity of existing and nonexistent links. The essence of combination methods is trying to approximate the pdfs from many aspects. Limit theorem provides a unified interpretation for all combination methods. On the basis of theoretical limit theorem, the proposed TLF method evaluates two pdfs directly, and it has a better fusion effect from results of simulation and experiment in real network.

7. Conclusions

This paper proposes mathematic description of link prediction combination methods and derives the limit theorem. Before the mathematic description we proposed, many combination methods have been put forward and widely used. However, all these methods are groping respectively without unified explanation. Limit theorem solved this problem and provided a guidance for link prediction method design. The TLF method based on limit theorem can achieve higher prediction accuracy.

Acknowledgments

We acknowledge professor Guo’en Hu for inspirations. This work was partially supported by the Foundation for Innovative Research Groups of the National Natural Science Foundation of China (No. 61521003), and National Natural Science Foundation of China (No. 61601513).

Author Contributions

Yiteng Wu and Hongtao Yu proposed mathematical description of combination method; Yiteng Wu proposed and proved the theoretical limit theorem; Yiteng Wu and Ruiyang Huang designed the experiments and analyzed the results. Yingle Li and Senjie Lin wrote part of code.

Conflicts of Interest

The authors declare no conflict of interest. The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.

References

  1. Seife, C. What are the limits of conventional computing. Science 2005, 309, 96. [Google Scholar] [CrossRef] [PubMed]
  2. Wang, P.; Xu, B.; Wu, Y.; Zhou, X. Link prediction in social networks: The state-of-the-art. Sci. China Inf. Sci. 2015, 58, 1–38. [Google Scholar] [CrossRef]
  3. Lü, L.; Pan, L.; Zhou, T.; Zhang, Y.-C.; Stanley, H.E. Toward link predictability of complex networks. Proc. Natl. Acad. Sci. 2015, 112, 2325–2330. [Google Scholar] [CrossRef] [PubMed]
  4. Lü, L.; Zhou, T. Link prediction in complex networks: A survey. Phys. A Stat. Mech. Appl. 2011, 390, 1150–1170. [Google Scholar] [CrossRef]
  5. Wohlfarth, T.; Ichise, R. Semantic and Event-Based Approach for Link Prediction. In Proceedings of the Practical Aspects of Knowledge Management (PAKM 2008), Yokohama, Japan, 22–23 November 2008. [Google Scholar] [CrossRef]
  6. Chiancone, A.; Franzoni, V.; Li, Y.; Markov, K.; Milani, A. Leveraging Zero Tail in Neighbourhood for Link Prediction. In Proceedings of the 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Singapore, 6–9 December 2015; pp. 135–139. [Google Scholar] [CrossRef]
  7. Yu, H.T.; Wang, S.H.; Ma, Q. Link prediction algorithm based on the Choquet fuzzy integral. Intell. Data Anal. 2016, 20, 809–824. [Google Scholar] [CrossRef]
  8. He, Y.-l.; Liu, J.N.K.; Hu, Y.-X.; Wang, X.-Z. OWA operator based link prediction ensemble for social network. Expert Syst. Appl. 2015, 42, 21–50. [Google Scholar] [CrossRef]
  9. Liao, L.; He, X.; Zhang, H.; Chua, T.-S. Attributed Social Network Embedding. Trans. Knowl. Data Eng. 2017. Available online: http://www.comp.nus.edu.sg/~xiangnan/papers/attributed-social-network-embedding.pdf (accessed on 5 September 2017).
  10. Grover, A.; Leskovec, J. node2vec: Scalable Feature Learning for Networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016. [Google Scholar]
  11. Wang, Z.; Chen, C.; Li, W. Predictive Network Representation Learning for Link Prdiction. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Shinjuku, Tokyo, Japan, 7–11 August 2017; pp. 969–972. [Google Scholar]
  12. Chuan, P.M.; Son, L.H.; Ali, M.; Khang, T.D.; Huong, L.T.; Dey, N. Link prediction in co-authorship networks based on hybrid content similarity metric. Appl. Intell. 2017. [Google Scholar] [CrossRef]
  13. Franzoni, V.; Chiancone, A.; Milani, A. A Multistrain Bacterial Diffusion Model for Link Prediction. Int. J. Pattern Recognit. Artif. Intell. 2017, 31, 1759024. [Google Scholar] [CrossRef]
  14. Liu, B.; Sun, C.; Liu, M.; Wang, X.; Liu, F. Deep Belief Network-Based Approaches for Link Prediction in Signed Social Networks. Entropy 2015, 17, 2140–2169. [Google Scholar] [CrossRef]
  15. Ou, Q.; Jin, Y.D.; Zhou, T.; Wang, B.H.; Yin, B.Q. Power-law strength-degree correlation from resource-allocation dynamics on weighted networks. Phys. Rev. E Stat. Nonlinear Soft Matter Phys. 2007, 75, 021102. [Google Scholar] [CrossRef] [PubMed]
  16. Parzen, E. On estimation of a probability density function and mode. Ann. Math. Stat. 1962, 33, 1065–1076. [Google Scholar] [CrossRef]
  17. Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
  18. Hanley, J.A.; McNeil, B.J. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982, 143, 29–36. [Google Scholar] [CrossRef] [PubMed]
  19. Lorrain, F.; White, H.C. Structural equivalence of individuals in social networks. Soc. Netw. 1977, 1, 67–98. [Google Scholar] [CrossRef]
  20. Adamic, L.A.; Adar, E. Friends and neighbors on the web. Soc. Netw. 2003, 25, 211–230. [Google Scholar] [CrossRef]
  21. Zhou, T.; Lü, L.; Zhang, Y.C. Predicting missing links via local information. Eur. Phys. J. B-Condens. Matter Complex Syst. 2009, 71, 623–630. [Google Scholar] [CrossRef]
  22. Barabasi, A.L.; Albert, R. Emergence of scaling in random networks. Science 1999, 286, 509–512. [Google Scholar] [CrossRef] [PubMed]
  23. Coleman, J.; Katz, E.; Menzel, H. The Diffusion of an Innovation among Physicians. Sociometry 1957, 20, 253–270. [Google Scholar] [CrossRef]
  24. Klein, D.J.; Randić, M. Resistance distance. J. Math. Chem. 1993, 12, 81–95. [Google Scholar] [CrossRef]
  25. Fouss, F.; Pirotte, A.; Renders, J.-M.; Saerens, M. Random-Walk Computation of Similarities between Nodes of a Graph with Application to Collaborative Recommendation. IEEE Trans. Knowl. Data Eng. 2007, 19, 355–369. [Google Scholar] [CrossRef]
  26. Ulanowicz, R.E.; DeAngelis, D.L.; Egnotovich, M.S. Network Analysis of Trophic Dynamics in South Florida Ecosystems, FY 99: The Graminoid Ecosystem. 2000. Available online: https://www.researchgate.net/publication/237005295_Network_Analysis_of_Trophic_Dynamics_in_South_Florida_Ecosystems_FY_99_The_Graminoid_Ecosystem (accessed on 13 June 2016).
  27. Ulanowicz, R.E.; Bondavalli, C.; Egnotovich, M.S. Network Analysis of Trophic Dynamics in South Florida Ecosystem, FY 97: The Florida Bay Ecosystem; Technical Report; CBL: Chattanooga, TN, USA, 1998; pp. 98–123. [Google Scholar]
  28. Kolaczyk, E.D. Statistical Analysis of Network Data: Methods and Models; Springer: New York, NY, USA, 2009. [Google Scholar] [CrossRef]
  29. Coleman, J.; Katz, E.; Menzel, H. The Diffusion of an Innovation among Physicians 1. Soc. Netw. 1977, 20, 107–124. [Google Scholar] [CrossRef]
  30. Newman, M.E.J. Finding community structure in networks using the eigenvectors of matrices. Phys. Rev. E Stat. Nonlinear Soft Matter Phys. 2006, 74 (Pt 2), 036104. [Google Scholar] [CrossRef] [PubMed]
  31. Von Mering, C.; Krause, R.; Snel, B.; Cornell, M.; Oliver, S.G.; Fields, S.; Bork, P. Comparative assessment of large-scale data sets of protein protein interactions. Nature 2002, 417, 399–403. [Google Scholar] [CrossRef] [PubMed]
  32. Adamic, L.A.; Glance, N. The political blogosphere and the 2004 U.S. election: Divided they blog. In Proceedings of the 3rd International Workshop on Link Discovery, Chicago, IL, USA, 21–25 August 2005; ACM: New York, NY, USA, 2005; pp. 36–43. [Google Scholar] [CrossRef]
  33. Michalski, R.; Palus, S.; Kazienko, P. Matching Organizational Structure and Social Network Extracted from Email Communication. In Business Information Systems; Springer: Berlin, Germany, 2011; pp. 197–206. [Google Scholar]
  34. Leskovec, J.; Kleinberg, J.; Faloutsos, C. Graph Evolution: Densification and Shrinking Diameters. ACM Trans. Knowl. Discov. Data ACM TKDD 2007, 1. [Google Scholar] [CrossRef]
  35. Yang, J.; Leskovec, J. Defining and Evaluating Network Communities based on Ground-truth. In Proceedings of the 12th International Conference on Data Mining (ICDM), Brussels, Belgium, 10–13 December 2012. [Google Scholar] [CrossRef]
  36. Leskovec, J.; Lang, K.J.; Dasgupta, A.; Mahoney, M.W. Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters. Internet Math. 2009, 6, 29–123. [Google Scholar] [CrossRef]
  37. Klimmt, B.; Yang, Y. Introducing the Enron corpus. In Proceedings of the CEAS Conference 2004, Mountain View, CA, USA, 30–31 July 2004. [Google Scholar]
  38. Ma, C.; Bao, Z.K.; Zhang, H.F. Improving link prediction in complex networks by adaptively exploiting multiple structural features of networks. Phys. Lett. A 2017, 381, 3369–3376. [Google Scholar] [CrossRef]
Figure 1. Combination methods.
Figure 1. Combination methods.
Applsci 08 00032 g001
Table 1. The steps of theoretical limit fusion (TLF) method.
Table 1. The steps of theoretical limit fusion (TLF) method.
Step 1Divide the network into training set, ET, and probe set, EP;
Step 2According to Equation (15), normalize these similarity indices, then we distinguish existing links and nonexistent links in the training set;
Step 3Based on Equation (16), estimate the pdfs of existing links and nonexistent links, and we obtain the estimated pdfs as f ^ ( x ) and g ^ ( x ) ;
Step 4Obtain the synthetic score of n structure similarity indices according to Equation (14);
Step 5Calculate the accuracy such as AUC metric or Precision metric on the probe set.
Table 2. Simulation results of group 1 and group 2. The bold figure indicates the best accuracy in each dimension and combination method.
Table 2. Simulation results of group 1 and group 2. The bold figure indicates the best accuracy in each dimension and combination method.
ParametersAccuracyDim1Dim2Dim3Dim4NBLRTLFTheoretical LimitTransform by Increasing Function
Group 1AUC0.5540.5660.5470.5850.6100.6680.6910.7380.738
Precision0.0470.0150.0140.0270.0380.0200.0970.1200.120
Group 2AUC0.5690.6600.6040.6220.7650.6760.7860.7920.792
Precision0.1140.1400.0810.0380.1530.0510.2120.2410.241
Table 3. Simulation results of group 3. The bold figure indicates the best accuracy in each dimension and combination method.
Table 3. Simulation results of group 3. The bold figure indicates the best accuracy in each dimension and combination method.
AccuracyDim1Dim2Dim3Dim4NBLRTLFTheoretical LimitTransform by Increasing Function
AUC0.7700.5050.4880.8780.9380.9230.9500.9560.956
Precision0.5670.0070.0070.6540.7110.1000.8150.8580.858
Table 4. Definitions and descriptions of similarity indices.
Table 4. Definitions and descriptions of similarity indices.
IndexEquationDescription
CN s CN ( i , j ) = | Γ ( i ) Γ ( j ) | Γ ( i ) is the set of neighbors of node i. | · | represents cardinality of a set. CN index denotes the common neighbors between nodes i and j.
AA s AA ( i , j ) = z Γ ( i ) Γ ( j ) 1 log k z AA index weights the common neighbors by the reciprocal of the logarithm of each node’s degree.
RA s RA ( i , j ) = z Γ ( i ) Γ ( j ) 1 k z RA index weights the common neighbors by the reciprocal of each node’s degree.
PA s PA ( i , j ) = k i k j PA index expresses preferential attachment by node’s degree.
Katz s Katz ( i , j ) = [ lim n m = 1 n ( α A ) m ] A is adjacent matrix of network. Katz index considers all path between two nodes and gives more weights, α , to the shorter paths.
ACT s ACT ( i , j ) = 1 l i i + + l j j + 2 l i j + l x y + is the corresponding element in L + , and L + denotes the pseudo-inverse of laplacian matrix.
Cos+ s Cos + ( i , j ) = v i T v j | v i | · | v j | = l i j + l i i + · l j j + According to L + , Cos+ calculates cosine similarity of two vectors in matrix L + .
Table 5. Basic topological features of 6 example networks. |V| and |E| are the total numbers of nodes and links, respectively. <k> represents the average degree of nodes in a network. C and r are the clustering coefficient and assortative coefficient respectively. H is the degree heterogeneity, defined as H = < k 2 > < k > 2 .
Table 5. Basic topological features of 6 example networks. |V| and |E| are the total numbers of nodes and links, respectively. <k> represents the average degree of nodes in a network. C and r are the clustering coefficient and assortative coefficient respectively. H is the degree heterogeneity, defined as H = < k 2 > < k > 2 .
Data|V||E|<k>rCH
FWEW6988025.51−0.2980.5601.275
FWFB128207532.42−0.1120.3351.24
PPI_Cell1272373.7320.0350.4551.649
CKM-32464233.4390.1020.3561.335
Yeast237511,6939.850.4690.3783.48
PB122216,71727.36−0.2210.3612.97
NS158927423.4510.4620.8892.011
Email113354519.6220.0780.2971.942
CG524214,4961.110.6590.7203.05
CD425,9571,049,8664.930.2670.2674.412
EE36,692183,83110.02−0.1110.74613.9
Table 6. Comparisons of the AUC value between TLF and other combination methods or global indices. In each network, the selected window width h is along with the AUC value. The bold figure indicates the best AUC.
Table 6. Comparisons of the AUC value between TLF and other combination methods or global indices. In each network, the selected window width h is along with the AUC value. The bold figure indicates the best AUC.
DataCNAARAPAACTCos+KatzNBLRTLF
FWEW0.6870.6940.7140.8190.7930.5110.7270.8250.8320.876 (h = 0.1)
FWFB0.6240.6240.6240.7420.7270.6490.6800.7490.7620.781 (h = 0.1)
PPI_Cell0.7360.7450.7400.6990.7790.7830.8220.7530.6790.831 (h = 0.3)
CKM-30.6610.6650.6610.5850.5600.5350.9280.6830.6750.713 (h = 0.15)
Yeast0.9180.9180.9150.8690.9030.9580.9620.9250.9340.968 (h = 0.2)
PB0.9220.9280.9280.9060.8900.9320.9340.9310.9360.949 (h = 0.3)
NS0.9940.9940.9950.7090.5580.5070.9960.9980.9990.999 (h = 0.2)
Email0.8490.8520.8510.8170.8010.8890.9080.8650.8700.912 (h = 0.15)
CG0.9660.9650.9670.9920.5490.6790.9960.9840.9910.994 (h = 0.1)
CD0.9620.9680.9710.9430.9120.9710.9150.9750.9730.982 (h = 0.15)
EE0.9810.9840.9840.9270.9030.9800.5140.9850.9870.992 (h = 0.15)
Table 7. Comparisons of the Precision value between TLF and other combination methods or global indices. In each network, the corresponding window width h is the same as Table 6. The bold figure indicates the best Precision.
Table 7. Comparisons of the Precision value between TLF and other combination methods or global indices. In each network, the corresponding window width h is the same as Table 6. The bold figure indicates the best Precision.
DataCNAARAPAACTCos+KatzNBLRTLF
FWEW0.1430.1450.1620.3340.2710.0040.1960.3010.3250.543
FWFB0.0710.0720.0830.2400.1840.0290.1480.2490.2830.382
PPI_Cell0.0520.0480.0730.0120.0450.0610.0580.0720.0680.085
CKM-30.0510.0590.0620.0110.0010.0030.0610.0600.0620.064
Yeast0.6520.7030.4610.4390.4870.2910.7210.7120.7230.785
PB0.3810.3200.2120.1000.1290.2980.3810.4110.3950.452
NS0.8200.9710.9820.0080.0040.0060.8230.9880.9860.991
Email0.2020.2530.2140.0390.0310.0860.2310.2630.2890.347
CG0.9720.9690.9670.9910.5570.6630.9980.9830.9890.996
CD0.9010.9240.9310.8920.8670.9370.9120.9390.9420.951
EE0.9810.9840.9870.9240.8980.9120.5160.9880.9850.992

Share and Cite

MDPI and ACS Style

Wu, Y.; Yu, H.; Huang, R.; Li, Y.; Lin, S. A Fusion Link Prediction Method Based on Limit Theorem. Appl. Sci. 2018, 8, 32. https://doi.org/10.3390/app8010032

AMA Style

Wu Y, Yu H, Huang R, Li Y, Lin S. A Fusion Link Prediction Method Based on Limit Theorem. Applied Sciences. 2018; 8(1):32. https://doi.org/10.3390/app8010032

Chicago/Turabian Style

Wu, Yiteng, Hongtao Yu, Ruiyang Huang, Yingle Li, and Senjie Lin. 2018. "A Fusion Link Prediction Method Based on Limit Theorem" Applied Sciences 8, no. 1: 32. https://doi.org/10.3390/app8010032

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop