Our index is constructed in two steps. First, we compute the vertex cover, S. Second, we process each vertex of S to construct two-hop labels. Below, we discuss the details.
3.2.1. Vertex Cover Computation
The vertex cover, S, in our approach should satisfy two conditions: (1) When selecting a vertex and adding it into S in each iteration, the selected vertex should be the one covering the largest number of uncovered edges. In this way, we can obtain a smaller vertex cover to reduce the two-hop index size. Assuming that, after a vertex is processed, all the edges incident to it will be deleted, we know that the number of uncovered edges with respect to a vertex is its current degree. Therefore, the selected vertex in each iteration is the one with the largest degree among all the remaining vertices in V\S, i.e., u’s degree is the number of out- and in-neighbors that do not belong to S. (2) It should be computed efficiently. If we want to satisfy the first condition, it means that, after choosing a vertex, u, and adding it to S, we need to change the degree of u’s out- and in-neighbors and then perform a sorting operation so that, in the next iteration, we can find the vertex that has the largest degree on the fly. If we calculate it using a priority queue, the cost of computing the vertex cover is . Here, we propose a novel approach which guarantees that if a vertex cover satisfies the first condition, it can be computed in linear time .
We first sort all the vertices in descending order with respect to their degrees. We use an array,
A, to maintain the sorting result. For each degree, there may exist many vertices that have the same degree in
A. We use a hash table,
H, to record for these vertices with the same degree,
d, the position of the last vertex in
A (the subscript of the last vertex). For example, for
G in
Figure 3a, the array,
A, and the hash table,
H, are shown in
Figure 3a. For
G, there are two vertices with a degree = 4, i.e.,
d and
g.
= 1 means that, in
A, for the vertices with a degree = 4, the position of the last one is one, denoting vertex
g. When we take a vertex,
u, and add it to
S, we need to update the degree of
u’s neighbors then adjust their positions in
A and update
H.
Theorem 2. Let u be the vertex with the largest degree. The operation of removing u from the given graph, G, and sorting all the remaining vertices according to their new degrees can be performed with a cost of .
Proof of Theorem 2. Let v be one of u’s in- or out-neighbors, and let I(v) be the set of vertices in A that have the same degree as v. When we remove u from G, v’s degree will decrease by one (Case 1) or by two (Case 2). Case 1 means that there exists an edge, or , and Case 2 means that both and exist.
Consider Case 1. Before deleting u, all the vertices in A are sorted by degree. Because the vertex, w, after the last vertex of has a degree satisfying we only need to swap v and the last vertex of , which guarantees that, after u is deleted, all the vertices of have the same degree and that their positions in A are still continuous. After the adjustment, we update the hash table by reducing the end value of by one, i.e., = , denoting that the size of is reduced by one. Then, if deg(v) > deg(w), it means that and that we need to add a new entry into H, i.e., = . Therefore, the cost of processing neighbor, v, is .
Consider Case 2. It equals performing Case 1 twice; therefore, it can be performed in time.
As a result, the overall cost of processing u is . □
The Algorithm: Based on Theorem 2, we have a greedy approach shown by Algorithm 1 which sorts all the vertices by degree in line 1, initializes the hash table and the vertex cover in lines 2–3, and then computes the vertex cover in lines 4–15 by sequentially scanning array
A until all the edges are removed from the graph. In each iteration (lines 4–15), Algorithm 1 sequentially fetches a vertex,
u, from
A and adds it to the vertex cover,
S, in lines 5–6. In lines 7–14, we process each of
neighbor,
v, according to Theorem 2. First, we swap
v and the last vertex that has the same degree as
v in line 8. In line 9, we update the hash table by reducing the end value of
by one, denoting that the number of vertices with a degree =
is reduced by one. In line 10, we reduce the degree of
v, where
denotes the subscript of
v in A. At last, we remove the edge between
u and
v in lines 12–13.
Algorithm 1 CompVC |
1 | sort all vertices in array A by degree in decreasing order |
2 | initialize the end value of each degree in hash table |
3 | /* is the vertex cover*/ |
4 | while () do |
5 | |
6 | |
7 | foreach and do |
8 | last /*last = */ |
9 | |
10 | |
11 | if then |
12 | if then |
13 | else |
14 | endfor |
15 | endwhile |
16 | return |
Example 2: Consider G in Figure 3a. The initial status of the corresponding array, A, and the hash table, H, is shown in Figure 3a. Since all the vertices in A are sorted by degree, the first vertex to be processed is d. When processing d, we first add it to the vertex cover, S, then process its neighbors in order (
and b). For f, = 2, and we first compute last = then swap vertex i and f in line 8 of Algorithm 1. After that, we update the hash value of to five due to the fact that the number of vertices with a degree = 2 is reduced by one. At last, we update f′s degree to one and remove the edge from E. The other three neighbors of d, i.e., , are processed similarly to f. After processing all of d’s neighbors, the status of G, A, and H is shown in Figure 3b. The next vertex to be processed is g. The processing is similar to that of d. After processing g, the status of G, A, and H is show in Figure 3c. The following processing is similar. After processing , E becomes empty, and we know that vertex cover .
As a comparison,
kReach [
10] can also compute a vertex cover in linear time. It randomly selects an edge, rather than a vertex, in each iteration and adds both vertices into
S, and then it deletes all the edges incident to
u and
v. This operation repeats until all the edges are removed from
G. For
G in
Figure 1, if the edges are chosen in the order “
”, then the vertex cover is
which is larger than the result of our approach, i.e.,
.
Analysis: For Algorithm 1, we use an array, A, to maintain all the vertices, and the size is . Since the maximum degree is much less than , we know that, for Algorithm 1, the space complexity is .
Consider the time complexity. The cost of sorting operation by degree in line 1 is using counting sort with a space cost of . The cost of initializing the hash table, H, in line 2 is also , due to sequentially scanning A once. Since the cost of each iteration to process a vertex in lines 4–15 is , for all the vertices, the time cost is . Hence, for Algorithm 1, the time complexity is .
It is worth noting that
kReach [
10] also computes a vertex cover in linear time
. Here, we give an intuitive comparison between the two approaches. The common feature is that they both can obtain a vertex cover in linear time
and that the result of both the approaches is a two-approximate minimum vertex cover. The difference lies in two aspects. First, as shown in Example 2, the size of the vertex cover,
S, of
kReach could be large in practice due to the fact that it randomly selects an edge, rather than a vertex, in each iteration. We see in
Section 4 that our approach usually generates a vertex cover much smaller than that of
kReach. Second, as pointed by [
16], processing vertices in descending order with respect to their degrees is usually a better choice to generate an index with a smaller size; however,
kReach cannot guarantee that the vertices in
S always have a larger degree. For example, for
G in
Figure 1 and for
, we know that
and that the degree of
j is one. As a comparison,
, but the degree of
f is two, which is greater than that of
j. 3.2.2. The Baseline Two-Hop Distance Label
In this section, we discuss how to build the smallest two-hop distance label based on the vertex cover, S. Specifically, our approach takes the vertices of S as hop vertices and only generates a two-hop distance label for the vertices in S as shown by Algorithm 2.
In Algorithm 2, we process all the vertices in the vertex cover,
S, to generate a two-hop distance label. It first calls Algorithm 1 in line 1 to compute the vertex cover,
S. In lines 2–29, it processes all the vertices of
S in the same order as they entered
S. For each vertex,
u, Algorithm 2 performs
BFS in lines 3–15 and reverse
BFS in lines 16–28. During the
BFS and reverse
BFS of
u, it only constructs a two-hop distance label for the vertices that belong to
S. Specifically, for each vertex,
v, being visited when performing
BFS, we first check whether
v belongs to
S in line 4. If
(line 4 holds), we first obtain the distance from
u to
v through
BFS from the unvisited vertices in line 5. Then, we compute the distance in lines 6–10 using the current distance label as indicated by
. If there exists a path with a length greater than the length of a path computed by the current label (line 10 holds), it means that
v is useless when computing the length of the paths from
u. Thus, we do not need to process
v anymore and terminate the processing of
v in line 11. Otherwise, we add the distance entry,
into label
. After performing
BFS in lines 3–15, we perform reverse
BFS from
u in lines 16–28. In line 17, we first check whether the currently visited vertex,
v, belongs to
S. If YE
S (line 17 holds), we first obtain for the path from
v to
u its length,
lv, in line 18. Then, in lines 19–23, we check whether the newly computed path is less than
lv. If line 23 holds, it means that
v is useless. Therefore, we terminate the processing of
v in line 24. Otherwise, we need to then add the distance entry,
into the distance label
. For example, assume that the vertex processing order is
and that we have the two-hop distance label shown in
Figure 2c.
Algorithm 2 genVCIndex |
1 | CompVC /*Algorithm 1*/ |
2 | foreach (vertex ) do |
3 | for each vertex, v, being visited when performing BFS from u |
4 | if |
5 | the distance from u to v by BFS |
6 | |
7 | |
8 | |
9 | |
10 | |
11 | terminate the processing of |
12 | endif |
13 | endif |
14 | add into |
15 | endif |
16 | for each vertex, v, being visited when performing reverse BFS from u |
17 | if |
18 | the distance from u to v by reverse BFS |
19 | |
20 | |
21 | |
22 | |
23 | |
24 | terminate the processing of |
25 | endif |
26 | endif |
27 | add into |
28 | endif |
29 | endfor |
Analysis: Let
L be the number of entries in the longest distance label
. The cost of lines 6–10 is
; thus, the time complexity of Algorithm 2 is
. Additionally, the size of the distance label is
In practice,
S is usually much smaller than
V; thus, it is reasonable that, based on our approach, the index size and index construction time could be reduced, which is verified by our experimental results in
Section 4.
Queries Answering: With the two-hop distance label constructed with Algorithm 2, we answer a kHRQ using Algorithm 3. There are four cases that we need to consider as indicated by Equation (3) and Algorithm 3.
Case 1: both u and v belong to the vertex cover, S (line 1 of Algorithm 3). In this case, we answer this query directly by comparing the two-hop distance labels of u and v to obtain the length of the path from u to v as indicated by Equation (1) and function query in Algorithm 3.
Case 2: u belongs to S, but v does not (line 3). In this case, we need to answer for each in-neighbor w of v whether u can reach w in steps as indicated by lines 4–9. If there exists an in-neighbor w such that returns TRUE, then we know returns TRUE in line 6; otherwise, the result is FALSE.
Case 3:
v belongs to
S, but
u does not (line 10). In this case, we need to answer for each out-neighbor
w of
u whether
w can reach
v in
steps as indicated by lines 11–16. If there exists an out-neighbor
w such that
returns TRUE, then we know
returns TRUE in line 13; otherwise, the result is FALSE.
Algorithm 3 VCRea (u, v, k) |
1 | Case 1: |
2 | return query (u, v, k) |
3 | Case 2: |
4 | foreach do |
5 | if (query = TRUE) then |
6 | return TRUE |
7 | endif |
8 | endfor |
9 | return FALSE |
10 | Case 3: |
11 | foreach do |
12 | if (query = TRUE) then |
13 | return TRUE |
14 | endif |
15 | endfor |
16 | return FALSE |
17 | Case 4: |
18 | foreach do |
19 | foreach do |
20 | if (query = TRUE) then |
21 | return TRUE |
22 | endif |
23 | endfor |
24 | endfor |
25 | return FALSE |
26 | Function query |
27 | |
28 | |
29 | |
30 | |
31 | |
32 | if () then |
33 | return TRUE |
34 | endif |
35 | endif |
36 | return FALSE |
Case 4: both u and v do not belong to S (line 17). In this case, we need to check for each pair of vertices s and t, whether s can reach t in steps, where s is an out-neighbor of u and where t is an in-neighbor of v, as indicated by lines 18–25. If there exists a vertex pair such that returns TRUE, then we know returns TRUE in line 21; otherwise, the result is FALSE.
Example 3: Consider the graph in Figure 1. Figure 2c shows its two-hop label. As both d and i belong to , query corresponds to Case 1, which can be answered by first computing using Formula (1). Since = 4 by Formula (1), we know that the result is FALSE. For another example, as query corresponds to Case 2. Similarly, query corresponds to Case 3, and query corresponds to Case 4, which needs to call function query () twice due to the fact that the out-degree of a is one and due to fact that the in-degree of f is two. Let L be the number of entries in the longest two-hop label; the cost of function query is . Therefore, for Algorithm 3, the time complexity is , where is the maximum in-degree (out-degree) for all the vertices.
3.2.3. The Degree-Constraint-Based Two-Hop Label
Although the index size of Algorithm 2 is small, the cost of query answering is expensive, especially when both query vertices have a large degree and do not belong to the vertex cover, S. In fact, the probability that both of the two query vertices do not belong to S is high in practice, which means that the baseline query algorithm, i.e., Algorithm 3, may not be as efficient as expected.
Table 1 shows the ratio of the vertex cover size over that of
V and the probability that only one query vertex belongs to
S and that both query vertices do not belong to
S, respectively. From
Table 1 we know that the vertex cover could be small sometimes, especially when the graph is sparse, which means that the probability that both query vertices do not belong to
S could be large in practice. For example, for the twitter dataset, the ratio of the vertex cover is 2.33% to that of
V. In this case, the probability that one query vertex does not belong to
S is 4.56%, and the probability that both query vertices do not belong to
S is 95.39%, which means that, in most cases, both query vertices do not belong to
S; that is, Case 4 will be processed frequently when calling Algorithm 3 to answer queries. Hence, we need to reduce the cost of Case 4 in Algorithm 3 to improve the overall query performance.
Different than the existing approach [
14] that chooses partial hop vertices to construct a two-hop distance label, our vertex cover is the smallest vertex set that can be used to construct a two-hop distance label to answer all queries without graph traversal. To reduce the cost of Case 4, we propose degree-constraint heuristics to guide the construction of the two-hop distance label. The basic idea is that we do not enlarge the vertex cover but construct a two-hop distance label for more vertices of
when their out- or in-degree is greater than a given threshold,
d. In this way, for a given kHRQ,
, even if both of the two query vertices do not belong to
S, we can still directly answer it by calling function query (
) once if
. For the worst case, where
, i.e., Case 4, function query (
) will be called at most
times to obtain the result of the query, where
.
The Algorithm: Algorithm 4 shows how to construct the degree-constraint two-hop distance label. Compared with Algorithm 2, which generates a two-hop distance label for the vertices of
S, Algorithm 4 additionally generates a two-hop distance label for vertices with an out- or in-degree greater than the given threshold,
d, even if they do not belong to
S, as indicated by lines 4 and 17. Specifically, when performing
BFS from
u, we add
to the in-label
of each visited vertex,
v, if
(lines 4 and 14). Similarly, when performing reverse
BFS from u, we add
to the out-label
of each visited vertex,
v, if
(lines 17 and 27). For example, when
d = 1, Algorithm 4 generates the the-hop distance label as shown in
Figure 2d. Compared with the baseline two-hop distance label shown in
Figure 2c with respect to Algorithm 2, a new out-label,
only constructed for vertex
f. However, in this case, the benefit is obvious; we can guarantee that, for any query, function query (
) is called only once.
Algorithm 4 genVCIndex+ |
1 | CompVC /*Algorithm 1*/ |
2 | foreach (vertex ) do |
3 | for each vertex, v, being visited when performing BFS from u |
4 | if |
5 | the distance from u to v calculated through BFS |
6 | |
7 | |
8 | |
9 | |
10 | |
11 | terminate the processing of |
12 | endif |
13 | endif |
14 | add into |
15 | endif |
16 | for each vertex, v, being visited when performing reverse BFS from u |
17 | if |
18 | the distance from u to vby reverse BFS |
19 | |
20 | |
21 | |
22 | |
23 | |
24 | terminate the processing of |
25 | endif |
26 | endif |
27 | add into |
28 | endif |
29 | endfor |
Let L be the number of entries in the longest distance label (). The cost of lines 6–10 is ; thus, for Algorithm 4, the time complexity is . Additionally, the size of the distance label is O(|S|L), where S ; is a set of vertices where each vertex has an out-degree > d, and S2 is another set of vertices where each vertex has an in-degree > d. As |S|≤|S|≤|V|, the degree-constraint two-hop distance label achieves higher efficiency with a larger index size than that of Algorithm 2.
Queries Answering: Based on the two-hop distance label constructed with Algorithm 4, we answer queries using Algorithm 5. Compared with the baseline algorithm, i.e., Algorithm 3, Algorithm 5 also answers a query in four cases with Equation (4). The difference lies in that, in each case, we not only check whether
u or
v belongs to
S, but we also check whether the number of out-neighbors (in-neighbors) of
is greater than the predefined threshold,
d (lines 1 and 2). In this way, we enlarge the probability that the given query is answered with the first three cases, which, in turn, reduces the probability of Case 4. It is worth noting that, when
, we can answer any query by calling function query
once, which is as efficient as generating a two-hop distance label for all the vertices but has smaller index size. When
, to answer the given query,
, Algorithm 5 will call function query
once for Case 1, at most
d times for Case 2 and Case 3, and at most
d2 times for Case 4. Since the probability of Case 4 in Algorithm 5 is small compared with that in Algorithm 3, the query performance can be improved as shown by our experimental results.
Algorithm 5 VCRea+ (u, v, k, d) |
1 | |
2 | |
3 | Case 1: ) |
4 | return query (u, v, k) |
5 | Case 2: |
6 | foreach do |
7 | if (query = TRUE) then |
8 | return TRUE |
9 | endif |
10 | endfor |
11 | return FALSE |
12 | Case 3: |
13 | foreach do |
14 | if (query = TRUE) then |
15 | return TRUE |
16 | endif |
17 | endfor |
18 | return FALSE |
19 | Case 4: |
20 | foreach do |
21 | foreach do |
22 | if (query = TRUE) then |
23 | return TRUE |
24 | endif |
25 | endfor |
26 | endfor |
27 | return FALSE |
Example 4 continues Example 3, where query , and corresponds to Case 2, but here it corresponds to Case 1 due to . Query corresponds to Case 4 in Example 3, but here it corresponds to Case 3. Additionally, as , any case of Algorithm 5 will call function query () only once for any query. Therefore, compared with Algorithm 3, the calling times of function query () can be reduced with Algorithm 5.