Efficient Processing of k-Hop Reachability Queries on Directed Graphs

Tang, Xian; Zhou, Junfeng; Shi, Yunyu; Liu, Xiang; Lin, Keng

doi:10.3390/app13063470

Open AccessArticle

Efficient Processing of k-Hop Reachability Queries on Directed Graphs

by

Xian Tang

¹,

Junfeng Zhou

^2,*

,

Yunyu Shi

¹,

Xiang Liu

¹ and

Keng Lin

²

¹

School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai 201620, China

²

School of Computer Science and Technology, Donghua University, Shanghai 201620, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(6), 3470; https://doi.org/10.3390/app13063470

Submission received: 12 February 2023 / Revised: 6 March 2023 / Accepted: 7 March 2023 / Published: 8 March 2023

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Given a directed graph, a k-hop reachability query,

u \overset{? k}{\to} v

, is used to check for the existence of a directed path from

u

to

v

that has a length of at most

k

. Addressing k-hop reachability queries is a fundamental task in graph theory and has been extensively investigated. However, existing algorithms can be inefficient when answering queries because they require costly graph traversal operations. To improve query performance, we propose an approach based on a vertex cover. We construct an index that covers all reachability information using a small set of vertices from the input graph. This allows us to answer k-hop reachability queries without performing graph traversal. We propose a linear-time algorithm to quickly compute a vertex cover, S, which we use to develop a novel labeling scheme and two algorithms for efficient query answering. The experimental results demonstrate that our approach significantly outperforms the existing approaches in terms of query response time.

Keywords:

graph data management; k-hop reachability; vertex cover

1. Introduction

Directed graphs are widely used in various applications to represent complex relationships between entities [1,2]. A reachability query can be performed on a directed graph to determine if a path exists from a source vertex to a target vertex. This fundamental graph operation has been an active research topic for over two decades [1,3,4,5,6,7,8,9]. In addition to checking for reachability, users often want to measure the strength of the influence between two vertices, which is typically calculated based on the length of the path connecting them. For instance, in wireless networks, information can be lost after a certain number of hops, prompting users to inquire whether a message from a source vertex can reach a destination vertex within a certain number of hops. Similarly, users may want to know if two people in social networks can be connected within a few hops. These queries are modeled as k-hop reachability queries (kHRQs), where k represents the maximum number of hops allowed to find a directed path from a source vertex to a target vertex [10,11].

Given a directed graph, there are two ways to answer a kHRQ. The first is to precompute the distance between each pair of vertices, allowing for a constant-time lookup to answer the query. However, this suffers from an unaffordable index size. The second approach involves performing graph traversal, such as a depth-first search (DFS) or breadth-first search (BFS), from the source vertex to obtain the answer. However, this approach has poor query performance. Over the years, researchers have proposed various approaches that aim to strike a balance between the two extremes. Some notable algorithms include kReach [10,11], ESTI [12], BFSI-B [13], and HT [14]. Among them, BFSI-B [13] and HT [14] are specialized for kHRQs on directed acyclic graphs (DAGs) and do not support general directed graphs containing strongly connected components (SCCs). On the other hand, kReach [10,11] and ESTI [12] are tailored for general directed graphs with SCCs. Both approaches utilize an index that covers a part of the reachability information, rather than all of it, to answer kHRQs. However, when the index cannot answer the query, they still need to perform a graph traversal operation, which can lead to poor query performance in practice.

In this paper, our main focus is on developing an efficient approach for answering kHRQs on general directed graphs with SCCs. We propose to first find a vertex cover, S, based on which we construct a two-hop index [15,16], maintaining the distance between each pair of vertices. Compared with kReach and ESTI, our approach avoids the costly graph traversal operation. We propose a linear-time algorithm, which computes a smaller vertex cover, S, satisfying that the vertices in S always have a larger degree. Based on the vertex cover, we propose to construct a degree-constrained two-hop label to efficiently answer kHRQs. Our contributions are as follows:

1.: We propose constructing a compact two-hop index that utilizes a vertex cover and suggest a linear-time algorithm for computing a smaller vertex cover. We prove that it correctly answers kHRQs by utilizing the two-hop index constructed from the vertex cover and introduce efficient algorithms for constructing the index.
2.: We propose a degree-constrained two-hop label to accelerate query answering. By this labeling scheme, we take the vertices of the cover, set as hop vertices, and perform BFS from these vertices. During BFS, only the vertices with a degree $> d$ are assigned with the two-hop label, where d is a small constant. In this way, we guarantee that we can answer a kHRQ by comparing at most d² labels.
3.: We verify the benefits of our approach by conducting rich experiments. The experimental results show that our approach is much more efficient when answering kHRQs with a reasonable index size and index construction time.

The remainder of this paper is structured as follows: Section 2 provides an overview of the related work in the field. In Section 3, we introduce the fundamental concept of our approach, which relies on utilizing a vertex cover. A detailed description of the index’s construction is then presented in Section 3.2. Section 4 presents the experimental results, and Section 5 concludes the paper.

2. Related Work

Given a directed graph, G = (V, E), where V denotes the set of vertices and where E denotes the set of edges, an edge (u, v)

\in

E denotes a directed edge from vertex u to v. In this context, u(v) is referred to as the in-neighbor (out-neighbor) of v(u). We use d(u, v) to represent the length of the shortest simple path p_u,v from u to v, i.e., the number of edges in p_u,v. In the following, we discuss the related work on kHRQs.

Query Processing on General Directed Graphs: The two recent approaches for addressing kHRQs on general directed graphs with SCCs are kReach [10,11] and ESTI [12]. kReach [10,11] is the first approach tailored to answering kHRQs in such graphs, and it precomputes k-hop reachability information among a set of vertices to facilitate answering these queries. When letting

d_{m a x} = m a x {d (u, v) | u, v \in V}

, the basic kReach needs to construct

d_{m a x}

distinct indexes. Here, each index is a graph of

G_{K} = (V_{k}, E_{k}) (k \in [1, d_{m a x}])

, where

V_{k}

is the vertex cover and where

E_{k} = {(u, v) | u, v \in V_{k}, d (u, v) \leq k}

. Each edge,

(u, v) \in E_{k}

, is associated with a weight denoting the length from u to v. Given a query,

u \overset{? k}{\to} v

,

G_{k}

can only be used to answer whether there exists a path from u to v with a length ≤ k without graph traversal. In practice, the basic kReach cannot scale to large graphs to process queries with a general k due to a large storage space. An improvement for this is building the index on a single vertex cover, S. However, to answer general kHRQs, it needs to compute the lengths of the shortest paths for each pair of vertices in S. Still, such an improvement suffers from a

O ({| S |}^{2})

index size and cannot scale to graphs where S becomes large. Considering this, the optimized kReach [10,11] uses the k-relaxed partial vertex cover to reduce the index size; such a strategy also means that many queries are answered with the costly operation of graph traversal, and the query performance degenerates accordingly. Hereafter, when we say kReach, it denotes the optimized approach that can process large graphs.

ESTI [12] is an extended spanning tree index combined with two topological orders [4]. The spanning tree index is used to compute the distance, and the two topological orders are used to prune unreachable queries. For ESTI, as the spanning tree index is constructed by BFS and as the size of the ESTI index is equal to the number of edges in the input graph, queries are usually answered depending on the costly graph traversal operation, and the query performance degenerates sharply, especially when the graph becomes dense.

Query Processing on DAGs: On top of ESTI and kReach, there are also other approaches, such as BFSI-B [13] and HT [14], that can be used to answer kHRQs. However, both BFSI-B and HT can only work on DAGs; they cannot be applied to graphs with SCCs. The reason lies in that they both use topological orders to check the unreachability from the source vertex to the target vertex. Further, HT uses a topological level to check the unreachability relationship between the source vertex and the target vertex, which also cannot be applied to a directed graph.

Note that a traditional reachability query is a special case of kHRQs where

k = \infty

, therefore the approaches for traditional reachability queries [2,4,5,17,18,19,20] cannot be used to answer kHRQs. Other than the above approaches that focus on processing kHRQs, [21] studied the problem of answering reachability queries with label constraints, and [22] tried to enumerate for two vertices all the paths between them with lengths smaller than a predefined threshold. As a comparison, our approach does not need to list all the paths; we focus on checking the existence of a path between two vertices with a length smaller than a value specified in advance, which is a totally different problem from that of [22].

Two-hop Distance Label: In [15], the authors proposed a two-hop distance labeling scheme to answer distance queries; the index construction cost was as high as

O (| V | | E | \log ({| V |}^{2} / | E |))

. To tackle this problem, Yano Y et al. [16] proposed an ordering strategy to rank vertices so that the index could be constructed efficiently. By this labeling scheme, every vertex, v, has two labels: one is in-label

L_{i n} (v)

, and the other is out-label

L_{o u t} (v)

.

L_{i n} (v) (L_{o u t} (v))

contains a set of tuples, where each one,

[u, d (u, v)] ([u, d (v, u)])

, conveys the information of a target vertex, u, and the distance from u to v (v to u). Here, u in each tuple is called a hop vertex. Given the two-hop label, the distance between u and v can be computed with Equation (1), where

C_{i n} (v) = {u | [u, d (u, v)] \in L_{i n} (v))}

and where

C_{o u t} (v) = {u | [u, d (v, u)] \in L_{o u t} (v))}

.

d (u, v) = \min_{w \in C_{o u t} (u) \cap C_{i n} (v)} (d (u, w) + d (w, v))

(1)

Then, for a given kHRQ,

u \overset{? k}{\to} v

, we can answer TRUE if

d (u, v) \leq k

, and, otherwise, we can answer FALSE.

The two-hop distance labels can be constructed in the following way [16]: First, all the vertices are sorted by their degrees in descending order. Second, we process one vertex, v, in each iteration to obtain two-hop labels. We perform forward BFS from v to obtain a set, D, of vertices that v can reach. Second, we perform backward BFS from v to obtain a set, A, of vertices that can reach v. For each vertex

u \in A

, we add v and

d (u, v)

to u’s out-label, i.e.,

L_{o u t} (u) = L_{o u t} (u) \cup {[v, d (u, v)]}

, denoting that u can reach v with a distance of

d (u, v)

. For each vertex

w \in D

, we add v to w’s in-label, i.e.,

L_{i n} (w) = L_{i n} (w) \cup {[v, d (v, w)]}

, denoting that v can reach w with a distance of

d (v, w)

. In practice, such a two-hop distance label suffers from a bigger index size, longer indexing time, and longer query processing time due to the fact that it takes all the vertices as hop vertices to construct the index.

3. A Vertex-Cover-Based Approach

3.1. Overview

To improve the query performance, our approach is to construct a novel two-hop distance label. The novelty of our labeling lies in that, before index construction, we firstly obtained a small set, S, of vertices based on which we generated two-hop distance labels with a smaller index size that covered all reachability information. We show below that, even if |S|

≪

|V|, our approach can still guarantee the avoidance of the costly graph traversal.

Definition 1.

Given a directed edge, e = (u, v), e is covered by a vertex if it is either u or v.

Definition 2.

Given a directed graph,

G = (V, E)

, a vertex cover, S, of G is a subset of V, satisfying that every edge of E is covered by at least one vertex of S.

According to Definition 1, the number of covered edges for a vertex, v, is

| i n_{G} (v) \cup o u t_{G} (v) |

. For G in Figure 1,

S = {a, b, c, d, e, g, i, j}

is a cover set according to Definition 2. Note that for a directed graph,

G

, the vertex cover is not unique. It is easy to verify that, for G, V is a vertex cover.

Theorem 1.

Given a directed graph, G, and one of its vertex covers, S, any kHRQ can be answered with the two-hop distance labels generated from S.

Proof of Theorem 1.

Given a query,

u \overset{? k}{\to} v

, we prove this result with two aspects.

Case 1: if u belongs to S, then we need to construct a two-hop label from u by performing bidirectional BFS; that is, we know the distance from u to any vertex that u can reach or be reached by, which is maintained in the two-hop distance labels. Thus, we know the distance,

d (u, v)

, due to Equation (1). If

d (u, v) \leq k

, then the answer is TRUE; otherwise, the answer is FALSE. Symmetrically, if v belongs to S, we obtain the same result.

Case 2: if none of u or v belongs to S, then all the vertices in

o u t_{G} (u)

and

i n_{G} (u)

belong to S; otherwise, there exists at least one edge that cannot be covered by S according to Definition 2. Therefore, the distance from any vertex of

o u t_{G} (u) \cup i n_{G} (u)

to all the other vertices can be computed by Case 1. Then, through Equation (1), we can obtain

d (u, v)

. □

According to Theorem 1, the two-hop label constructed from the vertices in S can be used to obtain the answers for all kHRQs. Since

S \subseteq V

and since [16] constructs two-hop labels from all vertices, the index size can be reduced if we construct the two-hop index from S. The smaller S is, the smaller the index size is. As finding the minimum vertex cover is NP-hard [23], we propose a linear-time approach to find a smaller vertex cover by choosing in each iteration the vertex that covers more uncovered edges. Here, the key problem is how to determine which vertex can cover more uncovered edges in

O (1)

time to achieve linear-time complexity. The details are discussed in Section 3.2.1.

After finding the vertex cover, S, we can build two-hop distance labels. Here, we have three choices. One is to use the vertices of S to generate labels for the vertices in V so that a given query,

u \overset{? k}{\to} v

, can be answered with Equation (2). To do so, we need to construct two-hop distance labels for all the vertices, which may still result in a large index size.

u \overset{? k}{\to} v = {\begin{matrix} T R U E, d (u, v) \leq k \\ F A L S E, o t h e r w i s e \end{matrix}

(2)

The other is to use S to construct labels for the vertices in S. For the vertices that belong to

V \ S

, no labels are generated. Intuitively, the index size can be reduced greatly. As the vertices belonging to

V \ S

do not have a two-hop distance label, we need to answer the given query with Equation (3). Note that, in Equation (3), Equation (2) may need to be called

| o u t_{G} (u) | \times | i n_{G} (v) |

times in the worst case to answer the given query

u \overset{? k}{\to} v

.

u \overset{? k}{\to} v = {\begin{array}{l} u \overset{? k}{\to} v, & u, v \in S \\ \cup u \overset{? (k - 1)}{\to} v^{'}, & u \in S, v \notin S, v^{'} \in i n_{G} (v) \\ \cup u^{'} \overset{? (k - 1)}{\to} v, & u \notin S, v \in S, u^{'} \in o u t_{G} (u) \\ \cup u^{'} \overset{? (k - 2)}{\to} v^{'}, & u, v \notin S, u^{'} \in o u t_{G} (u), v^{'} \in i n_{G} (v) \end{array}

(3)

At last, we can make a trade off in the query performance and index size for the two above approaches. Specifically, besides constructing an index for the vertices in S, we also generate two-hop distance labels for the other vertices if they satisfy the predefined degree constraint. That is, when performing BFS from a vertex, u, to construct a two-hop label, if the visited vertex, v, satisfies

| i n_{G} (v) | > d

, then we generate v’s in-label L_in (v). Similarly, when performing reversed BFS from u, if the visited vertex, v, satisfies

| o u t_{G} (v) | > d

, then we generate v’s out-label

L_{o u t} (v)

. In this way, we can answer the given query using Equation (4), where S₁ is the set of vertices satisfying

| o u t_{G} (v) | > d

and where S₂ is the set of vertices satisfying

| i n_{G} (v) | > d

. Obviously, the given query

u \overset{? k}{\to} v

can be answered by calling Equation (2) at most

d^{2}

times (

d > 0

). When

d = 0

, Equation (4) equals Equation (2).

u \overset{? k}{\to} v = {\begin{array}{l} u \overset{? k}{\to} v, & u \in S \cup S_{1} a n d v \in S \cup S_{2} \\ \cup u \overset{? (k - 1)}{\to} v^{'}, & u \in S \cup S_{1}, v \notin S \cup S_{2} v^{'} \in i n_{G} (v) \\ \cup u^{'} \overset{? (k - 1)}{\to} v, & u \notin S \cup S_{1}, v \in S \cup S_{2}, u^{'} \in o u t_{G} (u) \\ \cup u^{'} \overset{? (k - 2)}{\to} v^{'}, & u \notin S \cup S_{1}, v \notin S \cup S_{2}, u^{'} \in o u t_{G} (u), v^{'} \in i n_{G} (v) \end{array}

(4)

Example 1:

Figure 2 shows the comparison between different labels for G in Figure 1. From Figure 2a, we know that constructing labels for all the vertices will result in a larger index size. As a comparison, our approach has a smaller vertex cover,

S = {b, d, g, i}

, based on which we generate the two-hop label shown in Figure 2b. Compared with constructing an index based on all the vertices, the index size is greatly reduced. Figure 2c shows the index of our degree-constraint two-hop label, which is slightly bigger than our baseline approach but is still much smaller than that of traditional two-hop distance labels.

3.2. Index Construction

Our index is constructed in two steps. First, we compute the vertex cover, S. Second, we process each vertex of S to construct two-hop labels. Below, we discuss the details.

3.2.1. Vertex Cover Computation

The vertex cover, S, in our approach should satisfy two conditions: (1) When selecting a vertex and adding it into S in each iteration, the selected vertex should be the one covering the largest number of uncovered edges. In this way, we can obtain a smaller vertex cover to reduce the two-hop index size. Assuming that, after a vertex is processed, all the edges incident to it will be deleted, we know that the number of uncovered edges with respect to a vertex is its current degree. Therefore, the selected vertex in each iteration is the one with the largest degree among all the remaining vertices in V\S, i.e., u’s degree is the number of out- and in-neighbors that do not belong to S. (2) It should be computed efficiently. If we want to satisfy the first condition, it means that, after choosing a vertex, u, and adding it to S, we need to change the degree of u’s out- and in-neighbors and then perform a sorting operation so that, in the next iteration, we can find the vertex that has the largest degree on the fly. If we calculate it using a priority queue, the cost of computing the vertex cover is

O (| E | l o g | V |)

. Here, we propose a novel approach which guarantees that if a vertex cover satisfies the first condition, it can be computed in linear time

O (| V | + | E |)

.

We first sort all the vertices in descending order with respect to their degrees. We use an array, A, to maintain the sorting result. For each degree, there may exist many vertices that have the same degree in A. We use a hash table, H, to record for these vertices with the same degree, d, the position of the last vertex in A (the subscript of the last vertex). For example, for G in Figure 3a, the array, A, and the hash table, H, are shown in Figure 3a. For G, there are two vertices with a degree = 4, i.e., d and g.

H [4]

= 1 means that, in A, for the vertices with a degree = 4, the position of the last one is one, denoting vertex g. When we take a vertex, u, and add it to S, we need to update the degree of u’s neighbors then adjust their positions in A and update H.

Theorem 2.

Let u be the vertex with the largest degree. The operation of removing u from the given graph, G, and sorting all the remaining vertices according to their new degrees can be performed with a cost of

O (| i n_{G} (u) | + | o u t_{G} (u) |)

.

Proof of Theorem 2.

Let v be one of u’s in- or out-neighbors, and let I(v) be the set of vertices in A that have the same degree as v. When we remove u from G, v’s degree will decrease by one (Case 1) or by two (Case 2). Case 1 means that there exists an edge,

(u, v)

or

(v, u)

, and Case 2 means that both

(u, v)

and

(v, u)

exist.

Consider Case 1. Before deleting u, all the vertices in A are sorted by degree. Because the vertex, w, after the last vertex of

I (v)

has a degree satisfying

d e g (v) > d e g (w),

we only need to swap v and the last vertex of

I (v)

, which guarantees that, after u is deleted, all the vertices of

I (v)

have the same degree and that their positions in A are still continuous. After the adjustment, we update the hash table by reducing the end value of

H [d e g (v)]

by one, i.e.,

H [d e g (v)]

=

H [d e g (v)] - 1

, denoting that the size of

I (v)

is reduced by one. Then, if deg(v) > deg(w), it means that

I (v) = {v}

and that we need to add a new entry into H, i.e.,

H [d e g (v)]

=

p o s (v)

. Therefore, the cost of processing

u ’ s

neighbor, v, is

O (1)

.

Consider Case 2. It equals performing Case 1 twice; therefore, it can be performed in

O (1)

time.

As a result, the overall cost of processing u is

O (| i n_{G} (u) | + | o u t_{G} (u) |)

. □

The Algorithm: Based on Theorem 2, we have a greedy approach shown by Algorithm 1 which sorts all the vertices by degree in line 1, initializes the hash table and the vertex cover in lines 2–3, and then computes the vertex cover in lines 4–15 by sequentially scanning array A until all the edges are removed from the graph. In each iteration (lines 4–15), Algorithm 1 sequentially fetches a vertex, u, from A and adds it to the vertex cover, S, in lines 5–6. In lines 7–14, we process each of

u ’ s

neighbor, v, according to Theorem 2. First, we swap v and the last vertex that has the same degree as v in line 8. In line 9, we update the hash table by reducing the end value of

H [d e g (v)]

by one, denoting that the number of vertices with a degree =

d e g (v)

is reduced by one. In line 10, we reduce the degree of v, where

p o s (v)

denotes the subscript of v in A. At last, we remove the edge between u and v in lines 12–13.

Algorithm 1 CompVC $(G = (V, E))$
1	sort all vertices in array A by degree in decreasing order
2	initialize the end value of each degree in hash table $H$
3	$S \leftarrow \emptyset$ /* $S$ is the vertex cover*/
4	while ( $E \neq \emptyset$ ) do
5	$u \leftarrow$ $g e t N e x t (A)$
6	$S \leftarrow S \cup {u}$
7	foreach $(v \in o u t_{G} (u)$ and $v \in i n_{G} (u))$ do
8	$v \leftrightarrow$ last $(I (v))$ /last $(I (v))$ = $A [H [d e g (v)]]$ /
9	$H [d e g (v)] \leftarrow H [d e g (v)] - 1$
10	$d e g (v) \leftarrow d e g (v) - 1$
11	if $(d e g (v) > A [p o s (v) + 1] . d e g r e e)$ then $H [d e g (v)] = p o s (v)$
12	if $(v \in o u t_{G} (u))$ then $E \leftarrow E \ {(u, v)}$
13	else $E \leftarrow E \ {(v, u)}$
14	endfor
15	endwhile
16	return $S$

Example 2:

Consider G in Figure 3a. The initial status of the corresponding array, A, and the hash table, H, is shown in Figure 3a. Since all the vertices in A are sorted by degree, the first vertex to be processed is d. When processing d, we first add it to the vertex cover, S, then process its neighbors in order (

f, e, c,

and b). For f,

d e g (f)

= 2, and we first compute last

(I (f))

=

A [H [2]] = A [6] = i

then swap vertex i and f in line 8 of Algorithm 1. After that, we update the hash value of

H [2]

to five due to the fact that the number of vertices with a degree = 2 is reduced by one. At last, we update f′s degree to one and remove the edge

(d, f)

from E. The other three neighbors of d, i.e.,

e, c, a n d b

, are processed similarly to f. After processing all of d’s neighbors, the status of G, A, and H is shown in Figure 3b. The next vertex to be processed is g. The processing is similar to that of d. After processing g, the status of G, A, and H is show in Figure 3c. The following processing is similar. After processing

d, g, b, a n d i

, E becomes empty, and we know that vertex cover

S = {d, g, b, i}

.

As a comparison, kReach [10] can also compute a vertex cover in linear time. It randomly selects an edge, rather than a vertex, in each iteration and adds both vertices into S, and then it deletes all the edges incident to u and v. This operation repeats until all the edges are removed from G. For G in Figure 1, if the edges are chosen in the order “

(a, b), (c, d), (e, g), (i, j)

”, then the vertex cover is

S = {a, b, c, d, e, g, i, j},

which is larger than the result of our approach, i.e.,

{b, d, g, i}

.

Analysis: For Algorithm 1, we use an array, A, to maintain all the vertices, and the size is

O (| V |)

. Since the maximum degree is much less than

| V |

, we know that, for Algorithm 1, the space complexity is

O (| V |)

.

Consider the time complexity. The cost of sorting operation by degree in line 1 is

O (| V |)

using counting sort with a space cost of

O (| V |)

. The cost of initializing the hash table, H, in line 2 is also

O (| V |)

, due to sequentially scanning A once. Since the cost of each iteration to process a vertex in lines 4–15 is

| o u t_{G} (u) | + | i n_{G} (u) |

, for all the vertices, the time cost is

\sum_{u \in V} | o u t_{G} (u) | + | i n_{G} (u) | = 2 | E |

. Hence, for Algorithm 1, the time complexity is

O (| V | + | E |)

.

It is worth noting that kReach [10] also computes a vertex cover in linear time

O (| V | + | E |)

. Here, we give an intuitive comparison between the two approaches. The common feature is that they both can obtain a vertex cover in linear time

O (| V | + | E |)

and that the result of both the approaches is a two-approximate minimum vertex cover. The difference lies in two aspects. First, as shown in Example 2, the size of the vertex cover, S, of kReach could be large in practice due to the fact that it randomly selects an edge, rather than a vertex, in each iteration. We see in Section 4 that our approach usually generates a vertex cover much smaller than that of kReach. Second, as pointed by [16], processing vertices in descending order with respect to their degrees is usually a better choice to generate an index with a smaller size; however, kReach cannot guarantee that the vertices in S always have a larger degree. For example, for G in Figure 1 and for

S = {a, b, c, d, e, g, i, j}

, we know that

j \in S

and that the degree of j is one. As a comparison,

f \notin S

, but the degree of f is two, which is greater than that of j.

3.2.2. The Baseline Two-Hop Distance Label

In this section, we discuss how to build the smallest two-hop distance label based on the vertex cover, S. Specifically, our approach takes the vertices of S as hop vertices and only generates a two-hop distance label for the vertices in S as shown by Algorithm 2.

In Algorithm 2, we process all the vertices in the vertex cover, S, to generate a two-hop distance label. It first calls Algorithm 1 in line 1 to compute the vertex cover, S. In lines 2–29, it processes all the vertices of S in the same order as they entered S. For each vertex, u, Algorithm 2 performs BFS in lines 3–15 and reverse BFS in lines 16–28. During the BFS and reverse BFS of u, it only constructs a two-hop distance label for the vertices that belong to S. Specifically, for each vertex, v, being visited when performing BFS, we first check whether v belongs to S in line 4. If

v \in S

(line 4 holds), we first obtain the distance from u to v through BFS from the unvisited vertices in line 5. Then, we compute the distance in lines 6–10 using the current distance label as indicated by

m i n_{w \in C} (d (u, w) + d (w, v))

. If there exists a path with a length greater than the length of a path computed by the current label (line 10 holds), it means that v is useless when computing the length of the paths from u. Thus, we do not need to process v anymore and terminate the processing of v in line 11. Otherwise, we add the distance entry,

[u, l_{v}],

into label

L_{i n} (v)

. After performing BFS in lines 3–15, we perform reverse BFS from u in lines 16–28. In line 17, we first check whether the currently visited vertex, v, belongs to S. If YES (line 17 holds), we first obtain for the path from v to u its length, l_v, in line 18. Then, in lines 19–23, we check whether the newly computed path is less than l_v. If line 23 holds, it means that v is useless. Therefore, we terminate the processing of v in line 24. Otherwise, we need to then add the distance entry,

[u, l_{v}],

into the distance label

L_{o u t} (v)

. For example, assume that the vertex processing order is

d, g, b, i

and that we have the two-hop distance label shown in Figure 2c.

Algorithm 2 genVCIndex $(G = (V, E))$
1	$S \leftarrow$ CompVC $(G = (V, E))$ /Algorithm 1/
2	foreach (vertex $u \in S$ ) do
3	for each vertex, v, being visited when performing BFS from u
4	if $(v \in S)$
5	$l_{v} \leftarrow$ the distance from u to v by BFS
6	$C_{o u t} \leftarrow {w \| [w, d (u, w)] \in L_{o u t} (u))}$
7	$C_{i n} \leftarrow {w \| [w, d (w, v)] \in L_{i n} (v))}$
8	$C \leftarrow C_{o u t} ⋂ C_{i n}$
9	$i f (C \neq \emptyset) t h e n$
10	$i f (l_{v} > \min_{w \in C} (d (u, w) + d (w, v))) t h e n$
11	terminate the processing of $v$
12	endif
13	endif
14	add $[u, l_{v}]$ into $L_{i n} (v)$
15	endif
16	for each vertex, v, being visited when performing reverse BFS from u
17	if $(v \in S)$
18	$l_{v} \leftarrow$ the distance from u to v by reverse BFS
19	$C_{o u t} \leftarrow {w \| [w, d (v, w)] \in L_{o u t} (v))}$
20	$C_{i n} \leftarrow {w \| [w, d (w, u)] \in L_{i n} (u))}$
21	$C \leftarrow C_{o u t} ⋂ C_{i n}$
22	$i f (C \neq \emptyset) t h e n$
23	$i f (l_{v} > \min_{w \in C} (d (v, w) + d (w, u))) t h e n$
24	terminate the processing of $v$
25	endif
26	endif
27	add $[u, l_{v}]$ into $L_{o u t} (v)$
28	endif
29	endfor

Analysis: Let L be the number of entries in the longest distance label

(L \leq | S |)

. The cost of lines 6–10 is

O (L)

; thus, the time complexity of Algorithm 2 is

O (| S | (| V | + | E |) L)

. Additionally, the size of the distance label is

O (| S | L) .

In practice, S is usually much smaller than V; thus, it is reasonable that, based on our approach, the index size and index construction time could be reduced, which is verified by our experimental results in Section 4.

Queries Answering: With the two-hop distance label constructed with Algorithm 2, we answer a kHRQ using Algorithm 3. There are four cases that we need to consider as indicated by Equation (3) and Algorithm 3.

Case 1: both u and v belong to the vertex cover, S (line 1 of Algorithm 3). In this case, we answer this query directly by comparing the two-hop distance labels of u and v to obtain the length of the path from u to v as indicated by Equation (1) and function query

(u, v, k)

in Algorithm 3.

Case 2: u belongs to S, but v does not (line 3). In this case, we need to answer for each in-neighbor w of v whether u can reach w in

k - 1

steps as indicated by lines 4–9. If there exists an in-neighbor w such that

u \overset{? k - 1}{\to} w

returns TRUE, then we know

u \overset{? k}{\to} v

returns TRUE in line 6; otherwise, the result is FALSE.

Case 3: v belongs to S, but u does not (line 10). In this case, we need to answer for each out-neighbor w of u whether w can reach v in

k - 1

steps as indicated by lines 11–16. If there exists an out-neighbor w such that

w \overset{? k - 1}{\to} v

returns TRUE, then we know

u \overset{? k}{\to} v

returns TRUE in line 13; otherwise, the result is FALSE.

Algorithm 3 VCRea (u, v, k)
1	Case 1: $u \in S \land v \in S$
2	return query (u, v, k)
3	Case 2: $u \in S \land v \notin S$
4	foreach $w \in i n_{G} (v)$ do
5	if (query $(u, w, k - 1)$ = TRUE) then
6	return TRUE
7	endif
8	endfor
9	return FALSE
10	Case 3: $u \notin S \land v \in S$
11	foreach $w \in o u t_{G} (u)$ do
12	if (query $(w, v, k - 1)$ = TRUE) then
13	return TRUE
14	endif
15	endfor
16	return FALSE
17	Case 4: $u \notin S \land v \notin S$
18	foreach $s \in o u t_{G} (u)$ do
19	foreach $t \in i n_{G} (v)$ do
20	if (query $(s, t, k - 2)$ = TRUE) then
21	return TRUE
22	endif
23	endfor
24	endfor
25	return FALSE
26	Function query $(u, v, k)$
27	$C_{o u t} \leftarrow {w \| [w, d (u, w)] \in L_{o u t} (u))}$
28	$C_{i n} \leftarrow {w \| [w, d (w, v)] \in L_{i n} (v))}$
29	$C \leftarrow C_{o u t} ⋂ C_{i n}$
30	$i f (C \neq \emptyset) t h e n$
31	$l \leftarrow (m i n_{w \in C} (d (u, w) + d (w, v)))$
32	if ( $l \leq k$ ) then
33	return TRUE
34	endif
35	endif
36	return FALSE

Case 4: both u and v do not belong to S (line 17). In this case, we need to check for each pair of vertices s and t, whether s can reach t in

k - 2

steps, where s is an out-neighbor of u and where t is an in-neighbor of v, as indicated by lines 18–25. If there exists a vertex pair such that

s \overset{? k - 2}{\to} t

returns TRUE, then we know

u \overset{? k}{\to} v

returns TRUE in line 21; otherwise, the result is FALSE.

Example 3:

Consider the graph in Figure 1. Figure 2c shows its two-hop label. As both d and i belong to

S = {b, d, g, i}

, query

b \overset{? 2}{\to} i

corresponds to Case 1, which can be answered by first computing

d (b, i)

using Formula (1). Since

d (b, i)

= 4 by Formula (1), we know that the result is FALSE. For another example, as

h \notin S,

query

g \overset{? 2}{\to} f

corresponds to Case 2. Similarly, query

a \overset{? 2}{\to} i

corresponds to Case 3, and query

a \overset{? 2}{\to} f

corresponds to Case 4, which needs to call function query (

u, v, k

) twice due to the fact that the out-degree of a is one and due to fact that the in-degree of f is two.

Let L be the number of entries in the longest two-hop label; the cost of function query

(u, v, k)

is

O (L)

. Therefore, for Algorithm 3, the time complexity is

O (d_{m a x}^{i n} d_{m a x}^{o u t} L)

, where

d_{m a x}^{i n} (d_{m a x}^{o u t})

is the maximum in-degree (out-degree) for all the vertices.

3.2.3. The Degree-Constraint-Based Two-Hop Label

Although the index size of Algorithm 2 is small, the cost of query answering is expensive, especially when both query vertices have a large degree and do not belong to the vertex cover, S. In fact, the probability that both of the two query vertices do not belong to S is high in practice, which means that the baseline query algorithm, i.e., Algorithm 3, may not be as efficient as expected.

Table 1 shows the ratio of the vertex cover size over that of V and the probability that only one query vertex belongs to S and that both query vertices do not belong to S, respectively. From Table 1 we know that the vertex cover could be small sometimes, especially when the graph is sparse, which means that the probability that both query vertices do not belong to S could be large in practice. For example, for the twitter dataset, the ratio of the vertex cover is 2.33% to that of V. In this case, the probability that one query vertex does not belong to S is 4.56%, and the probability that both query vertices do not belong to S is 95.39%, which means that, in most cases, both query vertices do not belong to S; that is, Case 4 will be processed frequently when calling Algorithm 3 to answer queries. Hence, we need to reduce the cost of Case 4 in Algorithm 3 to improve the overall query performance.

Different than the existing approach [14] that chooses partial hop vertices to construct a two-hop distance label, our vertex cover is the smallest vertex set that can be used to construct a two-hop distance label to answer all queries without graph traversal. To reduce the cost of Case 4, we propose degree-constraint heuristics to guide the construction of the two-hop distance label. The basic idea is that we do not enlarge the vertex cover but construct a two-hop distance label for more vertices of

V \ S

when their out- or in-degree is greater than a given threshold, d. In this way, for a given kHRQ,

u \overset{? k}{\to} v

, even if both of the two query vertices do not belong to S, we can still directly answer it by calling function query (

u, v, k

) once if

| o u t_{G} (u) | > d \land | i n_{G} (v) | > d

. For the worst case, where

| o u t_{G} (u) | \leq d \land | i n_{G} (v) | \leq d

, i.e., Case 4, function query (

u, v, k

) will be called at most

d^{2}

times to obtain the result of the query, where

d^{2} ≪ d_{m a x}^{i n} d_{m a x}^{o u t}

.

The Algorithm: Algorithm 4 shows how to construct the degree-constraint two-hop distance label. Compared with Algorithm 2, which generates a two-hop distance label for the vertices of S, Algorithm 4 additionally generates a two-hop distance label for vertices with an out- or in-degree greater than the given threshold, d, even if they do not belong to S, as indicated by lines 4 and 17. Specifically, when performing BFS from u, we add

[u, l_{v}]

to the in-label

L_{i n} (v)

of each visited vertex, v, if

| i n_{G} (v) | 〉 d

(lines 4 and 14). Similarly, when performing reverse BFS from u, we add

[u, l_{v}]

to the out-label

L_{o u t} (v)

of each visited vertex, v, if

| o u t_{G} (v) | 〉 d

(lines 17 and 27). For example, when d = 1, Algorithm 4 generates the the-hop distance label as shown in Figure 2d. Compared with the baseline two-hop distance label shown in Figure 2c with respect to Algorithm 2, a new out-label,

L_{o u t} (f), is

only constructed for vertex f. However, in this case, the benefit is obvious; we can guarantee that, for any query, function query (

u, v, k

) is called only once.

Algorithm 4 genVCIndex⁺ $(G = (V, E), d)$
1	$S \leftarrow$ CompVC $(G = (V, E))$ /Algorithm 1/
2	foreach (vertex $u \in S$ ) do
3	for each vertex, v, being visited when performing BFS from u
4	if $(v \in S \lor \| i n_{G} (v) \| 〉 d)$
5	$l_{v} \leftarrow$ the distance from u to v calculated through BFS
6	$C_{o u t} \leftarrow {w \| [w, d (u, w)] \in L_{o u t} (u))}$
7	$C_{i n} \leftarrow {w \| [w, d (w, v)] \in L_{i n} (v))}$
8	$C \leftarrow C_{o u t} ⋂ C_{i n}$
9	$i f (C \neq \emptyset) t h e n$
10	$i f (l_{v} > \min_{w \in C} (d (u, w) + d (w, v))) t h e n$
11	terminate the processing of $v$
12	endif
13	endif
14	add $[u, l_{v}]$ into $L_{i n} (v)$
15	endif
16	for each vertex, v, being visited when performing reverse BFS from u
17	if $(v \in S \lor \| o u t_{G} (v) \| 〉 d)$
18	$l_{v} \leftarrow$ the distance from u to vby reverse BFS
19	$C_{o u t} \leftarrow {w \| [w, d (v, w)] \in L_{o u t} (v))}$
20	$C_{i n} \leftarrow {w \| [w, d (w, u)] \in L_{i n} (u))}$
21	$C \leftarrow C_{o u t} ⋂ C_{i n}$
22	$i f (C \neq \emptyset) t h e n$
23	$i f (l_{v} > \min_{w \in C} (d (v, w) + d (w, u))) t h e n$
24	terminate the processing of $v$
25	endif
26	endif
27	add $[u, l_{v}]$ into $L_{o u t} (v)$
28	endif
29	endfor

Let L be the number of entries in the longest distance label (

L \leq | S |

). The cost of lines 6–10 is

O (L)

; thus, for Algorithm 4, the time complexity is

O (| S | (| V | + | E |) L)

. Additionally, the size of the distance label is O(|S|L), where S

= S \cup S_{1} \cup S_{2} \subseteq V

;

S_{1}

is a set of vertices where each vertex has an out-degree > d, and S₂ is another set of vertices where each vertex has an in-degree > d. As |S|≤|S|≤|V|, the degree-constraint two-hop distance label achieves higher efficiency with a larger index size than that of Algorithm 2.

Queries Answering: Based on the two-hop distance label constructed with Algorithm 4, we answer queries using Algorithm 5. Compared with the baseline algorithm, i.e., Algorithm 3, Algorithm 5 also answers a query in four cases with Equation (4). The difference lies in that, in each case, we not only check whether u or v belongs to S, but we also check whether the number of out-neighbors (in-neighbors) of

u (v)

is greater than the predefined threshold, d (lines 1 and 2). In this way, we enlarge the probability that the given query is answered with the first three cases, which, in turn, reduces the probability of Case 4. It is worth noting that, when

d = 1

, we can answer any query by calling function query

(u, v, k)

once, which is as efficient as generating a two-hop distance label for all the vertices but has smaller index size. When

d > 1

, to answer the given query,

u \overset{? k}{\to} v

, Algorithm 5 will call function query

(u, v, k)

once for Case 1, at most d times for Case 2 and Case 3, and at most d² times for Case 4. Since the probability of Case 4 in Algorithm 5 is small compared with that in Algorithm 3, the query performance can be improved as shown by our experimental results.

Algorithm 5 VCRea⁺ (u, v, k, d)
1	$S_{1} \leftarrow {v \| v \in V \land \| o u t_{G} (v) \| 〉 d}$
2	$S_{2} \leftarrow {v \| v \in V \land \| i n_{G} (v) \| 〉 d}$
3	Case 1: $(u \in S \cup S_{1}) \land (v \in S \cup S_{2}$ )
4	return query (u, v, k)
5	Case 2: $(u \in S \cup S_{1}) \land (v \notin S \cup S_{2})$
6	foreach $w \in i n_{G} (v)$ do
7	if (query $(u, w, k - 1)$ = TRUE) then
8	return TRUE
9	endif
10	endfor
11	return FALSE
12	Case 3: $(u \notin S \cup S_{1}) \land (v \in S \cup S_{2})$
13	foreach $w \in o u t_{G} (u)$ do
14	if (query $(w, v, k - 1)$ = TRUE) then
15	return TRUE
16	endif
17	endfor
18	return FALSE
19	Case 4: $(u \notin S \cup S_{1}) \land (v \notin S \cup S_{2})$
20	foreach $s \in o u t_{G} (u)$ do
21	foreach $t \in i n_{G} (v)$ do
22	if (query $(s, t, k - 2)$ = TRUE) then
23	return TRUE
24	endif
25	endfor
26	endfor
27	return FALSE

Example 4 continues Example 3, where query

g \overset{? 2}{\to} f

, and corresponds to Case 2, but here it corresponds to Case 1 due to

d = 1

. Query

a \overset{? 2}{\to} f

corresponds to Case 4 in Example 3, but here it corresponds to Case 3. Additionally, as

d = 1

, any case of Algorithm 5 will call function query (

u, v, k

) only once for any query. Therefore, compared with Algorithm 3, the calling times of function query (

u, v, k

) can be reduced with Algorithm 5.

3.2.4. A Comparison of Computational Complexity

We show the comparison of the algorithms for kHRQs on general directed graphs in Table 2 from which we know that the common problem of kReach and ESTI is that they both suffer from a higher cost for query answering. The reason for this is that, when answering a query, they both need to traverse the underlying graph to obtain the answer due to the fact that their index covers only part of the reachability information. As a comparison, both of our approaches, i.e., VCRea and VCRea⁺, can obtain the query answer by comparing the vertex labels of the source and the target query vertices without graph traversal. The cost for this is affording more time to the index construction time even though our experimental results show that, for some datasets, our approaches need less index construction time and obtain a smaller index.

4. Experiment

We showed the experimental results through two aspects. First, we discussed the impact of the threshold, d, on index construction and query answering. Second, we compared the performance of our approach and the recently published approaches, including kReach [10,11] and ESTI [12]. Here, kReach was the optimized algorithm using a k-relaxed partial vertex cover with budget b = 1000, which is the value suggested by the authors. We did not make a comparison with BFSI-B [13] and HT [14] due to the fact that they are tailored to DAGs and cannot process kHRQs when the input graph contains SCCs.

We implemented our algorithms using C++ and compiled them with G++ 7.5.0. For existing algorithms, we obtained the source code of ESTI and kReach from the authors. All experiments were run on a PC with Intel(R) Core(TM) i5-4200M @ 2.5 GHz CPU, 16G dual-channel memory, and Ubuntu 18.04.3 Linux OS. Query time denotes the running time of a total of 1,000,000 kHRQs. For algorithms that

r a n \geq 24

h or exceeded the memory limit (16GB), we showed their results as “-” in the tables.

Datasets: Table 3 contains the statistics of eight carefully selected real datasets used in our experiment, which could be divided into two groups.

The first group consisted of four sparse graphs, including human, WikiTalk, twitter and web-uk datasets. Human (https://code.google.com/archive/p/grail/downloads, accessed on 13 March 2021) is a graph which describes the genome and biochemical machinery of E. coli K-12 MG1655, and WikiTalk (http://snap.stanford.edu/data/index.html, accessed on 13 March 2021) was transformed from communication network wiki-Talk. Twitter (https://code.google.com/p/ferrari-index/downloads/list, accessed on 13 March 2021) was transformed from a social network obtained from twitter.com (accessed on 13 March 2021), and web-uk (https://code.google.com/p/ferrari-index/downloads/list, accessed on 13 March 2021) is a web graph dataset. These datasets are sparse graphs that were used to test the performance when the number of vertices increased.

The second group consisted of four dense graphs. BerkStan (http://snap.stanford.edu/data/index.html, accessed on 13 March 2021) is a web graph describing the hyperlinks between berkely.edu and stanford.edu. EX6, 3dtube, and bio are three directed graphs downloaded from the same web page (https://networkrepository.com/, accessed on 13 March 2021) with large densities

(d e g_{a v g} > 20)

. These dense graphs were used to test the performance when

d e g_{a v g}

increased.

Workloads: We tested the performance of kHRQ answering using an equal workload. The workload in our study consisted of 1,000,000 queries of which 500,000 were considered reachable and of which 500,000 were considered unreachable. In order to avoid bias towards unreachable queries, we did not use a random workload that would select pairs of vertices at random. This approach is known to heavily favor unreachable queries in practice [1]. Furthermore, it is unlikely that most queries in a real-world scenario would be unreachable, as vertex pairs tend to have a certain level of connectivity [9].

To generate our equal workload, we adopted the following procedure: First, we generated the set of unreachable queries by sampling vertex pairs with equal probability until we reached the required ratio, and then we tested each query using one of the existing algorithms. For the reachable queries, we randomly selected a vertex, u, in each iteration and then selected a vertex, v, from the transitive closure of u. This process continued until we generated the required number of reachable queries.

4.1. Impacts of Threshold d

Table 4 shows the impacts of the threshold, d, on the index size, index construction time, and query time with respect to our approach. In Table 4, we see the following observations:

First, with a decrease in d, the index size of all the datasets increased accordingly. The reason was very simple. The smaller the value of d, the more vertices satisfied the degree constraint; therefore, we needed to construct 2-hop distance label for these vertices.

Second, with a decrease in d, the index construction time increased for all datasets. The reason lay in that we needed to make more judgments, such that to generate 2-hop distance label for more vertices. However, no matter how d changed, the number of visited vertices for each processed vertex of S did not change; therefore, the index construction time increased for all datasets, but the fluctuation was not as large as that of index size.

At last, with a decrease in d, the query time decreased accordingly. The reason was that the smaller the value of d was, the more vertices had a 2-hop distance label. Then, the calling time of function query (

u, v, k

) would decrease accordingly, and, thus, the query time could be reduced.

Table 4 also indicates that, if query time was the first citizen, then d could be smaller so that queries could be answered more efficiently.

As d = 1 guaranteed that each kHRQ could be answered by calling function query (

u, v, k

) only once in the worst case, when we refer to Algorithm 5 in the below section, the default value of d is 1.

4.2. Query Processing

In this subsection, VCRea denotes the baseline algorithm, i.e., Algorithm 2, and VCRea⁺ denotes the degree-constraint algorithm with

d = 1

.

Query Time: First, we showed the impacts of the number of vertices on query performance in sparse graphs. The results are shown in Table 5 from which we know that all algorithms consumed more query time when the graph size increased and that, for both k = 2 and k = 10, our approaches achieved the best performance. The reason lay in that, compared with ESTI and kReach, our approaches could answer all queries without graph traversal. Note that ESTI could not work successfully for web-uk when k = 10 due to performance of costly graph traversal.

Second, we showed the impacts of the graph density on query performance. Figure 4 shows the results from which we made the following observations: (1) Graph density dominated the query performance of both ESTI and kReach, rather than the number of vertices in the graph, and both ESTI and kReach consumed much more query time when the graph became dense. For example, ESTI could not successfully obtain the results for all datasets within 24 h when k = 10, where EX6 contained only 6545 vertices. (2) Our VCRea⁺ algorithm worked much better than both ESTI and kReach. For example, VCRea⁺ was faster than kReach (ESTI) by more than three orders of magnitude for BerkStan when k = 8.

Index Size and Index Construction Time: Figure 5a shows the comparison of index size from which we know that all algorithms could work successfully for all datasets and that no one could beat others for all datasets. For sparse graphs, our approaches had much smaller index size; when the graph became dense, kReach had the smallest index size.

Figure 5b shows the comparison of index construction time for which we had results similar to those of index size, i.e., no one could beat others for all datasets.

For sparse graphs, our approaches consumed much less index construction time; when the graph became dense, ESTI and kReach were more efficient than our approaches.

5. Conclusions

One of the fundamental operations in the directed graphs is the processing of kHRQs. The existing approaches rely on the index covering part of the reachability information, which necessitates performing costly graph traversal operations and which may be inefficient for answering kHRQs. To address this issue, we propose to construct a compact two-hop distance label based on a small set of vertices, allowing for efficient kHRQ answering without performing graph traversal. Specifically, we propose a linear-time algorithm for quickly generating a smaller vertex cover based on which we generate two-hop distance labels. Furthermore, we propose a degree-constraint two-hop distance label that can bind the query performance with a given threshold, d, in the worst case. We conduct a comprehensive experimental evaluation and compare our approach to the state-of-the-art methods ESTI and kReach. Our results show that our approach is over 1000 times faster than ESTI and kReach for the BerkStan dataset.

For future work, we plan to design a more efficient algorithm for index construction to obtain a smaller index size compared with VCRea and VCRea⁺. Here, the key problems that need to be solved include (1) how to design a better algorithm to find a smaller cover set and (2) how to control the value of d so that our approach can be adaptive to different graphs.

Author Contributions

Conceptualization, X.T. and J.Z.; methodology, X.T. and K.L.; software, K.L.; validation, J.Z. and Y.S.; formal analysis, J.Z.; investigation, X.T.; resources, X.T.; data curation, K.L.; writing—original draft preparation, X.T. and J.Z.; writing—review and editing, X.T. and X.L.; funding acquisition, X.T. and J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partly supported by grants from the Natural Science Foundation of Shanghai (No. 20ZR1402700) and from the Natural Science Foundation of China (No.: 61472339, 61873337).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All datasets used in this study are publicly available and are discussed in Section 4. They are also available from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Yildirim, H.; Chaoji, V.; Zaki, M.J. GRAIL: A scalable index for reachability queries in very large graphs. VLDB J. 2012, 21, 509–534. [Google Scholar] [CrossRef]
Hocine, I.; Yahiaoui, S.; Bendjoudi, A.; Nouali-Taboudjemat, N. Reach-ability in big graphs: A distributed indexing and querying approach. Inf. Sci. 2021, 573, 541–561. [Google Scholar] [CrossRef]
Seufert, S.; Anand, A.; Bedathur, S.J.; Weikum, G. FERRARI: Flexible and efficient reachability range assignment for graph indexing. In Proceedings of the 2013 IEEE 29th International Conference on Data Engineering, Brisbane, Australia, 8–12 April 2013; pp. 1009–1020. [Google Scholar] [CrossRef]
Veloso, R.R.; Cerf, L.; Meira, W., Jr.; Zaki, M.J. Reachability queries in very large graphs: A fast refined online search approach. In Proceedings of the 17th International Conference on Extending Database Technology, Athens, Greece, 24–28 March 2014; pp. 511–522. [Google Scholar] [CrossRef]
Su, J.; Zhu, Q.; Wei, H.; Yu, J.X. Reachability querying: Can it be even faster? IEEE Trans. Knowl. Data Eng. 2017, 29, 683–697. [Google Scholar] [CrossRef]
Zhou, J.; Zhou, S.; Yu, J.X.; Wei, H.; Chen, Z.; Tang, X. DAG reduction: Fast answering reachability queries. In Proceedings of the 2017 ACM SIGMOD/PODS Conference, Chicago, IL, USA, 14–19 May 2017; pp. 375–390. [Google Scholar] [CrossRef]
Zhou, J.; Yu, J.X.; Li, N.; Wei, H.; Chen, Z.; Tang, X. Accelerating reachability query processing based on DAG reduction. VLDB J. 2018, 27, 271–296. [Google Scholar] [CrossRef]
Li, W.; Qiao, M.; Qin, L.; Zhang, Y.; Chang, L.; Lin, X. Scaling distance labeling on small-world networks. In Proceedings of the 2019 ACM SIGMOD/PODS Conference, Amsterdam, The Netherlands, 30 June–5 July 2019; pp. 1060–1077. [Google Scholar] [CrossRef]
Jin, R.; Ruan, N.; Dey, S.; Yu, J.X. SCARAB: Scaling reachability computation on large graphs. In Proceedings of the 2012 ACM SIGMOD/PODS Conference, Scottsdale, AZ, USA, 20–25 May 2012; pp. 169–180. [Google Scholar] [CrossRef]
Cheng, J.; Shang, Z.; Cheng, H.; Wang, H.; Yu, J.X. K-reach: Who is in your small world. Proc. VLDB Endow. 2012, 5, 1292–1303. [Google Scholar] [CrossRef]
Cheng, J.; Shang, Z.; Cheng, H.; Wang, H.; Yu, J.X. Efficient processing of k-hop reachability queries. VLDB J. 2014, 23, 227–252. [Google Scholar] [CrossRef]
Cai, Y.; Zheng, W. ESTI: Efficient k-hop reachability querying over large general directed graphs. In Database Systems for Advanced Applications, Proceedings of the DASFAA 2021 International Workshops: BDQM, GDMA, MLDLDSA, MobiSocial, and MUST, Taipei, Taiwan, 11–14 April 2021; Jensen, C.S., Lim, E., Yang, D., Eds.; DASFAA Workshops. Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2021; Volume 12680, pp. 71–89. [Google Scholar]
Xie, X.; Yang, X.; Wang, X.; Jin, H.; Wang, D.; Ke, X. BFSI-B: An improved k-hop graph reachability queries for cyber-physical systems. Inf. Fusion 2017, 38, 35–42. [Google Scholar] [CrossRef]
Du, M.; Yang, A.; Zhou, J.; Tang, X.; Chen, Z.; Zuo, Y. HT: A novel labeling scheme for k-hop reachability queries on DAGs. IEEE Access 2019, 7, 172110–172122. [Google Scholar] [CrossRef]
Cohen, E.; Halperin, E.; Kaplan, H.; Zwick, U. Reachability and distance queries via 2-hop labels. In Proceedings of the 13th ACM-SIAM Symposium on Discrete Algorithm 2002, San Francisco, CA, USA, 6–8 January 2002; pp. 937–946. Available online: http://dl.acm.org/citation.cfm?id=545381.545503 (accessed on 19 May 2018).
Akiba, T.; Iwata, Y.; Yoshida, Y. Fast exact shortest-path distance queries on large networks by pruned landmark labeling. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, 22 June 2013; pp. 349–360. [Google Scholar] [CrossRef] [Green Version]
Zhu, A.D.; Lin, W.; Wang, S.; Xiao, X. Reachability queries on large dynamic graphs: A total order approach. In Proceedings of the 2014 ACM SIGMOD/PODS Conference, Snowbird, UT, USA, 22–27 June 2014; pp. 1323–1334. [Google Scholar] [CrossRef]
Chen, X.; Wang, K.; Lin, X.; Zhang, W.; Qin, L.; Zhang, Y. Efficiently answering reachability and path queries on temporal bipartite graphs. Proc. VLDB Endow. 2021, 14, 1845–1858. [Google Scholar] [CrossRef]
Hao, K.; Yuan, L.; Zhang, W. Distributed hop-constrained s-t simple path enumeration at billion scale. Proc. VLDB Endow. 2021, 15, 169–182. [Google Scholar] [CrossRef]
Lyu, Q.; Li, Y.; He, B.; Gong, B. DBL: Efficient reachability queries on dynamic graphs. In Database Systems for Advanced Applications, Proceedings of the 26th International Conference, DASFAA 2021, Taipei, Taiwan, 11–14 April 2021; Jensen, C.S., Lim, E., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2021; Volume 12682, pp. 761–777. [Google Scholar]
Peng, Y.; Zhang, Y.; Lin, X.; Qin, L.; Zhang, W. Answering Billion-Scale Label-Constrained Reachability Queries within Microsecond. Proc. VLDB Endow. 2020, 13, 812–825. [Google Scholar] [CrossRef] [Green Version]
Peng, Y.; Zhang, Y.; Lin, X.; Zhang, W.; Qin, L.; Zhou, J. Hop-constrained s-t Simple Path Enumeration: Towards Bridging Theory and Practice. Proc. VLDB Endow. 2019, 13, 463–476. [Google Scholar] [CrossRef]
Cormen, T.H.; Leiserson, C.E.; Rivest, R.L.; Stein, C. Introduction to Algorithms, 2nd ed.; The MIT Press and McGraw-Hill Book Company: Cambridge, MA, USA, 2001. [Google Scholar]

Figure 1. A sample directed graph, G.

Figure 2. Illustration of different labels. (a) Constructing labels based on all vertices. (b) Baseline approach. (c) Degree constraint.

Figure 3. Running status of Algorithm 1, where (a) shows the initial status before processing any vertex and where (b–e) are the status after processing d, g, b, and i, respectively. In each subfigure, A is the array maintaining the vertices after sorting operation by degree, and H is the hash table.

Figure 4. Comparison of query time (ms) on dense graphs (

d e g_{a b g}

> 10), where (a) shows results for BerkStan (

d e g_{a b g}

= 11), (b) 3dtube (

d e g_{a b g}

= 23), (c) EX6 (

d e g_{a b g}

= 35), and (d) bio (

d e g_{a b g}

= 52).

Figure 4. Comparison of query time (ms) on dense graphs (

d e g_{a b g}

> 10), where (a) shows results for BerkStan (

d e g_{a b g}

= 11), (b) 3dtube (

d e g_{a b g}

= 23), (c) EX6 (

d e g_{a b g}

= 35), and (d) bio (

d e g_{a b g}

= 52).

Figure 5. Comparison of index size and index construction time, where (a) shows comparison of index size (MB) and where (b) shows that of index construction time (ms).

Table 1. The ratio of vertex cover size, |S|, over the number of vertices in the given graph,

G = (V, E)

(column 2), and the probability that one or both query vertices do not belong to S with the assumption that a query is generated by randomly sampling vertex pairs (columns 3–4).

Table 1. The ratio of vertex cover size, |S|, over the number of vertices in the given graph,

G = (V, E)

(column 2), and the probability that one or both query vertices do not belong to S with the assumption that a query is generated by randomly sampling vertex pairs (columns 3–4).

Dataset	$\| S \| / \| V \|$ (%)	Either One (%)	Neither (%)
Human	0.97	1.92	98.07
WikiTalk	0.72	1.43	98.56
Twitter	2.33	4.56	95.39
Web-uk	8.48	15.53	83.75
BerkStan	41.36	48.51	34.38
EX6	91.14	16.15	0.79
3dtube	95.77	8.10	0.18
Bio	95.44	8.71	0.21

Table 2. Comparison of algorithms with respect to kHRQ processing on directed graphs with SCCs, where

G_{k} (u)

is the subgraph of G that can be reached from u in k hops, D is the subgraph of G obtained by removing vertices in S,

d_{m a x}^{i n} (d_{m a x}^{o u t})

is the maximum in-degree (out-degree) for all vertices, and S

= S \cup S_{1} \cup S_{2} \subseteq V

. S₁ is the set of vertices with out-degree greater than d, and S₂ is the set of vertices with in-degree greater than

d

.

Table 2. Comparison of algorithms with respect to kHRQ processing on directed graphs with SCCs, where

G_{k} (u)

is the subgraph of G that can be reached from u in k hops, D is the subgraph of G obtained by removing vertices in S,

d_{m a x}^{i n} (d_{m a x}^{o u t})

is the maximum in-degree (out-degree) for all vertices, and S

= S \cup S_{1} \cup S_{2} \subseteq V

. S₁ is the set of vertices with out-degree greater than d, and S₂ is the set of vertices with in-degree greater than

d

.

Algorithm	Indexing Construction Time	Index Size	Query Time
kReach	$O (Σ_{u \in S} \| G_{k} (u) \|)$	$O ({\| S \|}^{2})$	$O (\| V_{D} \| + \| E_{D} \|)$
ESTI	$O (\| V \| \log (d_{m a x}^{i n}) + \| E \|)$	$O (\| V \| + \| E \|)$	$O (\| V \| + \| E \|)$
VCRea	$O (\| S \| (\| V \| + \| E \|) L)$	$O (\| S \| L)$	$O (d_{m a x}^{i n} d_{m a x}^{o u t} L)$
VCRea⁺	$O (\| S \| (\| V \| + \| E \|) L)$	O(\|S\|L)	$O (d^{2} L)$

Table 3. Statistics of real datasets, where d denotes the average degree (

d e g_{a v g} = | E | / | V |)

.

Table 3. Statistics of real datasets, where d denotes the average degree (

d e g_{a v g} = | E | / | V |)

.

Dataset	$\| V \|$	$\| V \|$	$d e g_{a v g}$
Human	38,811	39,576	1.02
WikiTalk	2,281,879	2,311,570	1.01
Twitter	18,121,168	18,359,487	1.01
Web-uk	22,753,644	38,184,039	1.68
BerkStan	685,230	7,600,595	11
EX6	6545	147,840	23
3dtube	45,330	1,584,144	35
Bio	36,417	1,884,276	52

Table 4. Impacts of d on index size (MB), index construction time (ms), and query time (ms).

Dataset	Index Size			Index Construction Time			Query Time (k = 10)
Dataset	d = 5	d = 3	d = 1	d = 5	d = 3	d = 1	d = 5	d = 3	d = 1
Human	0.01	0.01	0.02	2	2	2	38	28	17
WikiTalk	0.12	0.12	0.55	114	121	133	71	58	47
Twitter	5	6	30	1658	1769	1944	92	85	83
Web-uk	319	363	445	11,727	12,510	13,748	165	155	151
BerkStan	538	568	656	14,033	14,970	16,451	302	177	147
EX6	11	11	11	250	267	294	557	456	355
3dtube	877	877	877	6476	6909	7592	3791	3487	3287
Bio	149	150	150	2576	2748	3020	977	841	818

Table 5. Comparison of query time (ms), where d = 1 for the VCRea⁺ algorithm.

Dataset	k = 2				k = 10
Dataset	ESTI	kReach	VCRea	VCRea⁺	ESTI	kReach	VCRea	VCRea⁺
Human	41	101	20	21	41	112	17	17
WikiTalk	140	256	47	52	133	259	42	47
Twitter	354	373	85	103	275	375	68	83
Web-uk	512	883	210	159	-	1191	179	151

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tang, X.; Zhou, J.; Shi, Y.; Liu, X.; Lin, K. Efficient Processing of k-Hop Reachability Queries on Directed Graphs. Appl. Sci. 2023, 13, 3470. https://doi.org/10.3390/app13063470

AMA Style

Tang X, Zhou J, Shi Y, Liu X, Lin K. Efficient Processing of k-Hop Reachability Queries on Directed Graphs. Applied Sciences. 2023; 13(6):3470. https://doi.org/10.3390/app13063470

Chicago/Turabian Style

Tang, Xian, Junfeng Zhou, Yunyu Shi, Xiang Liu, and Keng Lin. 2023. "Efficient Processing of k-Hop Reachability Queries on Directed Graphs" Applied Sciences 13, no. 6: 3470. https://doi.org/10.3390/app13063470

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient Processing of k-Hop Reachability Queries on Directed Graphs

Abstract

1. Introduction

2. Related Work

3. A Vertex-Cover-Based Approach

3.1. Overview

3.2. Index Construction

3.2.1. Vertex Cover Computation

3.2.2. The Baseline Two-Hop Distance Label

3.2.3. The Degree-Constraint-Based Two-Hop Label

3.2.4. A Comparison of Computational Complexity

4. Experiment

4.1. Impacts of Threshold d

4.2. Query Processing

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI