1. Introduction
This paper considers the solvability of several fundamental state observation, synchronization, and graph computation problems in asynchronous message-passing distributed systems in the presence of Byzantine processes using distributed algorithms. In a
distributed algorithm, each process has access only to its local variables and incident edge parameters (such as edge weight, edge cost, edge delay); its local variables are not accessible to any other process/node. We show the impossibility of solving these fundamental problems by proving that they require a solution to the causality determination problem [
1,
2,
3], which has been shown to be unsolvable in asynchronous message-passing distributed systems [
4].
In a seminal paper, Lamport formulated the “happened before” or the causality relation, denoted →, between events in an asynchronous distributed system [
5]. Given two events
e and
, the
causality determination (CD) problem asks to determine whether
. Examples of events from real-world applications include users making bids in an online auction, physical parameters like temperature and pH value reaching certain thresholds in a chemical manufacturing plant, and program variable values in a parallel program satisfying an application-specific predicate. In computing systems, applications of causality determination include determining consistent recovery points in distributed databases, deadlock detection, termination detection, distributed predicate detection, distributed debugging and monitoring, and the detection of race conditions and other synchronization errors [
6]. It was shown in [
4] that it is impossible to determine the causality or the happens before relation → between two events
and
when there is even a single Byzantine process in an asynchronous message-passing distributed system. False negatives and/or false positives are possible. A false negative means that
whereas
is perceived/detected. A false positive means that
whereas
is detected. Specifically, the following results were shown.
It is impossible to determine causality between events in the presence of even a single Byzantine process when e is a communication (send or receive) event and processes communicate by unicasting. This is because both false positives and false negatives can occur.
A similar impossibility result when processes communicate by broadcasting was shown. In this case, false positives cannot occur but false negatives can occur.
A similar impossibility result to the unicasting case was shown where processes communicate by multicasting. Both false positives and false negatives can occur.
We show that many problems in distributed computing in the presence of Byzantine processes, which might be locally solved at event(s) at individual processes but require another process to detect the occurrence of such event(s), are not solvable in asynchronous message-passing systems by showing that solving them requires solving the CD problem. This also establishes the CD problem as a fundamental first-class problem as all the other problems for which we show impossibility results inherently require causality determination between a pair of events. The occurrence of false negatives (false positives) in the CD problem manifests as the occurrence of liveness and safety violations in these problems. A direct implication of our results is that none of the many algorithms proposed to solve these problems over the past five decades for failure-free systems/crash failures can be adapted for Byzantine failures.
We consider the following problems; the reader is referred to any standard textbook such as [
6,
7,
8,
9,
10] for a centralized source of algorithms to solve these problems in asynchronous message-passing failure-free systems/graphs.
Solving these problems using distributed algorithms requires the determination of the existence of a causal path between two events
e and
where
e is an event where a process finishes setting its local variables as a result of the distributed algorithm and
is an event where an(other) process detects the global completion of the distributed algorithm in order to use/further process the result of the distributed algorithm. The determination of the existence of such a causal path in the execution in a Byzantine system is not solvable as shown in [
4] and hence, these problems are also not solvable.
Finally, we generalize our results and show that any problem which uses a distributed algorithm is subject to at least the same limitations as the CD problem in a Byzantine failure-prone system.
The area of distributed computing is known for many impossibility results, even for the more benign crash failure model—such as for the consensus problem in asynchronous systems [
11]. Or, for example, it is known that mutual exclusion cannot be solved even in a crash-prone system, so the result also applies to Byzantine failures. Lynch [
12,
13] has given a hundred impossibility results in distributed computing. Other impossibility results have been given in [
14,
15]. These impossibility results identified several classes of more basic tasks or more elementary problems that need to be solved in order to solve these problems. However, none of these more basic tasks was identified as the task of causality determination between events. In our paper, the impossibility results for the problems we identify are related to the impossibility of solving the more basic task—causality determination between events.
Previously, Lynch [
12,
13] observed that (in the shared memory architecture), the inherent limitations are imposed by local knowledge. This complemented Chandy-Misra’s results on how processes learn [
16] via message chains hints at our results which are in the context of Byzantine processes. While some of our results may not be very surprising, they nevertheless state and formalize an important outcome for a large number of important, real-world, and practical problems in asynchronous message-passing distributed systems subject to Byzantine failures that have not been previously enunciated. All these problems require relating the partial solutions of the problem at various processes to detecting at another process that these partial solutions have been reached.
Roadmap. Section 2 gives the system model.
Section 3 formulates the problem of determining causality.
Section 4 gives our main impossibility results about the solvability of basic problems using distributed algorithms in the Byzantine failure model.
Section 5 gives a discussion and concludes.
2. System Model
We consider an asynchronous distributed system having Byzantine processes which are processes that can misbehave [
17]. A correct process behaves exactly as specified by the algorithm whereas a Byzantine process may deviate arbitrarily from its protocol by exhibiting arbitrary behaviour at any point during the execution. A Byzantine process cannot impersonate another process or spawn new processes.
The distributed system is modeled as an undirected graph . Here, P is the set of processes communicating asynchronously in the distributed system. Let . C is the set of FIFO (logical) communication links over which processes communicate by message passing. A process is interchangably used with a node in the graph.
The distributed system is asynchronous, i.e., there is no fixed upper bound
on the message latency, nor any fixed upper bound
on the relative speeds of processors [
18].
In this paper, we consider only distributed algorithms to solve various problems. A distributed algorithm is one in which each process has access only to its local variables and incident edge parameters; its local variables are not accessible to any other process/node. Exchange of variable values can be carried out explicitly through message-passing. The adjacent process may be Byzantine and hence, information received from it can corrupt local variables.
Let
, where
, denote the
x-th event executed by process
. An event may be an internal event, a message send event, or a message receive event. Let the state of
after
be denoted
, where
, and let
be the initial state. The
execution at
is the sequence of alternating events and resulting states, as
. The
execution history at
is the finite execution at
up to the current or most recent or specified local state. The
happened before [
5] relation, denoted →, is an irreflexive, asymmetric, and transitive partial order defined over events in a distributed execution that is used to define causality.
Definition 1. The happened before relation → on events consists of the following rules:
Program Order: For the sequence of events executed by process , ∀ such that , we have .
Message Order: If event is a message send event executed at process and is the corresponding message receive event at process , then .
Transitive Order: If ∧, then .
Definition 2. The causal past of an event e is denoted as and is defined as the set of events in E that causally precede e under →.
3. The Causality Determination Problem Formulation
The problem formulation in this section is based on [
4]. An algorithm to solve the causality determination problem collects the execution history of each process in the system and derives causal relations from it. Let
denote the
actual execution history at
and let
. For any causality determination algorithm, let
be the execution history at
as perceived and collected by the algorithm and let
.
F thus denotes the execution history as collected by the algorithm. Let
and
denote the sets of all events in
E and
F, respectively. Analogous to Definition 1, we can define the
happened before relation on
instead of on
.
Let and be the evaluation (1 (true) or 0 (false)) of using E and F, respectively. Byzantine processes may corrupt the collection of F to make it different from E. We assume that a correct process needs to determine whether holds and is an event in . If , then evaluates to false. If (or ), then evaluates to false. We assume an oracle that is used for determining the correctness of the causality determination algorithm; this oracle has access to E, which can be any execution history such that . Byzantine processes may collude as follows.
To delete from or in general, record F as any alteration of E such that , while ; or
To add a fake event in or in general, record F as any alteration of E such that , while .
Without loss of generality, we have that . Note that belongs to when it is a fake event in F.
Definition 3. The causality determination problem for any event at a correct process is to devise an algorithm to collect the execution history E as F at such that , where When one is returned, the algorithm output matches the actual truth and solves CD correctly. Thus, returning one indicates that the problem has been solved correctly by the algorithm using F. A value of 0 is returned if either
such that (denoting a false negative, abbreviated ); or
such that (denoting a false positive, abbreviated ).
To determine whether CD is solved correctly, we have to evaluate even if because such an is recorded by the algorithm as part of F. The key observation we make is that in CD, a single Byzantine process can cause F (as recorded by the algorithm) to be different from E.
An FN arises because a send–receive event pair () of E in a causal chain from to is missing as per F. In addition, an FN may arise if is a receive event or an internal event, .
An FP arises because a non-existent send–receive message pair () in E appears in a causal chain from to as per F. In addition, an FP may arise if is an internal event, .
It has been proved in [
4] that for send and receive events, solving the CD problem in asynchronous message-passing systems prone to Byzantine process failures is subject to both false positives and false negatives under the unicast and multicast modes of communication, and subject to false negatives under the broadcast mode of communication.
4. Impossibility Results
Consider the following class of problems. There are events
at which local (possibly partial) solution(s) at
are obtained but which require the detection of such
events at some event
at a remote process
. In the presence of Byzantine processes, such problems are not solvable in asynchronous message-passing systems because this requires solving the CD problem. This also establishes the CD problem as a fundamental first-class problem as these other problems for which we show impossibility results inherently require causality determination between a pair of events. The occurrence of false negatives (false positives) in the CD problem manifests as the occurrence of liveness and safety violations in these problems. A direct implication of our results is that none of the many algorithms proposed to solve these problems over the past five decades for failure-free systems/crash failures [
6,
7,
8,
9,
10] can be adapted for Byzantine failures.
We begin by showing the following result regarding internal events at a process.
Theorem 1. For an internal event , it is impossible to prevent false negatives or false positives in determining correctly at a correct process , i.e., matching , in an asynchronous message passing system with one or more Byzantine processes.
Proof. There may be no other event in the rest of the system to corroborate the occurrence of an internal event at a process. A Byzantine process can choose not to reveal an internal event to the rest of the system, leading to a false negative that cannot be prevented. It may also choose to add a fake internal event in what it reveals to the rest of the system, leading to a false positive that cannot be prevented. □
For the problems for which we are about to show the impossibility results, the event under consideration is seen as an internal event.
4.1. Synchronization and State Observation Problems
4.1.1. Distributed Mutual Exclusion (ME)
The ME problem is specified as follows.
Safety specification of ME states that no two processes should gain access to the critical section (CS) at the same time.
Liveness specification of ME states that some process should eventually be able to gain access to the critical section (CS). In addition, fairness requirements of varying degrees of stringency are typically specified.
Theorem 2. In a system with even one Byzantine process, the distributed ME problem is subject to the same limitations (exposure to false positives and false negatives) as the CD problem, resulting in safety and liveness violations.
Proof. Solving ME requires satisfying
where
is an “exit CS (critical section)” event,
,
is a predicate capturing the other requirements of the ME problem besides the first predicate,
is an event where the CS is entered, and → is defined on messages of the ME algorithm.
As is an internal event, from Theorem 1 for the CD problem, detecting is susceptible to false positives and false negatives. Thus, it cannot be guaranteed that the predicate in the formula can be satisfied. Hence, ME cannot be solved.
A false positive of the CD problem is a safety violation—multiple processes in CS—in the ME problem. A false negative of the CD problem is a liveness violation—no process can enter CS—in the ME problem. □
4.1.2. Global Snapshot Recording (GSR)
The GSR problem is specified as follows.
Safety specification of GSR states that a recorded global state should include the recording of the local state of each process, that all in-transit messages in each channel should be recorded, and that such a global state should be consistent.
Liveness specification of GSR states that once a recording of a global snapshot is initiated, its recording should be eventually completed.
Theorem 3. In a system with even one Byzantine process, the distributed GSR problem is subject to the same limitations (exposure to false positives and false negatives) as the CD problem, resulting in safety and liveness violations.
Proof. Solving GSR requires satisfying
where
is an event where the process records its local state,
,
is a predicate capturing the other requirements of the GSR problem (the recorded local states at the various processes are consistent, the channel states recording is complete) besides the first predicate,
is an event where the completion of the global state recording is detected, and → is defined on snapshot recording algorithm messages.
As is an internal event, from Theorem 1 for the CD problem, detecting is susceptible to false positives and false negatives. Thus, it cannot be guaranteed that the predicate in the formula can be satisfied. Hence, GSR cannot be solved.
A false positive of the CD problem can result in a safety violation—an inconsistent global state, supposing the false positive event is where the process is supposed to record its local state as per the algorithm but does not, receives application messages, and later records the local state—in the GSR problem. Alternately, in the definition of , let be an event where the process completes the recording of its local state and states of incoming channels. A false positive of the CD problem can result in a safety violation—an incomplete global state, with some local states and channel states not recorded—in the GSR problem. A false negative of the CD problem is a liveness violation—a global snapshot recording detection never occurs—in the GSR problem. □
4.1.3. Termination Detection (TD)
The TD problem is specified as follows.
Safety specification of TD states global termination—all processes passive and no in-transit application messages in a (transitless) consistent global state—should not be declared unless global termination has occurred.
Liveness specification of TD states that some process should eventually be able to detect global termination once it has occurred.
Theorem 4. In a system with even one Byzantine process, the distributed TD problem is subject to the same limitations (exposure to false positives and false negatives) as the CD problem, resulting in safety and liveness violations.
Proof. Solving TD requires satisfying
where
is an event where the process becomes passive,
,
is a predicate capturing the other requirements of the TD problem—other processes are passive in a transitless consistent global state—besides the first predicate,
is an event where global termination is detected, and → is defined on termination detection algorithm messages.
As is an internal event, from Theorem 1 for the CD problem, detecting is susceptible to false positives and false negatives. Thus, it cannot be guaranteed that the predicate in the formula can be satisfied. Hence, TD cannot be solved.
A false positive of the CD problem is a safety violation—no real global termination has occurred—in the TD problem. A false negative of the CD problem is a liveness violation—real global termination is not detectable—in the TD problem. □
4.1.4. Distributed Deadlock Detection (DD)
The DD problem is specified as follows.
Safety specification of DD states that only a process that is part of a deadlock cycle or knot should be aborted/killed as part of deadlock resolution.
Liveness specification of DD states that once a deadlock (cycle or knot in the wait-for graph) occurs in a consistent global state it should be detected and deadlock resolution performed.
Theorem 5. In a system with even one Byzantine process, the distributed DD problem is subject to the same limitations (exposure to false positives and false negatives) as the CD problem, resulting in safety and liveness violations.
Proof. Solving DD requires satisfying
where
is an event where the process gets blocked,
,
is a predicate capturing the other requirements of the DD problem (existence of a cycle or knot in the wait-for graph in a consistent global state) besides the first predicate,
is an event where the deadlock is detected, and → is defined on deadlock detection algorithm messages.
As is an internal event, from Theorem 1 for the CD problem, detecting is susceptible to false positives and false negatives. Thus, it cannot be guaranteed that the predicate in the formula can be satisfied. Hence, DD cannot be solved.
A false positive of the CD problem can result in a safety violation—unnecessary abortion—in the DD problem. A false negative of the CD problem is a liveness violation—deadlock not detectable—in the DD problem. □
4.1.5. Distributed Predicate Detection (PD)
The PD problem is specified as follows.
Safety specification of PD states that a global predicate is not declared as true when the global predicate is false.
Liveness specification of PD states that some process should eventually be able to detect that a global predicate had become true after the predicate became true.
Theorem 6. In a system with even one Byzantine process, the distributed PD problem is subject to the same limitations (exposure to false positives and false negatives) as the CD problem, resulting in safety and liveness violations.
Proof. Solving PD requires satisfying
where
is an event where the local variable at the process takes a value that can satisfy the global predicate
,
,
is a predicate capturing the other requirements of the PD problem—conditions on how the various local variable values can be combined to satisfy the global predicate
—besides the first predicate,
is an event where the global predicate
is detected as true, and → is defined on predicate detection algorithm messages.
As is an internal event, from Theorem 1 for the CD problem, detecting is susceptible to false positives and false negatives. Thus, it cannot be guaranteed that the predicate in the formula can be satisfied. Hence, PD cannot be solved.
A false positive of the CD problem is a safety violation—there is no real satisfaction of the global predicate —in the PD problem. A false negative of the CD problem is a liveness violation—the global predicate that becomes true is never detected (because the process did not disclose its local value that could satisfy )—in the PD problem. □
4.1.6. Causal Ordering of Messages (CO)
The CO problem is specified as follows [
19,
20,
21,
22].
Safety specification of CO states that if the send event of message m causally happens before send event of message , then at each common destination of m and , cannot be delivered before m.
Liveness specification of CO states that a message sent by a correct process to another correct process should be eventually delivered.
Theorem 7. In a system with even one Byzantine process, the distributed CO problem is subject to the same limitations (exposure to false positives and false negatives) as the CD problem, resulting in safety and liveness violations.
Proof. Solving CO requires satisfying
where
is a send event of a message to
,
is an event where
sends a message
to
,
is a predicate on when/whether
can safely deliver
sent at
to itself (i.e., has received and determines it is safe to give
with respect to all other messages sent to itself in the execution to the application),
is an event where
delivers the message
from
, and → is defined on application messages.
As
is a send event, from [
4] for the CD problem, detecting
is susceptible to false positives and/or false negatives. Thus, it cannot be guaranteed that the predicate
in the formula
can be satisfied. Hence, CO cannot be solved.
A false positive of the CD problem can result in a liveness violation—waiting indefinitely at for the delivery of until the prior delivery of that was never sent by —in the CO problem. A false negative of the CD problem is a safety violation—not waiting for the delivery of that was sent by at to —in the CO problem. □
4.2. Distributed Graph Problems
4.2.1. Spanning Tree Construction (ST)
The ST problem is specified as follows.
Safety specification of ST states that a spanning tree (having edges and an acyclic sub-graph) is selected.
Liveness specification of ST states that some process should eventually be able to detect that the spanning tree construction is completed.
Theorem 8. In a system with even one Byzantine process, the distributed ST construction problem is subject to the same limitations (exposure to false positives and false negatives) as the CD problem, resulting in safety and liveness violations.
Proof. Solving ST requires satisfying
where
is an event where
has selected its incident spanning tree edges (of an actual spanning tree),
,
is an event where
determines that the distributed spanning tree determination is complete, and → is defined on spanning tree algorithm messages.
As is an internal event, from Theorem 1 for the CD problem, detecting is susceptible to false positives and false negatives. Thus, it cannot be guaranteed that the predicate in the formula can be satisfied. Hence, ST cannot be solved.
A false positive of the CD problem is a safety violation—a cycle or a non-tree sub-graph is created instead of a spanning tree—in the ST problem. A false negative of the CD problem is a liveness violation—completion of the distributed spanning tree construction is not detectable to any —in the ST problem. □
4.2.2. Minimum Spanning Tree Construction (MST)
The MST problem is specified as follows.
Safety specification of MST states that a spanning tree (having edges and an acyclic sub-graph) having the minimum possible sum of edge weights is selected.
Liveness specification of MST states that some process should eventually be able to detect that the minimum spanning tree construction is completed.
Theorem 9. In a system with even one Byzantine process, the distributed MST construction problem is subject to the same limitations (exposure to false positives and false negatives) as the CD problem, resulting in safety and liveness violations.
Proof. Solving MST requires satisfying
where
is an event where
has selected its incident spanning tree edges (of an actual minimum spanning tree),
,
is an event where
determines that the distributed determination of the minimum spanning tree is complete, and → is defined on minimum spanning tree algorithm messages.
As is an internal event, from Theorem 1 for the CD problem, detecting is susceptible to false positives and false negatives. Thus, it cannot be guaranteed that the predicate in the formula can be satisfied. Hence, MST cannot be solved.
A false positive of the CD problem is a safety violation—a cyclic sub-graph or a non-tree sub-graph or a non-minimal spanning tree is identified as a minimum spanning tree—in the MST problem. A false negative of the CD problem is a liveness violation—completion of the minimum spanning tree construction is not detectable by any —in the MST problem. □
4.2.3. All–All Shortest Paths Construction (AASP)
The AASP problem is specified as follows.
Safety specification of AASP states that for each node of a graph acting as a source (or sink) node, (the spanning tree representing) the shortest paths to (or from) every other node are selected.
Liveness specification of AASP states that each process should eventually be able to detect that the construction of the shortest paths spanning tree rooted at itself is completed.
Theorem 10. In a system with even one Byzantine process, the distributed AASP construction problem is subject to the same limitations (exposure to false positives and false negatives) as the CD problem, resulting in safety and liveness violations.
Proof. Solving AASP requires satisfying
where
is an event where
has identified its adjacent edges in a shortest paths sink tree rooted at
(of an actual shortest path sink tree of
),
,
is an event where
determines that the distributed determination of the shortest path sink tree rooted at itself is complete, and → is defined on the shortest path sink tree algorithm messages.
As is an internal event, from Theorem 1 for the CD problem, detecting is susceptible to false positives and false negatives. Thus, it cannot be guaranteed that for any i, the predicate in the formula can be satisfied. Hence, AASP cannot be solved.
A false positive of the CD problem is a safety violation—a shortest paths sink tree is not used as the sink tree rooted at some —in the AASP problem. A false negative of the CD problem is a liveness violation—completion of the construction of the shortest paths sink tree rooted at some is not detectable by that —in the AASP problem. □
4.2.4. Maximal Independent Set Construction (MIS)
The MIS problem is specified as follows.
Safety specification of MIS states that no two nodes that are neighbors add themselves to the maximal independent set and no superset of the set so constructed satisfies the independent set property.
Liveness specification of MIS states that some process should eventually be able to detect that the maximal independent set construction is complete.
Theorem 11. In a system with even one Byzantine process, the distributed MIS construction problem is subject to the same limitations (exposure to false positives and false negatives) as the CD problem, resulting in safety and liveness violations.
Proof. Solving MIS requires satisfying
where
is an event where
has determined whether or not it belongs to the maximal independent set (in a true maximal independent set),
,
is an event where
determines that the distributed maximal independent set construction is complete, and → is defined on maximal independent set algorithm messages.
As is an internal event, from Theorem 1 for the CD problem, detecting is susceptible to false positives and false negatives. Thus, it cannot be guaranteed that the predicate in the formula can be satisfied. Hence, MIS cannot be solved.
A false positive of the CD problem is a safety violation—two neighboring nodes add themselves to the maximal independent set or some node that has not added itself to the maximal independent set can be added to the maximal independent set—in the MIS problem. A false negative of the CD problem is a liveness violation—completion of the maximal independent set construction is not detectable by any —in the MIS problem. □
4.3. Generalized Theorem
A distributed algorithm is an algorithm in which each process is initialized with its local variable values and incident edge parameters; no process has access to any other variables and parameters of the system. The process can communicate only with its neighboring processes (depending on the overlay, if any) along incident edges.
For any problem X which requires a distributed algorithm to solve it, there are two characteristic events. is an internal event at which process completes its calculation of local variable values required to solve the problem after communicating with other processes in a distributed manner. is an event at which process determines that a global solution to the problem has been attained. When , in order that at can detect that problem X has been solved, there needs to be an actual causal path from to ( and where ) that is also detectable by , i.e., .
Safety specification of X states the correctness conditions of a solution to X, as captured by a global formula .
Liveness (or termination) specification of X states that some process should eventually be able to detect that a global formula has become true.
Theorem 12. In a system with even one Byzantine process, when a process has to detect that a problem X has been locally solved at events , the distributed X problem is subject to the same limitations (exposure to false positives and false negatives) as the CD problem, resulting in safety and liveness violations.
Proof. Solving
X requires satisfying
where
is an event where the local variables at the process take values that specify that the local computation has completed at
,
,
is a predicate capturing the other requirements of the
X problem besides the first predicate,
is an event where the global formula
is detected as true, and → is defined on algorithm messages for solving
X.
As is an internal event, from Theorem 1 for the CD problem, detecting is susceptible to false positives and false negatives. Thus, it cannot be guaranteed that the predicate in the formula can be satisfied. Hence, X cannot be solved.
A false positive of the CD problem is a safety violation—there is no real satisfaction of the global formula —in the X problem. A false negative of the CD problem is a liveness violation—the global formula that becomes true is never detected (because the process did not disclose its local value that could satisfy )—in the X problem. □