1. Introduction
The distribution of a secure symmetric key is essential to ensure the confidentiality of data transmission in secure communication. General symmetric key distribution methods such as the Diffie–Hellman algorithm have high convenience [
1]. However, the emergence of quantum computing [
2] reduces the complexity of discrete logarithm problems and increases the risk of key cracking. Quantum Key Distribution (QKD) ensures the security of distributed keys through the fundamental principles of quantum mechanics [
3]. However, this method requires a dedicated optical channel and has the drawback that the key generation rate decays exponentially with distance [
4]. Physical Unclonable Functions (PUF) based on Semiconductor Superlattice (SSL) is a Secure Key Distribution (SKD) system designed by synchronizing random numbers generated by the space-separated chaotic systems of both parties through the public channel [
5,
6]. The random information outputted by the SSL device is a complex nonlinear function of the driving signal, and the two twinning SSL devices under the same wafer are strongly correlated but difficult to be cloned in different wafers at some time in the future even if the same process is adopted [
7]. These properties not only ensure the security of distributed keys [
8], but also avoid the deployment of dedicated channels, and can be used on high-mobility terminals such as satellites [
9].
To enable more users to communicate securely through a one-time pad cipher, each user needs to install a twinning SSL device with each other. However, in large-scale networks, a single node will have communication and computing performance bottlenecks due to the need to deal with tasks such as fuzzy extraction [
6]. At the same time, the high manufacturing cost of SSL devices will also limit the installation and deployment of a large number of devices [
9]. Therefore, this kind of PUF technology still has shortcomings, such as poor scalability to apply to complex network application scenarios.
By organizing a point-to-point SKD system to establish Secure Key Distribution Networks (SKDN), the application scope of the SKD system can be expanded. Enabling the provision of a secure symmetric key distribution service between any nodes using existing limited resources, which is the basic functional requirement of SKDN. In QKD, one implementation uses quantum repeaters to employ quantum entanglement of photons to communicate over different optical channels, and there is no need for trust in the network nodes to ensure unconditional security during QKD operations facilitated by such quantum repeaters [
10]. Due to the limitations of the underlying physical mechanism, these methods are only suitable for QKD. The SKDN based on trusted relay has the characteristics of flexibility and strong scalability. The secure key distribution among the connected nodes can be completed by hop-by-hop forwarding in a public channel [
11]. Moreover, it can support nodes to deploy other PUF devices for heterogeneous networking [
12].
Compared with classical routing [
13], the design goals of the routing strategy in SKDN have changed. During the key forwarding process, the routing strategy is responsible for finding a path with sufficient local key resources to encrypt the forwarded key to prevent the key from being eavesdropped on the way. Besides, the trusted relay on the selected path needs to avoid key leakage. The increase in the scale of SKDN, however, leads to a higher probability of network layer faults [
14]. The term fault is used to refer to disruptions that can significantly impact key distribution performance. The causes may be benign, such as node movement, fail-stop caused by bugs, shortage of local key resources, congestion caused by bursting traffic, etc. It may also be malicious. For example, when a trusted relay normally participates in the routing process, it will exhibit Byzantine behavior such as tampering, forging, and selectively discarding packets. In addition, the faulty node may tamper with routing and intervene in other key distribution processes, reducing key distribution security [
15]. The fault tolerance of routing protocols can be improved to ensure normal network functions and at the same time alleviate the urgent need to achieve fault tolerance of underlying software and hardware [
16]. Since SSL-based SKDN has no dedicated channel restrictions, the node locations deployed by matched devices are highly decentralized and dynamic [
9]. A well-designed design of the distributed routing strategy can reduce system operation and maintenance costs.
Existing related research primarily focuses on reducing the possibility of faulty nodes passively eavesdropping on key [
15]. For example, random routing is used to avoid a certain path where the faulty node is located or multi-path key distribution is used to ensure that at least one path in the multiple paths does not contain a faulty node to ensure that the key is not leaked [
10]. The key generation rate of the current point-to-point SSL-based technique, however, is only in the range of megabits per second [
5]. This method improves key confidentiality at the expense of consuming a large amount of local keys, which weakens availability. Additionally, there is no mechanism optimization for potential performance issues under broader fault scenarios like the above. Considering the aforementioned issues, this paper proposes a practical routing strategy for SKDN. By considering the path discovery and fault detection mechanisms in the routing strategy design, and the collaborative working methods between them, we ensure that the key distribution system has certain fault tolerance in the scenario of network layer fault. The main contributions of this paper are summarized as follows:
An on-demand path discovery mechanism in SKDN is proposed. Fault-free path discovery is performed when necessary, and appropriate paths are selected based on local key status to reduce control message propagation overhead and improve key resource utilization.
A fault handling method in the communication key distribution stage is proposed. After analyzing the location of the fault through the acknowledgment-based fault detection mechanism, based on the Dempster–Shafer (DS) evidence theory, the mass function of the weighted observation evidence is calculated to identify possible causes. A new round of path discovery can be used to isolate the cause of fault and transfer key status to improve the ability to handle exceptions.
The effectiveness of the proposed solution under different node scales and different proportions of faulty nodes is evaluated through simulation. Simulation results show that the proposed solution has improved in parameters such as packet delivery ratio and corrupted ratio, and the verification strategy has a certain practical value.
The rest of this paper is organized as follows.
Section 2 summarizes the related research articles. The system and problem model is introduced in
Section 3.
Section 4 describes the proposed solution.
Section 5 presents the simulation result and discussion. The paper is concluded in
Section 6.
2. Related Work
In the secure key distribution networks based on the trusted relay assumption, existing work on the design of the routing mechanism mainly considers the performance and security of secure key distribution.
In terms of network performance considerations, early testbeds primarily made modifications based on the Internet routing protocol [
11]. The Defense Advanced Research Projects Agency uses modified OSPF, where link cost is evaluated based on the number of generated keys distributed on a link within the Link State Announcement update interval, but the current load problem is ignored [
4,
17]. In the European project, SECOQC [
18] evaluates the link workload by calculating the actual key forwarding rate, but does not provide a method for calculating the link cost. Some studies optimize key utilization based on the number of local keys of the relay and minimize the number of path hops [
19]. The common feature among the above methods is the utilization of periodic path information exchange, but when the number of SKDN requests increases, on-demand probing provides lower overhead and more timely link status information [
20]. In addition to processing messages according to priority in different message queues on nodes to improve the quality of service, research [
21] has found that the similarity in resource distribution with Mobile Ad Hoc Networks (MANETs) provides a reference for SKDN design, and proposes on-demand routing by calculating geographical distance and link status, which reduces the number of routing control packets. Research [
22] has found that the OLSR algorithm can be improved through interaction with the status of the local key, but the hop count has not been optimized, which also affects system performance; in addition, software-defined networks can be used to collect topology information and complete path calculation [
23] based on application requirements [
24].
Some SKD technologies, such as QKD, can prevent QKD link eavesdropping, Attackers can exploit this to carry out denial-of-service attacks [
25], but the aforementioned routing mechanism can be utilized to select alternative paths if a trusted relay node available. However, when the assumption based on trusted relay nodes partially fails, the security risk of active attacks such as traffic redirection [
26] increases.
Research [
13] has pointed out that routing mechanisms in existing work lack attention to key protection. Among them, how to improve the security of forwarding keys can be roughly divided into two types of methods; the first method realizes key distribution in an untrusted environment through multi-path [
27], which can prevent eavesdropping by malicious nodes. However, it causes a huge overhead of key resources under the condition that the environment is trustworthy, and it cannot avoid the occurrence of system paralysis events caused by faulty nodes [
15]. The second method uses a stochastic routing mechanism to ensure that it does not rely on a single key forwarding path to avoid possible risky relay nodes [
28]. Although this method can avoid active attacks causing system unavailability to a certain extent, it exhibits limited performance improvement in the presence of faulty nodes [
29]. The method also lacks a mechanism for switching between trusted and untrusted environments in order to reduce system overhead.
4. Proposed Solution
Due to the criticality of the basic services provided by the SKDN, in addition to improving its scalability, it is also necessary to consider how to handle the failure of a SKD node. Based on the traffic consistency principle [
33], if the aforementioned fault occurs, it will cause inconsistency between the inflow and outflow of nodes. To this end, nodes count the acknowledgment of the communication key, determine the traffic consistency, and detect and locate the abnormal link. However, the detection can also be caused by benign faults. To prevent the direct isolation of the faulty link solely based on the acknowledgment rate, which could lead to a significant reduction in system availability, we apply DS evidence theory to identify the cause of fault through multi-information aggregation and make corresponding decisions, and propose an on-demand fault-tolerant routing strategy for SKDN.
4.1. Overview
As shown in
Figure 4, this routing strategy provides full life cycle management for communication keys, which is divided into six parts: path discovery, key transmission, fault detection, evidence collection, information aggregation, and decision making.
Path Discovery:
An improved on-demand path discovery mechanism based on signatures and flooding not only reduces the overhead but also prevents path discovery failures caused by tampering or selective loss of routing messages to a certain extent.
Key Transmission:
After the path is discovered, the node sends the communication key to the next hop node in the corresponding routing table entry of the path; at this time, the communication key status is uncertain and cannot be used by the upper layer.
Fault Detection:
The destination node will acknowledge the communication key. An abnormal acknowledgment rate will trigger the fault detection. The detection will start from the intermediate node of the path to detect the upstream and downstream status and recursively detect the faulty subpath until the faulty link is confirmed. This will allow the location of the faulty node to be narrowed down to a single link.
Evidence Collection:
Characterize the acknowledgment rate, bitmap autocorrelation coefficient, and local key status of the current path as evidence of the cause of the link fault, and define and derive the mass function corresponding to each piece of evidence; weight the evidence and perform fusion calculations to obtain the combined mass function.
Information Aggregation and Decision Making:
Select the cause of fault that can maximize the pignistic transformation of the combined mass function, conduct a corresponding new round of path discovery, and convert the communication key status generated during evaluation.
4.2. Path Discovery
The process is mainly divided into two phases: route request and route response. At least one path (if any) free of faulty nodes is found by flooding routing packets.
Each node will maintain a routing table, in which each table entry contains the following fields: destination address, egress port, relay list, and entry status. When the routing table entry for the destination is located is unavailable, it enters the route request phase. The SKD node generates and broadcasts the route request message. the message has an incremented request sequence number and is signed by the source, so other nodes can identify the request. To ensure the privacy of routing messages, the messages need to be encrypted with a local key before broadcasting. Each relay will maintain a list that stores recent routing requests. As shown in
Figure 5, if the received request message does not match, it will be broadcast to other ports, thereby reducing the overhead of routing messages and the consumption of local keys.
When the route request message arrives at the destination, it enters the routing response phase. If the current request has not been processed, a routing response message containing the incremented response sequence number is broadcast. As shown in
Figure 6, when the response list maintained by the relay cannot match, or it matches but the link status calculated based on the packet is better than the record in the table, the cumulative link status in the table is updated. Add the link status at the entrance of this node to the link status list in the packet, sign it, encrypt it, and forward it to other ports. If the source does not receive the route response message within a certain period, it simply resends the route request message. If there is at least one path between the source and the destination, the algorithm can discover the path and ensure the reachability of the route.
If the path priority is measured by accumulating link status when propagating routing response messages, a path that cannot provide local key services may be selected. As shown in
Figure 7, if A triggers path discovery, this mechanism will cause the network layer to select paths containing A, B, and C. The key supply capability of the logical link between B and C is poor, and the key transmission failure is more likely. We comprehensively consider the local key status and hop number of the link of the current message as the path status, and the larger one is selected:
where
P represents the selected path sequence number and
represents the number of hops of path
i. The link status
S in the formula is defined as:
The first factor weight represents the status of the key cache. and here represent the amount of currently cached local key and storage limit, respectively. The second factor is the absolute net key generation rate of the current logical link. If the first factor is small, it implies that a significant number of communication keys have passed through the node in the past period, which requires time to recover. If is 0, it means that the current logical link cannot provide services.
4.3. Fault Detection
When a path is discovered, communication key distribution can be performed within the path. As mentioned earlier, if there are faulty nodes on the path, transmission abnormalities will occur. Therefore, this detection method is based on the destination node’s acknowledgment of the communication key: if the amount of acknowledgment in a sliding window is greater than the threshold, fault detection at the source is triggered.
Some faulty nodes may send bogus information to the destination node to maintain the traffic consistency principle. However, in the SKDN, the SKD node will utilize the local key to encrypt or decrypt messages when forwarding or receiving messages. If the faulty node does not use the local key, this bogus information will be decrypted into garbled characters and cannot be forwarded to the destination, causing fault detection to be triggered. Therefore, by using the local key, the traffic flowing through the faulty node is limited to a certain normal range, which improves the detection accuracy.
When fault detection is triggered, a detection list is maintained based on the relay list of the corresponding entry of this path in the routing table at this moment. We assume that the path includes source and destination with an odd number of N nodes, with as serial numbers, respectively. If the detection is triggered, the detected abnormal path from 1 to N is divided into two parts by inserting the intermediate point in the detection list. When the next communication request arrives, the detection list is attached to the end of the packet carrying the communication key and then forwarded. All points in the detection list, including the destination, need to return acknowledgment and recursively detect the two subpaths. By analogy, until the detection triggers detection on the indivisible path from i to j, it is suspected that one of the nodes is faulty.
To prevent the faulty node from discarding the acknowledgment so that it can be blamed on any node between itself and the destination on the path when the relay receives the packet containing its own identity in the detection list, it needs to trigger the timing without sending the acknowledgment first. When the acknowledgment returned by the destination arrives at the relay within the time limit, the relay information and the packet body will be signed and attached to the end of the packet; otherwise, a new acknowledgment will be generated and sent.
If a fault-triggered communication key is used for secure communication between source and destination, the communication of both parties may be leaked. We observe the asynchronous nature of key distribution and application, establishing four different states for communication keys: uncertain, suspicious, available, and unavailable. For convenience, we assume that the duration of the fault will exceed the full cycle of fault detection. When the communication key request arrives, the status of the forwarded communication key is uncertain. When the acknowledgment rate is lower than the threshold, the status of the batch of keys is converted to available and returned to the service layer. Otherwise, convert the key status to suspicious.
4.4. Evidence Collection
In addition to malicious causes, benign factors such as poor local key status can also cause detection triggers. Simply isolating the faulty link will reduce system availability. Therefore, we use the Dempster–Shafer evidence theory to provide theoretical tools for multi-information aggregation and subsequent decision making to make a reasonable link status assessment. The Dempster–Shafer evidence theory provides a mathematical model for the uncertainty and imprecision in events associated with certain evidence, and also shows how to combine different evidence to make reasonable deductions about associated events [
34]. We define a frame of discernment
as the possible causes that trigger detection, and the corresponding mass functions are assigned to both subjective and objective knowledge, also known as evidence, generated during the detection process, representing their respective estimations of potential causes:
Mass function
m is the mapping of power sets
to positive real numbers, satisfying:
where
m is defined as the following piecewise linear function:
where
x is derived from each evidence and
b represents the current value of the input evidence
x when the uncertainty of the associated event is highest. The evidence is collected from the communication key acknowledgment rate, current path status, and autocorrelation coefficient of acknowledgment bitmap, as shown below.
Evidence 1: communication key acknowledgment rate
. As shown in the following formula,
is the number of communication key packets sent in a sliding window, and
is the number of acknowledgments; this evidence is collected from the acknowledgment rate of the communication key that triggers the current detection process. Since the false positive rate is a common problem of intrusion detection systems [
35], further analysis is required. It needs to be evaluated based on further evidence to improve the accuracy of inference of the cause of the fault:
Evidence 2: autocorrelation coefficient of acknowledgment bitmap
. This evidence is collected from the autocorrelation coefficient of acknowledgment bitmap that triggers the current detection process. A bitmap is represented by a one-dimensional sequence:
where
indicates whether the
jth acknowledgment that triggers the current detection process is lost or not. Malicious faults usually result in irregular response packet loss, so we use the autocorrelation function [
36] as shown below to evaluate the repeating pattern in the time series represented by the bitmap and further deduce the nature of the anomaly:
Evidence 3: current path status
. This evidence is collected from the current path status
in the routing table.
represents the maximum key generation rate of all logical links. A path with a poor current path status can easily cause detection triggers. If not, the belief that indicates a malicious fault in the path is stronger.
Evidence 4 to Evidence : communication key acknowledgment rate
. This evidence is collected from the acknowledgment rate that triggered the
ith round of the fault detection process:
Evidence to Evidence : autocorrelation coefficient of acknowledgment bitmap
. This evidence is collected from the autocorrelation coefficient of the acknowledgment bitmap in the
ith round of the fault detection process. A bitmap is represented by a one-dimensional sequence:
where
indicates whether the
jth acknowledgment in the
ith round of the fault detection process is lost or not.
4.5. Information Aggregation and Decision Making
The mass functions
and
of different observational evidence can be combined through Dempster’s rule to obtain the combined mass function
, as shown in the following formula.
also satisfies the condition of Equation (
5) and is the same as the power set
mapped by
and
:
Therefore, different information can be considered comprehensively to further analyze and reason about the elements in the set. However, the normalized conjunctive rule of combination that defines Dempster’s rule assumes the same degree of belief for different evidence, and lacks the distinction of evidence importance through prior knowledge. For this reason, we calculate the combined mass function by using the weighted Dempster’s rule. After assigning a positive integer importance factor (IF) to each piece of evidence, for any
, Equations (
16) and (
17) will be transformed into:
where
and
are the IFs of
and
, respectively. The IF of the combined mass function
is
. It can be verified that the weighted Dempster’s rule satisfies the condition of Equation (
5); weighting is more general, and due to the special case
, it degenerates into Dempster’s rule. More importantly, Equations (
18) and (
19) shows that the mass function with high IF contributes more to the combined mass function, effectively utilizing prior knowledge obtained from key transmission.
For the obtained combined mass function
, select the element
that can maximize the Pignistic transform of
as the cause of fault:
where Pignistic transform
is as follows:
When the detection is triggered, we collect evidence from
to
, calculate the combined mass function
, and calculate
through Equation (
20). When
is benign, path discovery is performed, and convert the communication key sent from the detection trigger to available, return to the service layer. Otherwise, the communication key in the buffer is cleared, and the detected link will be added to the route request message in the next path discovery, and other nodes will ignore the link during the route discovery phase. Algorithm 1 for combined mass function
is summarized as follows. where
is the IF of each
,
is calculated from Equations (
6)–(
8). The time complexity of combined mass function calculation is
, where
N is the number of nodes on the path. This shows that the computational overhead increases slowly as the network size increases, indicating the practical value of the algorithm.
Algorithm 1 Combined mass function calculation |
- Input:
- Output:
- 1:
- 2:
for to do - 3:
- 4:
for do - 5:
- 6:
end for - 7:
- 8:
end for
|
4.6. Analysis
First, we analyze the security of the proposed solution. In the above path discovery phase, the RSA digital signature algorithm is used to verify the identity of the relevant information in the message to prevent the faulty node from impersonating other nodes. However, this will affect the security of the communication key. Due to the characteristics of SKD technology, the generated symmetric keys are independent of each other. Therefore, if the RSA public and private key pair are cracked at some point in the future, it will not affect the generated key and the one-time pad encryption using the key. This is called forward security. In addition, an issue that requires attention is that the possibility for an adversary to store the current so-called secure ciphertext and wait for the emergence of more advanced cryptanalysis technology to extract information from the ciphertext. This is possible in traditional symmetric key distribution scenarios. Different from the former, combining the key generated by SKD technology and the one-time pad encryption method can ensure the long-term security of ciphertext, which is of great significance for certain scenarios, such as personal medical records, business secrets, etc.
Next, we analyze and compare the upper limit of fault tolerance and overhead of this method and existing methods in the scenario shown in
Figure 3. We will now analyze the shortcomings of existing methods under this mixed fault. The method used in [
15] combined with PBFT [
37] ensures that in the presence of faulty nodes, any two normal nodes can reach a consensus on the same key distribution path construction scheme. Based on this, path discovery between normal nodes can resist Byzantine behavior from faulty nods, avoiding previously described traffic redirection attack. However, this method assumes that the leader node proposes a construction plan through global network information calculation, which brings huge resource consumption when implemented in dynamic SKDN [
22]. In addition, the method can only tolerate most
faulty nodes in
nodes simultaneously. In the multi-path communication key distribution strategy proposed by method [
38], communication keys are distributed on
disjoint paths in sequence. After the destination node XORs the results, it can obtain the ITS secure communication key. This ensures the security of key distribution. However, faulty nodes can reduce the success rate of the multi-path scheme by delaying the communication key of one of the paths. In the literature [
14], it has been proposed that additional
disjoint paths are needed to improve the key distribution liveness. Although the combination of the above methods can ensure the reliability of secure key distribution in the fault scenario shown in
Figure 3, there is an upper limit to the number of faulty nodes that can be resisted, and a large number of redundant disjoint paths are required as the cost of improving reliability, which greatly reduces system availability. For example, when C, E, and G in
Figure 3 become faulty, faulty nodes become the majority of network nodes, and there are not enough disjoint paths. By the path discovery and fault detection mechanism, our method can discover the path containing A, B, F, and D and distribute the key on this path. Therefore, ideally, as long as there is a path that does not contain a faulty node, communication keys can be distributed on this path without relying on local keys from other paths, which improves the availability of the system.
5. Simulations and Discussion
This section will verify the effectiveness of the proposed solution through experimental simulation and detailed analysis of relevant results.
5.1. Simulation Environment
We use the ns3 simulator to compare the proposed solution with OSPF, the adaptive stochastic routing (ASR) [
28] and the multi-path communication scheme (MPCS) [
38]. BRITE [
39] is used to generate a random topology under the Waxman model, in which nodes and links are randomly distributed in the grid to ensure that the evaluation of the method based on the simulation results is independent of the specific network structure.
Table 1 shows the parameters used in the simulation. The simulation time was 100 s for each simulation and 20 simulations were run to obtain the average value. We will consider the following two scenarios representing malicious fault in the communication key distribution phase and path discovery phase, respectively. One is the black hole attack (BHA), where the faulty node participates in path discovery normally while performing uniform random selective packet loss; the second is the traffic redirection attack (TRA). Based on BHA, it interferes with the path discovery process involved, and the faulty node falsely reports its local key status, which will cause nodes to select a path containing faulty nodes for key distribution with a greater probability.
Performance metrics including packet delivery ratio, key material utilization, corrupted key ratio, and hop count can be used to assess the availability and reliability of routing strategy within the specified simulation environment, as described below.
Packet Delivery Ratio: This metric is the ratio of communication keys successfully received by the destination to all communication keys sent by the source during the simulation. This is used to evaluate communication key distribution performance under fault scenarios.
Key Material Utilization: This metric is the ratio of local keys consumed by successfully received communication keys during the simulation to the total consumption of local keys. It can reflect the key utilization rate and certain routing overhead, reflecting the availability of the routing strategy.
Corrupted Key Ratio: The metric is the proportion of successfully received communication keys that have passed faulty nodes during the simulation. This reflects the confidentiality of communication keys successfully distributed in fault scenarios.
Hop Count: This metric is the average forwarding number of successfully received communication keys during the simulation. This reflects the effectiveness of the path discovery mechanism with a certain key distribution delay and improved fault tolerance.
5.2. Results and Discussion
Based on the above simulation environment configuration, this section will conduct a comprehensive comparison of the proposed solution and the other three methods based on the above metrics in two fault scenarios.
Packet Delivery Ratio:
Figure 8 shows the packet delivery ratio (PDR) of the four methods under different faulty nodes ratio (FNR) in the BHA scenario of different numbers of nodes. The vertical line segments depict the
confidence intervals of the results. As FNR increases, the PDR of the four different methods decreases, in line with expectations. We found that the proposed solution has a higher PDR than the other three methods at the same FNR. When the FNR is low, the proposed solution can identify based on different evidence that there is a greater chance that the cause of the fault is benign, so it can perform a dynamic route discovery and plan a better path for communication key forwarding. While the other methods have no identification mechanism, and both paths are relatively fixed, they have a lower PDR. ASR’s load balancing ability can improve the PDR a little but not much. MPCS reduces the possibility of key leakage by transmitting it on redundant paths. However, this leads to a proportional acceleration of local key consumption, resulting in a decrease in system availability. When FNR increases, the probability of the proposed solution detecting the faulty nodes is greater. Although partitioning the network causes performance degradation and PDR reduction, the faulty nodes are isolated and the accuracy of path discovery information is improved, so the PDR is still higher than the other two methods. As the number of nodes increases, the distributed scalability of the proposed solution enables the system to maintain a high PDR. It can be observed that when the number of nodes is 40, the proposed solution has a significant decrease in PDR when FNR is around
to
, while when the number of nodes is 60, 80, and 100, PDR drops significantly when FNR is greater than
. This shows that when the number of nodes increases, even though the proposed solution isolates faulty nodes under the same FNR, complex topology provides more available paths.
Figure 9 shows the PDR of the four methods under FNR in the TRA scenario of different numbers of nodes. The vertical line segments depict the
confidence intervals of the results. As the FNR increases, the PDR under different methods decreases. However, compared to the BHA scenario, using OSPF, ASR, and MPCS under TRA causes PDR to drop rapidly. This is because the faulty node propagates erroneous status information during the process of route discovery, causing path establishment to pass through the faulty nodes. As the FNR increases, this phenomenon becomes more serious. It can be seen that compared with the BHA scenario, when the proposed solution is used, erroneous status information can easily cause too many paths to be concentrated on faulty nodes, and the PDR will inevitably drop sharply. However, the inconsistency between status information and PDR improves the fault detection efficiency. Therefore, the proposed solution also avoids related faulty nodes in the subsequent path discovery process, ensuring a relatively stable PDR level. Therefore, we observe that at each FNR, the proposed solution performs better than the other three methods.
Key Material Utilization:
Figure 10 shows the key material utilization (KMU) of the three methods under FNR in the BHA scenario of different numbers of nodes. The vertical line segments depict the
confidence intervals of the results. KMU reflects the proportion of key material used to successfully forward communication keys among the total key consumption, and is affected by the PDR and the overhead of the local keys used for other than communication key forwarding. Compared with OSPF and ASR, the KMU of the proposed solution is slightly lower, because it has more frequent detection and path discovery under high FNR. MPCS has the lowest KMU due to the resource consumption caused by its mechanism. When the other two methods correctly implement routing control, it will not increase the local key overhead in other places due to the improvement of FNR. The proposed solution provides significant PDR gains as mentioned above. This is due to the improved fault tolerance of the system at the cost of resource consumption. In general, the KMU difference between these three methods is small, reflecting the effective resource utilization of the proposed solution. For example, when the number of nodes is 100, with the improvement of PDR, the KMU of the proposed solution is even better than the comparison method at FNR around
to
.
Corrupted Key Ratio:
Figure 11 shows the corrupted key ratio (CKR) of the three methods under FNR in the BHA scenario of 60 nods. The CKR of different methods increases with the increase of FNR, and the rate of increase of the proposed solution is much lower than that of the comparison method. This is one of the advantages of the proposed solution, although MPCS has lower CKR than OSPF and ASR when FNR is low. However, as FNR increases, there are not enough redundant paths in the network for key distribution, which leads to a sharp increase in CKR under MPCS. Through the collaborative work of the fault detection mechanism and the path discovery mechanism, a suitable path is found for communication key distribution by the proposed strategy, which reduces the possibility of the communication key being intercepted by faulty nodes. When the maximum FNR is equal to
, the CKR is still less than
. This information leakage can be eliminated with privacy amplification. Since the path provided by OSPF and ASR is relatively fixed, CKR is linearly positively correlated with FNR, which greatly improves the possibility of communication key leakage.
Hop Count:
Figure 12 shows the hop count of the three methods under FNR in the BHA scenario of 60 nods. OSPF identifies a single shortest path for key distribution. ASR and MPCS selects other subshort paths for key distribution with a certain probability, thereby improving the distribution success rate and therefore having a larger hop count. As FNR increases, a slight decrease in hop count can be observed. This is essentially due to the lack of response measures to faulty nodes in the comparison method, which results in a greater possibility of containing faulty nods in long paths, resulting in a decrease in the distribution success rate. The hop count traversed by successfully received communication keys is reduced. In the absence of faulty nodes, benefiting from the reasonable definition of link status, the hop count of the proposed solution is between OSPF and ASR, which also represents a lower distribution delay. The increase of hop count with the increase of FNR means that the proposed solution can identify faulty nodes and find paths that cannot be found by the comparison method for key distribution, which improves the fault tolerance of the strategy.
6. Conclusions
In this paper, we propose a practical on-demand fault-tolerant routing strategy to improve the availability and reliability of communication key distribution in the presence of network layer fault. We consider a combination of path discovery and fault detection mechanisms to balance the effectiveness and fault tolerance of SKDN. In particular, the strategy adopts a fault-free on-demand path discovery and selects the appropriate path for key forwarding based on the local key status. In addition, an acknowledgment-based fault detection mechanism is integrated during the distribution process to locate abnormal links, and the identification accuracy is improved by identifying possible causes based on DH evidence theory. The system’s reliability is enhanced by varying responses to different causes. The simulation results demonstrate the effectiveness and scalability of the proposed solution compared to comparative methods under different faulty node ratios. Moreover, the proposed solution has a relatively low local key overhead, indicating certain practicability. In future work, we will consider the impact of changes in bandwidth resources, including the reduction of fault identification accuracy. Additionally, a solution to dynamically join and exit nodes is considered to prevent delays in path information dissemination. We will incorporate additional evidence from other intrusion detection systems and reputation systems to analyze network status more comprehensively, and develop a more refined exception response mechanism to further enhance the fault tolerance of SKDN.