In this section, we conduct simulated experiments to evaluate the utility and the robustness of our proposed DP-DHR framework.
In certain fields where sensitive information is collected, such as finance and healthcare, ensuring cyberspace security and data privacy are of the utmost importance. In these contexts, the frequency feature of private data is a critical factor to consider. For example, in the field of healthcare, the incidence rate of a specific disease, and in the field of finance, the high or low income rates for a particular population, are crucial statistical measures in the real world. Therefore, in this section, we consider the ‘frequency statistics’ as the executor functionality of the DP-DHR framework and analyze its utility.
To simulate our experiment, we generate a dataset consisting of 100 data points, each with 30 features. This dataset can be represented as a matrix, where each element at the i-th row and j-th column is denoted as and . This representation is frequently utilized in real-life situations, like medical databases that capture the occurrence or absence of diseases within a population, or financial databases that depict salary ranges for various individuals.
In all the experiments described below, we fix the candidate percentage parameter (introduced in Algorithm 2) as , indicating that of the perturbed intermediate results are selected as candidates C.
5.2. Two-Dimensional Outputs
If we consider 2-dimensional outputs, then the frequency of data points with a value of 1 in the first dimension is denoted as
, and that in the second dimension is denoted as
. Therefore, the expected output of the system can be represented as the pairwise frequency
. For instance, considering the database
D given in (
10), the expected output of the system would be
. This type of frequency is often utilized when assessing the concordance property of disease prevalence.
In our simulated database, the traditional DHR architecture, without being subjected to any attacks, produces a pairwise frequency output of
. This disparity between the outputs of the DP-DHR framework and the traditional DHR architecture (referred to as T-DHR) is illustrated in
Figure 4.
Figure 4a depicts the outputs obtained at various privacy budgets
, with a randomly chosen
. The green point in the figure represents the output produced by the T-DHR architecture, while the red points correspond to the outputs generated by the DP-DHR framework for different values of
, with the respective values of
marked next to the points for reference.
Similarly,
Figure 4b shows the outputs at different numbers of executors
k, with a randomly chosen
. The green point in the figure represents the output of the T-DHR architecture, whereas the red points denote the outputs generated by the DP-DHR framework. The numbers adjacent to the red points correspond to the respective values of
k.
As shown in
Figure 4, the difference between the outputs generated by the T-DHR architecture and the DP-DHR framework (measured by the Euclidean distance) diminishes as
and
k increase, respectively. This observation is similar to the phenomenon discussed in the experiments involving ‘1-dimensional outputs’ in
Section 5.1, in which the former (
Figure 4a) demonstrates the tradeoff between the utility and the privacy as demonstrated in Remark 3.
As is evident, the hypersphere clustering method plays a crucial role in the advanced decision strategy outlined in Algorithm 1. Thus, our focus is on evaluating the utility of our proposed clustering method under both normal (unattacked) and attacked conditions.
To begin, we conduct experiments under normal conditions. In order to evaluate the performance of our proposed hypersphere clustering method, we randomly choose and showcase the clustering results for different numbers of executors k = 10, 50, 100, 200, 300, 400.
The clustering results are shown in
Figure 5. The blue points are the perturbed intermediate results classified to the candidates
C, while the red points represent the perturbed intermediate results classified to the outliers
O. The green points represent the outputs of the T-DHR architecture (
and
, as mentioned before), and the brown points represent the outputs generated by the DP-DHR framework.
In
Figure 5a–f, we consider a candidate percentage of
as mentioned above. Accordingly, the number of blue points is slightly larger than
as per Algorithm 2. From
Figure 5, it can be seen that for the DP-DHR framework, the injected random noise causes the perturbed intermediate results (represented by the red and blue points) to deviate from the expected system output (the green points). The gaps are particularly large for some individual executors. However, since the Gaussian random noise added to the intermediate results has a zero-mean, the gaps brought by individuals are weakened by averaging all the candidates, as discussed in the analysis of Algorithm 1. As mentioned earlier, due to this property, the gap between the outputs of the DP-DHR framework and the T-DHR architecture decreases with the increase in the number of executor
k. When
, the difference between them becomes relatively small, and with
, the outputs of the DP-DHR framework and T-DHR architecture are almost identical.
Then, we perform experiments under attacked conditions. In this scenario, the adversary gains control over a certain percentage of executors and aims to manipulate the system output. In our experiments, we consider different attack rates (AR), where 10%, 20%, 30%, 40% of the executors are assumed to be under the adversary’s control. Due to the decision strategy of the majority rule, all the executors controlled by the adversary are assumed to output the same result. Furthermore, for the attacked condition, we consider two cases: (1) where the random noise injecting process is not controlled, and (2) where the random noise injecting process is controlled. In the latter case, the attack is considered to be more sophisticated. We start with the former case.
In the first case, where the random noise injecting process is not controlled, the outputs of the compromised executors are assumed to be fixed values, such as
and
. Then, random noise is added to these outputs
o to obtain the perturbed intermediate results. The clustering results are shown in
Figure 6 and
Figure 7, where the green and brown points represent the same data points as in
Figure 5. Additionally, in these figures, the light blue points represent the outputs given by the attacked executors and classified into the candidates
C. The dark blue points represent the outputs given by the normal executors and also classified into the candidates
C. The light red points correspond to the outputs given by the attacked executors and classified into the outliers
O, while the dark red points correspond to the outputs given by the normal executors and classified into the outliers
O.
In the case where the adversary outputs
, we simulate the scenario where the adversary intends to mislead the system output, regardless of whether it is detected or not. Since the frequency values
naturally, and
is obviously far from the reasonable range, this simulation represents an attempt by the adversary to manipulate the system. In this case,
Figure 6 illustrates that all the outputs generated by the attacked executors are classified to the outliers
O, except for
Figure 6d. The reason for this exception is that the candidate percentage parameter and the attack ratio are set to
and AR =
, which means that at least one attacked perturbed intermediate result is selected as a candidate according to Algorithm 2. However, this does not affect the final system output. In general, unless the candidates percentage
, it is highly likely for the outputs given by the attacked executors to be classified as candidates when the adversary’s output is significantly different from the expected value. This indicates that our proposed hypersphere clustering method is robust against this kind of attack.
In the case where the adversary output
, we simulate the scenario where the adversary intends to mislead the system output but is not easily detected because of the frequency
naturally. In this case,
Figure 7 demonstrates that while some of the outputs generated by the attacked executors are classified as candidates, the impact of the adversary on the final output of the DP-DHR framework is minimal. This occurrence becomes more pronounced with an increase in AR, indicating the robustness of Algorithm 2 against this type of attack.
Now, let us consider a more complex scenario where the random noise injection process is controlled by the adversary. In this case, we assume that the outputs
o of the executors controlled by the adversary are as follows: (1)
: Both dimensions of
o are set to 3; (2)
: Two dimensions of
o are randomly chosen within the range of
; (3)
: Both dimensions of
o are set to 1. The clustering results for these scenarios are depicted in
Figure 8,
Figure 9, and
Figure 10, respectively. The point colors in these figures correspond to those in
Figure 6 and
Figure 7.
Regarding the adversary’s output
, similar to
Figure 6, we simulate the scenario where the adversary intends to mislead the system output, whether it is detected or not. In
Figure 8, the light red point located at the top right corner (coordinates
) represents all the intermediate results generated by the attacked executors, as all the controlled intermediate results are mapped to this single point. Specifically, in
Figure 8a–d,
(40),
(80),
(120), and
(160) intermediate results are respectively mapped to this point.
The clustering results presented in
Figure 8 demonstrate that the perturbed intermediate results classified as candidates
C are almost exclusively generated by normal executors, and not by the attacked executors controlled by the adversary. Furthermore, the results obtained from the T-DHR architecture and the DP-DHR framework are similar, indicating that our proposed hypersphere clustering method is robust against this kind of attack.
In the case where the adversary’s output
, we simulate that the adversary randomly generates two dimensions
. The intention of the adversary is to mislead the system output while avoiding easy detection, as the frequency
naturally.
Figure 9 displays the clustering results in this scenario.
In this case, the intermediate results classified as candidates C may contain results generated by the attacked executors. The ratio of light blue points (representing attacked candidates) to the total number of blue points (attacked candidates + (normal) candidates) increases with the growth in AR. However, the difference between the outputs of the T-DHR architecture and the DP-DHR framework is not significant, regardless of the scale of AR. One reason for this is that the expected output in the T-DHR architecture is close to . If the expected result was different (e.g., (0,0)), the gap between the outputs might be larger.
In the case where the adversary’s output
, we simulate the scenario where the adversary aims to mislead the system output without being easily detected. The clustering results are shown in
Figure 10, in which all the attacked results are mapped to the single point
, represented by the light blue point.
As mentioned above, of the attacked results are stained by the light blue color, indicating that all the attacked results are classified as candidates. This represents that all the intermediate results generated by the attacked executors are misclassified and the system is successfully attacked.
As shown in
Figure 10, the gap between the outputs given by the DP-DHR framework and the T-DHR architecture is larger compared to previous conditions. When comparing
Figure 10a–d to
Figure 9 (or
Figure 8a–d), it is obvious that the gaps between the brown and green points are notably wider in
Figure 10. Due to the relatively large distance between
and the expected output
, the output of the DP-DHR framework (the brown point) is ‘dragged’ away from the expected output of the T-DHR architecture (the green point) by the attacked executors. When AR is small, the consequence is not very pronounced. For example, in
Figure 10a, when AR
, although all the attacked results are classified as the candidates
C, the gap between the DP-DHR output and the T-DHR output is not significant due to the small AR. On the contrary, when a large AR is employed as shown in
Figure 10d, the attacked results would significantly mislead the output of the DP-DHR framework. In conclusion, it is evident that the gap between the DP-DHR framework and the T-DHR architecture becomes larger with the increase in AR as shown in
Figure 10a–d. This observation aligns with common sense and the expectations.
In summary, the proposed DP-DHR framework demonstrates resilience against the majority of attack conditions, with one notable exception: when the adversary coerces the majority of attacked executors to produce the same value
within a reasonable range (e.g., (0,1) in frequency), particularly when the disparity between
and the expected correct value
is substantial. This issue is exacerbated with higher levels of AR as shown in
Figure 10.
Here, we discuss the effects of the Advanced Persistent Threat (APT) attack, using the coordinated multi-vector attack as an example. In the DHR architecture (or the proposed DP-DHR framework), a coordinated multi-vector attack occurs if (when more than one executor is controlled by the adversary) and we classify the attack into two categories:
- 1.
: Due to the majority principle decision strategy, in the case of the traditional DHR architecture, the system’s output cannot be tampered with. This is because the intermediate results provided by the controlled executors are categorized as ‘attacked’, while the correct intermediate results are considered when making decisions. As a result, the integrity and the availability of traditional DHR do not decrease when
; however, the confidentiality (data privacy) is not protected. On the other hand, for the proposed DP-DHR framework, the theoretical and experimental results given above (Theorem 1 and
Figure 6,
Figure 7,
Figure 8,
Figure 9 and
Figure 10) enhance the confidentiality (data privacy) without much sacrifice of the integrity and the availability.
- 2.
: If more than half of the executors are controlled by an adversary, the adversary would definitely control the output of the traditional DHR architecture. This is because of the majority principle decision strategy, which allows the adversary to map the majority of the intermediate results to a single value. In this case, neither the traditional DHR architecture nor the proposed DP-DHR framework could ensure the integrity and the availability. However, the confidentiality (data privacy) is guaranteed in the DP-DHR framework rather than the traditional DHR architecture. It is worth noting that such a case is unlikely to occur in the DHR architecture (and the DP-DHR framework). This is because different executors are heterogeneously designed, reducing the probability of the majority of these heterogeneous executors being simultaneously controlled. Additionally, the dynamic property of the DHR architecture (and the DP-DHR framework) further decreases the probability of .
Remark 5. Through our analysis of the attacked results, we observed that the outputs generated by the controlled executors can alter the distributions of the injected Gaussian random noise, which in turn affects the privacy properties of the entire DP-DHR system. This phenomenon is even more pronounced in the DHR architecture, as it is designed to tolerate the compromise of several executors. This allows the adversary to construct certain probability distributions using a certain number of controlled executors to affect the privacy properties.
For instance, to guarantee DP, online executors not under the adversary’s control sample Gaussian random noise . However, if the adversary employs a different distribution, such as the uniform distribution with a sample range of , , and adds this noise to the intermediate results from controlled executors, the resulting noise distribution received by the decision module becomes unpredictable due to the aggregation, which alters the original Gaussian distribution .
Furthermore, due to the dynamic feature of the DHR architecture, the attacked online executor may be scheduled offline, while new executors may be scheduled online, making the situation more complicated. This issue is a subsequent consequence arising from our proposed framework, which aims to tackle privacy concerns in the DHR architecture. It stems from the specific noise distribution within the utilized DP tool and holds significant research value both theoretically and empirically. Therefore, we will conduct more detailed research on this matter in our future work.
Remark 6. Here, we delve into the implementation and potential implications of the DP-DHR framework in practical scenarios. On one hand, network latency in real-world scenarios could potentially impact the efficiency of the DP-DHR framework. Since the framework lacks the ability to verify the correctness of individual intermediate results, the effectiveness of the hypersphere clustering method (Algorithm 2) relies on the majority principle. Consequently, the decision module must await the aggregation of most (if not all) of the k intermediate results to reach a verdict, which may lead to efficiency decreases in the presence of prolonged network latency. Moreover, the heterogeneous nature of the DP-DHR framework further complicates efficiency considerations. Heterogeneously designed executors exhibit varying processing times for identical inputs and experience divergent network latency within the same cyber environment. This phenomenon mirrors challenges encountered in traditional DHR architectures. On the other hand, due to the heterogeneously designed executors, the DHR architecture requires more hardware or software resources compared to the traditional architecture. On this basis, the time complexity of the DP-DHR framework is of the order , which is brought by the hypersphere clustering method. This might lead to more processing time. However, both the resource costs associated with the DHR architecture and the time complexity introduced by the DP-DHR framework stem from security and privacy concerns, which are deemed acceptable. Furthermore, the validation and corresponding experiments under realistic conditions (including live data stream) will be performed in the future work.
Based on the comprehensive discussions provided above, we observed that the gap between the output of our proposed DP-DHR framework, which incorporates advanced decision strategies and hypersphere clustering methods, and the output of the T-DHR architecture is relatively small under most circumstances. Specifically, the DP-DHR framework demonstrates greater resilience against attacks designed to manipulate system outputs, regardless of whether they are detected or not. However, in the case of attacks that aim to mislead the system output without being easily detected, the robustness of the DP-DHR architecture diminishes, particularly when the attack rate is high. This behavior is similar to that observed in the T-DHR architecture.