DPTracer: Integrating Log-Driven Accountability into Data Provision Networks

Lee, JongHyup

doi:10.3390/app14188503

Open AccessArticle

DPTracer: Integrating Log-Driven Accountability into Data Provision Networks

by

JongHyup Lee

Department of Mathematical Finance, Gachon University, Seongnam 13120, Republic of Korea

Appl. Sci. 2024, 14(18), 8503; https://doi.org/10.3390/app14188503

Submission received: 15 August 2024 / Revised: 16 September 2024 / Accepted: 19 September 2024 / Published: 20 September 2024

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Emerging applications such as blockchain, autonomous vehicles, healthcare, federated learning, self-consistent large language models (LLMs), and multi-agent LLMs increasingly rely on the reliable acquisition and provision of data from external sources. Multi-component networks, which supply data to the applications, are defined as data provision networks (DPNs) and prioritize accuracy and reliability over delivery efficiency. However, the effectiveness of the security mechanisms of DPNs, such as self-correction, is limited without a fine-grained log of node activities. This paper presents DPTracer: a novel logging system designed for DPNs that uses tamper-evident logging to address the challenges of maintaining a reliable log in an untrusted environment of DPNs. By integrating logging and validation into the data provisioning process, DPTracer ensures comprehensive logs and continuous auditing. Our system uses Process Tree as a data structure to store log records and generate proofs. This structure permits validating node activities and reconstructing historical data provision processes, which are crucial for self-correction and verifying data sufficiency before results are finalized. We evaluate the overheads introduced by DPTracer regarding computation, memory, storage, and communication. The results demonstrate that DPTracer incurs reasonable overheads, making it practical for real-world applications. Despite these overheads, DPTracer enhances security by protecting DPNs from post-process and in-process tampering.

Keywords:

data provision network; decentralized systems; tamper-evident logging

1. Introduction

A recurring problem in emerging applications such as blockchain [1,2,3], self-consistent large language models (LLMs) [4], multi-agent LLMs [5], autonomous vehicles [6], healthcare [7], and federated learning [8] is the reliable acquisition and provision of data from external sources through multiple agents. For example, blockchain price oracles aggregate asset prices from centralized or decentralized exchanges to derive a consensus price used in smart contracts [9,10,11]. This collaborative network of multiple nodes, which supplies data to a closed system, is defined as a data provision network (DPN).

A DPN prioritizes providing reliable and accurate data over delivery efficiency. It is structured to mitigate, detect, and correct potential threats or errors that could compromise the accuracy of the data provided using several reliability mechanisms. First, nodes within the DPN are inherently untrusted and must perform mutual verification. Second, these nodes collect data from diverse sources, paths, and instances for the same type of information and synthesize them into a single reliable value through the in-network aggregation process, which includes consensus building. Finally, a critical feature of DPNs is the self-correcting mechanism. After data are delivered to the data requester, if an issue arises with the provided data, a self-correction process is initiated, such as the dispute phase in the price oracle [1,2] or the feedback mechanisms in multi-agent LLMs [5]. If a node is identified as the source of the erroneous data during this process, it is adjusted, isolated from subsequent data provision, or even subjected to penalties.

Nevertheless, the effectiveness of self-correction is constrained unless it is performed with sufficiently fine-grained evidence, which a systematic logging system can provide. Furthermore, DPNs often lack responsible loggers. Instead, the responsibility for maintaining and presenting evidence for misbehavior falls to those who propose self-corrections [1].

Maintaining comprehensive log records throughout the data provisioning process benefits DPNs. These logs help identify erroneous nodes during the error-correction process, provide clear evidence, and validate the results’ accuracy based on the sufficiency of the input data. Such detailed logs thus facilitate finer-grained accountability in DPNs, ensuring all actions are traceable and verifiable.

However, managing a log within DPNs presents unique challenges. Nodes in a DPN, often part of a dynamic membership, are generally unsuitable for retaining log records for later dispute processes. Introducing a specialized logger node could address this issue, but it must be designed to operate efficiently in the untrusted DPN environment of a DPN. The logger should efficiently implement frequent audits to protect against tampering and must enforce comprehensive logging by all nodes to facilitate potential inspections.

This paper proposes DPTracer: a novel logging system tailored for DPNs that uses a tamper-evident logging approach that assumes the logger is untrusted and subject to rigorous audits. We incorporate logging and validation steps directly into the delivery protocol for comprehensive logs of processes and continuous auditing of both nodes and the logger. Before a node transmits an output to the subsequent node in a DPN, the node must obtain a commitment and proof of the updated log states. By verifying the proof, the node can confirm the consistency of its inputs and outputs in the log and the integrity of logged items up to its output. The commitment is then forwarded along with the output for the subsequent node to continue this validation cascading.

The underlying data structure for the logger is Process Tree, which is based on History Tree [12] and is optimized for generating efficient proofs of versioned logs. DPTracer records the flow-sensitive activities of nodes in a data provision process with Process Tree and generates interval proofs. The proofs validate a node’s input–output relationship and confirm the inclusion of log requests. By leveraging the logs, DPTracer enables the reconstruction of historical data provision processes at the node level and verifies the sufficiency of data inputs before finalizing results.

Our contributions can be summarized as follows:

We tackle the challenge of insufficient logging, as logging is essential for self-correction in multi-agent data provisioning environments.
We frame multi-agent data provisioning within the structure of a DPN and introduce DPTracer. This logging system is designed for DPNs and incorporates a Process Tree data structure and integrates logging into the delivery protocol.
We present a security analysis and evaluation demonstrating that DPTracer enhances security and maintains reasonable overhead.

The structure of the paper is as follows. Section 2 provides background information on the Merkle tree structure and tamper-evident logging. Section 3 describes the design of DPTracer. The implementation of the logger using Process Tree is detailed in Section 4. Section 5 discusses how logging is integrated into the data provision and dispute processes. Section 6 analyzes the security of DPTracer. Section 8 evaluates DPTracer computation, memory, storage, and communication overhead. Related work is discussed in Section 9. The paper concludes with Section 10.

2. Background

2.1. Multi-Agent Data Provisioning

Price oracle feeds for blockchain and multi-path LLMs exemplify multi-agent data provisioning. Figure 1 illustrates that both structures use staged aggregations to provide a single result. We generalize these structures into a DPN model in Section 3.5.

2.1.1. Price Oracle Aggregation

Price oracles use multi-level aggregation, as depicted in Figure 1a, to consolidate price data from multiple data sources to a single result. For instance, Chainlink, a prominent price oracle service, delivers reliable asset prices to blockchain systems through three-level aggregations [13]. Price data aggregators gather prices from multiple raw sources and perform the first-level aggregation, such as calculating a weighted average based on volume and liquidity. Node operators filter outliers and compute the median value as the second-level aggregation. These results may be delivered individually or collectively by aggregating them [9]. Finally, the multiple values from node operators are merged into a single value through the oracle network aggregation, which may occur in a blockchain system. Other oracle services, like Band Protocol [10] and Pyth Network [14], also use multi-level aggregation and refinement strategies to provide reliable, correct data to blockchain systems.

2.1.2. Multi-Path LLM

The self-consistency strategy of an LLM addresses the inherent inconsistency of LLM outputs and improves correctness by exploring multiple reasoning paths [4]. Figure 1b illustrates this architecture. Multiple instances of an LLM generate outputs from an input, which are then aggregated into a single result by selecting the most consistent (major) answer from the output set. Approaches to enhancing the LLM output through diversifying reasoning continue to evolve: externally with multi-agent LLMs [5] and internally as with tree-of-thoughts strategies [15].

2.1.3. Self-Correction Mechanisms

Self-correction mechanisms enhance data accuracy and quality in multi-agent data provisioning systems. Price oracle systems, for example, have a dispute phase that is initiated by stakeholders to challenge and verify the correctness of the provided data using their own evidence [1,3]. Similarly, multi-agent LLMs use feedback mechanisms based on the previous output to adjust agents’ settings. Self-correction involves identifying errors through continuous monitoring and consensus mechanisms to pinpoint a target to improve. Countermeasures may include adjusting reputations or settings or slashing the deposits of the unreliable agent. Therefore, logs of in-network agent activities are essential for self-correction mechanisms to review data collection and aggregation processes. These logs provide a means to inspect the performance with evidence and refine decision-making models. They are also helpful for verifying the consistency of the data reported by different agents. In this paper, we focus on self-correction mechanisms at the level of dispute resolution in price oracle systems, which require stricter verifiable evidence to identify errors.

2.2. Tamper-Evident Logging Using Merkle Trees

2.2.1. Merkle Tree Structure

A Merkle tree is a binary hash tree in which each leaf is labeled with the cryptographic hash of a data block and every interior (non-leaf) element is labeled with the hash of the concatenated labels of its child elements. It enables quick verification of the integrity of individual data elements without requiring the entire dataset. The topmost hash, known as the root hash or Merkle root, functions as a digest of all the data underneath it. Any changes in data result in a corresponding change in the root hash.

2.2.2. Tamper-Evident Logging

Tamper-evident logging extends the concept of data integrity into log management by ensuring that any changes to log entries are detectable. A tamper-evident log system maintains a log record of events to identify any modification, insertion, or deletion of a log entry. Tamper evidence is used in environments where logs are critical, such as security monitoring and forensic analysis [16].

For tamper-evident logging, while specialized hardware may be applied [17,18], cryptographic methods such as hash chains [19] or hash trees [12,20] are more broadly applicable across various computing environments. The hash-tree method, in particular, has been more favored because of its efficiency; it requires fewer computations and less data during proof generation and verification, making it particularly suitable for systems with resource constraints.

2.2.3. History Tree

History Tree is a data structure used in tamper-evident history logging [12]. It is a versioned hash tree wherein each change to the tree is identified as a distinct version. Events are sequentially added as leftmost leaf elements, updating the tree into a new version each time. Like the Merkle tree, the interior elements of a History Tree are labeled with the hash of the concatenated labels of their child elements. This process continues recursively to the root. Consequently, the root hash of the History Tree becomes a commitment that fixes the integrity of the entire content of the tree. This commitment captures a snapshot of the History Tree representing all events up to the given version number.

More formally, when a new event

X_{i}

is introduced, the History Tree is updated by appending

X_{i}

, leading to the version-i tree, and a corresponding commitment

C_{i}

is computed. This commitment,

C_{i}

fixes the log containing events from

X_{0}

to

X_{i}

. Each commitment thus represents a unique snapshot that captures the cumulative state of the log up to version i.

Frequent auditing by verifying the proofs is essential to maintain the integrity of logs. History Tree uses two types of proofs to support auditing: incremental and membership. Incremental proofs validate the continuity and integrity between different versions of the tree (i.e., different commitments), confirming that two commitments share the same historical log data. In contrast, membership proofs are used to verify the presence of a specific event in a particular version of the History Tree. In combining these two proofs, the History Tree ensures the historical consistency of the log.

3. DPTracer Design

3.1. Data Provision Log

DPTracer aims to provide a data provision log (DPLog), which records the input–output relationships of nodes during data provisioning. This feature facilitates targeted inspections by enabling errors to be traced back to their sources, addressing any malfunctions identified within the network. Moreover, DPLog enables a comprehensive pre-submission check to ensure that a sufficient number of nodes have participated. The reliability of the DPN results depends on the engagement of nodes. Results are more reliable if they have passed through sufficient nodes, indicating that an adequate range of external data sources has been considered.

3.2. Threat Model

The goal of attackers is to covertly manipulate the DPN to generate incorrect outcomes and to thus exploit discrepancies in the target system for their gain. Attackers may infiltrate the network as nodes in order to inject incorrect information or omit crucial data. However, the DPLog ensures that such malicious activities are recorded, enabling the inspector to identify them. Thus, we assume that attackers may attempt either to manipulate the log record undetectably during and after data provisioning or to deliver inconsistent data to the log records. Additionally, we assume that attackers cannot find collisions in cryptographic hash functions.

We consider the following internal threats and log-tampering attacks:

Data modification: attackers may compromise nodes to send incorrect data to subsequent aggregators while deceiving the logger to avoid detection.
Log modification: attackers may compromise the logger to record incorrect information that deviates from the data specified in the nodes’ log requests.
Data fabrication: attackers may compromise nodes to generate a new data flow with arbitrarily fabricated outputs.
Delivery omission: attackers compromise nodes to omit certain inputs to skew results deliberately.
Post-process tampering: attackers access and modify log entries from previously completed results to hide misbehavior in earlier data provisioning.

3.3. Design Goals

DPTracer is structured around several design goals to ensure robustness under the threat model. The following goals are essential for maintaining the integrity and utility of the DPLog throughout its lifecycle:

$G 1$: Tamper-evident logging: any modification to the DPLog, both during data provisioning and after the submission of data, should be detectable.
$G 2$: Reconstructable process history: the DPLog must provide comprehensive information that is sufficient to reconstruct the data flow from earlier data provisioning.
$G 3$: Fine-grained accountability: the DPLog must contain enough detailed information to identify faults down to the individual node level.
$G 4$: Mutual audit: entities within the DPN should be able to audit each other’s behavior during data provisioning.
$G 5$: Storage flexibility: the DPLog must support efficient long-term storage strategies, which enables selective collapsing of logs from groups of nodes that are no longer of interest without losing verifiability.

3.4. Design Overview

At a high level, DPTracer is designed based on three core features to achieve the design goals:

Process Tree: DPTracer uses Process Tree as a data structure for tamper-evident logging ( $G 1$ ). Process Tree records each output in the DPLog and efficiently generates its proofs, which are essential for frequent auditing ( $G 4$ ). Grouping leaf elements enables compaction of the storage space without compromising security ( $G 5$ ).
Structure serialization: In DPTracer, each node’s output is recorded in the DPLog along with references to input data that reveal the delivery hierarchy of data provisioning. This serialization facilitates the detailed tracing and reconstruction of data pathways during inspections. The DPLog becomes a reliable source for reverse-engineering the process and indicating which inputs are used in the ongoing data provisioning ( $G 2$ and $G 3$ ).
Protocol integration: DPTracer integrates logging steps into transmission steps in data provisioning. Nodes and the logger mutually verify each other during these steps, ensuring the entire data provision process is accurately recorded, even in an untrusted environment ( $G 4$ ).

3.5. DPN Model

Figure 2 illustrates the overview of the DPN model. A DPN operates on a round-based schedule wherein nodes collect information from multiple external data sources in each round and aggregate it into a consolidated result for a data requester. Rounds can be scheduled regularly or can be triggered by specific requests or conditions. For example, a round in a blockchain price oracle is initiated when price fluctuations exceed a predetermined threshold or if there has been no update within a specific period. During such a round, an average asset price is derived from data from various external asset price sources.

3.5.1. Data Collection and Aggregation

At the start of a new round, the collector node fetches information from external data sources and then sends the fetched data to the subsequent aggregator nodes. The aggregator node performs in-network aggregation by receiving data from collector or other aggregator nodes, filtering out outlier values, and synthesizing them into a consolidated output. The submitter is the final aggregator node in data provisioning. It passes the finalized data to the data requester, which is part of the closed system. Once a result is submitted to the data requester, it is considered immutable and cannot be compromised. Depending on DPN applications, the submitter and the data requester may be combined. In such cases, aggregation of the final result occurs within the closed system.

3.5.2. Logging and Verification

The logger plays a crucial role by recording outputs from collector and aggregator nodes to build a DPLog. Whenever a new record is added, it returns a commitment and proof that the received nodes can check the log consistency and the inclusion of its input. The nodes also present the commitment to the next node as proof of engagement in the DPLog. The submitter, in turn, uses the complete log to check the comprehensiveness of input.

3.5.3. Dispute Process

A DPN initiates the dispute process as a self-correcting mechanism when erroneous data are detected in earlier rounds. During the dispute process, the logger interacts with the inspector. The inspector participates in the inspection process to identify errors by tracing back through data provisioning using the DPLog. This helps to pinpoint the source of any discrepancies or misbehavior.

3.5.4. Membership Management

The memberships of DPN nodes are managed on a round-based system. Different DPNs have varying rules for membership management. For example, in price oracle networks [9,10,14], nodes with sufficient security deposits can request to join the network, while node removal can be voluntary or can be enforced by the network as a result of misbehavior. However, to ensure data sufficiency and maintain consensus, node membership cannot be changed in the middle of the round. Therefore, any requests to join or leave must be submitted before the start of the target round at least. Additionally, the logger is assumed to always be included in the DPN for the continuity of the DPLog.

3.5.5. Hierarchical Data Delivery

We define a hierarchical structure called the ‘Report Tree’, through which nodes deliver data during data provisioning. As depicted in Figure 2, the submitter sits at the root of the Report Tree, with collector nodes positioned as the leaves and aggregator nodes located in between and aggregating multiple inputs into a single output. The construction of the Report Tree can be automatically performed each round, as in WSN [21], based on the predefined membership for the round. The logger and the submitter are informed of the Report Tree’s structure at the beginning of each round. In smaller DPNs, the submitter may function as the sole aggregator node and directly aggregates results from all collector nodes to produce the final output. Each node sends its output (upstream report) to its parent node in the Report Tree. For instance, in round r, an upstream report produced by node n_α is labeled as

R_{α}^{r}

. (For brevity, we omit the round number r in

R

for the current ongoing round.) When two collector nodes n_α and n_β generate

R_{α}

and

R_{β}

, respectively, from external data sources, an aggregator node n_γ receiving these will combine them into

R_{γ}

:

R_{γ} = R_{α} ⊓ R_{β}

, where ⊓ denotes the aggregation operator.

3.5.6. Notation

Table 1 details the notations used throughout the paper to describe various components and processes within DPTracer and the DPN.

4. DPTracer Logger

In this section, we describe Process Tree: the data structure used by the logger to maintain log records and generate proofs efficiently. For each round, a Process Tree is constructed in an arranged form as it receives log entries, which embed the input–output relationship of a node in the data provision flow.

4.1. Process Tree Construction

Process Tree is a variant of History Tree [12] and is structured as a binary hash tree similar to a Merkle tree. Figure 3 depicts a Process Tree consisting of leaf elements labeled as hash values of the corresponding log entries

X_{i}

, interior elements, and the root. The tree employs a cryptographic hash function, such as SHA-256, to ensure the integrity and uniqueness of each element’s hash value.

4.1.1. Adding Log Entries

Log addition begins when the logger receives an upstream report from a node. Upon receipt, the logger assigns a new version number i to these data and converts them into a log entry

X_{i}

, which is then added to the tree.

The locations of leaf elements are grouped based on the roles of the nodes. The log entries from collector nodes are stored in Group 1. Similarly, the logger stores the log entries from aggregator nodes and the submitter in Groups 2 and 3, respectively. For example, as illustrated in Figure 3, log entries such as

X_{1}

,

X_{2}

,

X_{3}

, and

X_{5}

from collector nodes are placed in Group 1. Among the aggregator nodes, two entries,

X_{4}

and

X_{6}

, are located in Group 2. When a new log entry is entered into Process Tree, it is placed at the leftmost available slot within its group. Thus, log entries within a group are arranged in the order of increasing version number, though the numbers may not be sequential.

Specifically, when a new log entry is received, the logger performs the following steps:

Assign version number: increment the version counter in the logger to obtain the new version number i.
Create leaf element: make a leaf element for $X_{i}$ and calculate its hash value as $h_{i} = Hash (X_{i})$ , where the contents of $X_{i}$ are as described in Section 4.2.
Place in group: determine the appropriate group based on the node’s role and place the leaf element in the leftmost available slot within that group.
Update tree: recompute the hash values of the nodes along the path from the new leaf element up to the root, updating any affected interior elements.

For instance, Figure 4 illustrates an earlier version (version-4) of the Process Tree from Figure 3. In the version-3 tree, which precedes version-4, only three entries—

X_{1}

,

X_{2}

, and

X_{3}

are placed in Group 1 in order. When

X_{4}

is added, assuming it comes from aggregator nodes, it is then located at the leftmost slot of Group 2, skipping the free slot of Group 1.

4.1.2. Tree Update Mechanism

The interior elements are iteratively updated from the leaf elements by hashing the concatenated values of the two child elements up to the root. When the Process Tree is updated, only the nodes affected by the addition of a new log entry are recalculated. The update process proceeds as follows:

Starting from the new leaf element, traverse up to the root, updating the hash value of each parent element.
For each interior element, recalculate the hash value as $h = Hash (h_{left} | h_{right})$ , where $h_{left}$ and $h_{right}$ are the hash values of the left and right child elements, respectively.
Update the version number of the root to reflect the latest version.

4.1.3. Handling Incomplete Branches

In cases where a node has only one child (due to an incomplete binary tree), the missing child is represented by an empty marker node, denoted as ⌀ and illustrated by the ⌀-node in Figure 3. The hash value of an empty marker node is defined as a fixed constant hash value, e.g.,

h_{⌀} = Hash (“ . ”)

to ensure the consistency of the tree structure.

4.1.4. Commitment Generation

The root value of the Process Tree is used as the commitment that fixes the current status of the log entries. As each log entry is assigned to a unique version number, a commitment

C_{i}

represents the fixed state of the version-i Process Tree and contains log entries from

X_{0}

to

X_{i}

. For instance, in Figure 3 and Figure 4, the root values represent the commitments for the version-7 Process Tree and version-4 Process Tree, i.e.,

C_{7}

and

C_{4}

, respectively. For clarity, each version number is embedded in its corresponding commitment, facilitating accurate tracking of each node’s output in a round. This binding permits a commitment to indicate a node’s output in a round, which is used to reliably serialize data flow in the DPLog.

4.2. Data Flow Serialization

The Process Tree captures the data flow of the Report Tree to express a detailed history of data provisioning. Thus, each log entry includes a reliable reference to a node’s input and upstream report and records the node’s activity. A log entry

X_{i}

from n_α is stored as a tuple consisting of:

X_{v (n_{α})} = (R_{α}, V_{α}),

(1)

where

v (\cdot)

maps a node identifier to its version number in the current Process Tree. The input version set

V_{α}

comprises the versions of n_α’s inputs. For example, if n_γ generates an aggregated result from the reports of n_α and n_β, then

v (α) < v (γ)

,

v (β) < v (γ)

, and

R_{γ} = R_{α} ⊓ R_{β}

. Accordingly, the logger stores the following entry for n_γ:

X_{v (γ)} = (R_{γ}, {v (α), v (β)}),

(2)

where

{v (α), v (β)}

corresponds

V_{γ}

.

4.3. Interval Proof

The logger must demonstrate to the nodes that each log entry is correctly stored in the DPLog and that the updated DPLog is consistent with the process history. Consequently, it uses an interval proof, denoted as π, along with the commitment. An interval proof, designed to cover multiple target log entries, includes the minimally required elements to compute the commitment of the highest version among the targets. Therefore, it comprises the siblings on the unified paths from the target log entries to the root and the target log entries themselves. For example,

π_{3, 5, 6}

is the interval proof for targets

X_{3}

,

X_{5}

, and

X_{6}

and consists of the items highlighted in red in Figure 5.

4.3.1. Constructing an Interval Proof

The construction of an interval proof begins by capturing the log entries for the targets, e.g.,

X_{3}

,

X_{5}

, and

X_{6}

for

π_{3, 5, 6}

. It sets the highest version among the targets as the anchor version of the interval proof. In Figure 5, the anchor version of

π_{3, 5, 6}

is 6. The interior elements

e_{1}

and

e_{2}

are siblings along the path from the targets

X_{3}

,

X_{5}

, and

X_{6}

to the root. The commitment

C_{6}

can be computed from the left branch consisting of

X_{3}

,

X_{5}

, and

e_{1}

and the right branch consisting of

X_{6}

and

e_{2}

.

Given a set of target log entries with version numbers

v_{1}, v_{2}, \dots, v_{k}

, the logger constructs an interval proof as follows:

Identify the anchor version: determine the highest version among the targets and denote it as $v_{\max} = max (v_{1}, v_{2}, \dots, v_{k})$ .
Collect target nodes: obtain the leaf elements corresponding to the target log entries.
Find common ancestors: for each target element, trace the path to the root, checking the sibling nodes along the way, and find the lowest common ancestor.
Build minimal proof set: combine the paths from all target elements to the root, eliminating duplicate elements. The interval proof includes:
- The log entries of the target leaf elements;
- The hash values of the sibling elements along the combined paths.
Assemble proof: assemble the collected log entries and hash values into the interval proof structure along with the anchor version $v_{\max}$ .

4.3.2. Verifying an Interval Proof

From an interval proof, a node can confirm that the logger holds identical data to the target log entries and maintains consistent log records between versions. The node verifies an interval proof by reconstructing the root hash and comparing it with the received commitment. The verification process is as follows:

Initialize leaf hashes: use the received log entry data to compute the hash values of the target leaf elements.
Reconstruct tree paths: using the hash values from the interval proof, rebuild the hash values of the interior elements along the paths to the root.
Compute root hash: combine the reconstructed paths to compute the root hash corresponding to the anchor version $v_{\max}$ .
Verify commitment: check if the computed root hash matches the received commitment $C_{v_{\max}}$ . If they match, the proof is valid.

4.3.3. Selective Commitment Verification

The commitment of any target log entries can be recomputed using an interval proof. A node selects a version among the targets and replaces the branches leading to the newer-version log entries with ⌀-nodes. Thus, the node can compare the recomputed and received commitments to ensure that both are derived from identical records. To verify the commitment for a specific version

v_{t}

among the targets, the node can adjust the interval proof as follows:

Prune newer branches: replace any branches leading to versions higher than $v_{t}$ with the empty marker hash $h_{⌀}$ (for ⌀-nodes).
Recompute root hash: recalculate the root hash using the pruned tree.
Compare commitments: verify that the recomputed root hash matches the known commitment $C_{v_{t}}$ .

4.4. Group Collapsing

Flexible storage management is crucial for logs that must be maintained over the long term. There is a trade-off between storage efficiency and the level of detail in logs: more detailed logs enable more accurate inspections but require more storage space.

The log entries of the Process Tree are organized into groups, and they are adjacently positioned in the tree. In this arrangement, the Process Tree can efficiently generate a group digest from the collocated leaf elements of specific group members. By grouping related entries, the system can selectively prune log entries that are not targeted for further inspection, enhancing storage efficiency. The grouping can provide a focused snapshot of relevant data while enabling consistency checks with the previous commitments. For example, as depicted in Figure 3, log entries are grouped by role; Group 1 consists of collector nodes, Group 2 is aggregator nodes, and Group 3 is the submitter. If an inspection of logs that are past a certain age focuses only on collector nodes–since they represent raw external data inputs–the old data in DPTracer might retain log entries for Group 1 only. DPTracer can then collapse the subtree for Groups 2 and 3 into the minimal interior elements necessary for computing the commitment, such as the interior element labeled

g_{2, 3}

in Figure 3.

5. DPTracer Protocol

Figure 6 illustrates the operations within DPTracer. The logger interacts with nodes and the inspector during two processes: data provisioning and the dispute process. During data provisioning in the DPN, nodes collaboratively deliver results. However, if incorrect data are detected and a dispute process is initiated, the inspector identifies erroneous nodes based on the DPLog.

5.1. Nodes

During data provisioning, a node gathers data from external data sources or other nodes and forwards an aggregated result to the next node. Before delivering the result, the node submits a request to the logger to record its output along with the information with input into the DPLog. Figure 6a illustrates these collaborative steps followed by the node and the logger, and Algorithm 1 details each operation in pseudo-code.

Algorithm 1 Data provisioning operations
1:	procedure Node.LogRequest( $R, C$ )
2:	input $R$ : List of input reports
3:	input $C$ : List of input commitments (or source IDs)
4:
5:	$o \leftarrow Aggregate (R)$
6:	$SendToLogger (o, C)$
7:	end procedure
8:
9:	procedure Logger.LogUpdate( $R, C$ )
10:	input $R$ : The report to be logged
11:	input $C$ : List of received commitments (or source IDs)
12:
13:	if $IsAggregator (C)$ then
14:	$V \leftarrow ExtractVersionSet (C)$
15:	for each $(C, q)$ in $(C, V)$ do
16:	$C^{'} \leftarrow RetrieveCommitment (q)$
17:	if $C$ ≠ $C^{'}$ then
18:	return ⊥	▹ Abort on verification fail
19:	end if
20:	end for
21:	else
22:	$V \leftarrow ExtractSourceIDs (C)$
23:	end if
24:	$v \leftarrow AssignNewVersionNumber ()$
25:	$X_{v} \leftarrow (R, V)$
26:	$AddToProcTree (X_{v})$
27:	$C_{v} \leftarrow CalculateCommitment (v)$
28:	$π \leftarrow GenerateIntervalProof (V, v)$
29:	// $π \leftarrow GenerateIntervalProof (v)$ for collector nodes
30:	if $IsTheSubmitter (R)$ then
31:	$L \leftarrow TraceInputs (C)$
32:	$SendToNode (C_{v}, π, L)$
33:	else
34:	$SendToNode (C_{v}, π)$
35:	end if
36:	end procedure
37:
38:	procedure Node.ReportSend( $R, C, R, C, π$ )
39:	input $R$ : List of input reports
40:	input $C$ : List of input commitments
41:	input $R$ : Current node’s report
42:	input $C$ : Commitment received for the report
43:	input $π$ : Received interval proof
44:
45:	if $IsThisNodeAggregator (C)$ then
46:	for each $(w, q)$ in $(R \cup {R}, C \cup {C})$ do
47:	if not $IsFound (w, π)$
48:	or $ComputeRoot (π, ExtractVersion (q)) \neq q$ then
49:	return ⊥
50:	end if
51:	end for
52:	if not $IsFound (ExtractVersionSet (C), π)$ then
53:	return ⊥
54:	end if
55:	else if $IsThisNodeCollector (C)$ then
56:	if not $isFound (R, π)$
57:	or $ComputeRoot (π, ExtractVersion (C)) \neq C$ then
58:	return ⊥
59:	end if
60:	end if
61:	$SendToNextNode (R, C)$
62:	end procedure

(Node) $LogRequest (R, C) \to (R, C)$ : A node receives multiple pairs consisting of a report and its commitment from input nodes. These reports and commitments are collected in the lists $R$ and $C$ , respectively. The commitment list is empty for the collector nodes, as these nodes directly fetch data from external sources. After gathering the reports, the node generates its output report $R$ based on the input reports. This output, along with $C$ , is then sent to the logger as a log request. For collector nodes, identifiers from external data sources are submitted instead of the commitments. For example, in Figure 6a, step (1), node n_k receives two inputs: $(R_{i}, C_{v (i)})$ and $(R_{j}, C_{v (j)})$ . Thus, its report $R_{k}$ is computed as $R_{k} = R_{i} ⊓ R_{j}$ , where ⊓ is the aggregator operator, and $C = \{C_{v (i)}, C_{v (j)}\}$ .
(Logger) $LogUpdate (R, C) \to (C, π) | ⊥$ : When a log request from an aggregator node is received, the logger verifies the commitments in $C$ to ensure that each commitment corresponds accurately to the Process Tree root of the specific version. The versions of the inputs, which are encoded in the commitments, are collected into a set $V$ . If any commitment mismatches, the logger halts the process and returns ⊥ to the node and notifies of the inconsistency. If all commitments are verified, the logger assigns a new version number v to $R$ , composes $X_{v}$ as $(R, V)$ , and adds $X_{v}$ to Process Tree at the corresponding group. Subsequently, a new commitment $C_{v}$ is calculated. Using the updated Process Tree, the logger generates an interval proof $π$ that spans the input versions in $V$ and includes the new version v. Finally, it sends back $C_{v}$ and $π$ to the node. For instance, in Figure 6a, step (2), after verifying the commitments in $C$ , the logger assigns a new version for $R_{k}$ , denoted as $v (k)$ . It then adds $R_{k}$ along with its input versions ${v (i), v (j)}$ as a new leaf element, computes $C_{v (k)}$ , and generates $π_{v (i), v (j), v (k)}$ , confirming that $R_{i}$ , $R_{j}$ , and $R_{k}$ are accurately recorded in the DPLog. For collector nodes, the logger uses the received source IDs in the place of $V$ in $X_{v}$ without verifying input commitments and generates $π_{v}$ for the current output solely. The rest of the process remains the same as that for the aggregator.
(Node) $ReportSend (R, C, R, C, π) \to (R, C) | ⊥$ : The node receives the new commitment for its report from the logger and verifies the two conditions. First, the node checks whether all the input reports in $R$ and its own report $R$ correspond with the entries in $π$ , which should contain all target reports from Process Tree. It also confirms that its input version set is correctly contained in $π$ . Second, the node verifies that all input commitments in $C$ and the received commitment $C$ match the corresponding commitment recalculated from $π$ . Notably, collector nodes are required to perform these verifications only for their own $R$ and $C$ . If all conditions are satisfied, this confirms that all input nodes have faithfully delivered their reports without any modification and that $R$ is correctly logged in the DPLog. Consequently, the node forwards the report and commitment $(R, C)$ to the parent node in the Report Tree. If any condition is not met, the process is halted, and no data are forwarded. In Figure 6a, step (3), n_k receives $C_{v (k)}$ and $π_{v (i), v (j), v (k)}$ and then verifies that $R_{i}$ , $R_{j}$ , $R_{k}$ and $V$ (extracted from $C$ ) are included in $π_{v (i), v (j), v (k)}$ . It also confirms that $C_{v (i)}$ , $C_{v (j)}$ , and $C_{v (k)}$ are accurately recalculated from it. If all conditions hold true, n_k ensures the integrity of the inputs and the DPLog and sends $R_{k}$ and $C_{k}$ to the next node.

5.2. Submitter

The submitter is a designated aggregator node and submits the final aggregated result to the data requester. Before submission, the submitter assesses the sufficiency of the input and aggregation for that round to ensure the results have been thoroughly processed by the nodes. Figure 6b demonstrates the operations of the submitter. The detailed operations are as follows:

(Submitter) $LogRequest (R, C) \to (R, C)$ : This operation is the same as that of other nodes as detailed in Section 5.1. For example, as illustrated in Figure 6b, step (1), the submitter n_N, which is the final aggregator node, receives input reports and commitments $R_{N - 1}$ and $C_{v (N - 1)}$ , respectively. It then compiles its report $R_{N}$ and includes $C_{v (N - 1)}$ in the $C$ , and it sends them to the logger.
(Logger) $LogUpdate {(R, C)}^{'} \to (C, π, L) | ⊥$ : This operation is similar to $LogUpdate ()$ performed by nodes, but it additionally includes an engagement list $L$ . When the logger recognizes $LogRequest ()$ from the submitter, it recursively traces the input versions $V$ from the submitter’s log entry in Process Tree down to the child nodes’ $V$ . Accordingly, the logger can gather all versions and identifiers of external data sources into $L$ and can capture only the versions and external source identifiers that influence the final results.
(Submitter) $Submit (R, C, R, C, π, L) \to (R, V, C) | ⊥$ : As the final step in data provisioning, the submitter verifies the integrity of its direct inputs and the logger’s records using $R$ , $R$ , $C$ , and $π$ . This step mirrors the verification described in $ReportSend (R, C, R, C, π)$ and is detailed in Algorithm 2. Once the integrity is confirmed, the submitter reviews the engagement level of nodes and external data sources with $L$ . If this engagement level is below a predefined threshold $T$ , which varies depending on the DPN application, the submitter may reject the final result. A low engagement level indicates that the results may not sufficiently reflect a diverse range of data sources or may lack adequate consensus among nodes, possibly caused by intentional delivery omission. In such cases, the submitter sends ⊥ to the data requester to notify it that the final result is rejected. If all conditions are satisfied, the submitter transmits the final result $R_{N}$ , including its input version set $V_{N}$ and commitment $C_{N}$ , to the data requester, as depicted in Figure 6b.

Algorithm 2 Submission
1:	procedure Submitter.Submit( $R, C, R, C, π, L$ )
2:	input $R$ : List of input reports
3:	input $C$ : List of input commitments
4:	input $R$ : Submitter’s own report
5:	input $C$ : Commitment received along with the report
6:	input $π$ : Interval proof for verification
7:	input $L$ : List of node IDs
8:
9:	for each $(w, q)$ in $(R \cup {R}, C \cup {C})$ do
10:	if not $IsFound (w, π)$
11:	or $ComputeRoot (π, ExtractVersion (q)) \neq q$ then
12:	return ⊥
13:	end if
14:	end for
15:	if not $IsFound (ExtractVersionSet (C), π)$ then
16:	return ⊥
17:	end if
18:
19:	$v \leftarrow ExtractVersion (C)$
20:	if $ComputeEngagementLevel (L, v) < T$ then
21:	return ⊥
22:	end if
23:
24:	$SendFinalResult (R, ExtractVersionSet (C), C)$
25:	end procedure

5.3. Inspector

The dispute process on a DPN is triggered when an erroneous result is detected among the provided results. This process is application-specific and there are varying criteria for determining node misbehavior based on the context.

Despite these variations, a comprehensive historical record of previous data provision processes is universally required across all applications to identify the origins of incorrect data. As illustrated in Figure 6c and detailed in Algorithm 3, the inspector interacts with the logger to manage this process through the following operations:

Algorithm 3 Inspection
1:	procedure Inspector.Query( $r, s$ )
2:	input r: Target round number
3:	input s: Version number of the target log entry
4:
5:	$SendQueryRequestToLogger (r, s)$
6:	end procedure
7:
8:	procedure Logger.Response( $r, s$ )
9:	input r: Round number
10:	input s: Version number of the log entry
11:
12:	$X_{s}^{r} \leftarrow SearchInProcTree (r, s)$
13:	if $X_{s}^{r} \neq ⌀$ then
14:	$v_{N} \leftarrow GetLastEntryVersion (r)$
15:	$π \leftarrow GenerateIntervalProof (s, v_{N})$
16:	$(R, V) \leftarrow X_{s}^{r}$
17:	SendToNode $(R, V, π)$
18:	else
19:	SendToNode $(⌀)$	▹ Log entry not found
20:	end if
21:	end procedure
22:
23:	procedure Inspector.Detect( $R, V, π$ )
24:	input $R$ : Report of the target node
25:	input $V$ : Input list of the target node
26:	input $π$ : Interval proof for this entry
27:
28:	$(R_{N}^{r}, V_{N}^{r}, C_{v (N)}^{r}) \leftarrow RetreiveStoredData (r)$
29:	$C^{'} \leftarrow ComputeRoot (π, N)$
30:	if $CheckIncl (R, V, R_{N}^{r}, V_{N}^{r}, π)$ and $C^{'} = C_{v (N)}^{r}$ then
31:	if IdentifyError( $R$ , $V$ , history) then
32:	TriggerNextDisputeStep()
33:	return ⊤
34:	else	▹ Further tracking required
35:	$V^{'} \leftarrow V^{'} \cup V$	▹ $V^{'}$ : Worklist of versions
36:	$s^{'} \leftarrow SelectNext (V^{'})$
37:	return $Query (r, s^{'})$
38:	end if
39:	else
40:	return ⊥	▹ Verification failed or end of tracking
41:	end if
42:	end procedure

(Inspector) $Query (r, s) \to (r, s)$ : The inspector requests a log entry of version s from round r. When a dispute is initiated, the inspector traces data flow in reverse. Starting with the final version of the submitter for round r, denoted as $v (N)$ (from $C_{v (N)}^{r}$ ), this process is iteratively conducted based on the findings from the $Detect ()$ operation.
(Logger) $Response (r, s) \to (R, V_{,} π) | ⌀$ : Upon receiving the request, the logger locates the log entry for a given version s within Process Tree for round r. If the log entry is found, the logger sends the $R$ and $V$ from the stored log entries $X_{s}^{r}$ , along with its interval proof $π$ , generated as $π_{s, v (N)}^{r}$ . This proof verifies that the log entry has not been tampered with based on the final commitment for round r, $C_{v (N)}^{r}$ . If the log entry is not found in Process Tree, the logger returns ⌀.
(Inspector) $Detect (R, V, π) \to ⊤ | (r, s^{'}) | ⊥$ : The inspector first verifies the integrity of the received $R$ and $V$ . The inspector retrieves the stored submission for round r, which includes $R_{N}^{r}$ , $V_{N}^{r}$ , and $C_{v (N)}^{r}$ , and checks the received $π_{s, v (N)}^{r}$ . The inspector confirms the inclusion of $R$ , $V$ , $R_{N}^{r}$ , and $V_{N}^{r}$ in $π_{s, v (N)}^{r}$ and verifies that its computed commitment matches $C_{v (N)}^{r}$ . If the log entry is validated, the inspector assesses node errors or misbehavior based on all retrieved log entries. The specific criteria for this assessment can vary depending on the application. If the inspector identifies an erroneous node, it returns ⊤ and triggers further dispute steps, such as voting to determine the penalties, isolating the node, or slashing its deposit. If further tracking is required, the inspector selects the next target version $s^{'}$ from the inputs of this version $V$ and executes $Query (r, s^{'})$ again. If the integrity of the log entry cannot be verified based on the proof and commitment or if there is no further node to traverse, the inspector returns ⊥. This result might involve alternative dispute resolution methods, such as additional manual inspections.

6. Security Analysis

As we described in the threat model of Section 3.2, the goal of attackers in DPNs is to deliver incorrect data to the data requester and to manipulate the log records undetectably. The attacks can be categorized into the internal threats from compromised nodes within the DPNs and the post-process tampering of log records. DPTracer protects the data provisioning process against these attacks by combining data flow tracking and accountable aggregation.

6.1. Internal Threats

We address internal threats by compromised nodes within the DPN, which include data modification, log modification, data fabrication, and delivery omission. DPTracer mitigates these threats through mutual auditing using audit windows and rigorous verification processes to ensure the integrity and reliability of the data provisioning process.

6.1.1. Data Modification

In DPTracer, every report is verified twice during data provisioning. Initially, when a node receives the interval proof from the logger, it verifies whether its newly generated report has been correctly stored. This report is rechecked by the subsequent node, which uses its interval proof and received commitments (as performed by

ReportSend ()

in Section 5.1). This auditing period (audit window) enables each node to ensure that the input and output reports are consistent with the corresponding log entries.

If attackers compromise a node to modify its output, the node might send a correct output to the logger for logging while sending an incorrect output to the subsequent node. In such cases, the subsequent node will detect discrepancies within the audit window, as the interval proof will not match the received input report, given that the interval proof is derived from the correct output.

Furthermore, the interval proof is tailored for the audit window, ensuring the log consistency from the input to the newly added output. These audit windows are linked sequentially from one node to the next node until the submitter is reached. Therefore, they cover the entire data provisioning process.

6.1.2. Log Modification

DPTracer provides tamper-evident logging, ensuring forward integrity [22], where loggers are presumed honest but may be subject to Byzantine failures by node compromises. While traditional forward integrity prevents tampering only before a logger is compromised, DPTracer extends this protection by maintaining tamper evidence throughout the data provisioning process via mutual auditing.

If attackers compromise the logger to modify the log entries during the process, this manipulation is detected within the audit window. When a node validates the interval proof, it recalculates the commitments for all the inputs and outputs from the received interval proof and compares them to the received commitments. Any discrepancies in this comparison reveal modifications to the log entries.

6.1.3. Data Fabrication

Attackers might attempt to mislead the data requester by injecting arbitrary outputs from compromised nodes without legitimate inputs. However, in DPTracer, nodes are required to provide commitments for their outputs, which are evidence that the outputs are correctly recorded in the DPLog. The commitments are checked when the subsequent node receives the output from the previous node, and the logger validates them for consistency with the DPLog. If the commitments do not match to the DPLog, the fabricated output is rejected by the logger and the subsequent node.

6.1.4. Delivery Omission

The DPLog provides a global view of the ongoing process, helping to verify the reliability and refinement of the final result. Thus, we can detect omissions, such as when a compromised node intentionally fails to deliver inputs or outputs. Process Tree serializes the data flow for each round to the log entries; each log entry includes a

V

that specifies its inputs, which helps trace the data flow.

If a compromised node abandons the delivery of its output to the next node, this omission will be apparent as the absence of the nodes in the reconstructed data flows. Similarly, if a node omits an input when generating its report, this omission will also be noticeable as a missing subtree. Such discrepancies, highlighted by the engagement list

L

, help to detect these omissions during data provisioning. The

L

quantifies the number of engaged external sources and involved nodes. By comparing the latest version number, which indicates the count of recorded outputs, with the number of involved nodes from

L

, any discrepancy indicates omitted node outputs.

6.2. Post-Process Tampering

In DPTracer, an attacker cannot undetectably tamper with the log records of previous rounds. When the submitter sends the final result to the data requester, the final result encapsulates the last log entry and its commitment. Furthermore, the submitted data are assumed to be immutable in DPNs. For example, if the data requester operates within a blockchain system, the data become unchangeable because of the blockchain’s inherent characteristics. Therefore, the Process Tree is finalized at the time of the submission. If attackers modify any log entries from earlier rounds, the root hash value will differ from the submitted commitment for that round. Other tampering attempts on the log entries, such as insertion, deletion, and re-ordering, will also result in noticeable discrepancies in the root hash value. The inspector must detect these inconsistencies. As described in Section 5.3, whenever the inspector retrieves data from the logger, it validates the interval proof against the stored commitment for that round. If the log entries have been altered, the original commitment cannot be recomputed from the interval proof because the attacker cannot find the hash collision in Process Tree.

7. Comparative Analysis

In this section, we compare DPTracer with existing tamper-evident logging systems: History Tree [12], Concurrent Authenticated Tree (CAT) [23], and Accountability-oriented Tree (AoT) [24]. History Tree introduced a versioned Merkle tree to support tamper-evident logging, which Process Tree extends with input-referencing and leaf-grouping. The Merkle-tree-based structure is advanced by adding chameleon hashing in CAT to enable concurrent log additions for large-scale distributed systems. AoT, also based on the Merkle tree, focuses on node-level accountability for IoT devices.

In contrast, DPTracer provides tamper-evident logging specifically for DPNs by enforcing mutual verification based on log records and integrating these verification steps directly into the protocol. Table 2 compares the main features of DPTracer with those of the existing systems. Each scheme is designed for different target applications and operates under different design principles. For example, History Tree prioritizes efficient log verification, CAT is optimized for high concurrency, and AoT focuses on node-level accountability. In comparison, DPTracer targets DPNs, focusing on data-flow transparency for reconstruction and mutual auditing.

The following unique features of DPTracer are highlighted in contrast to existing systems:

Data flow tracking and reconstruction: DPTracer provides an end-to-end view of data flows with DPLog based on Process Tree. This allows for the reconstruction of data flows and the identification of misbehaving nodes, which is crucial for data provisioning networks.
Mutual auditing: DPTracer integrates mutual audits with interval proofs and commitments for tamper detection during provisioning, while others support one-way audits on log integrity. AoT also integrates periodic audits of protocols, but at coarser intervals than DPTracer.

However, DPTracer has limitations on scalability and overhead due to its focus on DPN features:

Scalability: While DPTracer can scale to large DPNs, it incurs higher overhead due to data-flow tracking and mutual auditing mechanisms. Each scheme is optimized for a specific application; for example, CAT is optimized for high-concurrency systems, while AoT is optimized for power-constrained IoT devices.
Storage requirements: DPTracer requires more storage to serialize the Process Tree in DPLog. However, group collapsing (as described in Section 4.4) can improve storage efficiency depending on the focus of inspection. AoT is the most storage-efficient among the existing schemes. It may limit the storage overhead into constant size of several KBytes per hour per device.

DPTraceris a tamper-evident logging system tailored for DPNs that provides data-flow tracking and reconstruction and mutual auditing. While it imposes some scalability and storage overhead, these are reasonable trade-offs for the enhanced verifiability in DPNs.

8. Evaluation

In this section, we evaluate DPTracer in terms of security, performance, and scalability. We assess its overhead in computation, memory, storage, and communication. The evaluation covers various DPN configurations, including small-scale and large-scale deployments, to analyze the impact of network complexity on DPTracer’s performance. Additionally, we examine the system’s scalability and explore the trade-offs between security and overhead.

8.1. Computation Overhead

We implement a prototype of DPTracer in Python and ran experiments on an Apple Silicon 10-core M1 CPU with 32 GB RAM. Each collector node in our simulation receives a 128-byte report from external data sources in a single round (

| R | = 128

). The aggregator nodes calculate an average from the inputs by the aggregating operation. In the logger, the hash operations for Process Tree use the SHA-256 algorithm with commitments sized at 34 Bytes (32 Bytes for SHA-256 plus 2 Bytes for the prepended version number), i.e.,

| C | = 34

.

Figure 7 illustrates the execution times required for computing commitments and generating interval proofs when the number of nodes varies from 10 to 990. Both tasks are the primary operations of the logger and occur during

LogUpdate ()

. As depicted in Figure 7, the time to compute a commitment increases from 0.067 ms to around 0.6 ms when the number of nodes changes from 100 to 990, reflecting the expansion of the Process Tree. The time for proof generation grows logarithmically and peaks at 6.03 μs when the node count increases to 650. The time required for the logger to generate a commitment is significantly longer: two orders of magnitude greater than the time required for proof generation.

Given this computational overhead, DPTracer is practical for real-world applications. For context, the average round times for price oracles such as Chainlink [9] and Band Protocol [10] are at least 5 and 6 s, respectively. Even in the high-frequency-update scenario of Pyth Network [14], where the round time is as quick as 400 ms [25], DPTracer remains well within the required performance; in this case, the computation time for commitment generation in DPTracer is under 0.067 ms.

8.2. Memory Overhead

In conjunction with computation overhead, we measure the memory usage associated with DPTracer’s logging operations. Figure 8 illustrates the memory footprints for storing both the Process Tree and interval proofs.

The memory usage for a Process Tree linearly increases with the number of nodes. When the Process Tree supports from 100 nodes to 990 nodes, the memory consumption grows from 47 KBytes to 503 KBytes. Similar to the computation overhead, memory usage for generating proofs is primarily influenced by the height of the Process Tree. As depicted in Figure 8, the memory usage for proof generation increases logarithmically: ranging from 7.2 KBytes to 9.6 KBytes as the number of nodes expands from 100 to 990.

8.3. Storage Overhead

After a round ends, the Process Tree is archived in the secondary storage of the logger. The primary storage usage in the logger consists of log entries arranged as leaf elements of the Process Tree. Storing the interior elements is optional because they can be recomputed from the leaf elements if needed. This presents a trade-off between computation time and storage efficiency. Retaining the interior elements can speed up the dispute process by reducing the need for immediate calculations. In contrast, not storing these elements saves space but requires more computing power during audits. Additionally, the logger maintains a version map that links each version number to the position of the corresponding leaf element to help interpret

V

in the log entries.

Figure 9 illustrates the storage usage of the logger when the number of collector nodes is 100, assuming only the log entries are stored. We analyze the storage overhead for different aggregation factors, denoted as d and representing the average number of children for each node. With the same number of collector nodes, a higher d results in fewer aggregator nodes and subsequently lowers the height of the Report Tree.

When the size of a version number,

| R |

, and

| C |

are 2, 64, and 34 Bytes, respectively, the total storage occupied by the log entries is 17.2 KBytes when

d = 6

and increases to 20 KBytes when

d = 15

. Furthermore, we compare the storage overhead when group collapsing is applied to the aggregator nodes. In such cases, log entries for the collapsed group are removed from storage. However, the interior elements required to compute the same commitment should be retained to facilitate future validations of the other parts. With the leaf-grouping feature, the removed parts are replaced with the minimum-necessary interior elements. Consequently, as depicted in Figure 9, storage usage decreases by 46.1%, 47.3%, 48.6%, and 48.8% when d is 6, 9, 12, and 15, respectively.

8.4. Communication Overhead

We evaluate network-wide communication overhead by comparing the communication overhead with a reference model through simulations. Because this is the first proposal of a log-integrated DPN to the best of our knowledge, we developed a model, referred to as the Output Accumulation Model (OAM), for baseline comparison, which achieves the goal of maintaining a complete record of nodes’ outputs at the end of a round.

In OAM, we assume that nodes are organized as a Report Tree and deliver their output along this tree. However, no logger exists in the DPN. Instead, the submitter collects and retains all node outputs at the end of a round. Therefore, each node should relay its output and inputs, and data are accumulated from one node to the next. At the end of the round, when all the outputs have reached the submitter, it broadcasts the complete output record back to the nodes, allowing each node to verify the accuracy of its entry in the total records. This broadcast is performed as a context-aware delivery in the reverse direction of the Report Tree. Each node extracts relevant parts from the broadcast message for its descendants and forwards the extracted parts to its children, similar to the secure aggregation model in Wireless Sensor Networks (WSNs) [21]. If a node confirms that the received part matches its output for that round, it returns a positive acknowledgment to the submitter. Once all acknowledgments are received, the submitter can confirm its comprehensive record for the round and then submit the final result. OAM achieves post-process tamper-evident logging by storing a cryptographic digest such as a hash chain on the confirmed records. However, it lacks support for in-process tamper-evidence or efficient data-flow reconstruction.

Figure 10 illustrates the communication overhead of DPTracer and OAM. In the simulation, we explore various DPN structures by adjusting the aggregation factor d and measured the total message sizes for both models with varying numbers of collector nodes. In DPTracer, the communication overhead increases more when d is higher. When the number of collector nodes is 300, the total message size of DPTracer increases by 6.8% when d doubles from

d = 6

to

d = 12

. In contrast, the communication overhead of OAM decreases as d increases. This reduction is caused by fewer levels being in the Report Tree, which lowers the burden of delivering accumulated reports and the final broadcast OAM. For example, when

d = 6

, OAM incurs 68.7% more communication overhead than DPTracer. However, when

d = 9

, the performance gap decreases to 26%, and it further reduces to 17.7% when

d = 12

. In all scenarios, DPTracer exhibits lower communication overheads than OAM because it combines the delivery and verification processes with the logger, suppressing the onerous confirmation process at the end of the round and the burden of nodes relaying cumulative inputs as required in OAM.

8.5. Real-World Deployment Scenarios

In this section, we evaluate DPTracer by reflecting on real-world DPN deployments. Chainlink and Pyth Network are two well-known oracle networks that operate based on the DPN model. However, since their network topologies are not publicly disclosed, we make assumptions based on the available service status information from their official websites [26,27] and define two network scenarios to approximate real-world deployments:

Direct Aggregation DPN (DA-DPN): In this scenario, the final result is decided by a single level of aggregator nodes directly connected to the collector nodes. The submitter acts as the final aggregator and the data requester. The Chainlink ETH/USD Price Feed [26] represents this scenario, where the final price is decided by a quorum of aggregator nodes at an on-chain smart contract, which serves as the final aggregator.
Network Aggregation DPN (NA-DPN): This scenario involves a network of nodes participating in the aggregation and validation of data. The aggregator network handles multiple types of data from various sources and thus represents a more complex and large-scale DPN compared to DA-DPN. The Pyth Network [14] reflects this scenario, where data providers perform validation and aggregation of price data.

Figure 11 illustrates the node configurations for two scenarios. Using the configurations of aggregator nodes, we instantiate the scenarios with the node configurations according to our DPN model (Section 3.5), including collector nodes and aggregator nodes. For the DA-DPN scenario, we set the number of aggregator nodes to 31 based on the Chainlink ETH/USD Price Feed [26] (as of September 2024). For the NA-DPN scenario, we set the number of aggregator nodes to 114, matching the number of data providers in Pyth Network [27] (as of September 2024).

Price sources encompass various forms of price data, such as APIs and on-chain smart contracts. To represent the normalization and preprocessing required in data collection, which can differ depending on price sources, we assign collector nodes to the price sources. In the DA-DPN scenario, assuming multiple data sources per aggregator node, we set the number of collector nodes to 123 (three data sources per aggregator node). For NA-DPN, we refer to the number of serving price feeds in Pyth Network, which is 507, as the number of collector nodes since it reflects the average number of unique data types handled within the aggregator network. In total, the number of nodes in the DA-DPN and NA-DPN scenarios are 154 and 621, respectively.

Figure 12 shows the overheads in the DA-DPN and NA-DPN scenarios. The overall overhead of the NA-DPN is higher due to the larger number of nodes and the complexity of the network structure. However, the high aggregation factor (maximum

d = 31

) in the DA-DPN, resulting from single-level aggregation, leads to increased memory usage and communication overheads. In contrast, the NA-DPN has a lower aggregation factor (

d = 15

) as multi-level aggregation can be applied in the aggregator network.

The results in Figure 12 illustrate that the network configuration significantly impacts the performance of the DPN. However, the overheads for both scenarios are still within acceptable limits for real-world DPN deployments. Even in the more complex NA-DPN scenario, DPTracer incurs less than 800 KBytes of network-wide communication overhead and generates commitments and proofs in under 0.05 ms, which is well within the typical round time for oracle networks (ranging from 400 ms to 6 s).

9. Related Work

Advancements in cryptographic data structures or secure storage systems have significantly led to the development of secure logging mechanisms. The integration of robust hardware solutions, such as Trusted Execution Environments (TEEs), along with the incorporation of blockchain technology, has further enhanced the capabilities of secure logging systems.

Crosby and Wallach [12] introduced History Tree with a set of semantics for tamper-evident logs, enabling efficient audits of an untrusted logger’s stored events and the historical consistency of the log. With a versioned construction, History Tree supports membership, and incremental proofs enable clients to validate the log’s outputs. Based on this structure, Pulls and Peeters [20] proposed Balloon, which combines two data structures (a hash tree and a History Tree) to create a forward-secure authenticated data structure. Balloon supports efficient membership and non-membership proofs and enables the log clients to verify the correctness of inserts without retaining a full copy of the data structure. We also further developed the History Tree into the Process Tree by incorporating the Merkle tree’s principles with versioning of log entries. Furthermore, the Process Tree supports input-referencing and leaf-grouping, which enhance inspection of dispute resolution through data-flow reconstruction.

Similar to our approach, the Merkle tree has been adapted to support tamper-evident logging across various environments. For large-scale log streams, Ning et al. [23] enhanced the Merkle tree to a concurrent authenticated tree by adding chameleon hashing to enable concurrent log additions. More recently, Koisser and Sadeghi [24] proposed a tamper-evident logging system tailored for Internet of Things (IoT) devices that records all outputs from IoT devices to ensure node-level accountability. Our approach, however, focuses on providing tamper-evident logging for DPNs by enforcing mutual verification based on log records and integrating these verification steps directly into the protocol.

Custos [17], proposed by Paccagnella et al., comprises a tamper-evident logging layer that integrates with existing frameworks. It uses TEE when adding a log entry, where each log entry is hashed and signed within the TEE enclave. Any modification of the log records can be detected during periodic audits. Similarly, in DPTracer, we use in-process verification to detect any log modification promptly and ensure its comprehensiveness.

Incorporating blockchain as append-only data storage for logs has been actively researched. However, adding log entries directly to the blockchain is impractical because of the low transaction speeds and high costs associated with blockchain. Recent research has focused on using off-chain components to condense multiple log events into summarized data, resulting in fewer transactions. Zhao et al. [28] use a Merkle tree to aggregate raw sensing data to a digest, which is then written to the blockchain rather than storing the raw data directly. Similarly, in WedgeBlock [29], authored by Singh et al., off-chain nodes function like a cache system for log data. It responds to the user’s request while deferring the actual blockchain writing transaction to store the log. In DPTracer, the commitment can be compared with the condensed data offered by off-chain nodes in the context of blockchain. However, the primary goal of DPTracer is to ensure tamper-evident logging of the entire data provision process.

10. Conclusions

In this paper, we present DPTracer: a tamper-evident logging system for DPNs that integrates logging and validation into the data delivery process. DPTracer enhances the reliability and accountability of data provision in DPNs by enabling the reconstruction of historical data flow processes and the validation of node activities, which is crucial for self-correction in DPNs. The system features of DPTracer, such as mutual auditing and data-flow tracking, make DPTracer particularly suitable for applications that require high levels of transparency and accountability.

Our analysis and evaluations of DPTracer demonstrate that DPTracer provides robust security without incurring excessive overhead in terms of computation, memory, storage, or communication. While the communication and storage overhead of DPTracer can increase as the network complexity grows, they are still manageable for practical applications, as shown in real-world deployments. By allowing DPNs to choose the level of security and traceability, DPTracer provides a flexible solution to meet the diverse security needs of different applications.

Moreover, the DPTracer approach can be extended to build decentralized data validation beyond DPNs. By integrating logging directly into the validation process, DPTracer can provide an efficient solution for lightweight validation of cross-domain messages, such as between blockchains and power-constrained IoT devices. As future work, we plan to extend DPTracer to support a decentralized data validation system for smart contracts across multiple blockchains.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within this article.

Conflicts of Interest

The author declares no conflicts of interest.

References

Eskandari, S.; Salehi, M.; Gu, W.C.; Clark, J. Sok: Oracles from the ground truth to market manipulation. In Proceedings of the 3rd ACM Conference on Advances in Financial Technologies, Arlington, VA, USA, 26–28 September 2021; pp. 127–141. [Google Scholar]
Ezzat, S.K.; Saleh, Y.N.; Abdel-Hamid, A.A. Blockchain oracles: State-of-the-art and research directions. IEEE Access 2022, 10, 67551–67572. [Google Scholar] [CrossRef]
Pasdar, A.; Lee, Y.C.; Dong, Z. Connect API with blockchain: A survey on blockchain oracle implementation. ACM Comput. Surv. 2023, 55, 1–39. [Google Scholar] [CrossRef]
Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E.; Narang, S.; Chowdhery, A.; Zhou, D. Self-consistency improves chain of thought reasoning in language models. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Guo, T.; Chen, X.; Wang, Y.; Chang, R.; Pei, S.; Chawla, N.V.; Wiest, O.; Zhang, X. Large language model based multi-agents: A survey of progress and challenges. arXiv 2024, arXiv:2402.01680. [Google Scholar]
Mariani, S.; Cabri, G.; Zambonelli, F. Coordination of autonomous vehicles: Taxonomy and survey. ACM Comput. Surv. (CSUR) 2021, 54, 1–33. [Google Scholar] [CrossRef]
Liu, F.; Panagiotakos, D. Real-world data: A brief review of the methods, applications, challenges and opportunities. BMC Med. Res. Methodol. 2022, 22, 287. [Google Scholar] [CrossRef] [PubMed]
Zhang, H.; Bosch, J.; Olsson, H.H. Real-time end-to-end federated learning: An automotive case study. In Proceedings of the 2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC), Madrid, Spain, 12–16 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 459–468. [Google Scholar]
Breidenbach, L.; Cachin, C.; Chan, B.; Coventry, A.; Ellis, S.; Juels, A.; Koushanfar, F.; Miller, A.; Magauran, B.; Moroz, D.; et al. Chainlink 2.0: Next steps in the evolution of decentralized oracle networks. Chain. Labs 2021, 1, 1–136. [Google Scholar]
Bandchain. Band Protocol. 2020. Available online: https://docs.bandchain.org/ (accessed on 16 September 2024).
Benligiray, B.; Milic, S.; Vänttinen, H. Decentralized apis for web 3.0. API3 Found. Whitepaper 2020. [Google Scholar]
Crosby, S.A.; Wallach, D.S. Efficient data structures for tamper-evident logging. In Proceedings of the USENIX Security Symposium, Berkeley, CA, USA, 10–14 August 2009; pp. 317–334. [Google Scholar]
The 3 Levels of Data Aggregation in Chainlink Price Feeds. 2020. Available online: https://blog.chain.link/levels-of-data-aggregation-in-chainlink-price-feeds/ (accessed on 16 September 2024).
Pyth. Pyth Network. Available online: https://pyth.network/ (accessed on 16 September 2024).
Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.; Cao, Y.; Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. Adv. Neural Inf. Process. Syst. 2024, 36. [Google Scholar]
Kwon, Y.; Wang, F.; Wang, W.; Lee, K.H.; Lee, W.C.; Ma, S.; Zhang, X.; Xu, D.; Jha, S.; Ciocarlie, G.; et al. MCI: Modeling-based causality inference in audit logging for attack investigation. In Proceedings of the Network and Distributed Systems Security (NDSS) Symposium, San Diego, CA, USA, 18–21 February 2018. [Google Scholar]
Paccagnella, R.; Datta, P.; Hassan, W.U.; Bates, A.; Fletcher, C.; Miller, A.; Tian, D. Custos: Practical tamper-evident auditing of operating systems using trusted execution. In Proceedings of the Network and Distributed System Security Symposium, San Diego, CA, USA, 23–26 February 2020. [Google Scholar]
Ahmad, A.; Lee, S.; Peinado, M. Hardlog: Practical tamper-proof system auditing using a novel audit device. In Proceedings of the 2022 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 23–25 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1791–1807. [Google Scholar]
Maniatis, P.; Baker, M. Secure history preservation through timeline entanglement. In Proceedings of the 11th USENIX Security Symposium (USENIX Security 02), San Francisco, CA, USA, 5–9 August 2002. [Google Scholar]
Pulls, T.; Peeters, R. Balloon: A forward-secure append-only persistent authenticated data structure. In Proceedings of the Computer Security—ESORICS 2015: 20th European Symposium on Research in Computer Security, Vienna, Austria, 21–25 September 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 622–641. [Google Scholar]
Chan, H.; Perrig, A.; Song, D. Secure hierarchical in-network aggregation in sensor networks. In Proceedings of the 13th ACM Conference on Computer and Communications Security, Alexandria, VA, USA, 30 October–3 November 2006; pp. 278–287. [Google Scholar]
Bellare, M.; Yee, B. Forward Integrity for Secure Audit Logs; Technical Report; University of California at San Diego: San Diego, CA, USA, 1997. [Google Scholar]
Ning, F.; Wen, Y.; Shi, G.; Meng, D. Efficient tamper-evident logging of distributed systems via concurrent authenticated tree. In Proceedings of the 2017 IEEE 36th International Performance Computing and Communications Conference (IPCCC), San Diego, CA, USA, 10–12 December 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–9. [Google Scholar]
Koisser, D.; Sadeghi, A.R. Accountability of Things: Large-Scale Tamper-Evident Logging for Smart Devices. arXiv 2023, arXiv:2308.05557. [Google Scholar]
Everything You Need to Know About the Pyth Network. 2023. Available online: https://pyth.network/blog/what-is-the-pyth-network (accessed on 16 September 2024).
Chainlink. ETH/USD Price Feed Status. 2024. Available online: https://data.chain.link/feeds/ethereum/mainnet/eth-usd (accessed on 16 September 2024).
Pyth Network. Pyth KPI Metrics. 2024. Available online: https://kpi.pyth.network/ (accessed on 16 September 2024).
Zhao, W.; Aldyaflah, I.M.; Gangwani, P.; Joshi, S.; Upadhyay, H.; Lagos, L. A blockchain-facilitated secure sensing data processing and logging system. IEEE Access 2023, 11, 21712–21728. [Google Scholar] [CrossRef]
Singh, A.A.; Zhou, Y.; Sadoghi, M.; Mehrotra, S.; Sharma, S.; Nawab, F. WedgeBlock: An Off-Chain Secure Logging Platform for Blockchain Applications. In Proceedings of the 26th International Conference on Extending Database Technology, Ioannina, Greece, 28–31 March 2023; pp. 526–539. [Google Scholar]

Figure 1. Examples of multi-agent data provisioning.

Figure 2. Data provision network model.

Figure 3. Process Tree at version 7.

Figure 4. Process Tree at version 4.

Figure 5. Interval proof (

π_{3, 5, 6}

).

Figure 5. Interval proof (

π_{3, 5, 6}

).

Figure 6. Operations of nodes, the submitter, and the inspector interacting with the logger.

Figure 7. Computational overhead.

Figure 8. Memory overhead.

Figure 9. Storage overhead.

Figure 10. Communication overhead.

Figure 11. Real-world deployment scenarios of price oracle networks.

Figure 12. Overhead in the DA-DPN and NA-DPN scenarios.

Table 1. Notations.

Symbol	Description
$X_{i}$	Version-i (i-th) log entry
$C_{i}$	Version-i commitment
$C$	Set of commitments
$π_{i, j, k}$	Interval proof for versions i, j, and k
r	Round number
N	Total number of nodes
n_α	Node with identifier $α$
$R_{α}$	Upstream report of n_α
$R$	Set of upstream reports
$v (n_{α})$	Version number of n_α
$V$	Set of input versions
$L$	Engagement list of nodes contributing to the final result
⊓	Aggregation operator

Table 2. Comparison of DPTracer with existing tamper-evident logging systems.

Feature	DPTracer	History Tree [12]	CAT [23]	AoT [24]
Tamper-Evidence	Tamper-evidence with mutual auditing across the entire data flow.	Tamper-evidence through the Merkle-tree-based History Tree.	Focuses on concurrency via Concurrent Authenticated Tree (CAT)	Time-sparse tree with parity integrity for constant storage overhead.
Target Applications	Data provisioning networks requiring data-flow transparency and strong tamper-evidence.	Logging systems with efficient static log verification.	Large-scale distributed systems with high concurrency needs.	Large-scale IoT environments with minimal storage and periodic tamper-evidence.
Data Flow Tracking	Provides full transparency with the ability to reconstruct data flow from input to output.	Focuses on static log integrity, not data flow tracking.	No data flow tracking.	Focuses on accountability between devices without flow tracking.
Mutual Auditing	Integrates real-time mutual audits between nodes and the logger to the protocol.	Supports immediate one-way log audit.	Supports one-way log audit.	Integrates periodic audits in data reporting.
Scalability	Scalable for DPNs but incurs higher overhead due to data-flow tracking.	Highly scalable for static datasets.	Optimized for large-scale, high-concurrency environments.	Scalable for IoT environments with low overhead.
Storage Requirements	Higher storage due to Process Tree serialization of data flow.	Moderate storage due to Merkle-tree-based structure.	Moderate storage, efficient handling of concurrent logs.	Low storage overhead with constant space requirement.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lee, J. DPTracer: Integrating Log-Driven Accountability into Data Provision Networks. Appl. Sci. 2024, 14, 8503. https://doi.org/10.3390/app14188503

AMA Style

Lee J. DPTracer: Integrating Log-Driven Accountability into Data Provision Networks. Applied Sciences. 2024; 14(18):8503. https://doi.org/10.3390/app14188503

Chicago/Turabian Style

Lee, JongHyup. 2024. "DPTracer: Integrating Log-Driven Accountability into Data Provision Networks" Applied Sciences 14, no. 18: 8503. https://doi.org/10.3390/app14188503

APA Style

Lee, J. (2024). DPTracer: Integrating Log-Driven Accountability into Data Provision Networks. Applied Sciences, 14(18), 8503. https://doi.org/10.3390/app14188503

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DPTracer: Integrating Log-Driven Accountability into Data Provision Networks

Abstract

1. Introduction

2. Background

2.1. Multi-Agent Data Provisioning

2.1.1. Price Oracle Aggregation

2.1.2. Multi-Path LLM

2.1.3. Self-Correction Mechanisms

2.2. Tamper-Evident Logging Using Merkle Trees

2.2.1. Merkle Tree Structure

2.2.2. Tamper-Evident Logging

2.2.3. History Tree

3. DPTracer Design

3.1. Data Provision Log

3.2. Threat Model

3.3. Design Goals

3.4. Design Overview

3.5. DPN Model

3.5.1. Data Collection and Aggregation

3.5.2. Logging and Verification

3.5.3. Dispute Process

3.5.4. Membership Management

3.5.5. Hierarchical Data Delivery

3.5.6. Notation

4. DPTracer Logger

4.1. Process Tree Construction

4.1.1. Adding Log Entries

4.1.2. Tree Update Mechanism

4.1.3. Handling Incomplete Branches

4.1.4. Commitment Generation

4.2. Data Flow Serialization

4.3. Interval Proof

4.3.1. Constructing an Interval Proof

4.3.2. Verifying an Interval Proof

4.3.3. Selective Commitment Verification

4.4. Group Collapsing

5. DPTracer Protocol

5.1. Nodes

5.2. Submitter

5.3. Inspector

6. Security Analysis

6.1. Internal Threats

6.1.1. Data Modification

6.1.2. Log Modification

6.1.3. Data Fabrication

6.1.4. Delivery Omission

6.2. Post-Process Tampering

7. Comparative Analysis

8. Evaluation

8.1. Computation Overhead

8.2. Memory Overhead

8.3. Storage Overhead

8.4. Communication Overhead

8.5. Real-World Deployment Scenarios

9. Related Work

10. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI