Layer-2 Solution

While Layer-1 solutions in DLTs define the form of the ledger, its distribution, consensus mechanism and features, Layer-2 solutions are built on top of Layer-1 without changing its trust assumptions, i.e., the consensus mechanism or the structure [63,64]. Layer-2 protocols allow users to communicate through mediums external to the DLT network, reducing the transaction load on the underlying DLT. On top of the IOTA layer-1 DLT, we designed a Layer-2 solution using a DHT with the aim of facilitating the search for large amounts of data through specific keywords (Figure 2). In order to obtain information from a IOTA message within a stream channel, indeed, it is necessary to know the exact address of the message or of the channel, i.e., the announcement link. However, the announcement link of a stream channel does not provide any information related to the type and kind of messages. No mechanisms are provided by IOTA (and the majority of DLTs) for the discovery based on the content of certain data/streams channels that are available in the Tangle. This is

the issue we deal with in this paper. In the remainder of this section, we describe how to surmount such limitations. In our system, every stream channel is indexed by a keyword set and then how such a keyword set is exploited to look for specific kinds of contents.

**Figure 2.** Layers in the context of DLTs. Layer zero consists of the DLT network, while Layer-1 is the set of software frameworks run by the network nodes (e.g., the ledger). Layer-2 solutions are the ones that leverage Layer-1 for other services, i.e., the hypercube DHT in our case.

### *4.4. Hypercube-Structured DHT*

Considering *O* as the set of all stream channels in IOTA, the idea is to map each object *o* ∈ *O* to a keyword set *Ko* ⊆ *W*, where *W* is the keyword space, i.e., the set of all keywords considered. In general, we refer to *K* ⊆ *W* as a keyword set that can be associated to a data content (i.e., the metadata associated to it) or a query (i.e., we are looking to some content with a specific metadata). By using a uniform hash function *h* : *W* → {0, 1, ... ,*r* − <sup>1</sup>}, a keyword set *K* can be represented by the result of such a function, i.e., a string of bits *u* where the 1s are set in the positions given by *one*(*u*) = {*h*(*k*) | *k* ∈ *<sup>K</sup>*}. In other words, each *k* ∈ *W* has a fixed position in the *r*-bit string given by *h*(*k*), and that position can be associated to more than one *k* (i.e., hash collision). Then, every keyword set *K* is represented by a *r*-bit string where the positions are "activated", i.e., are set to 1, by all the *k* ∈ *K*.

We use these *r*-bit strings to identify logical nodes in a DHT network, e.g., for *r* = 4, a node id can take values such as 0100 or 1110. In particular, inspired by [24], we refer to the geometric form of the hypercube to organize the topological structure of such a DHT network. *Hr*(*<sup>V</sup>*, *E*) is a *r*-dimensional hypercube, with a set of vertices *V* and a set of edges *E* connecting them. Each of the 2*r* vertices represents a logical node, whilst edges are formed when two vertices differ by only one bit, e.g., 1011 and 1010 share an edge. In the network, the nodes represented by vertices that share an edge are network neighbors as well. To find out how far apart two vertices *u* and *v* are within the hypercube, the Hamming distance can be used, i.e., *Hamming*(*<sup>u</sup>*, *v*) = ∑*<sup>r</sup>*−<sup>1</sup> *i*=0 (*ui* ⊕ *vi*),, where ⊕ is the XOR operation and *ui* is the bit at the *i*-th position of the *u* string, e.g., for *u* = 1011 and *v* = 1010, we have *Hamming*(*<sup>u</sup>*, *v*) = 1.

### 4.4.1. Keyword-Based Complex Queries

In our system, contents can be discovered through queries that are based on the lookup of multiple keywords, associated with data. Such queries are processed by the DHT-based indexing scheme described in the previous section. The base idea is to associate a keyword set to each IOTA stream channel through the DHT. In particular each logical node locally stores an index table that associates a keyword set *Ko* to the announcement link of an IOTA stream channel, i.e., the reference of an object *o*. Then, given a keyword set *K*, the associated *r*-bit string is used to reach the logical node responsible for *K* through a routing mechanism, in order to obtain the set of objects = {*o* ∈ *O* | *Ko* ⊇ *<sup>K</sup>*}. For instance, with *W* = {*"Turin", "Lingotto", "Temperature", "Celsius"*} and 1010 representing the keyword set *K* = {*"Turin, Temperature"*}, if *u* ∈ *V* is the node that is responsible for *K* because the id of *u* is equal to 1010, then *u* is in charge of maintaining a list of announcement links of IOTA stream channels containing the temperature of the city of Turin. Once that node is located, the objects = {*o* ∈ *O* | *Ko* = *K*} it stores in its index table can be returned or aggregated with other nodes' objects. These objects consist of a list of announcement links that can be used to obtain messages from IOTA.

4.4.2. Multiple Keywords Search

Our system provides two functions for making queries based on multiple keywords:


For the Pin Search we need to retrieve objects only from one node, whilst for Superset Search, we need to retrieve objects from all nodes that are responsible for a Superset of *K*. Such nodes are contained in the sub-hypercube *SH*(*<sup>S</sup>*, *F*) induced by the node *u* responsible for *K*, where *S* includes all the nodes *s* ∈ *V* that "contain" *u*, i.e., *ui* = 1 ⇒ *wi* = 1, while *F* includes all the edges *e* ∈ *E* between such nodes. Thus, during a Superset Search, the induced sub-hypercube is computed and then only nodes in such a sub-hypercube are queried using a spanning binomial tree, as described in [24] (definition 4.2). The *l* limit is a query parameter that indicates the maximum number of objects to return when traversing the spanning binomial tree.

### 4.4.3. The Query Routing Mechanism

Queries can be injected into the system by users external to the DHT to any *v* ∈ *V* network node. Through a routing mechanism, the query reaches a node *u* ∈ *V* that is responsible for a keyword set *K*. This process is described in detail in Algorithm 1.


### **5.** *k***-DaO Use Case: Participatory Data Stewardship and Citizen-Generated Data Creation**

The aim of this section is to describe a possible implementation of the above architecture through a specific use case. We first describe the scenario and then we go into the details of the technical specification. With this scenario, we find ourselves in the general context of facilitating the use of privately held data for the public interest. This is in line with the vision of the European Union's strategy on data sharing for public interest [7]. More specifically, the vision we intend to pursue with the implementation of our decentralized personal data marketplace is part of the intent to enable different stakeholders (government, businesses and citizens) to give access to and to use data transformed into non-personal form in order to create value and to make better decisions. In fact, the European Data Strategy elaborates on some points in this area, with components related to data governance and common data spaces, also by means of the Data Governance Act [14]. It enables the safe reuse of certain categories of public-sector data such as personal data.

The specific context of our scenario deals with participatory data [2] such as citizengenerated ones. Citizen-generated data, which include a range of scenarios such as participatory sensing to crowdsourced geospatial datasets, can be integrated with open data portals and, in the future, with shared data spaces. Although, to date, they are not as impactful, the aim is to increase and improve the presence of such data and to involve citizens in designing open data policy, processes and governance [65]. In most cases, citizengenerated data should be made orthogonal to the application of data protection laws and regulations, e.g., GDPR. Therefore, citizen-generated data should not contain personal data or personal data shall be appropriately anonymized or aggregated.

With this in mind, we describe the use case with the help of Figure 3. At the highest level, the flow of data is as follows: (i) citizens store and maintain their personal data in a PDS; (ii) a data aggregator undertakes the task of aggregating a specific kind of data and accesses the PDS through smart-contract access policies; (iii) the aggregator uses algorithms such as *k*-anonymity [66] to render the input personal data anonymous; and (iv) the citizengenerated anonymized aggregated dataset is published for potential data consumers. The main idea is to enable the participation of data owners in the dataset generation through a DAO. A token-based incentive for DAO members, i.e., tokenized data structures, can be used to enable participants to work together to build a curated dataset in pursuit of the instantiation of a decentralized, tokenized data marketplace [19].

**Figure 3.** Citizen-generated data use case. Data owners store personal data in a PDS and set some access policies through smart contracts. A data aggregator accesses these data and produces an anonymized dataset in a participatory data stewardship framework. The anonymized aggregated dataset can then be accessed by other data consumers.

We imagine a concrete scenario of citizen-generated hiking trails or pedestrian travel routes, produced using GPS-enabled smartphones using application such as Komoot or AllTrails [67]. For this scenario, one can simply consider three kinds of personal data: (i) the user's travel trace, i.e., a set of latitude and longitude points associated with a timestamp; (ii) the user's photos taken during the travel; and (iii) the list of nearby Bluetooth devices updated with a constant interval.

### *5.1. Anonymizing Data by Aggregation*

Figure 3 shows an overview of the interaction between the main actors. Data owners (leftmost boxes) maintain personal data in a PDS implemented using IPFS [20] as DFS. These data are travel traces, photos and Bluetooth ids recorded during the data owners' hiking sessions. Data owners also register personal data they want to share along with descriptions of what they measure, i.e., keywords in the hypercube DHT (not shown in figure; see Section 4.4). Each piece of data is then indexed in the IOTA DLT through a new stream channel for each hiking session (not shown in figure). The messages in the channel refer to data in IPFS using the CID as an immutable universal identifier. An access-control smart contract owned by the data owner (between data owners and aggregator in the figure) points to different stream channels using the associated announcement link. This smart contract is stored in a private permissioned Ethereum blockchain implemented using GoQuorum [68], i.e., the authorization blockchain. The data aggregator (at the middle of the figure) interacts with such a blockchain to request the data owners' data in line with their policies. If it manages to access the data of at least *k* data owners, the aggregator creates a *k*-DaO with the owners in the same blockchain, in order to work in a participatory data stewardship framework [2]. The anonymized dataset must meet certain requirements; otherwise, the *k*-DaO may decide to stop production. For instance, the data aggregator should be able to perform the data aggregation, producing a dataset that presents properties of *k*-anonymity and differential privacy [69]. This dataset can then be accessed by a variety of data consumers (rightmost boxes in figure) using the same data marketplace in a process where every participant to the dataset creation is rightfully rewarded.

We now make a brief digression on what it means to apply anonymization techniques in this case. The GDPR Recital 26 states that personal data becomes anonymous if it is 'reasonably likely' that no identification of a natural person can be derived [13]. This is based on the fact that the anonymization of a dataset can be defined as robust on a case-by-case basis [70]. Some techniques can provide privacy guarantees and can be used to generate efficient anonymization processes but only if their application is engineered appropriately. The *k*-anonymity proposal was introduced in [66], and it is considered one of the most popular approaches for syntactic protection, i.e., each release of data must be indistinguishably related to no less than a certain number (e.g., *k*) of individuals in the population. For instance, through a generalization approach, original values are substituted with more general values, such as the date of birth generalized by removing day and month of birth. On the other hand, we find semantic techniques, i.e., when the result of an analysis carried out on a dataset is insensitive to the insertion or deletion of a tuple in the dataset. Differential privacy [69] is the main example in this case, where a dataset is released and recipients learn properties about the population as a whole but that are probably wrong for a single individual. This can be achieved for instance by adding noise to the original dataset.

#### *5.2. Step Zero: Search Data on the Decentralized Marketplace*

The first step, or "step zero", before accessing any piece of data is the search for a specific kind of data, i.e., the data subset that a potential data consumer is interested in. This is the part where the hypercube DHT comes into play (see Section 4.4 for a detailed explanation). Figure 4 shows an example of the search of data on the decentralized marketplace. The data aggregator requests for a SuperSet Search to the hypercube, with a keyword set *K* = {*"walk", "mountain", "Tuscany"*}. The hypercube returns a set of aggregation links pointing to IOTA stream channels containing related data, e.g., *6bb3347. . . :219*. The first message of the stream channel is open to the marketplace users and includes information that points to the smart contract used for the access control. The information includes the identifier of the authorization blockchain network and the smart contract address. Each

subsequent channel's message includes information to the data themselves, i.e., hash links in the form of IPFS CIDs. In particular, each message stores the CID of a IPFS directory that stores the location data, photo and Bluetooth ids in a specific timestamp, e.g., *QmW...V4b2t* is the CID of the directory and *QmW...V4b2t/1* contains a location point and timestamp. Of course, the data are encrypted; hence, the content of the IPFS data are not meaningful at this point. The next step, thus, is to gain access to the content key used for the encryption.

**Figure 4.** Example of searching data on the decentralized marketplace.

#### *5.3. Smart Contracts Implementing the Distributed Access Control*

As seen in Section 4.2, the interesting aspect of smart contracts is that an algorithm executed in a decentralized manner enables two parties, i.e., data owner and aggregator, to reach an agreemen<sup>t</sup> in the transaction of the data. This not only increases the disintermediation in such a process but also leaves traces to be later audited and provides incentives to all the actors to correctly behave. Figure 5 graphically shows the process of the data aggregator accessing data owner's data, while Figure 6 shows the UML Class Diagram of the smart contract implementations we discuss in this subsection.

**Figure 5.** Example of the distributed access control where a data aggregator requests access to the data of some data owners.

**Figure 6.** UML Class Diagram of *DataOwnerContract* and *AggregationContract*. Some classes, attributes and methods have been removed to render the diagram clearer.


### *5.4. Smart Contracts Implementing the k-DaO*

The *k*-DaO is DAO composed by the *k Da*ta *O*wners that gran<sup>t</sup> access to their data to the data aggregator. Simply put, the aggregator stakes a safety deposit, and the DAO is used to start at any moment a vote to redeem this stake. The rationale behind it is to limit aggregator's malicious behavior. Not only for this but also if the creation process of the anonymized dataset involves a more complex case of curated dataset (e.g., Open-StreetMap [71]), then DAO members can make new proposals and add suggestions to vote in order to steer the development of the dataset generation. Figure 7 graphically shows the process of *k*-DaO creation and voting, while Figure 8 shows the UML Class Diagram of the smart contract implementations we discuss in this subsection.


**Figure 7.** Example of the anonymized dataset creation and DAO voting.

### *5.5. Anonymized Aggregated Dataset*

Finally, the work of the aggregator comes to produce new data in the form of anonymized aggregated data, providing anonymity by design. Multiple configurations of aggregated data can be produced, if stated earlier. Additionally, some kind of proof can be implemented for measuring the exact quantity of data used from each subject's dataset, e.g., storing in the *kDaO* contract the root of a Merkle tree that contains all the data pieces hashes used as leaves; then, *k*-DAO members can validate it by requesting (off-chain) leaves to the aggregator.

For the sake of the citizen-generated data use case, the result of the whole process is stored in an open data platform. If needed, some data, such as the participants list, can be shown upon request, but it is not public, since the authorization blockchain is a private permissioned one. In other cases, the resulting dataset can be encrypted, uploaded in IPFS and then referenced in new stream channels. In this case, the dataset is treated as all the other kinds of data in the marketplace and data consumers can access to it through a *DataOwnerContract* owned by the aggregator. In this case, some kind of royalties can be transferred directly to the *k*-DaO members, where the paymen<sup>t</sup> is proportional to the contribution produced by each participant, e.g., aggregator = 55%, data owner1 = 20%, data owner2 = 10%, data owner3 = 15%.

### **6. Performance Evaluation**

Based on the above *k*-DaO use case, we conducted the performance evaluation in three stages: (i) in the first stage, we simulated a DHT network implementing the hypercube queries of the use case's "step zero", in order to test the average steps necessary to reach all nodes; (ii) in the second stage, we set up a local permissioned authorization blockchain to test the distributed access control in use case's steps one and two; and (iii) in the third stage, we evaluate the implementation of all smart contracts by measuring the gas usage.

In this work, we lack an analysis of the performances for storing and retrieving data from IOTA and IPFS. However, we dealt with these aspects in previous work, testing out specifically the storing of personal data such as location data and photos, i.e., testing IOTA [74] and DFS including IPFS [75]. We refer the reader to these two studies. Moreover, being separate systems, the latency in performing operations are added up one another. Meaning that a data aggregator first needs to obtain the content key from the authorization blockchain (evaluated in this work) and then operate with IOTA or IPFS.

The decentralized personal data marketplace component implementation can be found as an open source code on Github [76–79].

### *6.1. Hypercube DHT Simulation*

We conducted a simulation assessment using PeerSim, a simulation environment developed to build P2P networks using extensible and pluggable components [80,81]. Once the hypercube-structured DHT was designed and implemented for multiple keyword search Section 4.4.2, we focused on studying the efficiency of the routing mechanism. The simulation implementation and the tests data can be found as open source code in [82]. Below are the main results obtained.

### 6.1.1. Tests Setup

Several tests were carried out assuming different scenarios in which the network consisted of a variable number of nodes and stored a variable number of objects. In order to evaluate Pin Search and Superset Search, tests were carried out on different sizes of the hypercube. Specifically, the number of nodes varied from 128 (*r* = 7) up to 8192 (*r* = 13). Then, for each dimension *r*, a different number of randomly created keyword objects, i.e., IOTA announcement links, was inserted in the DHT. The number of objects taken into consideration varies from 100, 1000 and finally 10,000.

### 6.1.2. Results

Given the nature of the tests, i.e., a simulated network, we considered the number of hops required for each new query as a parameter to be evaluated. A hop occurs when a query message is passed from one DHT node to the next. The query keyword sets were randomly generated, and the starting node was randomly chosen. For each type of test, 50 repetitions were performed, and then, the average results were calculated. For the Superset search, the limit value was set to *l* = 10 objects.
