**2. Background**

In this section, we introduce the main concepts and technologies involved in our work.

### *2.1. Distributed Hash Table (DHT)*

A Distributed Hash Table (DHT) is a distributed infrastructure and storage system that provides the functionalities of a hash table, i.e., a data structure that efficiently maps "keys" into "values". It consists of a P2P network of nodes that are supplied with the table data and on a routing mechanism that allows for searching for objects in the network [24]. Each node in the DHT network is responsible for part of the entire system's keys and allows the objects mapped to the keys to be reached. In addition, each node stores a partial view of the entire network, with which it communicates for routing information. To reach nodes from one part of the network to another, a routing procedure typically traverses several nodes, approaching the destination at each hop. This type of infrastructure has been used as a key element to implement complex and decentralized services, such as Content-Addressable Networks (CANs) [27], Decentralized File Storage (DFS) [20], cooperative web caching, multicast and domain name services.

### *2.2. Decentralized File Storage (DFS)*

Decentralized File Storage (DFS) is a solution for storing files as in Cloud Storage [28] but retaining the benefits of decentralization [9]. They offer higher data availability and

resilience thanks to data replication. A DFS comprises a P2P network of nodes that provide storage and follow the same protocol for content storing and retrieval. In content-based addressing, contents are directly queried through the network rather than establishing a connection with a server. In order to know which DFS node in the network owns the requested contents, it is possible to rely on a DHT in charge of mapping the contents, i.e., files and directories, to the addresses of the peers owning such data. A principal example of DFS is the InterPlanetary File System (IPFS) [20], a protocol that builds a distributed file system over a P2P network. IPFS is a DFS and a protocol created for distributed environments with a focus on data resilience. The IPFS P2P network stores and shares files and directories in the form of IPFS objects that are identified by a CID (Content Identifier). The CID acts as an immutable universal identifier used to retrieve an object in the network. Only the file digest is needed, i.e., the result of a hash function applied on the data. Users that want to locate that object use this identifier as a handle. When an IPFS object is shared in the network, it is identified by the CID retrieved from the object hash, for instance a directory with a CID equal to *QmbWqxBEKC3P8tqsKc98xmWNzrzDtRLMiMPL8wBuTGsMnR*. Even if other nodes in the network try to share the same exact directory, the CID will always be the same.

### *2.3. Distributed Ledger Technology (DLT)*

Distributed Ledger Technologies (DLTs) consist of networks of nodes that maintain a single ledger and follow the same protocol, including a consensus mechanism, for appending information to it. The blockchain is a type of DLT where the ledger is organized into blocks and where each block is sequentially linked to the previous one. The execution of the same protocol, i.e., source code, guarantees (most of the time) the property of being tamper-proof and not forgeable. This allows for a trust mechanism to be created without the need for third-party intermediaries [29,30].

There are different implementations of DLTs, each one with its pros and cons. In permissionless ones, anyone can take part in the consensus mechanism, while this is not true in permissioned ones. Another distinction lies in the support of smart contracts, e.g., Ethereum [22]. This feature is quite often in contrast with other key features related to the level of scalability and responsiveness of the system [31]. Conversely, some implementations are thought to provide better scalability at the expense of lacking some features. IOTA [21], for instance, implements a more scalable solution for distributing the ledger. It consists of a Layer-1 solution, while, on the other hand, Layer-2 solutions are technologies that operate on top of an underlying DLT to improve its scalability [32].

#### *2.4. Smart Contract and Decentralized Autonomous Organization (DAO)*

A smart contract is a new paradigm of contracts that does not completely embody the same features of a legal contract but can act as a self-managed structure able to execute code that forces agreements between two or more parts. A smart contract consists of instructions that, once distributed on the ledger, cannot be altered. Thus, the result of its execution will always be the same for all DLT nodes running the same protocol. When a smart contract is deployed on the DLT and the issuer is confident that the code embodies the intended and proper behavior (e.g., by reviewing the code), then transactions originating from that contract do not require the presence of a third party to have value [33].

Smart contracts are fundamental components of Ethereum that reside on the blockchain and are triggered by specific transactions [34]. Moreover, smart contracts can communicate with other contracts and even create new ones. The use of these contracts grants permission to build Decentralized Applications (dApps) and Decentralized Autonomous Organizations (DAOs) [6,32,35–37]. A DAO is a virtual entity managed by a set of interconnected smart contracts, where various actors maintain the organization state by a consensus system and are able to implement transactions, currency flows, rules and rights within the organization. Members of a DAO are able to propose options for decisions in the organization and to discuss about and vote on those through transparent mechanisms [26].

### *2.5. IOTA and Streams*

In this work, we specifically refer to the IOTA DLT as a technology that uses a different paradigm for managing the ledger; however, there are many other alternatives such as Radix [38] or Nano [39]. IOTA is a DLT that allows hosts in a network to transfer immutable data among each other. In the IOTA ledger, i.e., the Tangle [21], is based on a Directed Acyclic Graph (DAG) where the vertices represent transactions and edges represent validations to previous transactions. The validation approach is thought to address two major issues of traditional blockchain-based DLTs, i.e., latency and fees. IOTA has been designed to offer fast validation, and no fees are required to add a transaction to the Tangle [40]. When a new transaction is to be issued, two previous transactions must be referenced as valid (i.e., tips selection), and then, a small amount of Proof-of-Work is performed.

An important feature offered by IOTA are the Streams [41]. Streams consist of a communication protocol that adds the functionality to emit and access encrypted message streams over the Tangle [40]. Message streams assume the form of channels, i.e., a linked list of ordered messages stored in transactions. Once a stream channel is created, only the channel author can publish encrypted messages on it. Subscribers that possess the channel encryption key (or set of keys, since each message can be encrypted using a different key) are enabled to decode messages. A channel is addressed using an "announcement link". In other words, IOTA Streams enable users to subscribe and follow a messages stream channel, generated by some device. From a logical point of view, channels are an ordered set of messages; in fact, a channel is referenced through the link of a "starting" message.

#### *2.6. Proxy Re-Encryption (PRE) and Cryptographic Threshold Schemes*

Distributed systems usually store data as they are received, without further processing for confidentiality. Therefore, data can be accessed by any network participant. In order to deal with the protection of personal data, we employ in this work two cryptographic schemes, which are described in the following. Proxy Re-Encryption (PRE) is a cryptographic protocol where it is not necessary to know the recipient of the data in advance [42]. PRE is a type of public key encryption based on the figure of a proxy. A sender encrypts a plaintext with a specific public key obtaining a ciphertext. Then, the untrusted proxy transforms the ciphertext into a new ciphertext decryptable with the recipient private key, which does not have anything to do with the first public key. This operation is performed without learning anything about the underlying plaintext. This is possible using a re-encryption key generated by the sender using the recipient public key and shared with the proxy.

A Threshold Proxy Re-Encryption (TPRE) adds a layer of complexity [23]. A (*t*, *<sup>n</sup>*)- threshold scheme can be employed to share a secret among a set of *n* participants, allowing the secret to being reconstructed using any subset of *t* (with *t* ≤ *n*) or more fragments, but no subset of less than *t*. In a network where more than one node keeps secret fragments, a mutual consensus can be reached when *t* nodes provide the shares to a secret recipient, enabling the secret to be known by the latter. This can be used by a sender to share the re-encryption key in fragments with a network of proxies; none of the latter can obtain the whole key without the help of other *t* − 1 proxies.

### **3. Related Works**

In this section, we described the related work based on the different topics we have addressed in this work. To the best of our knowledge, no other works have developed a personal data marketplace using the same set of technologies and techniques; thus, we subdivided this section in work related to each part or parts of our proposed solution.

### *3.1. Decentralized Data Marketplace*

The use of DLTs has been proposed for the implementation of data marketplaces to take advantage of the following advantages [43,44]: (i) no need to rely on third party platforms, (ii) better resilience against network partitioning and single point of failure, and (iii) privacy-preserving mechanisms [45]. Most of the related work investigated the data

distribution through DLTs, focusing in particular on the use of off-chain storage based on DFS with data links referenced in DLTs [6,35,45]. Data exchange with such technologies can lead to a transparent market, where transactions between data owners and data consumers are recorded on DLTs and where smart contracts enable the self-enforcement of fair exchanges between participants and the automatic resolution of disputes [46]. In [5], the authors provided the implementation of a data marketplace based on the use of DFS for storing data and a paymen<sup>t</sup> protocol that exploits Ethereum smart contracts. Similarly, in [17,18], the proposed systems were based on P2P interactions and smart contracts to reach an agreemen<sup>t</sup> while also integrating other components such as the IOTA DLT. Lopez and Farooq [47] presented a framework for Smart Mobility Data Market in which the participants shared their data and could transact this information with another participant, as long as both parties reached an agreement. Their work focuses on the protection of individuals' personal information, while maintaining data transparency and users' ruled access control. Aiello et al. [35] designed IPPO, an architecture that allows users to generate and share anonymized datasets on a distributed marketplace to service providers, while monitoring the behavior of web services to discourage the most intrusive forms of tracking.

With respect to our work, these proposals build similar architectures but lack insight into decentralized access control mechanisms and or decentralized data searches.

### *3.2. Decentralized Access Control*

DLTs have desirable features that make them a reliable alternative infrastructure for access-control systems. Their distributed nature solves the single point of failure problem and mitigates the concern for privacy leakage by eliminating third parties. Traditional access-control policies have been combined with DLTs: discretionary (DAC), to manage personal data "off-chain" (i.e., not directly stored in the DLT), through the access-control policy on the blockchain [48]; mandatory (MAC), to constrain the ability of a subject to access on a datum through smart contracts [9]; role-based (RBAC), for achieving crossorganizational authentication for user roles [49]; and attribute-based (ABAC), to gran<sup>t</sup> or deny user requests based on the attributes of a user, an object and environment conditions [50]. Among DLT-based access-control mechanisms, Attribute-Based Encryption (ABE) [51] offers the best policy expressiveness without introducing many elements into the system infrastructure. ABE encrypts the data using a set of attributes that form a policy. Only those who have a secret key that meet the policy can decrypt the data. In [51], the authors designed a system using ABE-based access control and smart contracts to gran<sup>t</sup> data access, with similar policies mechanism to our solution, while the authors of [52,53] proposed similar frameworks that combined DFS and blockchains to achieve fine-grained ABE-based access control. However, in any of the three previous cases, the secret attribute keys are issued directly by the data owner in the DLT or by a central authority.

### *3.3. Decentralized Data Search*

With respect to our hypercube DHT contribution, a decentralized data search on DLT and DFS is a field that has been addressed by both scholars and developers with only a few efforts. Indeed, one of the concerns that is still open with respect to these novel technologies, is related to implementing data discovery and lookup operations in decentralized way. The Graph is one of the first protocols (actually the most used) with the aim of providing a "Decentralized Query Protocol" [32]. The Graph network consists in a Layer-2 protocol based on the use of a Service Addressable Network, i.e., a P2P network for locating nodes capable of providing a particular service such as computational work (instead of objects just as a CAN). In [54], the authors proposed a Layer-1 keyword search scheme that implements oblivious keyword search in DFS. Their protocol is based on a keyword search with authorization for maintaining privacy with retrieval requests stored as a transaction in a blockchain (i.e., Layer-1). Specifically for IPFS [20], in order to overcome the file search limitation, a generic search engine has been developed, namely "ipfs-search" [55]. This solution is rather centralized and does not escape the problem of concentration similar

to the conventional web. In response to this, a decentralized solution called Siva [56] has been proposed. An inverted index of keywords is built for the published contents on IPFS and users can search through it; however, Siva is proposed as an enhancement of the IPFS public network DHT and does not feature any optimization for a keyword storage structure apart from the use of caching. Finally, a Layer-2 solution for the keyword search in DFS has been proposed in [44], where a combination of a decentralized B+Tree and HashMaps is used to index IPFS objects.

### *3.4. Decentralized Personal Data Management*

The popularity of Internet of Things devices and smartphones and the associated generation of large amounts of data derived from their sensors [57] has resulted in an interest of individuals in the production and consumption of data via a data marketplace [11]. Making data, which are mostly personal, available for access and trade is expected to become a part of the data-driven digital economy [14]. In this context, we find a set of technologies referred to as Personal Information Management Systems, which help individuals reach the vision of Self-Sovereign Identity (SSI). SSI consists of the complete control of individuals' digital identities and their personal data through decentralization. SSI has been generically implemented as a set of technological components that are deployed in decentralized environments for the purpose of providing, requesting and obtaining qualified data in order to negotiate and/or execute electronic transactions [16].

The databox, for instance, is a PDS [8,9] that must be conceived as a concept that describes a set of storing and access-control technologies enabling users to have direct control of their data. In [11,58], the databox is a platform that provides means for individuals to manage personal data and control access by other parties wishing to use their data, supporting incentives for all parties. An undirected link to this model that puts in practice the concept of SSI is the Solid project [12]. Solid has the purpose of letting users choose where their data resides and who is allowed to access and reuse it. Semantic Web technologies are used to decouple user data from the applications that use this data. The storage itself can be conceived in a different manner, while the use of Semantic Web represents to us the core element that eases data interoperability and favors reasoning over individuals' policies. Semantic Web standards bring structure to the meaningful contents of the Web by promoting common data formats and exchange protocols, such as ontologies. The advantages consist in the fact that many ontologies are recommended by the World Wide Web Consortium (W3C) and are thus universally understood and that reasoning with the information represented using these data models is facilitated by mapping with a formal language. An example is the Open Digital Rights Language (ODRL) policy expression language. This can be used in conjunction with other standard ontologies to manage the access control to personal data in Solid [59]. Another possible approach is to program policy expression languages such as smart contracts, in order to manage control automatically [60].

### **4. Decentralized Personal Data Marketplace Architecture**

In this paper, we are interested in describing the fundamentals of a decentralized personal data marketplace: (i) data marketplace because we intend to provide a system that enables data owners to benefit from the sharing of the data they own; the benefits can be purely economical but also linked to the participation to an ecosystem, e.g., sharing data for social good and research; on the other hand, we intend to provide an easier data access to data consumers, especially to the ones who do not have the resources to compete with Big Tech companies; (ii) personal data because we specifically focus on the type of data that is generated by individuals through their personal devices; thus, we assume that the role of data owner in the system is going to be engaged by individuals themselves or by some other entities on their behalf, with a strong emphasis to the concept of Self-Sovereign Identity [16]; and (iii) decentralized because we make use of several decentralized systems that help to more easily achieve a disintermediation in the process of transacting data.

In this section, we devise the marketplace architecture through a description of four pillar systems and their interactions. As shown in Figure 1, the different architectural components can be organized into four layers:


**Figure 1.** Decentralized data marketplace architecture.

With respect to this architecture, in the following subsections, our aim is twofold: (i) to describe in detail the hypercube DHT system and (ii) to describe the interaction between all the architectural components. More specifically, we do not go into the details of all the possible configurations of the DFS, DLT and smart contracts layers, as the discussion may become too scattered and may stray away from the issues related to the decentralized personal data market.

### *4.1. DFS-Based Personal Data Store*

Data generated by personal devices or third-party systems on behalf of individuals are often private in nature, but incentivizing their sharing (as opposed to keeping them locked in data silos) can be beneficial in terms of economic gain and social good. However, the main challenge is often to provide access under certain conditions that data subjects find acceptable and compliant with regulations (e.g., GDPR).

A technological solution that is opposed to centralized data silos consists of the use of DFS for storing such personal data. DFS are usually built on top of a P2P network that

is freely accessible and where nodes execute the same protocol to store and retrieve data. Moreover, often at the heart of such systems, we find the provision of data replication protocols that enable a high data availability. All this means that data owners holding some data in their device can easily participate in the DFS network or reach a DFS node to store and replicate data. This use, then, makes data owners confident that their data can be retrieved by any data provider that, in turn, can participate in the network or contact a DFS node. However, to be on the safer side, data owners should incentivize DFS nodes to store and replicate their data. How to do this is beyond the scope of this paper and we refer the reader to our previous work that also investigates this topic [9].

DFSs have often built in their protocols the identification of data through immutable universal identifiers that directly represent their content in order to uniquely identify contents that are disseminated in the network. An implementation of this feature would be the use of the hash digest of a piece of data with the aim of obtaining a deterministically derived identifier. Thus, any node of the network holding the same piece of data, i.e., with exact content, can use its hash to derive its immutable universal identifier. Any other node in the network can use this id to retrieve the piece of data from other nodes and to verify its integrity through the hash.

Finally, due to the fact that data can be easily replicated in the p2p network and, thus, can be easily accessed by nodes that the data owner might be not aware of, we resort to the use of encryption as a mechanism of data protection. Such a mechanism is required both by the privacy needs of data owners and, specifically, by compliance with personal data regulations. Strong and state-of-the-art cryptographic algorithms help avoid the reidentification of such pseudonymous data, i.e., encrypted personal data, when shared in the DFS network [61].

### *4.2. Smart Contract-Based Distributed Access Control*

Smart contracts are the part of the proposed architecture where access-control logic to share encrypted personal data is performed. Through dedicated smart contracts, access to data can be purchased or can be enabled directly by the owner. Access is authorized only to consumers indicated by the policies of a data owner's contract. A policy would be for the smart contract to maintain an Access Control List (ACL) that represents the rights to access one or more pieces of data. In the rest of the paper, we focus on the application of such a policy.

According to our solution, nodes in a network that maintain a permissioned blockchain are responsible for enforcing the access rights specified in the ACLs of smart contracts. We take advantage of the high degree of trust that a blockchain provides for the data written in the ledger and, then, focus on the trust given to the nodes of this "authorization" blockchain, which must read from the ledger and follow the correct policy. If a data consumer is enlisted in the ACL, then this one is eligible to access certain data. If that is the case, then the consumer is also eligible to obtain the key used for encrypting the data in the DFS. Authorization blockchain nodes rely on ACLs to make sure that a data consumer entitled to this information can obtain such a key. For the encryption operation, we refer to a hybrid cryptographic scheme, making use of both asymmetric and symmetric keys. Generally, each piece of data is encrypted using a symmetric "content" key *k*, and then, this key is encrypted using an asymmetric keypair (*pkKEM*, *skKEM*). This consists of a Key Encapsulation Mechanism (KEM) [62], in which the key is encapsulated and the capsule is distributed, instead of distributing the encrypted data.

### 4.2.1. Access Mechanism

To ensure complete protection of the individual's data, only the authorized recipient of personal data should obtain the key, and nodes on the authorization blockchain should not be able to exploit it. For this reason, we make use of a (*t*, *n*)-threshold scheme to share the capsule that contains the content keys among the blockchain nodes. In particular, the Threshold Proxy Re-Encryption (TPRE) scheme is employed:


The access mechanism is as follows: (i) the public key of the data consumer *pkDC* is listed in the ACL (provided in detail in Section 5); (ii) the data consumer requests the release of a cfrag to at least *t* authorization blockchain nodes using a message signed with *skDC*; (iii) upon consumer request, each node checks if the signatory *pkDC* is in the ACL through an interaction with the smart contract in the blockchain; (iv) if this is the case, then each node releases the cfrag; (v) once the data consumer obtains *t* cfrags, the capsule can be reconstructed and decrypted with *skDC*; and (vi) the decryption reveals the content key *k* needed to decrypt the desired piece of data stored in the DFS.

### *4.3. DLT Indexing and Validation*

One of the main use cases of DLTs consists in data sharing due to their intrinsic property of untamperability. Once collected, in many cases, data can be stored directly on-chain, in a DLT, to validate their integrity. However, preventing the on-chain storage is a preferable solution, not only for retaining high data reads availability and better performances for data writes [9] but also because on-chain personal data are generally incompatible with data protection requirements (i.e., to guarantee personal data deletion to a data subject). Thus, our solution consists of storing personal data in a DFS and reference them in a DLT via their immutable universal identifiers, e.g., hash pointers. Moreover, due to the nature of some proposed DLTs, related pieces of data can be already linked and indexed in the ledger. That is the case of the IOTA DLT, which manages the upload of data in the form of a stream channel thanks to the Streams protocol. We refer to this DLT and this protocol to ease the description of the following parts.
