Security and Ownership in User-Defined Data Meshes

Pingos, Michalis; Christodoulou, Panayiotis; Andreou, Andreas S.

doi:10.3390/a17040169

Open AccessArticle

Security and Ownership in User-Defined Data Meshes

by

Michalis Pingos

^1,*,

Panayiotis Christodoulou

² and

Andreas S. Andreou

^1,*

¹

Faculty of Engineering and Technology, Department of Electrical Engineering, Computer Engineering and Informatics, Cyprus University of Technology, Limassol 3036, Cyprus

²

School of Business, Department of Digital Innovation, Institute for the Future, University of Nicosia, Nicosia 2417, Cyprus

^*

Authors to whom correspondence should be addressed.

Algorithms 2024, 17(4), 169; https://doi.org/10.3390/a17040169

Submission received: 2 March 2024 / Revised: 6 April 2024 / Accepted: 17 April 2024 / Published: 22 April 2024

(This article belongs to the Special Issue Hybrid Intelligent Algorithms)

Download

Browse Figures

Versions Notes

Abstract

:

Data meshes are an approach to data architecture and organization that treats data as a product and focuses on decentralizing data ownership and access. It has recently emerged as a field that presents quite a few challenges related to data ownership, governance, security, monitoring, and observability. To address these challenges, this paper introduces an innovative algorithmic framework leveraging data blueprints to enable the dynamic creation of data meshes and data products in response to user requests, ensuring that stakeholders have access to specific portions of the data mesh as needed. Ownership and governance concerns are addressed through a unique mechanism involving Blockchain and Non-Fungible Tokens (NFTs). This facilitates the secure and transparent transfer of data ownership, with the ability to mint time-based NFTs. By combining these advancements with the fundamental tenets of data meshes, this research offers a comprehensive solution to the challenges surrounding data ownership and governance. It empowers stakeholders to navigate the complexities of data management within a decentralized architecture, ensuring a secure, efficient, and user-centric approach to data utilization. The proposed framework is demonstrated using real-world data from a poultry meat production factory.

Keywords:

big data; smart data processing; systems of deep insight; data meshes; data lakes; data products; blockchain; NFT; data blueprints

1. Introduction

Nowadays, big data can be characterized as the “new oil”, recognized as a valuable human asset. The effective aggregation and analysis of these data may unearth information that provides insights into numerous facets of everyday activities and offers the ability to anticipate future occurrences. Big data refer to the substantial volumes of digital information consistently produced by machines and the global population from diverse sources such as social media, Internet of Things (IoT) devices, machines and sensors logs, public records and open data, online transactions, websites and applications, research and scientific instruments, etc. [1]. The vast majority of big data originate from heterogeneous data sources, yielding a variety of data types that include structured, unstructured, and semi-structured data. Encompassing a diverse range of content, big data span from textual information to multimedia elements, such as images, videos, and audio [2].

The three primary characteristics (3 Vs) of big data, as presented by Dough Laney in 2001, form and define its fundamental framework [3]. Firstly, Volume represents the broad amount of data generated from data sources, often reaching high levels that challenge typical data processing methods. The second characteristic defining the speed at which data are created, processed, and made available for analysis is denoted by Velocity. Fast processing speeds are required to keep up with the increasing rate of data creation due to the emergence of real-time data sources like social media and sensors. Thirdly, the term Variety highlights the variety in types of data, encompassing organized, unstructured, and semi-structured information. By integrating a broad range of textual, visual, and audio information, this inclusivity recognizes that big data extend beyond traditional databases. Taken together, these three qualities create the foundation for realizing and capitalizing on the possibilities of big data in a data-driven modern world. In addition, seven more characteristics were added to this list after 2001, thus creating the 10 Vs as terms related to big data. The new properties are Value, Veracity, Volatility, Validity, Vulnerability, Variability, and Visualization [4] and offer additional descriptive assets of big data.

In the pre-big data era, storage designs were mostly based on file systems and conventional relational databases. Relational databases with clear schemas, like MySQL and Oracle, were great at handling structured data. A lot of people used file-based storage systems, such us Network Attached Storage (NAS) and Storage Area Network (SAN), to store documents and other kinds of files. During the same period, the conventional approach to address escalating data requirements involved vertical scaling, which entailed augmenting resources on a single server [5]. In the era of big data, which is characterized by immense data volumes, rapid data transfer rates and the diversity of weakly structured data from numerous heterogeneous sources, as also declared by the 10 Vs characteristics, resulted in a fundamental transformation of storage architectures. NoSQL databases, such as MongoDB and Cassandra, as well as distributed storage systems like Hadoop Distributed File System (HDFS), have now become more popular [6].

The complex interactions amongst data lakes, data meshes, and data Markets in the big data era have created a dynamic ecosystem that transforms how businesses manage and extract value from heterogeneous data sources and big data [7]. Data meshes and data markets are innovative data management frameworks first introduced by Zhamak Dehghani in 2019, diverging from the conventional approach of data lake storage architectures. These storage architectures and structures can be deployed using storage and processing technologies, such as Apache Hadoop, Apache Spark, or cloud-based solutions such as Amazon S3, Azure Data Lake Storage, or Google Cloud Storage [8]. While these frameworks are linked with big data processing, the primary unsolved challenging problems revolve around security, encompassing issues related to privacy, regulatory requirements, and access control. Notably, weaknesses in metadata management pose challenges, as data in lakes or meshes can be replaced without proper management of their contents [9].

The primary research contribution of this paper lies in the introduction of an innovative framework that leverages semantic data blueprints (SDB) [10] for the dynamic assembly of Data Meshes and data products responding to user demands on the one hand, and ensuring that stakeholders access specific areas of the data mesh as needed via transfer of ownership on the other. The integration of non-fungible tokens (NFTs) and Blockchain technology collaboratively establishes a novel approach to address data ownership and governance concerns. The core of this framework is a dedicated algorithm that involves the execution of specific steps to facilitate secure and transparent data ownership transfers by combining the ability to mint time-based NFTs with extended functionality.

The proposed approach expands earlier research on the subject that proposed SDB, a semantic metadata enrichment technique for data lakes that enables the effective storing and retrieval of data from distributed and heterogeneous data sources, as well as ensuring security in data lakes using Blockchain technology and NFTs [10,11,12]. The same concepts are employed in this study, but this time they align with the characteristics of data meshes, ensuring security and ownership through the integration of Blockchain and NFT technology, thereby paving the way for the development of data markets. In this context, a data mesh is thought of as the evolution of a data lake in terms of managing massive amounts of data (big data) expressed in a variety of formats (structured, unstructured, and semi-structured), but most crucially, for making it simple, rapid, and effective to trace. Users and their preferences or needs shape the form and variety of data products synthesized within the Data Meshes. Users initiate a series of functions within the system, including data retrieval from a data lake, data product construction, and ownership transfer. Through this iterative process, the framework generates a diverse array of outputs, encompassing both traditional data products and NFTs, something that underscores the collaborative nature of data management and emphasizes its adaptability to accommodate the evolving needs and preferences of its users.

Real-world manufacturing data from Paradisiotis Group (PARG), a significant local industrial player in Cyprus, are used to illustrate the proposed approach. PARG is one of the most significant companies, comprising experts in the field of poultry farming and the production/trading of poultry meat in Cyprus. It provides a large assortment of food products that are delivered to local supermarkets. The operational procedures and production data of the factory are treated as confidential for privacy and security reasons. Consequently, this study uses a masked and de-identified rendition of the data and only presents a portion of the processes, providing limited but specific details. However, the case study reported in the present paper successfully illustrates the fundamental ideas of the proposed framework, confirming its applicability and effectiveness.

The remainder of the paper is structured as follows: Section 2 discusses the technical background and related research in the areas of data lakes and data meshes. Section 3 outlines previous research performed on the semantic enrichment technique, which is adopted and extended in this work to address security and ownership aspects. Section 4 presents the extended data meshes framework and discusses its main components. This is followed by demonstrating the applicability and assessing the performance of the proposed framework in Section 5 through a case study conducted using real-world data collected at PARG. Finally, Section 6 concludes the paper and highlights future research directions.

2. Technical Background

2.1. Understanding Data Lakes and Data Meshes

A data lake (DL) is a centralized architecture designed to store vast amounts of structured, unstructured, and semi-structured data on any scale. Unlike traditional databases or data warehouses that require data to be structured before storage, a DL can host raw data in its native format [13]. This means that data from various sources such as logs, clickstreams, social media, videos, and sensors can be stored without the need for pre-defined schemas. DLs offer storage flexibility, allowing the storage of data in its raw form without the definition of upfront schema. This feature enables the accommodation of various data types and formats from diverse sources at any production frequency. DLs are highly scalable and capable of handling very large datasets making them ideal for big data applications. They often provide cost-effective storage options by leveraging cloud object storage, resulting in more economical solutions compared to traditional data warehouses.

DLs seamlessly integrate with tools and technologies that enable the stored data to be processed, queried, and analyzed. Properly configured DLs can implement security measures and data governance policies to ensure privacy and compliance with regulations. While DLs offer a high degree of flexibility, they require careful management to prevent them from becoming data swamps, that is, hosting places where data are poorly organized, difficult to find, and hard to analyze [9]. To address this concern, practices like metadata management, data cataloguing, and the establishment of data governance policies are crucial. Figure 1 presents the structure of a DL and an algorithmic description of how the DL concept works in practice, from collecting data, annotating data using metadata, storing data, and finally retrieving data based on metadata tags.

The concept of a data mesh (DM), as mentioned in the previous section, was introduced in 2019 [14], which essentially represents a novel approach to data management within large organizations. Unlike traditional methods, a DM emphasizes several key concepts to revolutionize data handling. Firstly, it advocates for domain-oriented ownership. This means that data domains are entrusted to the teams or business units possessing the highest expertise in that specific domain. These teams bear the responsibility for ensuring the quality, accessibility, and privacy of their respective domain’s data. Additionally, a DM promotes the idea of decentralized data products. Here, data are treated as a product and each domain team is accountable for the entire data lifecycle within their domain. This encompasses tasks such as production, consumption, quality assurance, privacy measures, and comprehensive documentation. Furthermore, DMs advocate for federated computational governance, an approach where each domain team defines and enforces the computational logic specific to that domain. This logic is then executed within the broader context of the mesh [15].

To facilitate autonomy and efficiency, DM incorporates a self-serving data infrastructure. This infrastructure is designed to empower domain teams with the necessary tools and resources to independently manage their data products, reducing the reliance on centralized data engineering teams. Embracing an API-first approach, DM encourages the utilization of Application Programming Interfaces (APIs) for seamless data exchange and communication between different components of the system. This promotes loose coupling and flexibility in how data are consumed and utilized. Figure 2 presents the structure of a DM and the algorithm that serves as the core of its operation.

Furthermore, DMs emphasize a holistic view of the data product lifecycle. This encompasses stages such as discovery, ingestion, processing, storage, access, and consumption. Each of these stages is carefully considered and managed by the respective domain teams, ensuring a comprehensive and efficient data handling process. By adopting a DM approach, organizations aim to address the challenges of scaling data operations in a complex environment, where multiple teams work on diverse data domains. It provides a framework for decentralizing data ownership and enabling more effective, scalable, and resilient data operations.

Conversely, a data market is an ecosystem or marketplace where individuals, companies, or systems can buy, sell, or exchange data by leveraging the idea of DMs. Data suppliers in a data market offer datasets for purchase or access by data consumers for a range of applications, such as analysis, research, machine learning, and more [7]. Data markets facilitate the efficient sharing and monetization of data, allowing businesses to leverage external sources of information to enhance their insights and decision-making processes.

Using a large manufacturing company in the field of poultry farming and poultry meat trading as our case study and example demonstrator, we were able to identify various operational areas, including livestock records, agricultural data, supply chain management information, financial transactions, and trading analytics. Each operational area is assigned to a specialized team responsible for its monitoring and upkeep. Moreover, each team is tasked with generating specific data products tailored to their respective domains, such as APIs for accessing data, algorithms for analyzing trading trends, tools for secure data sharing, and reporting mechanisms for financial analytics. This makes the proposed approach an ideal way of sharing portions of data across authorized groups (e.g., departments) or individuals. Adopting a federated computational governance approach ensures that each team defines and enforces the computational logic for their specific domain, facilitating the implementation of specialized algorithms and quality checks. Additionally, each team has access to a self-serve data infrastructure, equipped with tools and resources for managing their data products independently, thereby ensuring autonomy and operational efficiency. To enhance interoperability within the field, the implementation of APIs and adherence to industry standards are prioritized, allowing seamless communication and data exchange between different operational areas. This approach contributes to the optimization of data management and the creation of tailored products, ultimately benefiting stakeholders in poultry and farming trading, including producers, traders, and administrators.

It is important to note that while DMs are more about organizational and conceptual principles for data management, DLs refer specifically to the technology and infrastructure for storing large volumes of raw data. These concepts are not mutually exclusive, and, in practice, organizations can implement a DM framework while utilizing a DL as the underlying basic component of their technical infrastructure for data storage and processing.

2.2. Understanding Blockchain and NFTs

Blockchain serves the purpose of providing a secure and transparent means for recording and transferring data. Notably, it addresses privacy concerns by anonymizing personal data, contributing to its increasing popularity and integration into infrastructure, opening avenues for innovative applications [16]. Functioning as a decentralized database on a peer-to-peer network, Blockchain establishes a distributed communication network enabling non-trusting nodes to interact without relying on a central authority. Its protocols ensure a verifiable and trustworthy system, offering traceability, transparency, and enhanced privacy and security features. In essence, Blockchain is evolving into a fundamental technology with wide-ranging applications and use cases such as IoT, smart contracts, NFTs, cybersecurity and cryptocurrency, providing a foundation for secure and trustworthy data transactions [17].

Algorithmically, Blockchain includes a number of essential elements, procedures, and guidelines to create a strong and feature-rich decentralized system. Initializing basic elements, such as a consensus mechanism and cryptographic algorithms for secure key management and hashing, are the first steps in the process. Implementing token and smart contract standards like ERC-20 and ERC-721 increase functionality by managing the creation, transfer, and ownership of assets [18]. With zero-knowledge proofs as an example, the method smoothly incorporates Decentralized Identity Standards (DIDs) to guarantee secure identification and privacy standards, offering strong user data security. Interledger Protocol and other interoperability standards also make cross-chain communication easier [19]. The integration of decentralized storage protocols, such IPFS, ensures file storage that is dispersed and impervious to censorship. Governance norms support secure and efficient decision-making. Security measures provide protection against vulnerabilities, compliance standards guarantee conformity to legal requirements, and governance standards support efficient decision-making. This all-encompassing strategy creates a conceptual framework for the building of a Blockchain that integrates fundamental criteria, promoting a safe, compatible, and considerate decentralized ecosystem.

Non-Fungible Tokens (NFTs) are a ground-breaking innovation in the ownership and management of digital assets. Because every NFT is distinct and has a unique identifier; it cannot be copied or traded. Blockchain technology is used to accomplish this uniqueness. NFTs are used to verify ownership of a wide range of digital and physical goods [20], including digital art, music videos, real estate, gaming avatars, etc. NFTs are also crucial to Web 3.0, the next iteration of the Internet that many companies and analysts are pushing. Blockchain’s decentralized structure guarantees the integrity and transparency of ownership data, and smart contracts streamline transactions by automating tasks like ownership transfers and royalty distribution.

Finally, the NFT process algorithm starts with the digital asset being initialized, having its nature defined, and being given a unique identification. The implementation of a smart contract that oversees the NFT requires integration with a Blockchain platform, such as Ethereum, via the ERC-721 standard [21]. An NFT is created during the minting process by adding ownership information and other pertinent metadata to the smart contract. Smart contract updates enable ownership transfers, guaranteeing safe and transparent transactions documented on Blockchain. The NFT ecosystem is made more efficient by automating features in the smart contract, such as the distribution of royalties upon resale. NFTs are posted on NFT marketplaces such as OpenSea (New York, NY, USA) or Rarible (Wilmington, NC, USA), where buyers and sellers can transact to make them more widely available [20]. Verifying the integrity of related metadata and examining ownership records on the Blockchain are two steps in the process of authenticating NFTs. The foundation of the NFT lifecycle is the aforementioned algorithmic procedure, which provides a methodical way to create, transfer, and confirm ownership of distinct digital assets on the Blockchain. The whole process is graphically depicted in Figure 3.

3. Related Research

The combination of DLs, DMs, and Blockchain-based technologies—specifically, NFTs—in the field of modern data management creates a dynamic synergy that is changing how businesses handle data ownership, accessibility, and storage. DLs function as large storage spaces for heterogeneous data, promoting a single repository that can handle a variety of data types. In addition, the DM paradigm supports distributed data processing and domain-oriented ownership using decentralized data architectures. A new dimension is brought to data ownership and authenticity by the integration of NFTs on Blockchain platforms, which offer a safe and verifiable framework for identifying the provenance and ownership of individual pieces of data. The integration of Blockchain, DLs, and DMs improves the scalability and flexibility of data ecosystems and lays the groundwork for more open, safe, and cooperative data management procedures, as related work reveals.

A dedicated DL architecture was used in [13] to investigate how Blockchain technology might be integrated to improve the scheme’s metadata management. It specifically presented the use of NFTs that are stored on the Blockchain to represent metadata for every data source. The intention was to use Blockchain technology to improve the DL’s semantics metadata, which could lead to better data management, organization, and retrieval. Furthermore, ref. [11] addressed the challenges associated with the smart processing of big data in the context of DLs, as well as ownership and security, using Blockchain and NFT technologies. It emphasized the need for a disciplined approach to manage diverse data sources within DLs for predictive and prescriptive analytics. That paper introduced a novel standardization framework that integrates the 5 Vs of big data characteristics and blueprint ontologies. The framework utilized a pond architecture to organize DLs and incorporated a metadata semantic enrichment mechanism for efficient storage and retrieval. Notably, the mechanism supported visual querying and enhanced security through Blockchain and NFTs. The authors also provided a comparative analysis with other metadata systems, demonstrating promising results based on a set of functional properties.

An enhanced DL metadata framework called DLMetaChain was introduced in [12], which can manage data from diverse sources like IoT data using Blockchain. The paper discussed the changing IoT ecosystem, where a variety of sources produce large amounts of data that are then converted into useful information. Metadata management becomes difficult when storing such data, including IoT data, in repositories such as DLs, especially when it comes to security and access control. The principal aim was to design an architecture that utilizes Blockchain technology to guarantee the data integrity of the DL by impeding any unsanctioned changes or additions.

A visionary approach to establish a distributed federated medical DL and ecosystem was proposed in [22], involving hospitals and personal health data from wearable medical devices. It emphasized the creation of a Blockchain-based platform with commercial incentives, addressing data ownership, patient privacy, and controlled access. The platform facilitated owner-centric medical data exchange, securely aggregated data from various hospitals, and unlocked academic and business value by representing medical data as NFTs. The primary goal was to improve healthcare research while fostering a sustainable medical data ecosystem.

Our framework builds upon existing research by integrating privacy engineering principles into decentralized data architectures [23], particularly focusing on DM. Unlike previous work, we propose a comprehensive algorithmic framework utilizing data blueprints to enable dynamic creation and governance of DMs and data products. Leveraging Blockchain and NFTs, our approach addresses privacy concerns, ownership, governance, and transparency. Additionally, we provide a practical implementation of data blueprinting, ensuring robust privacy practices and secure data ownership transfer.

While the study in [24] provides insights from industry interviews into DM adoption, our research goes further by proposing a concrete algorithmic framework to tackle these challenges. Using Blockchain and NFTs, our approach not only addresses governance and ownership issues, but also facilitates tailored DM creation and management. Furthermore, a real-world data demonstration validates the effectiveness and practical applicability of our framework and proves its applicability.

In contrast to [25], which provides a philosophical reflection on data ownership, the present study translates ethical considerations into actionable mechanisms. By integrating Blockchain and NFTs, the proposed approach promotes transparent and equitable data ownership within decentralized architectures, bridging ethical reflections with technological innovations to realize the redistributive and socio-cultural dimensions of data ownership.

Addressing the importance of stakeholder buy-in is highlighted by [26]. The present study provides a compelling narrative and a concrete technological framework, which, by aligning technical excellence with strategic business value, serves as a catalyst for organizational transformation towards a user-centric, decentralized data architecture.

Finally, in order to manage data at scale, ref. [27] investigated how a Blockchain-powered metadata catalogue might be integrated into a DM architecture. The metadata catalogue improved governance, efficiency, access, and discovery. The catalogue managed metadata across a dispersed network of data domains with federated governance, immutability, and transparency due to the use of Blockchain technology. A proof-of-concept solution utilizing HyperLedger Fabric was presented, with advantages including increased reliability, efficiency, and transparency. It also discussed and suggested possible solutions for issues including governance, scalability, and interoperability.

4. A Framework for Supporting the Transfer of Ownership in Data Meshes

This section describes the proposed framework for transferring the ownership of data products residing in DMs. The framework follows a series of algorithmic steps that include the creation of the DM through its transformation from a DL that bears a specific architectural structure, and the development of the appropriate smart contracts, the execution of which facilitates the transfer and proves the ownership of a specific data product. In our case, the fluidity of information in a DM depends on the type and diversity of the data stored in the DL from which data products are created.

4.1. Semantically Enriched Creation of Data Lake Architecture and Data Mesh Products

A metadata mechanism is of paramount importance for a DL as it functions as its organizational backbone, offering a systematic and detailed catalog of the diverse datasets hosted within the DL. Without such a metadata mechanism, a DL will gradually be transformed into a data swamp. In essence, a metadata mechanism provides data owners with a vital insight into the type and context of the stored information by capturing important details about the origin, structure, relationships, and usage of data. By providing this information, navigating and mapping raw data become feasible, making data searching, retrieval, and management easier and more efficient.

The SDB is a metadata enrichment mechanism that identifies and characterizes a candidate source before it becomes member of a DL [10]. The framework described in [10] integrates blueprint ontologies with the 5 Vs big data features, namely Volume, Velocity, Variety, Veracity, and Value, to support data processing (storage and retrieval) in DLs organized in a pond architecture. The latter structures a DL in several distinct data ponds, each of which holds or refers to a certain type of data according to the pond design. Depending on the type of data (structured, semi-structured, unstructured), each pond has a unique data processing and storage method. When extracting data from the DL, this built-in pond architecture is quite useful as it supports quick and easy access to the storage space.

As previously mentioned, a dedicated blueprint is developed to describe each data source that stores data in the DL. Specifically, the blueprint of a source consists of two interconnected blueprints, as shown in Figure 4, the stable and the dynamic blueprint [10].

The former is static and describes the name and type of the source, the type of data it produces, as well as the value, velocity, variety, and veracity of the data source pushed in the DL. The latter is a dynamic blueprint which involves attributes that are not stable over time and essentially characterizes volatile properties such as the volume of data, the last source update, and keywords characterizing the source. The dynamic blueprint is updated every time data sources produce new real-time or batch data, or its description through keywords may be modified. In essence, the metadata description—SDB is provided in Terse Triple Language (TTL) using the Resource Description Framework (RDF), which is a well-known framework for describing resources on the Web. The metadata mechanism contains TTL descriptions for all the sources included in the DL. In essence, TTL is a serialization format that provides a concise and human-readable way to represent RDF data, making it easier for both machines and humans to work with semantic information on the Web. RDF represents information as triples, which consist of subject–predicate–object statements. The resource being described is the subject, the property or attribute is the predicate, and another resource or value is the object. An example of a triple may be ex:variety “unstructured”, which means that the subject is the source, the predicate is “variety” and the object is the value “unstructured”.

Our framework supports the categorization of data into three pillars: structured, semi-structured, and unstructured. This classification allows for the flexible inclusion of diverse datasets that transcend the constraints of data representation through meticulous metadata recording and semantic enrichment. By introducing stable and dynamic blueprints, the framework captures the essence of data variety, thereby ensuring independence from specific representations and underscoring its adaptability and robustness across varying datasets, marking a significant stride in modern data analytics.

Let us assume that a user requests access to specific sources producing data and storing it to the DL. In this case, a dedicated SPARQL query is formed and executed in the DL. When the query starts executing, it first asks the owner of the data for her/his approval. If the owner approves the query, then the framework, and specifically the metadata mechanism of the DL, is utilized to create the corresponding DM product that satisfies the query as presented earlier in Figure 3. Figure 3 also shows that the user has access only to the sources requested through the corresponding APIs. Furthermore, this access is restricted to the specific person and is valid only within a specific period via Blockchain and NFT technologies, as is presented in detail in the next subsection.

Analytically, the steps taken are as follows:

The owner of the contract can add an administrator on the contract by calling the addAmin() function inserting an EVM-compatible address. Once an administrator is created, (s)he obtains access through her/his address to certain admin-only functions on the contract.
An administrator can mint an NFT by executing the safeMint() function, providing the address of the recipient, an expiration date in UNIX epoch time, the query that is associated with the NFT, and its access level. If the value of the access level is set to 1, then the NFT grants read-only access to its new owner and the NFT is non-transferable, while, if the value is set to 2, the owner of the NFT, besides read access, is also able to transfer access, and thus can transfer the ownership of the NFT to a different user. At any given time, the current owner of the NFT can access and read the data.

4.2. Smart Contract Architecture

Blockchain-based architecture uses a specially designed ERC721 smart contract that was implemented to evaluate the use of the proposed framework. ERC721 is a standard that is used in EVM-compatible Blockchain networks to represent the ownership of NFTs, where each token is unique and has its own metadata. In this study, we decided to develop our smart contract based on the ERC721 standard for two main reasons: (i) with ERC721, users can securely own, transfer, and manage their digital assets with transparent and verifiable ownership records; (ii) the ERC721 standard ensures that NFTs can easily interact with several wallets and decentralized applications (dApps), enhancing their utility and accessibility. Additionally, as the smart contract was developed for deployment on EVM-compatible blockchain networks such as ETH, Matic, Avalanche, etc. that are using a Proof-of-Stake consensus mechanism, there is no need for large energy consumption for the calculation of blocks.

The purpose of the smart contract developed in this paper is threefold: (i) Allow data owners to mint time-based NFTs and transfer them to an address, (ii) allow NFT owners to read specific portions of data for a certain period of time, and (iii) allow NFT owners to transfer ownership of the data to a different user. The proposed smart contract consists of three main actors: the contract owner, who is the deployer of the contract and responsible for registering administrators onto the contract; the contract administrators, who oversee the minting process; and the authorized users, who can view or transfer data. The administrator algorithmic workflow of the proposed framework is depicted in Figure 5 and summarized in pseudocode.

Figure 6 presents the algorithmic workflow followed for authorized users. Authorized users can view the assigned data based on two parameters, the expiration date and the query. A user holding a valid NFT can access a token-gated website to view the data. The website checks the eligibility of the connected address to allow or refuse access to the user. Finally, as depicted in Figure 7, NFTs are separated into two categories, transferable and non-transferable. When an NFT is minted, the admin specifies if the token has read-only or transfer access. When a user who holds a specific non-expired NFT initiates a transfer function, the contract checks whether the token can be transferred or not to a different address and proceeds to accept or reject the request accordingly. If the NFT is successfully transferred, then the new owner of the NFT is automatically granted access to the token-gated website and can view the data.

Two notes should be made here: (i) By using hash functions and asymmetric encryption, we ensure alignment with the different data privacy regulations (e.g., GDPR and CCPA). Each NFT in the developed system is owned by an EVM-compatible address, thus only the address that owns the NFT can access the system and/or transfer the NFT. Thus, granting access to data can only be initiated by the owner of the NFT and in cases in which the administrator enabled transfer access to the specific NFT during its minting (creation). (ii) Delays and costs that may occur are normally associated with the number of transactions stored on the Blockchain as well as the data volume in each transaction. In the proposed approach, this volume is kept minimal as the data themselves are not stored on the Blockchain, but rather a metadata description of these data. Moreover, as the transaction on the Blockchain does not concern the data, but only the metadata, the scaling of the DL (i.e., its expansion) does not affect the description of an existing source. In general, the scalability of the DL and the DM (i.e., the increase in their size) does not impact the metadata. The only case in which metadata are affected is when new sources of data are hosted in the DL. However, again, the information included in the metadata is very limited and is produced as a one-off for each new source. Finally, it should be noted that this approach focuses on the transformation of the DL into data products (DM), which again does not depend on the volume of the data kept in the DL and, hence, is not on scale.

5. Framework Demonstration through a Real-World Case Study

5.1. The PARADISIOTIS Group (PARG) Factory Case Study

As previously mentioned, this research utilizes a real-world case study from the area of smart manufacturing to demonstrate the applicability of the framework. Specifically, it utilizes data recorded at the PARG factory, the main business line of which is chicken farming and poultry meat production and distribution. PARG is a continuously growing company that over the years has invested in modern and technologically advanced equipment for breeding processes (e.g., automatic ventilation systems, technology-assisted mills for mixing ingredients and preparing chick food, etc.) and the production line (the cutting, mixing, and packaging of poultry meat). The management of the factory constantly seeks to improve performance and quality levels by frequently adapting production processes and adopting new technologies.

Data are produced within the factory mainly by two systems. (i) CUBORA is a fully operational heating control system designed to produce and monitor data related to poultry heating and emissions into the feeding atmosphere. This system is essential for ensuring the healthy growth and well-being of chicks on farms. (ii) AGROLOGIC specializes in the field of automated climate controllers, feeding and weighting systems. AGROLOGIC is integrated with a Chore Time controller and collects metrics from several remote sensors that are distributed in the farms, such as CO2, temperature, humidity, air static pressure, and light intensity level. All metrics are recorded in a database and accessed through a Web application in real time. Furthermore, images of the farms and/or equipment may be recorded for shift managers to visually inspect when necessary. Finally, the system generates alerts if any of the metrics exceed pre-defined thresholds via an embedded GSM modem.

A PARG case study presents all characteristics of big data originating from heterogeneous sources with atypical patterns, which produce various kinds of structured, semi-structured, and unstructured data in high frequencies. These heterogeneous data need to be treated differently from normal production speed data and be stored in more flexible and/or higher servicing speed data storage architectures or structures, compared to classic relational databases and data warehouses, such as big data warehouses, DLs, and DMs. To this end, the current study developed a dedicated DL for PARG in a controlled (lab) environment and applied the basic principles of SDB, Blockchain and NFT technologies for creating data products and domains. The latter are produced based on a DM constructed through the DL metadata mechanism. User requests for access to these data products are addressed to the data owner, and ownership may be granted through NFTs based on the relevant privileges, at the same time providing the ability to grant access and use the data only for a specific period of time.

5.2. Use Case Scenarios

As previously mentioned, a request to access a DL is supported by the utilization of the SDB semantic enrichment mechanism, which is the cornerstone for creating a data product as part of the DM according to user preferences and ownership granting. Access and ownership for a specific period of time is recorded on Blockchain using a dedicated NFT. The use case of the PARG factory focuses on the department of poultry feeding where sources produce data during the feed cycle of chickens within a specific farm. An excerpt of the structure of the corresponding SDB is depicted in Figure 8. We selected the following metadata characteristics to describe a source which produces data for monitoring chicken flock farming in different locations: (i) Source Name; (ii) Location; (iii) Feed cycle start; (iv) Feed cycle end; (v) Keywords; (vi) Variety; (vii) Velocity; (viii) Volume; and (ix) Source Path.

The use case scenarios tested are based on user requests to access a specific portion of data. For example, PARG stakeholders (shift managers, farm carers, production workers) often need to consult data related to the number of chicks in a farm, the environmental conditions within a farm, electricity consumption, emissions in the atmosphere, biomass production, etc. Therefore, data products in the DM were constructed to reflect these pieces of available information. Furthermore, in the scenarios below, we also assume that the shift manager wishes to acquire access to all information related to the Limassol farm and that, at some point, (s)he wishes to transfer this access to the head of production. Normally, access permissions are requested by sending a message to the owner of the data through a dedicated SPARQL. This query is essentially executed in all scenarios that follow. Essentially, a user requests access to a specific portion (data sources) of a DL through the DM data products that are constructed to provide information for the location “Limassol”, as presented in Figure 8.

To evaluate the efficiency of the proposed framework, we first developed a dedicated smart contract that was deployed in the Sepolia Test Network, and then we executed a series of transactions. The smart contract’s address is 0x88790ed3407e3b395ab0276d530 5a273a497612b, and the contract owner is 0xfb43d1384FC250B59996933CA2D8C766722 7BE52. The reader can refer to the smart contract’s URL on etherscan.io for complete access to the source code of the contract. By using the smart contract, we have explored various scenarios to showcase its fundamental features that include NFT minting with additional on-chain information, data retrieval, and transfer restrictions.

5.2.1. Scenario 1—Minting

As previously mentioned, this scenario demonstrates how a user that wishes to access all sources of the factory that produce data during breeding in the city of Limassol is serviced by executing the SPARQL query listed below. This scenario illustrates the minting capability of the smart contract, as outlined in the proposed administrative algorithmic workflow framework. Initially, the admin of the smart contract with the address 0xfb43d1384FC250 B59996933CA2D8C7667227BE52 sequentially executes two token processes for minting transactions to the address 0xcF1aB65AE4EFaA9BE8cDB13078360B811 D11616D, the first not allowing the token to be transferrable (Algorithm 1) (i.e., the ownership of the data may not be passed on to another user), and the second allowing the token to be transferrable (Algorithm 2). The processes are executed with the following parameters:

Algorithm 1: First token process parameters

1: Date of Expiration: 1706094000 (Wednesday, 24 January 2024 11:00:00 UTC)

2: Query:

3: PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> (accessed on 2 March 2024)

4: PREFIX ex: <http://example.org/>

5: SELECT ?location ?sourcePath

6: WHERE { ?source rdf:type ex:Description; ex:location “Limassol”; }

7: Transferrable: NO (flag is set to 1)

Algorithm 2: Second token process parameters

1: Date of Expiration: 1706095000 (Wednesday, 24 January 2024 11:16:40 UTC)

2: Query:

3: PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> (accessed on 2 March 2024)

4: PREFIX ex: <http://example.org/>

5: SELECT ?location ?sourcePath

6: WHERE { ?source rdf:type ex:Description; ex:location “Limassol”; }

7: Transferrable: YES (flag is set to 2)

Once the transactions are confirmed on the Blockchain network, the address 0xcF1a B65AE4EFaA9BE8cDB13078360B811D11616D becomes the owner of both token ids #0 and #1, as depicted in Figure 9. Essentially, the owner has access to the PARG sources for Limassol’s farm with either token#0 or token#1. The main difference between the two tokens is the ability to use them for transferring ownership to another user, as demonstrated below.

5.2.2. Scenario 2—Retrieving Data

This scenario presents the retrieving capabilities of the smart contract which are based on certain requirements. Here, we are using address 0xcF1aB65AE4EfaA9BE8cDB13 078360B811D11616D that corresponds to the owner of both NFTs #1 and #2. This address is checked to comply with two restrictions: First, that it is the owner of the NFT, and second, that the NFT has not expired. These restrictions safeguard that the address has permissions to retrieve the data recorded on the smart contract for each token (see also Figure 10). Therefore, access to the data product constructed to include all information produced in the Limassol farm is now granted to the owner of the corresponding address. If any other address besides the owner of the NFT attempts to retrieve those data, it is automatically blocked by the smart contract and it is not allowed to enter the token-gated website (see Scenario 3).

5.2.3. Scenario 3—Applying Transfer Restrictions

This scenario demonstrates the transfer restrictions that are set by the administrator when a token is minted. As described in Scenario 1, token#0 was minted as a non-transferable token, while token#1 was minted with transferrable properties. As outlined in Figure 11, when the owner attempts to transfer token#0 to a different address, it is blocked by the smart contract, as this is not a valid action due to transfer restrictions. Subsequently, when the owner of token#1 tries to transfer the token, this is carried out successfully as token#1 has the appropriate transfer rights and hence the permissions to do so. Here, the owner of token#1 transfers the token to address 0xC70bc32E46378B5a01c713d6dB18 042Acd8F0200. Upon confirmation of the transaction on the Blockchain network, the previous owner of the token loses access to it as now the access rights are transferred to the new owner. Therefore, access to the data products is secured via Blockchain and single control of ownership is guaranteed by the NFT.

6. Conclusions

This paper introduces an innovative framework for securing the access and ownership of data meshes based on Blockchain and NFTs. The framework is applied on a data lake storage architecture, which may host big data at any scale, frequency and format, and utilizes semantic data blueprints for dynamically constructing data products in data meshes. These products are designed to meet user demands and ensure that stakeholders access specific areas of the data mesh as needed through the transfer of ownership. The integration of NFTs and Blockchain technology offers a novel approach to addressing ownership and governance concerns. A dedicated algorithm was developed for incorporating the ability to mint time-based NFTs, thus facilitating secure and transparent data ownership transfers. The proposed framework was demonstrated using a real-world case study from the smart manufacturing area. Specifically, a data lake was built to host data produced at a poultry meat production factory by several sensors and automated systems during the breeding process followed in the farms. Specific portions of data were selected to construct data products in a custom data mesh, which were then used as key elements for granting access and transferring ownership to authorized users via the execution of smart contracts and NFTs. The scenarios tested suggested successful behavior in terms of ease of use, transparency, and correctness. It should also be noted that users in the factory (workers and managers in breeding sites and production line) were able to easily follow the algorithmic approach of the proposed framework and apply its steps efficiently, greatly appreciating the ability to share data.

Future research will focus on enhancing and automating parts of the framework by utilizing recommender systems driven by user preferences and/or the history of transactions for creating data products. Specifically, the system will be modified to employ advanced algorithms for analyzing user behavior and preferences to generate access recommendations. This will allow data products to be constructed upfront, thus speeding up the process for granting access and transferring ownership.

Author Contributions

Conceptualization, M.P. and A.S.A.; methodology, M.P. and A.S.A.; software, M.P. and P.C.; validation, M.P., P.C. and A.S.A.; formal analysis, M.P.; investigation, M.P.; resources, M.P. and A.S.A.; data curation, M.P.; writing—original draft preparation, M.P.; writing—review and editing, M.P. and A.S.A.; visualization, M.P. and P.C.; supervision, A.S.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflict of interest.

References

Gupta, S.; Kar, A.K.; Baabdullah, A.M.; Al-Khowaiter, W.A.A. Big Data with Cognitive Computing: A Review for the Future. Int. J. Inf. Manag. 2018, 42, 78–89. [Google Scholar] [CrossRef]
Blazquez, D.; Domenech, J. Big Data Sources and Methods for Social and Economic Analyses. Technol. Forecast. Soc. Chang. 2017, 130, 99–113. [Google Scholar] [CrossRef]
Al-Sai, Z.A.; Husin, M.H.; Syed-Mohamad, S.M.; Abdin, R.M.S.; Damer, N.A.; Abualigah, L.; Gandomi, A.H. Explore Big Data Analytics Applications and Opportunities: A Review. Big Data Cogn. Comput. 2022, 6, 157. [Google Scholar] [CrossRef]
Khan, N.; Alsaqer, M.; Shah, H.; Badsha, G.; Abbasi, A.A.; Salehian, S. The 10 Vs, Issues and Challenges of Big Data. In Proceedings of the 2018 International Conference on Big Data and Education, New York, NY, USA, 9–11 March 2018. [Google Scholar] [CrossRef]
Khine, P.P.; Wang, Z. A Review of Polyglot Persistence in the Big Data World. Information 2019, 10, 141. [Google Scholar] [CrossRef]
Shahid, A.; Nguyen, T.-A.N.; Kechadi, M.-T. Big Data Warehouse for Healthcare-sensitive Data Applications. Sensors 2021, 21, 2353. [Google Scholar] [CrossRef] [PubMed]
Driessen, S.; den Heuvel, W.J.V.; Monsieur, G. Promote: A Data Product Model Template for Data Meshes. In Proceedings of the International Conference on Conceptual Modeling, Lisbon, Portugal, 6–9 November 2023; Springer Nature: Cham, Switzerland, 2023; pp. 125–142. [Google Scholar]
Kunigk, J.; Buss, I.; Wilkinson, P.; George, L. Architecting Modern Data Platforms: A Guide to Enterprise Hadoop at Scale; O’Reilly Media: Sebastopol, CA, USA, 2018. [Google Scholar]
Derakhshannia, M.; Gervet, C.; Hajj-Hassan, H.; Laurent, A.; Martin, A. Data Lake Governance: Towards a Systemic and Natural Ecosystem Analogy. Future Internet 2020, 12, 126. [Google Scholar] [CrossRef]
Pingos, M.; Andreou, A. A Data Lake Metadata Enrichment Mechanism via Semantic Blueprints. In Proceedings of the 17th International Conference on Evaluation of Novel Approaches to Software Engineering, Virtual, 25–26 April 2022; pp. 186–196. [Google Scholar] [CrossRef]
Pingos, M.; Andreou, A.S. Exploiting Metadata Semantics in Data Lakes Using Blueprints. In Proceedings of the International Conference on Evaluation of Novel Approaches to Software Engineering, Virtual, 25–26 April 2022; Springer Nature: Cham, Switzerland, 2022; pp. 220–242. [Google Scholar] [CrossRef]
Pingos, M.; Christodoulou, P.; Andreou, A. DLMetaChain: An IoT Data Lake Architecture Based on the Blockchain. In Proceedings of the 13th International Conference on Information, Intelligence, Systems & Applications (IISA), Corfu, Greece, 18–20 July 2022; pp. 1–8. [Google Scholar] [CrossRef]
Beheshti, A.; Benatallah, B.; Nouri, R.; Tabebordbar, A. CoreKG: A knowledge lake service. Proc. VLDB Endow. 2022, 11, 1942–1945. [Google Scholar] [CrossRef]
Dehghani, Z. How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh. Available online: https://martinfowler.com/articles/data-monolith-to-mesh.html (accessed on 10 January 2024).
Dehghani, Z. Data Mesh Principles and Logical Architecture. Available online: https://martinfowler.com/articles/data-mesh-principles.html (accessed on 10 January 2024).
Viriyasitavat, W.; Da Xu, L.; Bi, Z.; Hoonsopon, D. Blockchain technology for applications in internet of things—Mapping from system design perspective. IEEE Internet Things J. 2019, 6, 8155–8168. [Google Scholar] [CrossRef]
Alam, T. Blockchain-based Internet of Things: Review, Current Trends, Applications, and Future Challenges. Computers 2022, 12, 6. [Google Scholar] [CrossRef]
Di Angelo, M.; Salzer, G. Tokens, types, and standards: Identification and utilization in Ethereum. In Proceedings of the 2020 IEEE International Conference on Decentralized Applications and Infrastructures (DAPPS), Oxford, UK, 3–6 August 2020; pp. 1–10. [Google Scholar]
Yildiz, H.; Küpper, A.; Thatmann, D.; Göndör, S.; Herbke, P. Towards Interoperable Self-Sovereign Identities. IEEE Access 2023, 11, 114080–114116. [Google Scholar] [CrossRef]
Rehman, W.; e Zainab, H.; Imran, J.; Bawany, N.Z. NFTs: Applications and challenges. In Proceedings of the 2021 22nd International Arab Conference on Information Technology (ACIT), Muscat, Oman, 21–23 December 2021; pp. 1–7. [Google Scholar]
Phuc, N.T.; Khanh, H.V.; Khoa, T.D.; Khiem, H.G.; Huong, H.L.; Ngan, N.T.; Triet, N.M.; Kha, N.H.; Anh, N.T.; Bang, L.K.; et al. An Enhanced CoD System Leveraging Blockchain, Smart Contracts, and NFTs: A New Approach for Trustless Transactions. Int. J. Adv. Comput. Sci. Appl. 2023, 14. [Google Scholar] [CrossRef]
Shae, Z.Y.; Tsai, J.J. On the Design of Medical Data Ecosystem for Improving Healthcare Research and Commercial Incentive. In Proceedings of the IEEE 3rd International Conference on Cognitive Machine Intelligence (CogMI), Atlanta, GA, USA, 13–15 December 2021. [Google Scholar]
Borovits, N.; Kumara, I.; Tamburri, D.A.; Van Den Heuvel, W.J. Privacy Engineering in the Data Mesh: Towards a Decentralized Data Privacy Governance Framework. In Proceedings of the International Conference on Service-Oriented Computing, Rome, Italy, 28 November–1 December 2023; Springer Nature: Singapore, 2023; pp. 265–276. [Google Scholar]
Bode, J.; Kühl, N.; Kreuzberger, D.; Hirschl, S.; Holtmann, C. Towards Avoiding the Data Mess: Industry Insights from Data Mesh Implementations. arXiv 2023, arXiv:2302.01713. [Google Scholar]
Hummel, P.; Braun, M.; Dabrock, P. Own data? Ethical reflections on data ownership. Philos. Technol. 2021, 34, 545–572. [Google Scholar] [CrossRef]
Dorota, O. Securing Stakeholder Buy-In for Data Mesh: A Strategic Guide for Moving Forward with Data Mesh Implementation. Available online: https://nexocode.com/blog/posts/securing-data-mesh-stakeholder-buy-in/ (accessed on 3 March 2024).
Dolhopolov, A.; Castelltort, A.; Laurent, A. Implementing a Blockchain-Powered Metadata Catalog in Data Mesh Architecture. In Proceedings of the International Congress on Blockchain and Applications, Guimaraes, Portugal, 12–14 July 2023; Springer Nature: Cham, Switzerland, 2023; pp. 348–360. [Google Scholar]

Figure 1. Data lake architecture and the concept algorithmic approach.

Figure 2. Data mesh architecture and the concept algorithmic approach.

Figure 3. The algorithmic process of transferring ownership from the data owner to a user.

Figure 4. Data source blueprints description using the 5 Vs of big data [10]. Characteristics [10].

Figure 5. The admin algorithmic workflow and pseudocode.

Figure 6. The authorized user algorithmic workflow and pseudocode.

Figure 7. The algorithmic user workflow for transferable and non-transferable NFTs.

Figure 8. PARG’s data lake SDB structure for use case scenarios.

Figure 9. Owner for token ids #0 and #1.

Figure 10. Query results directly from on-chain data and portal.

Figure 11. Result for token#0 and token#1.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pingos, M.; Christodoulou, P.; Andreou, A.S. Security and Ownership in User-Defined Data Meshes. Algorithms 2024, 17, 169. https://doi.org/10.3390/a17040169

AMA Style

Pingos M, Christodoulou P, Andreou AS. Security and Ownership in User-Defined Data Meshes. Algorithms. 2024; 17(4):169. https://doi.org/10.3390/a17040169

Chicago/Turabian Style

Pingos, Michalis, Panayiotis Christodoulou, and Andreas S. Andreou. 2024. "Security and Ownership in User-Defined Data Meshes" Algorithms 17, no. 4: 169. https://doi.org/10.3390/a17040169

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Security and Ownership in User-Defined Data Meshes

Abstract

1. Introduction

2. Technical Background

2.1. Understanding Data Lakes and Data Meshes

2.2. Understanding Blockchain and NFTs

3. Related Research

4. A Framework for Supporting the Transfer of Ownership in Data Meshes

4.1. Semantically Enriched Creation of Data Lake Architecture and Data Mesh Products

4.2. Smart Contract Architecture

5. Framework Demonstration through a Real-World Case Study

5.1. The PARADISIOTIS Group (PARG) Factory Case Study

5.2. Use Case Scenarios

5.2.1. Scenario 1—Minting

5.2.2. Scenario 2—Retrieving Data

5.2.3. Scenario 3—Applying Transfer Restrictions

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI