A Privacy Assessment Framework for Data Tiers in Multilayered Ecosystem Architectures

Chereja, Ionela; Erdei, Rudolf; Pasca, Emil; Delinschi, Daniela; Avram, Anca; Matei, Oliviu

doi:10.3390/math13071116

Open AccessArticle

A Privacy Assessment Framework for Data Tiers in Multilayered Ecosystem Architectures

by

Ionela Chereja

^1,*,

Rudolf Erdei

¹,

Emil Pasca

¹

,

Daniela Delinschi

¹

,

Anca Avram

¹ and

Oliviu Matei

^1,2

¹

Department of Electrical, Electronics and Computer Engineering, Technical University of Cluj-Napoca, 400114 Cluj-Napoca, Romania

²

R&D Department, Holisun, str. Cărbunari nr. 8, 430397 Baia Mare, Romania

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(7), 1116; https://doi.org/10.3390/math13071116

Submission received: 16 February 2025 / Revised: 21 March 2025 / Accepted: 27 March 2025 / Published: 28 March 2025

Download

Browse Figures

Versions Notes

Abstract

:

Data-centric operational systems, machine learning (ML), and other analytical and artificial intelligence (AI) pipelines are becoming increasingly imperative for organizations seeking to increase the protection of sensitive data while satisfying customer expectations. This paper proposes a novel methodology to assess the level of vulnerability assigned to each of the data storage components in complex multilayered data ecosystems through a nuanced assessment of data persistence and content metrics. The suggested methodology introduces a new and effective way to address the issues of determining perceived privacy risk across data storage layers and informing necessary security measures for an ecosystem by calculating an ecosystem vulnerability score. This offers a comprehensive overview of data vulnerability, aiding in the identification of high-risk components and guiding strategic decisions for enhancing data privacy and security measures. With consistent and generalized assessment of risk, the methodology can properly pinpoint the most vulnerable storage systems and assist in directing efforts to mitigate them.

Keywords:

privacy compliance; data privacy; ethical AI under GDPR; machine learning; multi-layer data warehousing architecture; ecosystem architecture; cybersecurity; data protection; data vulnerability; data risk assessment; ecosystem vulnerability score

MSC:

68P27

1. Introduction

As the world becomes increasingly digitized, we are generating more data than ever before. These data can come from a variety of sources, including social media, user behavior and tracking, and wearable technology. The majority of the systems that support the currently interconnected modern life are digitized, and they need to collect, store, and process this information. Although there are many benefits to this digital revolution, it also raises important questions about privacy [1]. Specifically, who has access to these data, what is done with it, and how can it be protected? These questions are paramount if the data in question can infringe on someone’s right to privacy by including information that can identify an individual.

Cybersecurity threats and attacks have become more frequent, and data breaches have been called “the new norm” [2]. These have an impact on businesses from a multitude of sectors, changing how consumers relate to them regarding the type of data that has been compromised. Ensuring a balance between cybersecurity and user-friendly interfaces in the current digital landscape has become increasingly imperative for organizations seeking to protect sensitive data and satisfy customer expectations. Companies are seeking to implement systems that are easy to use while making sure they are safeguarded against attacks that could erode the trust of their users. The security measures needed to achieve the protection of data extend beyond the storage and processing of the data itself to the customer experience [3]. Protecting the data that are entrusted to companies by their clients is, therefore, a concern that becomes central to the entire ecosystem, and it becomes a critical part of the architecture of that system.

As organizations seek to implement data mining (DM), machine learning (ML), and other artificial intelligence (AI) and analytical pipelines to extract data-driven insights and overviews, the privacy assessment of that entire data flow becomes paramount. Not only is the security of the source system and its data storage important, but each component of the analytical pipeline that processes or accesses those data is equally important [4]. As a result, data privacy practices, the data management life cycle, and architecture of the entire ecosystem, from source(s) to visualization, must be evaluated [5]. Any adjustment in that architecture in favor of increasing privacy could affect the performance and cost efficiency of the systems [6].

An automated methodology is necessary to strike the delicate balance between privacy and efficiency in such complex analytical pipelines. This methodology should account for the need to dynamically adjust privacy parameters based on the sensitivity of the data processed while being adaptive to diverse hardware capacities and computing costs. This paper expands on previously published work [7], which proposes a methodology to evaluate the level of privacy assigned to a DM pipeline. Therefore, our objective is to broaden the assessment of the source data to include structured, semi-structured, and non-structured data vulnerability and expand the proposed vulnerability score for each layer in the entire ecosystem architecture. This work will thus propose an enhanced framework that can be used in multi-layered ecosystem architectures. It includes the implications of using different data types and the usage of their respective storage engines, the consequences of storing and duplicating data in separate layers of such architecture in the calculation of the total vulnerability score, and adjusting the privacy parameters of a pipeline as the result of the computed score by using a privacy by design architectural approach. At the moment of publication, to the best of our knowledge, no such framework is available.

The paper follows a structured approach, beginning with a look at previous work, which includes background information related to data privacy concerns, an overview of the available related work at the time of research, and a look at data ecosystems and data flows. We then detail the proposed methodology, introducing the vulnerability score and the data persistence and content metrics. Subsequently, we define the Ecosystem Vulnerability Score and its calculation. The following section discusses a practical application of the methodology and the obtained results, after which we examine limitations and further research in Section 5. Finally, Section 6 synthesizes the applications of the methodology and the impact on data privacy in complex data ecosystems.

2. Materials and Methods

2.1. Background

Data processing comprises almost all ways in which you can use personal data, for example, collection, storage, deletion, alteration, analysis, transfers, viewing, anonymization, etc. In practice, this can be defined as anything that is done with personal data is considered processing. Not only is data privacy preservation of critical relevance, but its complexity increases with the use of data processing operations such as data linkage [8] or data integration [9]. Such manipulation of data can even introduce new sensitive information due to the generation of new patterns that can identify individuals.

Regulations such as GDPR define and categorize types of data into predefined categories, making it clearer for the entities that fall under its regulations which types of data need to be processed for privacy compliance. However, depending on the area of activity of a business, there may be more types of data that can be considered sensitive and could inflict negative consequences in the event of a data breach [10,11,12,13]. Taking that into account, each company should have a way to assess the vulnerability of its data not only by regulatory standards such as those enforced by GDPR but also by the internally recognized vulnerability or sensitivity of specific information.

In previous work on vulnerability assessment [7], we focused on proposing a privacy assessment methodology for systems that use data mining and machine learning to extract data insights. This methodology can be generally applied to any Machine Learning or Data Mining model and its corresponding source. However, to properly assess the vulnerability that an entity using multiple systems faces, it is necessary to include a view of the multiple stages that data goes through in an ecosystem architecture. As privacy is a critical concern for any digital system, there has been a lot of focus and research on this topic. Different approaches have been used to estimate the impact of data processing. An approach to estimate the maximum probable loss in the event of a data breach is proposed by K. Jung [14]. By using this framework, the consequences of a potential data breach by means of maximum data loss will inform the data controller or processor about the extent of the necessary data protection mechanisms.

Each entity that processes data in order to provide its service becomes responsible for assessing the type of data it stores in order to accurately implement mechanisms of data protection. The sensitivity and value of certain types of data, or combinations of data [15], should be ascertained properly. Data belonging to companies listed on the stock exchange are considered highly valuable, making it an attractive target for data misappropriation and theft [16]. Security events in publicly traded companies can have a direct monetary implication in assessing value changes in the stock price [17]. However, each company that stores client data is a potential target for cyber-attacks and is responsible for assessing its vulnerability and attached value. Not only is there an impact on the reputation of a company that has encountered a breach [10], on the reactions of individuals affected by it [11], and on market value [12], but they can also be subjected to penalties and fines under certain legal regulations [13].

There are various regulations and global standards that aim to safeguard data privacy and prevent cyber crimes against it [18]. Companies that fall under the jurisdiction of these regulations are obligated to comply with the laws and are subject to audits that can lead to high fines [19]. The General Data Protection Regulation (GDPR) is considered one of the most important such regulations on data privacy [20], as it provides explicit guidelines underlining compliance and non-compliance factors. However, beyond categorizing a particular set of data as compliant or noncompliant, the difficulty encountered by data processors reflects on the ability to find a reliable way of assessing its vulnerability in its raw form and after implementing protection mechanisms. Furthermore, while the use of differential privacy [21] mechanisms as a way to mitigate risk is possible, some organizations need to be able to store sensitive data for specific use cases. Therefore, an approach that also allows for these cases is necessary.

2.2. Related Work

Research in the area of data privacy and risks has been a matter of concern over the years, with an increase in studies beginning in 2018 that could be attributed to the implementation of the EU GDPR [22]. Research has been carried out on GDPR compliance, such as the usability of privacy policies under GDPR rules [23], automatic assessment of these policies [24], impact assessment methodology to support its requirements [25], and the introduction of tools like privacyTracker [26] aimed at providing support for GDPR principles, such as data traceability. Although these approaches provide ways to assess privacy risks, they focus specifically on the rules of compliance with GDPR. As such, these assessments are too specifically focused on predefined regulations and cannot be reliably applied outside of these regulations, as they do not readily provide the flexibility to define case-specific rules. This hinders the ability of data processors to expand the definitions of sensitive or high-risk data, overlooking the distinctive characteristics of organizations, which results in inadequate privacy risk assessments [27].

In [28], Caruccio et al. propose a methodology that analyzes data correlations in relaxed functional dependencies to identify new, possibly sensitive data that could break privacy rules during significant big data operations. While accounting for the introduction of new data associations in a big data system, this methodology is also constrained to the rules of GDPR. Security and privacy issues in relation to Big Data database systems have been investigated in several dimensions to identify and evaluate different standard security mechanisms, offering suggestions for future enhancements [29]. This research includes a look at relational, NoSQL, and NewSQL databases and the maturity of security implementations in these models. Although this approach to security and privacy issues can account for several types of storage systems, it focuses on standard ways to evaluate and implement security mechanisms that can be implemented to enhance security rather than assessing the vulnerability risks.

Focusing on the data analysis step and insight into the data processing flow, there are several approaches to assess vulnerability and privacy. Approaches such as privacy-preserving data mining (PPDM) [30,31] have been explored through a comprehensive analysis of privacy breaches throughout history, highlighting key regulations and legislation that have influenced the evolution of privacy management strategies. Ricardo et al. [32] have classified existing privacy metrics into privacy level, data quality, and complexity metrics to quantify data security, information loss, and technique efficiency. The authors emphasize the need for scalable and efficient PPDM solutions to handle increasing amounts of data. This focuses on the data flow for data analysis, particularly data mining, but does not account for the entire flow that the data move through in order to become available for analysis from the source.

A systematic view of privacy risk assessments has been proposed by Wairimu et al. [22] through a literature review of privacy impact assessments (PIAs) and privacy risk assessments (PRAs). While these methodologies are essential in privacy by design, they have not yet been rigorously and systematically evaluated in practice. According to the findings of this research, it is apparent that privacy impact and risk assessment methodologies exhibit shortcomings, either due to strong reliance on specific regulatory frameworks (e.g., GDPR) or by incurring significant overhead. Moreover, select methodologies are confined to the assessment of privacy risk by using the primary consideration of potential harm to data subjects, leaving broader data vulnerability concerns unaddressed.

While there are approaches to assessing vulnerabilities in relation to privacy, these are either focused on parts of an ecosystem inapplicable to the whole or are linked to certain rules and regulations and are not flexible enough to adjust to requirements of more stringent or more permissive regulations.

In the current study, we propose a privacy assessment framework that aims to decouple the assessment of vulnerability from legislative determinants and specific security measures. Its primary objective is to furnish a quantifiable metric capable of assessing and contrasting vulnerability attached to storing non-transient datasets across the entirety of a complex data ecosystem, regardless of the type of storage system being used. The framework is designed to provide a repeatable, reliable and flexible methodology for interconnected data layers that can be defined according to standardized rules and regulations or adjusted to more stringent or more permissive requirements. These scores will help identify the most vulnerable data layers in an ecosystem and choose appropriate measures in proportion to the identified vulnerabilities.

2.3. Data Ecosystems

A data ecosystem is a novel setting that consists of intricate networks composed of individuals and organizations that exchange and utilize data as the primary resource. These ecosystems offer an environment for fostering, handling, and sustaining data-sharing initiatives [33]. Data ecosystems, as defined by Gartner, “unify data management components distributed across clouds and/or on-premises” [34]. Data Ecosystems have data at their core. They facilitate subsequent integrations across different components, such as applications, marketplaces and exchanges, edge computing, event processing, advanced analyticsand data science AI/ML, all of which contain their own data layer.

Efficiently scaling data, analytics, and AI requires the integration of ecosystems consisting of complementary capabilities necessary to support all data workloads. An effective data ecosystem is one that can accommodate large amounts of data and enable powerful analytics and AI capabilities through seamless integration.

The challenges of data ecosystems have been explored in the work of Ramalli and Permici [35] in the context of scientific data. One of the identified challenges revolves around confidentiality. This provides us with a valid example of data that would not necessarily and objectively qualify as private due to personally identifiable qualities but would, in this context, be marked as sensitive and confidential information. Similarly, the opening of data for sharing introduces a major concern of privacy infringement, thus limiting the potential benefits of usage [36].

A look at the secure data management life cycle for data ecosystems has been proposed by Zahid et al. [5]. This paper presents a secure data management life cycle (SDMLC) framework for managing data throughout their entire life cycle, including creation, processing, storage, usage, sharing, archiving, and destruction or reusing. The framework addresses challenges specific to each stage of the data life cycle. It also identifies several major concerns, including privacy, liability, and security. Therefore, a proper assessment of the vulnerability of the data is paramount. With the methodology presented in this paper, we focus on the data persistence vulnerabilities in particular.

Hybrid use and multi-cloud are key to any data-ecosystem plan [34]. This implies that all data privacy vulnerabilities have to be assessed in the Cloud context as well. The security and privacy challenges of cloud computing are recurrent topics of research [37], including the exploration of data breaches and unauthorized access. The impact of these issues on user trust and data integrity within cloud infrastructure is assessed. The use of AI/ML systems in the Cloud introduces yet another layer of concern for compliance and data vulnerability [38].

2.4. Data Flow Overview

In an enterprise-level ecosystem, data flow through multiple layers. Whereas in the past, an application architecture would commonly be three-tiered, the focus on data collection and analysis has led to different architectural approaches per concern, as well as multiple points of integration between these systems. Following the flow of data from generation to analytics visualization, it becomes clear that not only are much more data being created in each system, but they are also potentially transferred, processed, and duplicated in each of these layers. Furthermore, the type of data handled is much more diverse (structured, semi-structured, and non-structured), and each processing and storage step can generate aggregations and links that could generate even more sensitive data. An overview of the different data layers that could be part of a system’s architecture and their respective types is represented in Figure 1.

2.5. Proposed Vulnerability Framework

A comprehensive vulnerability assessment should be able to provide a framework that can include all applicable data storage layers and all data types and offer the option to also assess the vulnerability of all of these through a unified score. This provides an advantage over other privacy or vulnerability assessment frameworks by offering one way to calculate a comparable risk score for all of the involved data steps. This flexibility allows for an end-to-end understanding of how the data flows and enables an understanding of each layer in terms of risk, but it also highlights where in the entire ecosystem the biggest risks lie. Furthermore, for ecosystems that are not designed to use differential privacy approaches, such a framework can help inform where sensitive data need to be protected if they cannot be sampled, fully anonymized, or removed.

Given that each component in an architecture is susceptible to different types of attack, it is important to clarify the boundaries of the proposed framework. Intrusion detection systems (IDS) [39] are essential components within cybersecurity strategies designed to monitor network or system activities for malicious activities or policy violations. While traditional IDS solutions effectively identify unauthorized activities, integrating a privacy-focused assessment framework can significantly enhance an organization’s overall data protection strategy. Assessing vulnerabilities in data storage layers, as proposed by our methodology, complements IDS capabilities by identifying data sensitivity points that require heightened monitoring and proactive defense mechanisms. The fundamental question to be addressed is what sensitive data can be exposed or maliciously extracted by a potential attacker at each step of the data flow? Most of the processing steps are transient in nature, making the attack window smaller than that of the storage layers. Therefore, we focus the vulnerability assessment framework on persistent layers, as these offer the largest window and the possibility of an attack. In Figure 2, an overview of the vulnerability points is depicted for a data flow from source to delivery of information.

Each of the steps illustrated in Figure 2 might have one or multiple vulnerability points due to the persistence of potentially sensitive data. We label persistent data as any data that have been stored in any format (structured, semi-structured, or non-structured) in any data storage system. Any logging data that persist during the data integration step could store sensitive information for an extended period of time, including data extracts or any form of persistent data caching offered by the data visualization tool.

2.6. Vulnerability Score

As introduced in [7], the Vulnerability Score (VS) is an integrated privacy score used to provide a clear assessment of the inherent vulnerability of a data pipeline. It takes into account several metrics on which the data can be evaluated. These metrics are classified under two main dimensions: metrics for how the data are persisted and metrics for the contents of the data. Taking these two components into account, a more comprehensive look at the vulnerability of the data can be formed and calculated into a score that can inform the extent of the potential risk attached to a set of data.

Our previous work focuses on data mining or machine learning (ML) pipelines, introducing the calculation of VS through the use of two analyses: a data privacy analysisand an ML model privacy analysis. We propose an extension of the framework focused on data privacy analysis, which will allow the assessment of each of the aforementioned persistent data layers for the entire architecture. The total vulnerability score can, therefore, only include the data privacy analysis for end-to-end data flow for environments where ML/data mining has not (yet) been implemented or the combined calculation of both the data privacy analysis and ML model privacy analysis where applicable.

Data privacy analysis focuses on the data, auditing each piece of information one by one, and generating a series of privacy metrics. Considering that the data being stored can be of different types, it is relevant to include a distinction within the metrics that can quantify the vulnerability impact for structured data, semi-structured, non-structured, or unstructured data.

The following metrics quantify a set of relevant questions that should be compiled into a unified vulnerability score. Each of these can be assessed for all the data formats mentioned. In the case of structured data, the evaluation of the metrics is carried out for each column stored in the database. In the case of semi-structured data, such as JSON formats, XML, HTML, or graph data sets, the same metrics will be evaluated, with the addition of an extra data content metric focusing on nested sensitive or secret information. This will therefore quantify not only the presence of sensitive information but also the added complexity of maintaining and potentially having to remove the secret piece of information while attempting to retain the rest of it in its original form.

Non-structured or unstructured data refer to data sets that lack a definable structure or are not organized in a way that is easily searchable or analyzed. Examples of unstructured data include text documents, emails, multimedia files, and social media posts from various platforms that contain both text and visual information. As unstructured data are more prevalent and more complex, potentially containing a mix of highly sensitive information, we include a separate metric for unstructured sensitive/secret information to properly ascertain its impact in terms of vulnerability.

Data Persistence metrics: compute the vulnerability of the persisted data:
(a)
(ENCD) Is the data encoded or stored in plain? Values of 1–2, 1 represents an anonymized field, and 2 is a non-anonymized field. In-between values can be assigned by the assessor contingent on the contents of the field in question.
(b)
(ENCR) Is the value encrypted? Values of 1–2 with a critical dependency on the algorithm used for encryption. The lowest score is assigned to the most secure algorithm, as that will not increase the vulnerability of the data. Subsequently, the least secure and compromised algorithms, as well as no encryption, will have higher values, which will in turn have a bigger impact on the final score. See Appendix A for algorithm assessment.
(c)
(MASK) Are the data masked? Values of 1–2, 1 indicates a masked field, and 2 refers to an unmasked field. In-between values can be assigned by the assessor contingent on the masking policy applied.
Data content metrics: quantify how useful data are to third parties:
(a)
(EXPL) Exploitability refers to the extent to which a particular piece of information can be leveraged or utilized to achieve a particular goal or result. This can be 1–2. More exploitable data will obtain a higher value as they will create a greater impact on vulnerability. Less sensitive or important information will receive a lower value (see Appendix B).
(b)
(HPI) Has Personal Information. This indicates whether the data contain personal identifiable information (PII) and can take values of 0–1. In-between values can be assigned by the assessor contingent on the contents of the field in question. Lower values indicate non-PII information, and higher values refer to PII-containing information.
(c)
(HSI) Has Sensitive Information. This indicates whether the data contain sensitive/secret information. Can be assigned values of 0–1, in-between values can be assigned by the assessor contingent on the contents of the field in question.
(d)
(HNS) Has Nested Information. Metric indicates whether the data contain nested sensitive information in a semi-structured form, making it more difficult to obfuscate the sensitive components. Can be assigned values of 0–1. In-between values can be assigned by the assessor contingent on the contents of the field in question.
(e)
(HUS) Has Unstructured data that is Sensitive. This indicates whether the data contain unstructured sensitive/secret information. Can be assigned values of 0–1. In-between values can be assigned by the assessor contingent on the contents of the field in question.

2.7. Ecosystem Vulnerability Score

Calculating a vulnerability score for each data storage layer in a data architecture implies that the score can include metrics for different data types that could contain sensitive information. Therefore, we included two data content metrics, which refer specifically to semi-structured and non-structured information. Commonly encountered formats such as JSON can be used in different layers and stored within a different storage form as well. A raw data layer could contain files with a semi-structured JSON format that could store sensitive data. While structured columnar models could have processes that obfuscate or completely remove sensitive information, in the case of semi-structured data, this process becomes particularly difficult. As a result, we propose a separate metric (HNS) that increases the vulnerability score of the data model for such cases.

In a similar way, non-structured data containing sensitive or secret information can further increase the vulnerability score. For example, a non-structured document such as a PDF format, which contains sensitive information which cannot be easily removed or hidden selectively, will inherently carry more risk than a similar piece of information which can be extracted, stored separately, or even removed completely without impacting the rest of the contained information. (HUS) will be used in such cases to quantify this additional risk.

The following equations describe a method to calculate the vulnerability risk of a specific data layer (Equation (4)) and the total ecosystem score (Equation (5)):

V S_{p m} = \sum_{i = 1}^{n} E N C D_{i} \times E N C R_{i} \times M A S K_{i},

(1)

V S_{c m} = \sum_{i = 1}^{n} E X P L_{i} \times (1 + H P I_{i} + H S I_{i} + H N S_{i} + H U S_{i}),

(2)

V S_{s} = V S_{p m} + V S_{c m},

(3)

V S_{s} = \sum_{i = 1}^{n} E N C D_{i} \times E N C R_{i} \times M A S K_{i} + E X P L_{i} \times (1 + H P I_{i} + H S I_{i} + H N S_{i} + H U S_{i}),

(4)

V S_{e} = \sum_{i = 1}^{m} V S_{s},

(5)

where

$V S_{p m}$ is the calculated score for data persistence metrics of the data storage layer;
$V S_{c m}$ is the calculated score for data content metrics of the data storage layer;
$V S_{s}$ is the calculated vulnerability scoreof the data storage layer (using $V S_{p m}$ and $V S_{c m}$ );
$V S_{e}$ is the data ecosystem vulnerability scoreof the totality of data layers involved in the ecosystem;
n is the total number of fields in the dataset;
m is the total number of data storage layers in the ecosystem.

The calculations are designed with the presumption that the data persistence metrics, which contain an assessment of the inherent vulnerability of the data in question, hold a higher weight in the final score. A piece of information that is revealed plainly, for example, will have a significantly higher impact on the vulnerability score; therefore, the values assigned (1–2 as previously defined) will be multiplied in the calculation to reflect this effect. Similarly, the (EXPL) exploitability holds a higher weight, as it signifies the extent to which these particular data can be of use and consequently damaging to the exposed party. The (EXPL) is, however, of particular relevance by further defining the vulnerability of the data contents, which will be reflected in the factor of impact by multiplying it with the sum of these metrics. We choose this approach to underline that even for exposed sensitive information, the impact will be impacted with a greater weight by the way they are persisted.

2.8. Implications for Vulnerability Scores

A higher VS implies a more vulnerable data storage layer. Depending on the type of data, whether personally identifiable, sensitive, or critically important for the organization, more stringent security measures must be taken into account. These must be correlated with the scores, and the resulting data systems must be reassessed. Although the methodology does not yet provide guidance on how to improve protection measures, it includes several metrics for potential measures, included in the data persistence metrics.

In the context of calculating the Ecosystem Vulnerability Score, a relevant consideration is the amount of data duplication and the amount of data layers that will store sensitive or important information. A clear way to obtain information about which layer carries the highest amount of risk is by calculating the VS value for each layer and comparing the resulting numbers. If the assessment of the metrics and functionally similar data is applied consistently, this calculation can provide a very accurate overview throughout an ecosystem.

3. Use Case

3.1. Video-on-Demand Data Ecosystem

This use case provides a partial architectural view of a simplified video-on-demand system, with the focus on using several data storage layers to explore the calculation of the Ecosystem Vulnerability Score. Specific details and project names are not provided to protect against potential attacks resulting from describing architectural information.

Data stored by services that offer media for user consumption, such as that of a video-on-demand system, imply the flow of data over several interconnected applications. These data, while operationally split and stored across multiple platforms that service different components of the system, are eventually meshed together in a data warehouse or data lake whose objective is to provide a single source of truth [40]. These data likely contain highly sensitive information such as names, email addresses, residential addresses, dates of birth, gender information, financial details such as bank account numbers, as well as detailed viewing behavior linked to video metadata and defined preferences. As such data are collected, processed, stored, and transferred across multiple layers, each such storage layer should have an individually calculated

V S_{s}

. Once this calculation is performed, the

V S_{e}

can be computed for the entire ecosystem.

Depending on the extent of the data collected and cross-linked in such a warehouse, the number of sensitive columns can range from dozens to hundreds. Also, considering the most common data warehouse architectures [41], it is not uncommon that by using a two-layer or three-layer architecture, the data will be duplicated by each individual layer. This will therefore render the total

V S_{d}

of the warehouse layer to extreme values. The security and privacy implications of managing such a data warehouse become immediately apparent, and a reliable approach is required to assign the right vulnerability scores and the necessary measures to secure it properly.

We propose a use case for assessing the vulnerability score using a simplified overview of a video-on-demand store. We chose to use this example as a relevant use case because it includes multiple areas that contain sensitive processes and information such as subscription management, payments, viewer behavior tracking, recommendation information, potentially sensitive or restricted audio and video files, as well as many others. The use case will focus mainly on the subscription management flow due to space limitations within this research work and the reduced relevancy of other superfluous information. This will include a depiction of some of the most relevant data regarding subscription information, as well as information about the subscriber and its attached sensitive details. We also aim to offer a clear view of the amount of data duplication that occurs in such an interconnected and interdependent ecosystem, from the supporting systems to the analytical warehousing layer.

3.2. Use Case Ecosystem Overview

In Figure 3, three main components are depicted to represent the way data flows between multiple applications and systems.

Within business applications, the streaming serviceis the customer-facing system. This system stores data in its own database for the purpose of providing access, showing video information and related metadata, recording streaming information related to the user who accessed the system, etc. The streaming serviceuses two back-end systems related to subscription management functionality: a backend video systemand a subscription management system. Data from both systems are sent to the streaming serviceand are stored independently on each of these layers.

Data from each of the business applications flow to the data warehouselayer, within which there are four sequential steps. The first step is for each of the systems to send data independently to be stored in a raw file with a semi-structured format in the Landing component. The following step implies the flattening and normalization of the data in a structured format, loading them into the Staging database, after which they are stored again in a separate layer for the purpose of keeping history. This implies another copy of the data to the Historical Layer. Finally, within the data warehouse, data are modeled, aggregated, centralized, and connected in a single overview in the Analytical layer. The loading of the data into the Analytical layercontains steps for obfuscating sensitive data with procedures as encryption and masking.

The final layer concerns data serving, which includes three different processes for this use case, all of which can be executed in parallel. For the dashboards/visualizations process we assume that the systems would consume data directly from the Analytical layer. Without a separate storage layer, this component will not be part of the Vulnerability Scorecalculation. The internal transient storage of data, such as memory storage and temporary caching, is not part of the focus of this work. The two other serving layers include Machine Learning/Data Miningand Data Sharing. Both can have separate storage layers, either for the models used by Machine Learning/Data Mining, or export copies of the data in semi-structured file formats and/or structured table formats for Data Sharing.

3.3. Vulnerability Assessment Calculations

For each of the storage layers described in Figure 3, the following pieces of data are defined at the field level, and the metrics are evaluated according to the proposed methodology. The Vulnerability Scoreand Ecosystem Vulnerability Scorewill be calculated for the defined use case. For each data layer, the

V S_{s}

calculation is done as shown in Equation (6).

V S_{s} = V S_{p m} + V S_{c m},

(6)

The detailed assessment of the data persistence metrics for the Streaming database is available in Table 1, and that of the data content metrics in Table 2. The calculation of the Vulnerability Score for the Streaming service is shown in Equation (7).

V S_{s} (S t r e a m i n g S e r v i c e) = V S_{p m} + V S_{c m} = 140.8 + 71.6 = 212.4 .

(7)

Similarly, the assessment is carried out for each of the subsequent data storage layers. The results for each are shown below, excluding detailed columnar information for brevity.

Subscription management systemdata contain information related to users, subscriptions, and profiles. The resulting Vulnerability Score is shown in Equation (8).

V S_{s} (S u b s c r i p t i o n M a n a g e m e n t S e r v i c e) = 164.0 .

(8)

The backend video systemmodel contains data related to users, subscriptions, profiles, metadata, and stream details per user. The resulting

V S_{s}

is shown in Equation (9).

V S_{s} (B a c k e n d S t r e a m i n g H o s t i n g S e r v i c e) = 159.5 .

(9)

In our study, the Landing layer resulted in the highest value for

V S_{s}

as calculated by Equation (10). This was caused by the fact that all previous business application data was sent raw to the Landinglayer. This use case assumes the data are sent as files and stored in a semi-structured format, including imbricated sensitive information. The amount of data duplication due to multiple ingestion streams, together with semi-structured formatting implications, results in a value close to 600. While the score by itself does not have a predefined meaning in comparison to the rest of the data storage layers, it is evident that this layer inherently contains the highest risk. As a result, this can be a very clear indication of where the biggest efforts for securing the data should be.

V S_{s} (L a n d i n g) = 559.7 .

(10)

The Staging layer has a similar value for

V S_{s}

, but it is slightly reduced; see Equation (11). This is due to the fact that these data are now flattened in a normalized structured format. However, the layer still contains all the information from all previous sources in a raw format.

V S_{s} (S t a g i n g) = 535.9 .

(11)

The Historical data layer also has a high

V S_{s}

value comparable to Equations (10) and (11); see Equation (12). The slight decrease in the score is due to masking applied at this level.

V S_{s} (H i s t o r i c a l) = 524.7 .

(12)

The score of the Analytical layer is approximately 50% compared to the first three layers in the data warehouse; see Equation (13). This is due to modeling, centralization, aggregation, and further masking of the data.

V S_{s} (A n a l y t i c a l) = 285.2 .

(13)

For the Data Serving layer, we assume the creation of a subset of data for sharing purposes, containing a reduced number of columns of Analytical database. The score is available in Equation (14).

V S_{s} (D a t a S h a r i n g) = 206.5 .

(14)

Finally, the Ecosystem Vulnerability Score can be calculated by adding the scores of each of the evaluated data layers. This takes the assessed risk of the data ecosystem for this specific functionality to a significant value: Equation (15).

V S_{e} = \sum_{i = 1}^{m} V S_{s} = 2647.9 .

(15)

An overview of each of the layers’ vulnerability scores, as well as the ecosystem score, is available in Table 3.

4. Results

Illustrated by the example defined in the use case section, the proposed framework was used to create a clear overview of the vulnerability each piece of information possesses according to the defined metrics. These were defined on the lowest grain available in each data layer, such as per column in the case of structured data or per data structure for semi-structured and non-structured data. Using the equations defined in the vulnerability assessment framework, a vulnerability score was calculated for each of the data layers. This provided a comparable view of the risk attached to each of the storage layers in a data ecosystem, ultimately used to calculate the score of the entire ecosystem. This provides a more flexible assessment that can include each data layer involved in building end-to-end data pipelines and provides a more complete view of vulnerability across entire ecosystems.

The framework provided the flexibility to define subjective weights for data structures that can be more sensitive and carry inherently more risk for specific use cases, such as those defined in Table 1 and Table 2. This places the responsibility of accurately assessing the vulnerability of data content with the data owner while predefining a set of metrics to be used for data persistence, which can also be extended and aligned across data layers.

The results produced offer an easy way to observe the vulnerability of the ecosystem, as shown in Table 3. Actions to better safeguard these layers should be informed by the aforementioned overview.

5. Discussion

The suggested methodology is an effective way to address the issues of determining the perceived privacy risk across data storage layers and informing the necessary security measures for an ecosystem. The use of the vulnerability score for each data layer, as well as the ecosystem vulnerability score, supported by a spectrum of values, provides a clear and measurable way of determining the necessary security expenses. This score has a high potential to standardize vulnerability assessments due to its ease of calculation, simplicity, clarity, and generalizability for nearly all data storage systems and formats.

The presented approach improves on the initial methodology [7] created with a focus on machine learning and its respective data sources to a more comprehensive and flexible analysis of end-to-end data storage privacy concerns. By integrating this methodology into architectural decisions, such as decision analysis and resolution reports, choices of systems and architectures can be driven by a privacy-first approach. Furthermore, any data flow processes within an organization and across organizations can be led by defining data contracts and calculating the vulnerability score even before investing in developing these processes. This can result in reevaluating the necessity of sending certain sensitive data, reducing risk and also potentially reducing costs for securing these data.

The framework can be used for one-time assessments, which inform the current status of an ecosystem or one of its data layers, but it can also be integrated into a continuous view of a system’s vulnerability by means of automation. This will provide a more accurate view of a system’s vulnerability at any given point in time, as data structures and architecture evolve dynamically and constantly, as do regulatory environments and associated risks to already defined data. More research is necessary to create an architectural approach that will enforce an accurate and continuous evaluation of each data piece added or modified as the system changes over time.

While the methodology provides an already reliable and reproducible procedure for assessing perceived risk, more research is needed to enhance the standardization of the proposed metrics. As these metrics are clearly defined, the automatization of score calculation can be integrated into such a system to provide continuous awareness of potential risks.

Security and data privacy are highly volatile issues, making it paramount to keep corrected assessments up-to-date with the developments and research that might impact the values of the metrics involved and their respective weights. We also aim to further investigate the impact of using vulnerability score calculations in creating privacy-conducive data ecosystems, facilitating a privacy-first technique. This can result in data processing and sharing agreements that can have an objective measure to build on. Thus, it is imperative to properly assess the impact so that the right security measures are implemented, proportional to the resulting value.

6. Conclusions

The proposed vulnerability assessment framework offers a novel and robust methodology to assess and quantify the susceptibility of data pipelines. By integrating detailed data privacy analysis and identifying key metrics for data persistence and content, a comprehensive view of data vulnerability within architectures is established. The expanded focus on data privacy analysis emphasizes the importance of evaluating data layers for potential risks, allowing organizations to prioritize security measures effectively. Furthermore, the ecosystem vulnerability score provides a holistic perspective, aiding in the identification of high-risk data storage layers and facilitating informed decisions on data protection strategies. Ultimately, the framework equips organizations with valuable information to improve data privacy and security across various data formats and storage layers, ensuring robust safeguards against potential threats. Further research can extend the framework to enable continuous vulnerability assessment and inform a privacy-first approach to ecosystem architectures.

Author Contributions

Conceptualization, I.C. and R.E.; methodology, I.C.; validation, I.C., R.E., E.P., D.D., A.A. and O.M.; formal analysis, I.C.; investigation, I.C.; resources, I.C.; data curation, I.C.; writing—original draft preparation, I.C.; writing—review and editing, I.C.; visualization, I.C.; supervision, I.C. and O.M.; project administration, I.C.; funding acquisition, O.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Romanian National Authority for Scientific Research and Innovation, CCCDI–UEFISCDI, project number EUROSTARS-4-E!4691-GenDeg 1/2024, within PNCDI IV. This work was also funded by Romania’s National Recovery and Resilience Plan PNRR-III-C9-2022-I8, under grant agreement 760070, project “Collaborative Framework for Smart Agriculture”—COSA. This research has also been funded by the CLOUDUT Project, co-funded by the European Fund of Regional Development through the Competitiveness Operational Programme 2014–2020, contract no. 235/2020.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

Oliviu Matei was employed by Holisun. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
CI	Continuous Integration
CD	Continuous Deployment
DM	Data Mining
ENCD	Encoded Data Metric
ENCR	Encrypted Data Metric
EXPL	Exploitability Data Metric
GDPR	General Data Protection Regulation
HPI	Personal Information Data Metric
HSI	Sensitive/Secret Information Data Metric
HNS	Nested Sensitive Information Data Metric
HUS	Unstructured Sensitive/Secret Information Data Metric
IDS	Intrusion Detection Systems
IoT	Internet of Things
MASK	Masked Data Metric
ML	Machine Learning
PPDM	Privacy Preserving Data Mining
PIAs	Privacy Impact Assessments
PII	Personal Identifiable Information
PRAs	Privacy Risk Assessments

Appendix A. Encryption Algorithms Score

Table A1 lists example values for several algorithms commonly used for data encryption. This table is designed to provide a handy reference rather than a comprehensive survey of encryption methods, so while other algorithms do exist, they are not included here. The best-performing algorithms are ranked closer to the value 2.0, with a larger value indicating better performance. In contrast, weaker or less secure algorithms are ranked closer to the value 1.0, which is considered equivalent to having no encryption.

Table A1. Encryption algorithm score.

Value	Algorithm
1.00	AES
1.10	ECC
1.20	3DES
1.30
1.40
1.50	RSA 1024
1.60
1.70
1.80	SHA-1
1.90	MD5
2.00	No Encryption

Appendix B. Exploitability Score

Exploitability assesses how a malicious person can use data for illicit purposes. Estimating values involves scaling the values relative to each other and relative to the most important and least important.

Table A2. Data exploitability score.

Value	Contents	Examples
1.00	Data with no direct or indirect link to specific persons, groups of persons, families, or neighborhoods	Environmental data (temperature)
1.10
1.20	Data with a weak potential link to persons	Land GPS coordinates that might be private property
1.30
1.40	Data with a strong potential link to persons	Anonymous information about houses that might be used to locate them (price or neighborhood)
1.50	Non-exploitable personal information (that might still have potential economic value), i.e., anonymized medical data	Height, hair color, eye color, blood test results, or preferences (food, beverages, or deserts)
1.60
1.70	Anonymized medical data with demographic/personal information	Cancer screening results or AIDS patient data
1.80	Low-exploitable personal information (that is easy to change or cannot be used to locate the person in real life)	Email (but no Password!) or mobile number
1.90	Non-anonymized medical data
2.00	Highly-exploitable and secret personal information, identity information, or national security-related information	Complete name, social security number, home address, home phone number, home building properties (e.g., intercom code), intimate personal photos, personal messages, sexual preferences, passwords, religious views, or political affiliation

References

Schäfer, F.; Gebauer, H.; Gröger, C.; Gassmann, O.; Wortmann, F. Data-driven business and data privacy: Challenges and measures for product-based companies. Bus. Horizons 2023, 66, 493–504. [Google Scholar]
Markos, E.; Peña, P.; Labrecque, L.I.; Swani, K. Are data breaches the new norm? Exploring data breach trends, consumer sentiment, and responses to security invasions. J. Consum. Aff. 2023, 57, 1089–1119. [Google Scholar]
Hassan, A.; Ahmed, K. Cybersecurity’s impact on customer experience: An analysis of data breaches and trust erosion. Emerg. Trends Mach. Intell. Big Data 2023, 15, 1–19. [Google Scholar]
Gupta, S.; Kumar, A.; Gupta, P. Security Landscape of a Strong Ecosystem to Protect Sensitive Information in E-Governance. Int. Res. J. Eng. Technol. (IRJET) 2023, 10, 362–369. [Google Scholar]
Zahid, R.; Altaf, A.; Ahmad, T.; Iqbal, F.; Vera, Y.A.M.; Flores, M.A.L.; Ashraf, I. Secure data management life cycle for government big-data ecosystem: Design and development perspective. Systems 2023, 11, 380. [Google Scholar] [CrossRef]
Carvalho, T.; Moniz, N.; Faria, P.; Antunes, L. Towards a data privacy-predictive performance trade-off. Expert Syst. Appl. 2023, 223, 119785. [Google Scholar]
Erdei, R.; Pasca, E.; Delinschi, D.; Avram, A.; Chereja, I.; Matei, O. Privacy assessment methodology for machine learning models and data sources. In Proceedings of the International Conference on Soft Computing Models in Industrial and Environmental Applications, Salamanca, Spain, 9–11 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 210–220. [Google Scholar]
Vatsalan, D.; Sehili, Z.; Christen, P.; Rahm, E. Privacy-preserving record linkage for big data: Current approaches and research challenges. In Handbook of Big Data Technologies; Springer: Berlin/Heidelberg, Germany, 2017; pp. 851–895. [Google Scholar]
Shelake, V.M.; Shekokar, N. A survey of privacy preserving data integration. In Proceedings of the 2017 International Conference on Electrical, Electronics, Communication, Computer, and Optimization Techniques (ICEECCOT), Mysuru, India, 15–16 December 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 59–70. [Google Scholar]
Makridis, C.A. Do data breaches damage reputation? Evidence from 45 companies between 2002 and 2018. J. Cybersecur. 2021, 7, tyab021. [Google Scholar] [CrossRef]
Mayer, P.; Zou, Y.; Lowens, B.M.; Dyer, H.A.; Le, K.; Schaub, F.; Aviv, A.J. Awareness, Intention,(In) Action: Individuals’ Reactions to Data Breaches. ACM Trans. Comput.-Hum. Interact. 2023, 30, 1–53. [Google Scholar]
Johnson, M.; Kang, M.J.; Lawson, T. Stock price reaction to data breaches. J. Financ. Issues 2017, 16, 1–13. [Google Scholar]
Wolff, J.; Atallah, N. Early GDPR penalties: Analysis of implementation and fines through May 2020. J. Inf. Policy 2021, 11, 63–103. [Google Scholar]
Jung, K. Extreme data breach losses: An alternative approach to estimating probable maximum loss for data breach risk. N. Am. Actuar. J. 2021, 25, 580–603. [Google Scholar] [CrossRef]
Finck, M.; Pallas, F. They who must not be identified—distinguishing personal from non-personal data under the GDPR. Int. Data Priv. Law 2020, 10, 11–36. [Google Scholar]
Rodrigues, G.A.P.; Serrano, A.L.M.; de Oliveira Albuquerque, R.; Saiki, G.M.; Ribeiro, S.S.; Orozco, A.L.S.; Villalba, L.J.G. Mapping of data breaches in companies listed on the NYSE and NASDAQ: Insights and Implications. Results Eng. 2024, 21, 101893. [Google Scholar]
Ali, S.E.A.; Lai, F.W.; Dominic, P.; Brown, N.J.; Lowry, P.B.B.; Ali, R.F. Stock market reactions to favorable and unfavorable information security events: A systematic literature review. Comput. Secur. 2021, 110, 102451. [Google Scholar]
Rustad, M.L.; Koenig, T.H. Towards a global data privacy standard. Fla. Law Rev. 2019, 71, 365. [Google Scholar]
Tobin, P.; McKeever, M.; Blackledge, J.; Whittington, M.; Duncan, B. UK Financial Institutions Stand to Lose Billions in GDPR Fines: How can They Mitigate This? In Proceedings of the British Accounting and Finance Association Scottish Area Group Conference, BAFA, Aberdeen, UK, 24 August 2017; p. 6. [Google Scholar]
Zaeem, R.N.; Barber, K.S. The effect of the GDPR on privacy policies: Recent progress and future promise. ACM Trans. Manag. Inf. Syst. (TMIS) 2020, 12, 1–20. [Google Scholar]
Olabim, M.; Greenfield, A.; Barlow, A. A differential privacy-based approach for mitigating data theft in ransomware attacks. Authorea Prepr. 2024. [Google Scholar] [CrossRef]
Wairimu, S.; Iwaya, L.H.; Fritsch, L.; Lindskog, S. On the Evaluation of Privacy Impact Assessment and Privacy Risk Assessment Methodologies: A Systematic Literature Review. IEEE Access 2024, 12, 19625–19650. [Google Scholar] [CrossRef]
Renaud, K.; Shepherd, L.A. How to make privacy policies both GDPR-compliant and usable. In Proceedings of the 2018 International Conference On Cyber Situational Awareness, Data Analytics And Assessment (Cyber SA), Glasgow, Scotland, 11–12 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–8. [Google Scholar]
Sánchez, D.; Viejo, A.; Batet, M. Automatic Assessment of Privacy Policies under the GDPR. Appl. Sci. 2021, 11, 1762. [Google Scholar] [CrossRef]
Georgiadis, G.; Poels, G. Towards a privacy impact assessment methodology to support the requirements of the general data protection regulation in a big data analytics context: A systematic literature review. Comput. Law Secur. Rev. 2022, 44, 105640. [Google Scholar]
Gjermundrød, H.; Dionysiou, I.; Costa, K. privacyTracker: A privacy-by-design GDPR-compliant framework with verifiable data traceability controls. In Current Trends in Web Engineering: ICWE 2016 International Workshops, DUI, TELERISE, SoWeMine, and Liquid Web, Lugano, Switzerland, June 6–9, 2016; Revised Selected Papers 16; Springer: Berlin/Heidelberg, Germany, 2016; pp. 3–15. [Google Scholar]
Agarwal, S. Developing a structured metric to measure privacy risk in privacy impact assessments. In IFIP International Summer School on Privacy and Identity Management; Springer: Berlin/Heidelberg, Germany, 2015; pp. 141–155. [Google Scholar]
Caruccio, L.; Desiato, D.; Polese, G.; Tortora, G. GDPR compliant information confidentiality preservation in big data processing. IEEE Access 2020, 8, 205034–205050. [Google Scholar]
Samaraweera, G.D.; Chang, J.M. Security and Privacy Implications on Database Systems in Big Data Era: A Survey. IEEE Trans. Knowl. Data Eng. 2021, 33, 239–258. [Google Scholar] [CrossRef]
Kenthapadi, K.; Mironov, I.; Thakurta, A.G. Privacy-preserving Data Mining in Industry. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, WSDM ’19, Melbourne, VIC, Australia, 11–15 February 2019; pp. 840–841. [Google Scholar] [CrossRef]
Vinothkumar, J.; Santhi. A Brief Survey on Privacy Preserving Techniques in Data Mining. IOSR J. Comput. Eng. 2016, 18, 47–51. [Google Scholar]
Mendes, R.; Vilela, J.P. Privacy-Preserving Data Mining: Methods, Metrics, and Applications. IEEE Access 2017, 5, 10562–10582. [Google Scholar] [CrossRef]
Oliveira, M.I.S.; Oliveira, L.E.R.; Batista, M.G.R.; Lóscio, B.F. Towards a meta-model for data ecosystems. In Proceedings of the 19th Annual International Conference on Digital Government Research: Governance in the Data Age, Delft, The Netherlands, 30 May–1 June 2018; pp. 1–10. [Google Scholar]
Gartner. Data Ecosystems to Support All Your Data Workloads. 2025. Available online: https://www.gartner.com/en/data-analytics/topics/data-ecosystem (accessed on 9 March 2025).
Ramalli, E.; Pernici, B. Challenges of a Data Ecosystem for scientific data. Data Knowl. Eng. 2023, 148, 102236. [Google Scholar]
Ali-Eldin, A.; Zuiderwijk, A.; Janssen, M. A Privacy Risk Assessment Model for Open Data. In Proceedings of the Business Modeling and Software Design; Shishkov, B., Ed.; Springer: Cham, Switzerland, 2018; pp. 186–201. [Google Scholar]
Raja, V.; Chopra, B. Exploring challenges and solutions in cloud computing: A review of data security and privacy concerns. J. Artif. Intell. Gen. Sci. (JAIGS) 2024, 4, 121–144. [Google Scholar]
Bayani, S.V.; Prakash, S.; Shanmugam, L. Data guardianship: Safeguarding compliance in AI/ML cloud ecosystems. J. Knowl. Learn. Sci. Technol. 2023, 2, 436–456. [Google Scholar]
Kheddar, H.; Dawoud, D.W.; Awad, A.I.; Himeur, Y.; Khan, M.K. Reinforcement-Learning-Based Intrusion Detection in Communication Networks: A Review. IEEE Commun. Surv. Tutor. 2024. [Google Scholar] [CrossRef]
Hackl, W.O.; Neururer, S.B.; Schweitzer, M.; Pfeifer, B. Making a Virtue of Necessity-A Highly Structured Clinical Data Warehouse as the Source of Assured Truth in a Hospital. In dHealth; IOS Press: Amsterdam, The Netherlands, 2023; pp. 180–185. [Google Scholar]
Blažić, G.; Poščić, P.; Jakšić, D. Data warehouse architecture classification. In Proceedings of the 2017 40th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), Opatija, Croatia, 22–26 May 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1491–1495. [Google Scholar]

Figure 1. Integrated layers of connected data.

Figure 2. Vulnerability points for data storage layers.

Figure 3. Use case: video-on-demand streaming architectural overview.

Table 1. Data persistence metrics assessment for the streaming service database.

Field Name	Data Persistence Metrics
	ENCD	ENCR	MASK	VSpm
user (internal) identifier	2.0	2.0	2.0	8.0
user login	2.0	2.0	2.0	8.0
user password	2.0	1.2	2.0	4.8
user email	2.0	2.0	2.0	8.0
user name	2.0	2.0	2.0	8.0
user surname	2.0	2.0	2.0	8.0
user nickname	2.0	2.0	2.0	8.0
user birthdate	2.0	2.0	2.0	8.0
user gender	2.0	2.0	2.0	8.0
user address	2.0	2.0	2.0	8.0
user city	2.0	2.0	2.0	8.0
user country	2.0	2.0	2.0	8.0
user postcode	2.0	2.0	2.0	8.0
subscription (internal) identifier	2.0	2.0	2.0	8.0
subscription user identifier	2.0	2.0	2.0	8.0
subscription plan	2.0	2.0	2.0	8.0
subscription startdate	2.0	2.0	2.0	8.0
subscription status	2.0	2.0	2.0	8.0
			Total $V S_{p m}$	140.8

Table 2. Data content metrics assessment for the streaming service database.

Field Name	Data Contents Metrics
	EXPL	HPI	HSI	HNS	HUS	VScm
user (internal) identifier	1.5	1.0	0.0	0.0	0.0	3.0
user login	1.8	1.0	1.0	0.0	0.0	5.4
user password	2.0	1.0	1.0	0.0	0.0	6.0
user email	1.8	1.0	1.0	0.0	0.0	5.4
user name	1.8	1.0	0.0	0.0	0.0	3.6
user surname	1.8	1.0	0.0	0.0	0.0	3.6
user nickname	1.5	1.0	0.0	0.0	0.0	3.0
user birthdate	1.8	1.0	1.0	0.0	0.0	5.4
user gender	2.0	0.5	1.0	0.0	0.0	5.0
user address	2.0	1.0	1.0	0.0	0.0	6.0
user city	2.0	1.0	1.0	0.0	0.0	6.0
user country	1.8	0.5	0.0	0.0	0.0	2.7
user postcode	2.0	1.0	1.0	0.0	0.0	6.0
subscription (internal) identifier	1.5	1.0	0.0	0.0	0.0	3.0
subscription user identifier	1.5	1.0	0.0	0.0	0.0	3.0
subscription plan	1.2	0.0	0.0	0.0	0.0	1.2
subscription startdate	1.4	0.5	0.0	0.0	0.0	2.1
subscription status	1.2	0.0	0.0	0.0	0.0	1.2
					Total $V S_{c m}$	71.6

Table 3. Vulnerability score overview.

Data Layer	Vulnerability Score
Streaming Service	212.4
Subscription Management Service	164.0
Backend Streaming Hosting Service	159.5
Landing	559.7
Staging	535.9
Historical	524.7
Analytical	285.2
Sharing	206.5
Ecosystem	2647.9

Complete calculations are available at https://ionela-chereja.github.io/EcosystemVulnerabilityScore/ (accessed on 27 March 2025).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chereja, I.; Erdei, R.; Pasca, E.; Delinschi, D.; Avram, A.; Matei, O. A Privacy Assessment Framework for Data Tiers in Multilayered Ecosystem Architectures. Mathematics 2025, 13, 1116. https://doi.org/10.3390/math13071116

AMA Style

Chereja I, Erdei R, Pasca E, Delinschi D, Avram A, Matei O. A Privacy Assessment Framework for Data Tiers in Multilayered Ecosystem Architectures. Mathematics. 2025; 13(7):1116. https://doi.org/10.3390/math13071116

Chicago/Turabian Style

Chereja, Ionela, Rudolf Erdei, Emil Pasca, Daniela Delinschi, Anca Avram, and Oliviu Matei. 2025. "A Privacy Assessment Framework for Data Tiers in Multilayered Ecosystem Architectures" Mathematics 13, no. 7: 1116. https://doi.org/10.3390/math13071116

APA Style

Chereja, I., Erdei, R., Pasca, E., Delinschi, D., Avram, A., & Matei, O. (2025). A Privacy Assessment Framework for Data Tiers in Multilayered Ecosystem Architectures. Mathematics, 13(7), 1116. https://doi.org/10.3390/math13071116

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Privacy Assessment Framework for Data Tiers in Multilayered Ecosystem Architectures

Abstract

1. Introduction

2. Materials and Methods

2.1. Background

2.2. Related Work

2.3. Data Ecosystems

2.4. Data Flow Overview

2.5. Proposed Vulnerability Framework

2.6. Vulnerability Score

2.7. Ecosystem Vulnerability Score

2.8. Implications for Vulnerability Scores

3. Use Case

3.1. Video-on-Demand Data Ecosystem

3.2. Use Case Ecosystem Overview

3.3. Vulnerability Assessment Calculations

4. Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Encryption Algorithms Score

Appendix B. Exploitability Score

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI