1. Introduction
As the world becomes increasingly digitized, we are generating more data than ever before. These data can come from a variety of sources, including social media, user behavior and tracking, and wearable technology. The majority of the systems that support the currently interconnected modern life are digitized, and they need to collect, store, and process this information. Although there are many benefits to this digital revolution, it also raises important questions about privacy [
1]. Specifically, who has access to these data, what is done with it, and how can it be protected? These questions are paramount if the data in question can infringe on someone’s right to privacy by including information that can identify an individual.
Cybersecurity threats and attacks have become more frequent, and data breaches have been called “the new norm” [
2]. These have an impact on businesses from a multitude of sectors, changing how consumers relate to them regarding the type of data that has been compromised. Ensuring a balance between cybersecurity and user-friendly interfaces in the current digital landscape has become increasingly imperative for organizations seeking to protect sensitive data and satisfy customer expectations. Companies are seeking to implement systems that are easy to use while making sure they are safeguarded against attacks that could erode the trust of their users. The security measures needed to achieve the protection of data extend beyond the storage and processing of the data itself to the customer experience [
3]. Protecting the data that are entrusted to companies by their clients is, therefore, a concern that becomes central to the entire ecosystem, and it becomes a critical part of the architecture of that system.
As organizations seek to implement data mining (DM), machine learning (ML), and other artificial intelligence (AI) and analytical pipelines to extract data-driven insights and overviews, the privacy assessment of that entire data flow becomes paramount. Not only is the security of the source system and its data storage important, but each component of the analytical pipeline that processes or accesses those data is equally important [
4]. As a result, data privacy practices, the data management life cycle, and architecture of the entire ecosystem, from source(s) to visualization, must be evaluated [
5]. Any adjustment in that architecture in favor of increasing privacy could affect the performance and cost efficiency of the systems [
6].
An automated methodology is necessary to strike the delicate balance between privacy and efficiency in such complex analytical pipelines. This methodology should account for the need to dynamically adjust privacy parameters based on the sensitivity of the data processed while being adaptive to diverse hardware capacities and computing costs. This paper expands on previously published work [
7], which proposes a methodology to evaluate the level of privacy assigned to a DM pipeline. Therefore, our objective is to broaden the assessment of the source data to include structured, semi-structured, and non-structured data vulnerability and expand the proposed vulnerability score for each layer in the entire ecosystem architecture. This work will thus propose an enhanced framework that can be used in multi-layered ecosystem architectures. It includes the implications of using different data types and the usage of their respective storage engines, the consequences of storing and duplicating data in separate layers of such architecture in the calculation of the total vulnerability score, and adjusting the privacy parameters of a pipeline as the result of the computed score by using a privacy by design architectural approach. At the moment of publication, to the best of our knowledge, no such framework is available.
The paper follows a structured approach, beginning with a look at previous work, which includes background information related to data privacy concerns, an overview of the available related work at the time of research, and a look at data ecosystems and data flows. We then detail the proposed methodology, introducing the vulnerability score and the data persistence and content metrics. Subsequently, we define the Ecosystem Vulnerability Score and its calculation. The following section discusses a practical application of the methodology and the obtained results, after which we examine limitations and further research in
Section 5. Finally,
Section 6 synthesizes the applications of the methodology and the impact on data privacy in complex data ecosystems.
2. Materials and Methods
2.1. Background
Data processing comprises almost all ways in which you can use personal data, for example, collection, storage, deletion, alteration, analysis, transfers, viewing, anonymization, etc. In practice, this can be defined as anything that is done with personal data is considered processing. Not only is data privacy preservation of critical relevance, but its complexity increases with the use of data processing operations such as data linkage [
8] or data integration [
9]. Such manipulation of data can even introduce new sensitive information due to the generation of new patterns that can identify individuals.
Regulations such as GDPR define and categorize types of data into predefined categories, making it clearer for the entities that fall under its regulations which types of data need to be processed for privacy compliance. However, depending on the area of activity of a business, there may be more types of data that can be considered sensitive and could inflict negative consequences in the event of a data breach [
10,
11,
12,
13]. Taking that into account, each company should have a way to assess the vulnerability of its data not only by regulatory standards such as those enforced by GDPR but also by the internally recognized vulnerability or sensitivity of specific information.
In previous work on vulnerability assessment [
7], we focused on proposing a privacy assessment methodology for systems that use data mining and machine learning to extract data insights. This methodology can be generally applied to any Machine Learning or Data Mining model and its corresponding source. However, to properly assess the vulnerability that an entity using multiple systems faces, it is necessary to include a view of the multiple stages that data goes through in an ecosystem architecture. As privacy is a critical concern for any digital system, there has been a lot of focus and research on this topic. Different approaches have been used to estimate the impact of data processing. An approach to estimate the maximum probable loss in the event of a data breach is proposed by K. Jung [
14]. By using this framework, the consequences of a potential data breach by means of maximum data loss will inform the data controller or processor about the extent of the necessary data protection mechanisms.
Each entity that processes data in order to provide its service becomes responsible for assessing the type of data it stores in order to accurately implement mechanisms of data protection. The sensitivity and value of certain types of data, or combinations of data [
15], should be ascertained properly. Data belonging to companies listed on the stock exchange are considered highly valuable, making it an attractive target for data misappropriation and theft [
16]. Security events in publicly traded companies can have a direct monetary implication in assessing value changes in the stock price [
17]. However, each company that stores client data is a potential target for cyber-attacks and is responsible for assessing its vulnerability and attached value. Not only is there an impact on the reputation of a company that has encountered a breach [
10], on the reactions of individuals affected by it [
11], and on market value [
12], but they can also be subjected to penalties and fines under certain legal regulations [
13].
There are various regulations and global standards that aim to safeguard data privacy and prevent cyber crimes against it [
18]. Companies that fall under the jurisdiction of these regulations are obligated to comply with the laws and are subject to audits that can lead to high fines [
19]. The General Data Protection Regulation (GDPR) is considered one of the most important such regulations on data privacy [
20], as it provides explicit guidelines underlining compliance and non-compliance factors. However, beyond categorizing a particular set of data as compliant or noncompliant, the difficulty encountered by data processors reflects on the ability to find a reliable way of assessing its vulnerability in its raw form and after implementing protection mechanisms. Furthermore, while the use of differential privacy [
21] mechanisms as a way to mitigate risk is possible, some organizations need to be able to store sensitive data for specific use cases. Therefore, an approach that also allows for these cases is necessary.
2.2. Related Work
Research in the area of data privacy and risks has been a matter of concern over the years, with an increase in studies beginning in 2018 that could be attributed to the implementation of the EU GDPR [
22]. Research has been carried out on GDPR compliance, such as the usability of privacy policies under GDPR rules [
23], automatic assessment of these policies [
24], impact assessment methodology to support its requirements [
25], and the introduction of tools like privacyTracker [
26] aimed at providing support for GDPR principles, such as data traceability. Although these approaches provide ways to assess privacy risks, they focus specifically on the rules of compliance with GDPR. As such, these assessments are too specifically focused on predefined regulations and cannot be reliably applied outside of these regulations, as they do not readily provide the flexibility to define case-specific rules. This hinders the ability of data processors to expand the definitions of sensitive or high-risk data, overlooking the distinctive characteristics of organizations, which results in inadequate privacy risk assessments [
27].
In [
28], Caruccio et al. propose a methodology that analyzes data correlations in relaxed functional dependencies to identify new, possibly sensitive data that could break privacy rules during significant big data operations. While accounting for the introduction of new data associations in a big data system, this methodology is also constrained to the rules of GDPR. Security and privacy issues in relation to Big Data database systems have been investigated in several dimensions to identify and evaluate different standard security mechanisms, offering suggestions for future enhancements [
29]. This research includes a look at relational, NoSQL, and NewSQL databases and the maturity of security implementations in these models. Although this approach to security and privacy issues can account for several types of storage systems, it focuses on standard ways to evaluate and implement security mechanisms that can be implemented to enhance security rather than assessing the vulnerability risks.
Focusing on the data analysis step and insight into the data processing flow, there are several approaches to assess vulnerability and privacy. Approaches such as privacy-preserving data mining (PPDM) [
30,
31] have been explored through a comprehensive analysis of privacy breaches throughout history, highlighting key regulations and legislation that have influenced the evolution of privacy management strategies. Ricardo et al. [
32] have classified existing privacy metrics into privacy level, data quality, and complexity metrics to quantify data security, information loss, and technique efficiency. The authors emphasize the need for scalable and efficient PPDM solutions to handle increasing amounts of data. This focuses on the data flow for data analysis, particularly data mining, but does not account for the entire flow that the data move through in order to become available for analysis from the source.
A systematic view of privacy risk assessments has been proposed by Wairimu et al. [
22] through a literature review of privacy impact assessments (PIAs) and privacy risk assessments (PRAs). While these methodologies are essential in privacy by design, they have not yet been rigorously and systematically evaluated in practice. According to the findings of this research, it is apparent that privacy impact and risk assessment methodologies exhibit shortcomings, either due to strong reliance on specific regulatory frameworks (e.g., GDPR) or by incurring significant overhead. Moreover, select methodologies are confined to the assessment of privacy risk by using the primary consideration of potential harm to data subjects, leaving broader data vulnerability concerns unaddressed.
While there are approaches to assessing vulnerabilities in relation to privacy, these are either focused on parts of an ecosystem inapplicable to the whole or are linked to certain rules and regulations and are not flexible enough to adjust to requirements of more stringent or more permissive regulations.
In the current study, we propose a privacy assessment framework that aims to decouple the assessment of vulnerability from legislative determinants and specific security measures. Its primary objective is to furnish a quantifiable metric capable of assessing and contrasting vulnerability attached to storing non-transient datasets across the entirety of a complex data ecosystem, regardless of the type of storage system being used. The framework is designed to provide a repeatable, reliable and flexible methodology for interconnected data layers that can be defined according to standardized rules and regulations or adjusted to more stringent or more permissive requirements. These scores will help identify the most vulnerable data layers in an ecosystem and choose appropriate measures in proportion to the identified vulnerabilities.
2.3. Data Ecosystems
A data ecosystem is a novel setting that consists of intricate networks composed of individuals and organizations that exchange and utilize data as the primary resource. These ecosystems offer an environment for fostering, handling, and sustaining data-sharing initiatives [
33]. Data ecosystems, as defined by Gartner, “unify data management components distributed across clouds and/or on-premises” [
34]. Data Ecosystems have data at their core. They facilitate subsequent integrations across different components, such as applications, marketplaces and exchanges, edge computing, event processing, advanced analyticsand data science AI/ML, all of which contain their own data layer.
Efficiently scaling data, analytics, and AI requires the integration of ecosystems consisting of complementary capabilities necessary to support all data workloads. An effective data ecosystem is one that can accommodate large amounts of data and enable powerful analytics and AI capabilities through seamless integration.
The challenges of data ecosystems have been explored in the work of Ramalli and Permici [
35] in the context of scientific data. One of the identified challenges revolves around confidentiality. This provides us with a valid example of data that would not necessarily and objectively qualify as private due to personally identifiable qualities but would, in this context, be marked as sensitive and confidential information. Similarly, the opening of data for sharing introduces a major concern of privacy infringement, thus limiting the potential benefits of usage [
36].
A look at the secure data management life cycle for data ecosystems has been proposed by Zahid et al. [
5]. This paper presents a secure data management life cycle (SDMLC) framework for managing data throughout their entire life cycle, including creation, processing, storage, usage, sharing, archiving, and destruction or reusing. The framework addresses challenges specific to each stage of the data life cycle. It also identifies several major concerns, including privacy, liability, and security. Therefore, a proper assessment of the vulnerability of the data is paramount. With the methodology presented in this paper, we focus on the data persistence vulnerabilities in particular.
Hybrid use and multi-cloud are key to any data-ecosystem plan [
34]. This implies that all data privacy vulnerabilities have to be assessed in the Cloud context as well. The security and privacy challenges of cloud computing are recurrent topics of research [
37], including the exploration of data breaches and unauthorized access. The impact of these issues on user trust and data integrity within cloud infrastructure is assessed. The use of AI/ML systems in the Cloud introduces yet another layer of concern for compliance and data vulnerability [
38].
2.4. Data Flow Overview
In an enterprise-level ecosystem, data flow through multiple layers. Whereas in the past, an application architecture would commonly be three-tiered, the focus on data collection and analysis has led to different architectural approaches per concern, as well as multiple points of integration between these systems. Following the flow of data from generation to analytics visualization, it becomes clear that not only are much more data being created in each system, but they are also potentially transferred, processed, and duplicated in each of these layers. Furthermore, the type of data handled is much more diverse (structured, semi-structured, and non-structured), and each processing and storage step can generate aggregations and links that could generate even more sensitive data. An overview of the different data layers that could be part of a system’s architecture and their respective types is represented in
Figure 1.
2.5. Proposed Vulnerability Framework
A comprehensive vulnerability assessment should be able to provide a framework that can include all applicable data storage layers and all data types and offer the option to also assess the vulnerability of all of these through a unified score. This provides an advantage over other privacy or vulnerability assessment frameworks by offering one way to calculate a comparable risk score for all of the involved data steps. This flexibility allows for an end-to-end understanding of how the data flows and enables an understanding of each layer in terms of risk, but it also highlights where in the entire ecosystem the biggest risks lie. Furthermore, for ecosystems that are not designed to use differential privacy approaches, such a framework can help inform where sensitive data need to be protected if they cannot be sampled, fully anonymized, or removed.
Given that each component in an architecture is susceptible to different types of attack, it is important to clarify the boundaries of the proposed framework. Intrusion detection systems (IDS) [
39] are essential components within cybersecurity strategies designed to monitor network or system activities for malicious activities or policy violations. While traditional IDS solutions effectively identify unauthorized activities, integrating a privacy-focused assessment framework can significantly enhance an organization’s overall data protection strategy. Assessing vulnerabilities in data storage layers, as proposed by our methodology, complements IDS capabilities by identifying data sensitivity points that require heightened monitoring and proactive defense mechanisms. The fundamental question to be addressed is what sensitive data can be exposed or maliciously extracted by a potential attacker at each step of the data flow? Most of the processing steps are transient in nature, making the attack window smaller than that of the storage layers. Therefore, we focus the vulnerability assessment framework on persistent layers, as these offer the largest window and the possibility of an attack. In
Figure 2, an overview of the vulnerability points is depicted for a data flow from source to delivery of information.
Each of the steps illustrated in
Figure 2 might have one or multiple vulnerability points due to the persistence of potentially sensitive data. We label persistent data as any data that have been stored in any format (structured, semi-structured, or non-structured) in any data storage system. Any logging data that persist during the data integration step could store sensitive information for an extended period of time, including data extracts or any form of persistent data caching offered by the data visualization tool.
2.6. Vulnerability Score
As introduced in [
7], the Vulnerability Score (VS) is an integrated privacy score used to provide a clear assessment of the inherent vulnerability of a data pipeline. It takes into account several metrics on which the data can be evaluated. These metrics are classified under two main dimensions: metrics for how the data are persisted and metrics for the contents of the data. Taking these two components into account, a more comprehensive look at the vulnerability of the data can be formed and calculated into a score that can inform the extent of the potential risk attached to a set of data.
Our previous work focuses on data mining or machine learning (ML) pipelines, introducing the calculation of VS through the use of two analyses: a data privacy analysisand an ML model privacy analysis. We propose an extension of the framework focused on data privacy analysis, which will allow the assessment of each of the aforementioned persistent data layers for the entire architecture. The total vulnerability score can, therefore, only include the data privacy analysis for end-to-end data flow for environments where ML/data mining has not (yet) been implemented or the combined calculation of both the data privacy analysis and ML model privacy analysis where applicable.
Data privacy analysis focuses on the data, auditing each piece of information one by one, and generating a series of privacy metrics. Considering that the data being stored can be of different types, it is relevant to include a distinction within the metrics that can quantify the vulnerability impact for structured data, semi-structured, non-structured, or unstructured data.
The following metrics quantify a set of relevant questions that should be compiled into a unified vulnerability score. Each of these can be assessed for all the data formats mentioned. In the case of structured data, the evaluation of the metrics is carried out for each column stored in the database. In the case of semi-structured data, such as JSON formats, XML, HTML, or graph data sets, the same metrics will be evaluated, with the addition of an extra data content metric focusing on nested sensitive or secret information. This will therefore quantify not only the presence of sensitive information but also the added complexity of maintaining and potentially having to remove the secret piece of information while attempting to retain the rest of it in its original form.
Non-structured or unstructured data refer to data sets that lack a definable structure or are not organized in a way that is easily searchable or analyzed. Examples of unstructured data include text documents, emails, multimedia files, and social media posts from various platforms that contain both text and visual information. As unstructured data are more prevalent and more complex, potentially containing a mix of highly sensitive information, we include a separate metric for unstructured sensitive/secret information to properly ascertain its impact in terms of vulnerability.
2.7. Ecosystem Vulnerability Score
Calculating a vulnerability score for each data storage layer in a data architecture implies that the score can include metrics for different data types that could contain sensitive information. Therefore, we included two data content metrics, which refer specifically to semi-structured and non-structured information. Commonly encountered formats such as JSON can be used in different layers and stored within a different storage form as well. A raw data layer could contain files with a semi-structured JSON format that could store sensitive data. While structured columnar models could have processes that obfuscate or completely remove sensitive information, in the case of semi-structured data, this process becomes particularly difficult. As a result, we propose a separate metric (HNS) that increases the vulnerability score of the data model for such cases.
In a similar way, non-structured data containing sensitive or secret information can further increase the vulnerability score. For example, a non-structured document such as a PDF format, which contains sensitive information which cannot be easily removed or hidden selectively, will inherently carry more risk than a similar piece of information which can be extracted, stored separately, or even removed completely without impacting the rest of the contained information. (HUS) will be used in such cases to quantify this additional risk.
The following equations describe a method to calculate the vulnerability risk of a specific data layer (Equation (
4)) and the total ecosystem score (Equation (
5)):
where
is the calculated score for data persistence metrics of the data storage layer;
is the calculated score for data content metrics of the data storage layer;
is the calculated vulnerability scoreof the data storage layer (using and );
is the data ecosystem vulnerability scoreof the totality of data layers involved in the ecosystem;
n is the total number of fields in the dataset;
m is the total number of data storage layers in the ecosystem.
The calculations are designed with the presumption that the data persistence metrics, which contain an assessment of the inherent vulnerability of the data in question, hold a higher weight in the final score. A piece of information that is revealed plainly, for example, will have a significantly higher impact on the vulnerability score; therefore, the values assigned (1–2 as previously defined) will be multiplied in the calculation to reflect this effect. Similarly, the (EXPL) exploitability holds a higher weight, as it signifies the extent to which these particular data can be of use and consequently damaging to the exposed party. The (EXPL) is, however, of particular relevance by further defining the vulnerability of the data contents, which will be reflected in the factor of impact by multiplying it with the sum of these metrics. We choose this approach to underline that even for exposed sensitive information, the impact will be impacted with a greater weight by the way they are persisted.
2.8. Implications for Vulnerability Scores
A higher VS implies a more vulnerable data storage layer. Depending on the type of data, whether personally identifiable, sensitive, or critically important for the organization, more stringent security measures must be taken into account. These must be correlated with the scores, and the resulting data systems must be reassessed. Although the methodology does not yet provide guidance on how to improve protection measures, it includes several metrics for potential measures, included in the data persistence metrics.
In the context of calculating the Ecosystem Vulnerability Score, a relevant consideration is the amount of data duplication and the amount of data layers that will store sensitive or important information. A clear way to obtain information about which layer carries the highest amount of risk is by calculating the VS value for each layer and comparing the resulting numbers. If the assessment of the metrics and functionally similar data is applied consistently, this calculation can provide a very accurate overview throughout an ecosystem.
3. Use Case
3.1. Video-on-Demand Data Ecosystem
This use case provides a partial architectural view of a simplified video-on-demand system, with the focus on using several data storage layers to explore the calculation of the Ecosystem Vulnerability Score. Specific details and project names are not provided to protect against potential attacks resulting from describing architectural information.
Data stored by services that offer media for user consumption, such as that of a video-on-demand system, imply the flow of data over several interconnected applications. These data, while operationally split and stored across multiple platforms that service different components of the system, are eventually meshed together in a data warehouse or data lake whose objective is to provide a single source of truth [
40]. These data likely contain highly sensitive information such as names, email addresses, residential addresses, dates of birth, gender information, financial details such as bank account numbers, as well as detailed viewing behavior linked to video metadata and defined preferences. As such data are collected, processed, stored, and transferred across multiple layers, each such storage layer should have an individually calculated
. Once this calculation is performed, the
can be computed for the entire ecosystem.
Depending on the extent of the data collected and cross-linked in such a warehouse, the number of sensitive columns can range from dozens to hundreds. Also, considering the most common data warehouse architectures [
41], it is not uncommon that by using a two-layer or three-layer architecture, the data will be duplicated by each individual layer. This will therefore render the total
of the warehouse layer to extreme values. The security and privacy implications of managing such a data warehouse become immediately apparent, and a reliable approach is required to assign the right vulnerability scores and the necessary measures to secure it properly.
We propose a use case for assessing the vulnerability score using a simplified overview of a video-on-demand store. We chose to use this example as a relevant use case because it includes multiple areas that contain sensitive processes and information such as subscription management, payments, viewer behavior tracking, recommendation information, potentially sensitive or restricted audio and video files, as well as many others. The use case will focus mainly on the subscription management flow due to space limitations within this research work and the reduced relevancy of other superfluous information. This will include a depiction of some of the most relevant data regarding subscription information, as well as information about the subscriber and its attached sensitive details. We also aim to offer a clear view of the amount of data duplication that occurs in such an interconnected and interdependent ecosystem, from the supporting systems to the analytical warehousing layer.
3.2. Use Case Ecosystem Overview
In
Figure 3, three main components are depicted to represent the way data flows between multiple applications and systems.
Within business applications, the streaming serviceis the customer-facing system. This system stores data in its own database for the purpose of providing access, showing video information and related metadata, recording streaming information related to the user who accessed the system, etc. The streaming serviceuses two back-end systems related to subscription management functionality: a backend video systemand a subscription management system. Data from both systems are sent to the streaming serviceand are stored independently on each of these layers.
Data from each of the business applications flow to the data warehouselayer, within which there are four sequential steps. The first step is for each of the systems to send data independently to be stored in a raw file with a semi-structured format in the Landing component. The following step implies the flattening and normalization of the data in a structured format, loading them into the Staging database, after which they are stored again in a separate layer for the purpose of keeping history. This implies another copy of the data to the Historical Layer. Finally, within the data warehouse, data are modeled, aggregated, centralized, and connected in a single overview in the Analytical layer. The loading of the data into the Analytical layercontains steps for obfuscating sensitive data with procedures as encryption and masking.
The final layer concerns data serving, which includes three different processes for this use case, all of which can be executed in parallel. For the dashboards/visualizations process we assume that the systems would consume data directly from the Analytical layer. Without a separate storage layer, this component will not be part of the Vulnerability Scorecalculation. The internal transient storage of data, such as memory storage and temporary caching, is not part of the focus of this work. The two other serving layers include Machine Learning/Data Miningand Data Sharing. Both can have separate storage layers, either for the models used by Machine Learning/Data Mining, or export copies of the data in semi-structured file formats and/or structured table formats for Data Sharing.
3.3. Vulnerability Assessment Calculations
For each of the storage layers described in
Figure 3, the following pieces of data are defined at the field level, and the metrics are evaluated according to the proposed methodology. The Vulnerability Scoreand Ecosystem Vulnerability Scorewill be calculated for the defined use case. For each data layer, the
calculation is done as shown in Equation (
6).
The detailed assessment of the data persistence metrics for the Streaming database is available in
Table 1, and that of the data content metrics in
Table 2. The calculation of the Vulnerability Score for the Streaming service is shown in Equation (
7).
Similarly, the assessment is carried out for each of the subsequent data storage layers. The results for each are shown below, excluding detailed columnar information for brevity.
Subscription management systemdata contain information related to users, subscriptions, and profiles. The resulting Vulnerability Score is shown in Equation (
8).
The backend video systemmodel contains data related to users, subscriptions, profiles, metadata, and stream details per user. The resulting
is shown in Equation (
9).
In our study, the Landing layer resulted in the highest value for
as calculated by Equation (
10). This was caused by the fact that all previous business application data was sent raw to the Landinglayer. This use case assumes the data are sent as files and stored in a semi-structured format, including imbricated sensitive information. The amount of data duplication due to multiple ingestion streams, together with semi-structured formatting implications, results in a value close to 600. While the score by itself does not have a predefined meaning in comparison to the rest of the data storage layers, it is evident that this layer inherently contains the highest risk. As a result, this can be a very clear indication of where the biggest efforts for securing the data should be.
The Staging layer has a similar value for
, but it is slightly reduced; see Equation (
11). This is due to the fact that these data are now flattened in a normalized structured format. However, the layer still contains all the information from all previous sources in a raw format.
The Historical data layer also has a high
value comparable to Equations (
10) and (
11); see Equation (
12). The slight decrease in the score is due to masking applied at this level.
The score of the Analytical layer is approximately 50% compared to the first three layers in the data warehouse; see Equation (
13). This is due to modeling, centralization, aggregation, and further masking of the data.
For the Data Serving layer, we assume the creation of a subset of data for sharing purposes, containing a reduced number of columns of Analytical database. The score is available in Equation (
14).
Finally, the Ecosystem Vulnerability Score can be calculated by adding the scores of each of the evaluated data layers. This takes the assessed risk of the data ecosystem for this specific functionality to a significant value: Equation (
15).
An overview of each of the layers’ vulnerability scores, as well as the ecosystem score, is available in
Table 3.
4. Results
Illustrated by the example defined in the use case section, the proposed framework was used to create a clear overview of the vulnerability each piece of information possesses according to the defined metrics. These were defined on the lowest grain available in each data layer, such as per column in the case of structured data or per data structure for semi-structured and non-structured data. Using the equations defined in the vulnerability assessment framework, a vulnerability score was calculated for each of the data layers. This provided a comparable view of the risk attached to each of the storage layers in a data ecosystem, ultimately used to calculate the score of the entire ecosystem. This provides a more flexible assessment that can include each data layer involved in building end-to-end data pipelines and provides a more complete view of vulnerability across entire ecosystems.
The framework provided the flexibility to define subjective weights for data structures that can be more sensitive and carry inherently more risk for specific use cases, such as those defined in
Table 1 and
Table 2. This places the responsibility of accurately assessing the vulnerability of data content with the data owner while predefining a set of metrics to be used for data persistence, which can also be extended and aligned across data layers.
The results produced offer an easy way to observe the vulnerability of the ecosystem, as shown in
Table 3. Actions to better safeguard these layers should be informed by the aforementioned overview.
5. Discussion
The suggested methodology is an effective way to address the issues of determining the perceived privacy risk across data storage layers and informing the necessary security measures for an ecosystem. The use of the vulnerability score for each data layer, as well as the ecosystem vulnerability score, supported by a spectrum of values, provides a clear and measurable way of determining the necessary security expenses. This score has a high potential to standardize vulnerability assessments due to its ease of calculation, simplicity, clarity, and generalizability for nearly all data storage systems and formats.
The presented approach improves on the initial methodology [
7] created with a focus on machine learning and its respective data sources to a more comprehensive and flexible analysis of end-to-end data storage privacy concerns. By integrating this methodology into architectural decisions, such as decision analysis and resolution reports, choices of systems and architectures can be driven by a privacy-first approach. Furthermore, any data flow processes within an organization and across organizations can be led by defining data contracts and calculating the vulnerability score even before investing in developing these processes. This can result in reevaluating the necessity of sending certain sensitive data, reducing risk and also potentially reducing costs for securing these data.
The framework can be used for one-time assessments, which inform the current status of an ecosystem or one of its data layers, but it can also be integrated into a continuous view of a system’s vulnerability by means of automation. This will provide a more accurate view of a system’s vulnerability at any given point in time, as data structures and architecture evolve dynamically and constantly, as do regulatory environments and associated risks to already defined data. More research is necessary to create an architectural approach that will enforce an accurate and continuous evaluation of each data piece added or modified as the system changes over time.
While the methodology provides an already reliable and reproducible procedure for assessing perceived risk, more research is needed to enhance the standardization of the proposed metrics. As these metrics are clearly defined, the automatization of score calculation can be integrated into such a system to provide continuous awareness of potential risks.
Security and data privacy are highly volatile issues, making it paramount to keep corrected assessments up-to-date with the developments and research that might impact the values of the metrics involved and their respective weights. We also aim to further investigate the impact of using vulnerability score calculations in creating privacy-conducive data ecosystems, facilitating a privacy-first technique. This can result in data processing and sharing agreements that can have an objective measure to build on. Thus, it is imperative to properly assess the impact so that the right security measures are implemented, proportional to the resulting value.