1. Introduction
Information Systems (ISs) and Information Communication Technologies (ICTs) provide a new and innovative mechanism of communication. In cyberspace, instant messaging communication is changing lifestyles, and business happens across all industries [
1]. In this sense, advances in ICT have allowed new ways of providing services in the health areas. ICTs have substantial potential to help provide high-quality health services by promoting better outcomes through access to healthcare information [
1]. Thus, the importance of the various ICT technological tools aimed at human beings led to the construction of HISs, which [
2] defines as “data, information and knowledge processing systems in healthcare environments”. Telemedicine is an example of a health scenario that uses HISs in an integrated manner. It has a range of medical equipment and integrated systems that generate diagnoses and treatments regardless of geographic distance, developing patient benefits, such as reduced travel or faster access to knowledge [
3,
4]. It is important to emphasize that telemedicine is a health scenario that uses different HIS types for the health data management process. Like any other healthcare scenario, it generates large volumes of data that need to be managed.
According to [
5], the world health sectors are characterized by the increased production of data related to patient care demands. This data production increase includes hospital records, test results, and devices that are part of the Internet of Things (IoT), among other medical data. The authors also claim that we face tons of data on various aspects of our lives, especially the healthcare industry. Like any other industry, healthcare organizations are producing data at a tremendous rate, which presents many advantages and challenges at the same time. Technological advances have helped generate a large quantity of data, becoming a complex task when using current technologies, such as blockchain, cloud computing, fog computing, and artificial intelligence, among others. In this way, some research addresses the technological aspects mentioned in the perspectives of computer science and information sciences [
6,
7,
8,
9,
10]. In this sense, the need to manage large volumes of data in HISs meets the use of computational strategies that address the historical processing of this data (i.e., provenance management) [
11,
12]. In this article, we present a Systematic Literature Review (SLR) that investigates the computational strategies adopted using methods, techniques, models, methodologies, and technologies that contribute to the provenance data management in HISs. According to [
11,
12], the data provenance is essential concerning the auditing, screening, and lineage of the data. It can also be considered metadata that describe the origin and the entire path taken by the cycle of the data used.
1.1. Motivation
In [
13], the data provenance is a process that aims to provide an overview of the origin of data used by information systems. It focuses on the origin of data, especially identifying the data sources and transformations that it has undergone over time. It is related to different application scenarios. The health scenario is the focus of this work. The use of data provenance in the health context is experiencing a growing research scenario based on the most varied types of scientific experiments. The technologies applied in this area have been obtaining expressive results [
14]. However, it is essential to emphasize that one of the main problems that data provenance faces is traceability. This ubiquitous problem is usually found in databases that are the result of several transformation steps. As an example, we can mention scientific databases and data warehouses [
15].
Problems in data management processes can lead to data loss and privacy exposure. In the medical context, this is vital, both in terms of security and data availability, which, in many cases, are obtained in an emergency [
16]. In this perspective, in the studies selected in this SLR, challenges are presented in the use of different methods, techniques, models, and methodologies related to data provenance through technological tools that can contribute to the provenance data management in HISs. It is important to emphasize the importance of managing provenance data in HISs, as they are sensitive and essential data for medical decision making, which, in fact, is one of the greatest motivations for carrying out this research. Consequently, the interoperability between HISs has also become one of the largest problems, especially in terms of health data security. In recent years, it has been shown that the secure exchange of medical information significantly benefits people’s life quality, improving their care and treatment. The interoperability of the entire healthcare ecosystem is a constant challenge, and even more with all the risks posed to the security of healthcare information [
17]. In this way, problems with interoperability, tracking, and security with regard to the management of provenance data in HISs are some of the points observed in this SLR study. Thus, this article aims to answer general and specific SLR questions, and it proposes a taxonomy related to the management of provenance data in HISs that contributes to further research in this area.
1.2. Contributions
The contributions of this SLR are as follows: (i) An SLR that focuses on different methods, techniques, models, and methodologies related to provenance data management in HISs, which takes into account the challenges, approaches, advantages, and used technologies; (ii) A presentation of the categories that stand out with the management of provenance data from HISs, such as storage, availability, traceability, confidentiality, integrity, authenticity, and auditability; (iii) A taxonomy is proposed considering the results of the SLR. The taxonomy proposed here presents a process for the provenance data management in HISs considering four dimensions: methods, techniques, models, and methodologies, as well as different types of HISs, computational technologies used for HISs, and international standards used in HISs; (iv) A preliminary analysis of technological tools and solutions was carried out focusing on the medical systems industry that contributes to the management of provenance data in HISs.
1.3. General Structure of the Work
This article is organized as follows:
Section 2 describes the background, and data provenance and HISs are conceptualized. In
Section 3, related works and the need for this SLR are highlighted. In
Section 4, we present the review methodology used to conduct the SLR and the backward and forward snowballing technique.
Section 5 presents the similarities of the selected studies by the SLR, along with the backward and forward snowballing technique. In
Section 6, the SLR report is presented.
Section 7 presents the proposed taxonomy for managing provenance data in HISs based on the results presented. In
Section 8, the data provenance in the medical systems industry is presented. Furthermore, in
Section 9, threats to the validity are presented, followed by
Section 10, which discusses important open-issue points of this SLR. Finally, in
Section 11 and
Section 12, conclusions and future works are presented, respectively.
3. Related Works and the Need for This SLR
In the past, very limited research was published related to the management of provenance data in HISs. However, no dedicated and detailed study on an SLR can be found covering the processes and activities involved in managing provenance data in HISs, presenting the existing methods, techniques, models, and methodologies for management provenance data, in addition to the different types of HISs, or employed computational technologies and international standards that contribute to the successful management of provenance data in HISs.
Thus, in this section, the existing research is summarized, presenting its contributions and limitations, compared with our work. As noted in the literature, most works do not follow an SLR methodology. These works are focused on a specific application domain or consider only limited aspects or those that contribute in parts to the process of managing provenance data in HISs. This in fact highlights the importance and necessity of the current study. In
Table 1, it is verified whether the recent research used the protocols of an SLR, general review techniques, were in the context of HISs, or focused on the management of provenance data in HISs.
In the study [
68], the authors also focus on security as one of the important factors for the communication of health devices in the Internet of Health Things (IoHT) scenario for the smooth running of health activities. As a contribution of this study, the authors emphasize the importance of safety and provenance in the IoHT, limiting themselves only to this scenario.
In the study [
69], the authors report that when working with sensitive data, such as health data, security mechanisms are needed. As a contribution to the study, the authors present research that provides a broad view of the security mechanisms applied, along with Semantic Web technologies that can allow their use with health data. These studies present mechanisms that address various attributes, such as authentication, authorization, integrity, availability, confidentiality, privacy, and provenance. Although the study addresses these issues, it is limited only to health data security mechanisms.
In the study [
70], the authors discuss the challenges for healthcare data management systems in terms of data transparency, traceability, immutability, auditing, data provenance, flexible access, trust, privacy, and security. As a contribution to the study, blockchain technology is discussed as promising for healthcare sectors. The study is limited to EHRs and electronic medical records (EMRs).
In the study [
71], the authors focus on the positive contributions of FHIRs, including challenges, implementation, opportunities, and future applications, limiting themselves to using FHIRs only in an electronic health record (EHR) scenario.
In the study [
72], the authors discuss the Internet of Things (IoT) and blockchain for the expansion of healthcare systems in relation to their scalability and consistency on a decentralized platform. As a contribution to the study, the research focused on the IoT and eHealth systems is presented that explores the application of blockchain technology in various fields of eHealthcare, and it is limited only to this purpose.
In the study [
73], the authors discuss the integration and exchange of information between health organizations, presenting some tools used in this process in the contribution of this study. The proprietary way of storing electronic health records of patient history is highlighted, limiting itself to semantic interoperability only.
In [
74], the study authors discuss the importance of Health Level Seven International (HL7) and Fast Healthcare Interoperability Resources (FHIRs) as the leading interoperability standards for the healthcare data exchange and clinical research process. The study contributions are focused on expanding and funding HL7–FHIR-enabled solutions for clinical research and are limited to that purpose only.
Importantly, in the content of the studies presented in
Table 1, the blockchain and health devices linked to the IoHT are trends for use in health data, which, in fact, will require provenance data management processes in future health scenarios. It is also noteworthy that most of the current works presented in this section discuss the importance of the use of artificial intelligence (AI) combined with data provenance in systems linked to the health sectors. We can agree with the statement in the study [
75], in which the authors argue that data provenance is important to improve AI-based systems, which is a trend for HISs to contribute to decision making. Another important point to highlight is that our SLR makes improvements in the face of the limitations presented in the studies in this section, highlighting the importance of current research.
4. Review Methodology
An SLR is a type of scientific research that aims to gather, evaluate, and summarize the results of multiple primary studies. This type of review also seeks to answer a set of formulated questions, using systematic and explicit methods to identify, select, and evaluate relevant research. SLRs typically collect and analyze data from included primary studies using statistical methods that summarize their results [
76,
77,
78,
79,
80,
81]. This study becomes necessary because those with superior methodological quality can be used in the most varied practices, among the numerous studies published on a given topic. Furthermore, as conflicting results often emerge from different studies that address the same question, individual studies rarely have sufficient statistical power to provide definitive answers [
80,
82]. SLRs are of great importance as a scientific research tool for decision making at much lower costs than those required for large-scale studies [
79,
82].
4.1. Review Planning
Many activities should be considered before performing an SLR. We conducted this review based on Kitchenham’s guidelines for performing SLRs in software engineering [
78]. She presented a set of steps that should be considered when building an SLR. Here, we present some steps that were used in this study:
Choice of databases and terms for search strategy;
Strategy and criteria for selecting primary studies;
Strategy for assessing the quality of the selection of primary studies;
Strategy for data extraction and synthesis;
Identification of relevant studies;
Strategy for a summary of relevant studies.
4.2. Research Question Definitions
The SLR presented here can also contribute to constructing new hypotheses, evidence, and synthesizing results that aid in managing provenance data in HISs. Therefore, one of the essential processes of an SLR is the construction of research questions [
76,
77,
78]. This study classified questions into General Research Questions (GRQs) and Specific Research Questions (SRQs). It is important to emphasize that, for the formulation of the general questions of this SLR, the mnemonic SPICE, proposed by [
83], was used. The mnemonic SPICE comprises the following: (i) setting—where? (e.g., In what context are you addressing the issue?); (ii) perspective—for whom? (e.g., Who are the participants?); (iii) intervention—what? (e.g., What is being performed?); (iv) comparison—compared with what? (e.g., What are your alternatives?); and (v) evaluation—with what result? (e.g., With what (what) result? How will you measure whether the intervention was successful?).
The phases of the SPICE mnemonic adapted for this article are as follows: (i) setting: type of HIS; (ii) perspective: HIS professionals and users; (iii) intervention: storage, availability, traceability, confidentiality, integrity, authenticity, and auditability; (iv) comparison: in addition to methods, techniques, models, or methodologies for managing provenance data, other alternatives that can be compared, such as the technologies used in HISs; and (v) evaluation: results presented by using methods, techniques, models, methodologies, and technologies used to manage provenance data in HISs.
These five phases were established to ensure the quality of the returned primary studies. The general and specific question (GRQ and SRQ) posts by the SPICE mnemonic for this SLR are presented below.
General Research Questions (GRQs): GRQ1: What are the different methods, techniques, models, and methodologies used for the provenance data management in HISs?; and GRQ2: What are the challenges regarding the different methods, techniques, models, and methodologies identified in relation to the provenance data management in HISs?
Specific Research Questions (SRQs): SRQ1: Taking into account the most representative aspects regarding the provenance data management in HISs, how did these systems approach the different methods, techniques, models, and methodologies identified?; SRQ2: What are the main advantages of applying different methods, techniques, models, or methodologies for the provenance data management in HISs?; and SRQ3: What are the main technologies identified in the different methods, techniques, models, or methodologies that contributed to the provenance data management in HISs?
4.3. Search String Construction and Libraries
The choice of the selected databases for searching primary studies started from the assumption of greater adherence related to the theme of the study area. They were as follows: the ACM Digital Library (see
http://dl.acm.org (accessed on 11 April 2023)); the IEEExplore Digital Library (see
http://ieeexplore.ieee.org (accessed on 11 April 2023)); ScienceDirect (see
http://www.sciencedirect.com (accessed on 11 April 2023)); SpringerLink (see
http://link.springer.com (accessed on 11 April 2023)); Scopus (see
http://www.scopus.com (accessed on 11 April 2023)); and Web of Science (WoS) (see
http://webofscience.com (accessed on 11 April 2023)). The interval used for searches in the chosen databases was from 2010 to 2020. This time interval was stipulated considering the pre-tests carried out in the databases, which showed more results after 2010. The searched terms were limited only to the metadata (e.g., titles, abstracts, and keywords) of the articles.
The research strategy’s string concatenated the terms “Data Provenance” and “Health” to identify different methods, techniques, models, or data provenance methodologies, considering the HIS types. According to [
78], variants and synonyms related to the research topic (e.g., telemedicine, eHealth, mHealth, and healthcare) were introduced for more accurate results. Finally, the Boolean logical operators (AND, OR) were used to form the following search string: “Data Provenance AND (Health OR Telemedicine OR e-Health OR m- Health OR Healthcare)”.
4.4. Inclusion and Exclusion Criteria
Articles in English published in journals or conferences were considered due to their relevance in computer science. Other documents, such as technical reports, dissertations, theses, and books, among others, were not selected. In this sense, the inclusion (
Table 2) and exclusion (
Table 3) criteria were defined for the selection of primary studies.
4.5. Quality Assessment Strategy
The quality of an article can be measured by its relevance and the scientific value of its content. To assess the quality of the selection of primary studies, some criteria were introduced to check whether the articles are relevant studies or not. As described by [
78], these procedures are necessary to assess the quality of selected works. The evaluation of the quality of primary studies in this article consisted of the selected studies, considering the purpose of the research, contextualization, literature review, related works, methodology, results, and conclusion, according to the aims and indication of future studies. Thus, during the analysis of the primary studies and the collection of results, the criteria formulated in
Table 4 were applied, allowing for a different, broader process of validation of the studies.
Quality assessment can serve as a recommendation for future research, providing information on the quality of information from each assessed study [
84]. It is described in
Table 5 to cover the scope of the studies, allowing us to find answers to the general and specific questions stipulated.
To assess the adequacy degree of the quality criteria, the assessment strategy proposed by [
85] was adopted, allowing gradual responses from 0 (strongly disagree) to 2 (strongly agree), as shown in
Table 6.
To aid in the assessment, the Likert-3 scale was adapted for each quality criterion proposed, as can be seen in
Table 7.
In this sense, the quality levels of the 14 studies selected in this SLR unanimously assumed the scale “2” for the five quality criteria (QC1, QC2, QC3, QC4, QC5) based on
Table 5 and
Table 6.
Soon after, adapting the Likert-3 scale for each quality criteria, the quality levels were analyzed, as proposed by [
86], and they are presented in
Table 8.
Thus, it was possible to observe that two studies (14.3%) presented themselves as “very good,” satisfying the quality criteria in a positive way. What drew more attention was that 12 studies (85.7%) were considered “great,” which shows that the selection of studies in this SLR has a high level of studies with significant quality for the study of data provenance management in HISs. The positive percentages presented and analyzed by the Likert-3 scale proved to be favorable to qualify the studies selected, contributing to the valorization of the content presented by the studies’ authors. After these steps, data were extracted, and these data were synthesized to obtain a broader view of the theme proposed in this article.
4.6. Data Extraction and Synthesis Strategy
The strategy for the extraction and synthesis of the data from the retrieved studies was carried out in a structured way through the export of the documents to Mendeley (see
https://www.mendeley.com/ (accessed on 11 April 2023)) to eliminate duplicate studies. For better visualization of the data, an electronic spreadsheet containing essential information was generated. To better obtain the data in the studies, they were synthesized and designed to answer the general and specific questions.
4.7. Primary Studies Identification
The electronic libraries already mentioned in this article to retrieve the primary studies that make up this SLR aim to cover the essential journals and conferences within computer science. Therefore, the research results still had to pass through the filters of the SLR processes and the synthesis phase of the relevant studies. The synthesis of the relevant studies to this SLR followed some steps to filter those related to methods, models, techniques, methodologies, and technologies used to manage provenance data in HISs. After the first processes described above, the first reading was carried out. This step considered only metadata (for example, title, abstract, and keywords) and inclusion and exclusion criteria. The second reading filter, which includes introductions, results, and conclusions, was performed. Thus, it was possible to select only the articles that met the previously specified selection criteria and answer the questions in this SLR.
4.8. Systematic Review Conduction
This section presents how the primary studies were identified and used to answer the questions discussed here.
Figure 1 shows the selection process of primary studies at each stage of the SLR.
We observe in
Figure 1 that many duplicate studies were found. This happens because digital databases often index primary studies from other databases. Different factors can justify the number of studies that are returned in each database. Some of these factors are related to the order in which the research was conducted, the total number of studies in the base, and the relevance of the base to the research question. To better understand (
Figure 1), in Step 1, the query string was run on the selected databases between 25 and 27 June 2021. The search interval was 10 years (from 2010 to 2020), returning a total of 239 studies. Of these 239 retrieved studies, 11 were taken from the ACM Digital Library, 50 from IEEExplore, 3 from ScienceDirect, 66 from Scopus, 59 from SpringerLink, and 50 from Web of Science.
Table 9 summarizes these data showing the number of articles retrieved per database.
In Step 2, 71 duplicate studies were found. Thus, in Step 3, the first filter was performed (reading the titles, abstracts, and keywords), in which 147 studies were discarded for not meeting the inclusion criteria of this SLR, leaving 21 studies for the execution of the second filter. It is important to emphasize that, when reading the abstracts of the studies, it was observed that they had characteristics related to the management of data from HISs. In Step 4, the second filter was performed (reading the introductions, results, and conclusions) for the remaining 21 studies. Thus, strong relationships with provenance data management in the SIS were carefully observed in 14 studies, with 7 studies showing no solid relationships and being discarded. Finally, 14 studies were selected for full reading, as they met all the selection criteria specified in the steps. For the quality assessment of the 14 primary studies selected to compose this research, we can state that all studies were evaluated following all the quality criteria already mentioned. It is important to emphasize that the exclusion process resulted in 14 studies not related to the management of provenance data in HISs. Although data provenance may have been mentioned in their abstracts as one of the use cases, it was not the focus of the authors’ research. They only mentioned data provenance in one of the subsections as a potential area of application in health, without contributing to new ideas applied in HISs.
4.9. Backward and Forward Snowballing
According to [
78], SLRs must be executed strictly following a predefined search strategy. This search strategy must be impartial and must allow the integrity of the research to be assessed. In [
78], the authors argue that initial searches for studies can be performed using several digital libraries, and they indicate that other complementary searches should be employed (e.g., manual searches in journals). An example of a manual procedure often used in addition to the SLR is snowballing. This search strategy consists of iteratively exploring the list of references (backward) and articles that have a citation of the selected article (forward) [
78,
87,
88,
89]. For this reason, we identify an initial set, defined as the starting point. This initial set is a collection of already selected studies to compose the systematic mapping, from which their references and citations will be verified [
87,
88]. In this set, only the studies that will be included for the final analysis are included. The next step is to start the first iteration, conducting snowballing (backward and forward). After executing the backward and forward processes, the retrieved documents are added to the total of the initial set that was evaluated at the beginning of the process [
88]. The iterations are defined in [
88] as follows:
Backward snowballing: The intention is to use the study reference list to find new works to be included. When checking the list of references, exclude according to basic exclusion criteria, such as year of publication, written language, or publication type. The next step is to exclude the studies already found before, and then the others are candidates for inclusion. Then, read the other information and parts with greater relevance;
Forward snowballing: The intention is to use the list of citations of the included works. Google Scholar can view the citations that each article has. Each citation is analyzed from an overview, and if information such as the title and abstract is sufficient, the article can be included in the list for further reading.
4.9.1. Execution of Backward and Forward Snowballing
A second step in the SLR was performed using the snowballing technique (backward and forward) based on [
88] to cover primary studies that were not previously identified. In this process, the 14 selected primary studies were used as the initial input set. It is important to emphasize that the inclusion and exclusion criteria used in the execution of the snowballing technique were the same as those used in the SLR, presented previously in
Table 2 and
Table 3 of this article, respectively. We considered only one caveat, in the inclusion criterion (IC1) in
Table 2, in addition to studies published in articles from magazines and conferences, technical reports, and article e-books for greater breadth in the use of the execution of the technique in question here. In the execution of the snowballing technique, both backward and forward, the references were analyzed, and the primary studies that met the interests of this SLR were added to the studied scope.
Figure 2 shows the steps used in the snowballing technique applied in this article.
As shown in
Figure 2, the snowballing process was performed in four steps, which are described below:
Step 1: The initial set of 14 accepted studies was evaluated to start the process of the snowballing techniques (backward and forward);
Step 2: In the snowballing process (backward), the years of publication of the articles in the reference list were checked to see whether they met the criteria previously defined. Soon after, four verifications were carried out following the recommendations of [
88]: (1) title verification in the reference list; (2) reference location verification; (3) reading of the abstract of the referenced study; and (4) verification of the complete references of the referenced study. Three iterations were performed through a manual search. In the first iteration, the proposed initial set with the references listed by the SLR was evaluated. In the second iteration, 73 studies were nominated for possible inclusion. However, in the third iteration, only 1 of the 73 studies had relevance associated with the theme of this research, and 72 were excluded. Therefore, the backward process resulted in only one new work for inclusion in the initial set, and the process was concluded;
Step 3: In snowballing (forward), we used Google Scholar as a citation search engine. According to [
88], Google Scholar avoids a bias in the search, as it indexes the main research bases among other renowned international bases. It was verified whether the year of publication of the articles also met the previously defined criteria. Soon after, four verifications were carried out following the recommendations of [
88]: (1) the title of the cited study was verified; (2) reading of the study summary; (3) reading of the place of citation performed; and (4) the complete citation of the study was verified. Thus, three iterations of manual research were performed to evaluate the studies on the citation list. In the first iteration, citations from the initial set presented by the SLR were evaluated. In the second iteration, 37 articles related to the topic indicated for possible inclusion were evaluated. In the third iteration, it was observed that, in the studies found, the iterations tended to leave the initial theme increasingly dispersed, and no more relevant sources were found. Thus, the 37 studies were evaluated, and it was observed that 2 of these studies had relevance associated with the theme of this research. Therefore, 2 studies were included in the forward snowballing process, and 35 studies were excluded. No new studies were found, and the forward process was completed;
Step 4: Finally, three studies were added to the initial set provided. In this phase, the three studies were included through the snowballing process, allowing us to delve into the topic presented in this research. It is important to emphasize that the exclusion process carried out on the backward and forward snowballing also follows the same practices described in the initial conduct of this SLR.
4.9.2. Quality of the Studies Found in the Snowball Technique Process (Backward and Forward)
It is important to emphasize that the three studies found here meet the quality criteria presented in
Table 4 and are considered “Great” according to the quality levels of the studies in
Table 8 (both tables are already presented in this article). This shows that the quality of the studies described in this article, both in the studies resulting from SLR and in the studies resulting from the snowball technique (backward and forward), contribute to significant research on the topic in question.
7. Towards a Taxonomy for Provenance Data Management in HISs
A taxonomy that involves the IS area connected to health structures can contribute to structuring the knowledge and emerging research in health information technologies. It is necessary to study the high complexity and diversity of health information technologies. Therefore, a taxonomy contributes to the identification and structural nature of constructs relevant to the development of theories in healthcare settings [
114]. Although there are studies that explicitly relate some types of categorization schemes, taxonomies, and identification of a significant number of comparison dimensions for data provenance characteristics, as in the case of the studies [
20,
28,
107,
115,
116,
117,
118], these studies do not address aspects related to the provenance of health data specifically in HISs. Although these studies make it a complex process to provide a comparison and, at the same time, identify applications and aspects related to the management of provenance data in HISs, they served as a basis for building the data provenance assumptions for the taxonomy proposed here. Therefore, it is important to emphasize that the taxonomy proposed here has an adaptive character; that is, it proposes a taxonomy related to the management of provenance data specifically in HISs, allowing it to be expanded, improved, and evaluated by other researchers in future studies in different scenarios of HISs.
It is important to point out that the provenance data management in HISs is not restricted to specific issues of provenance. Thus, from a comprehensive and systematic view, using previous and recent studies in the area, we defined a unified taxonomy to contribute to the strategies of data provenance management from different types of HISs. The proposed taxonomy is divided into four dimensions: (i) methods, techniques, models, and methodologies, for management provenance data existing in HISs; (ii) different types of HISs; (iii) computational technologies employed in HISs; and (iv) international standards between HISs. These dimensions were abstracted from the main characteristics observed in the studies already described and analyzed using this SLR. These dimensions are presented in the following subsections.
7.1. Methods, Techniques, Models, and Methodologies for Management Provenance Data Existing in HISs
There are several different methods, techniques, models, and methodologies in the literature that can be used to manage provenance data in HISs. In the literature of the selected studies, we can see which ones authors mention, which are as follows: (i) PROV [
91,
96,
98,
99,
103,
105,
106]; (ii) PROV-DM [
91,
96,
99,
104]; (iii) PROV-N [
98,
99]; (iv) PROV-O [
91,
95,
96,
98,
99]; (v) OPM [
97,
98,
100,
101,
103]; (vi) PROV-IoT [
101]; (vii) PROV-Comics [
99]; (viii) PROV-Chain [
100]; (ix) PTN [
95]; (x) BFTRN [
92]; (xi) TVC [
90]; and (xii) ATDM [
90].
7.2. Different Types of HISs
HISs can be observed in several countries, and their use to streamline the processes in relation to health data are observed. This dimension lists the main internationally known HISs that manage provenance data: (i) EHRs [
93,
94,
95,
96,
103,
106]; (ii) PHRs [
90,
97,
99,
101,
102,
103,
105]; (iii) the LHS [
98]; (iv) the CRIS [
91]; (v) HMSs [
92]; and (vi) HISs [
105].
7.3. Computational Technologies Employed in HISs
This dimension presents the main computational technologies listed by the authors according to the literature of selected studies that relate the provenance data management in HISs in their proposals. The main technologies are as follows: (i) ETL [
96,
98]; (ii) mobile technologies (smartphones, tablets, and sensors for data collection) [
91,
92,
93,
94,
99,
102,
103] and PDA [
90]; (iii) use of the Semantic Web (XML [
90,
94,
101,
105,
106]), OWL [
91,
96,
98,
105], RDF [
91,
96,
98,
105]), and semantic web languages (SPARQL [
91,
96,
98]); (iv) cloud computing structures [
90,
94,
99,
100,
101,
102,
103]; (v) private networks for monitoring patient data [
90,
91,
92,
94,
95,
97,
100,
101,
102,
103]; (vi) relational and non-relational database management systems [
91,
92,
93,
94,
95,
96,
97,
99,
100,
101,
102,
103,
104,
105]; use of MySQL [
90] and Neo4j [
98,
99], and use of a standard driven declarative query language such as the Cypher query language [
90]; (vii) DICOM standards set [
103,
106]; (viii) document type (CDA [
103,
106]; CCD [
106]; PDF [
102,
103,
106]; CSV [
99]); (ix) JavaScript Object Notation (JSON) [
98,
103,
106]; (x) blockchain [
93,
100,
102,
103,
104,
106]; and (xi) middleware [
93,
94].
7.4. International Standards between HISs
In this dimension, the main international standards existing among HISs that contribute to the data provenance management process are listed by the authors of one of the studies: (i) the HIPAA [
95]; (ii) IHE [
103,
106]; (iii) HL7 [
103,
105,
106]; (iv) FHIRs [
103,
106]; and (v) XDS [
103,
106]. The proposal of our taxonomy is limited to the main classifications of a specific area (data provenance) (that is, in the context of managing data provenance in HISs). The elements of the four dimensions mentioned above were identified, considering not only those that have been widely used for the longest time, but also those that have emerged recently. The proposed taxonomy presented in
Figure 6 illustrates a process for the provenance data management in HISs, covering a spectrum of alternatives along the specified dimensions.
In
Figure 6, we present a proposal for a unified taxonomy for provenance data management in HISs. This taxonomy covers four dimensions observed in relation to the results of the general and specific questions based on the readings of the primary studies previously selected in the SLR. Thus, the aim of this taxonomy proposal is to guide a set of essential characteristics that contribute to the provenance data management in HISs. The main elements of each dimension existing in the proposal of our taxonomy were considered by the studies previously read in the SLR as elements of high interest for the area of data provenance in the context of HISs. Our taxonomy also considers the impact of managing the provenance data in HISs that occurs in different health scenarios more frequently. This, in fact, contributed to the observation that the use of different methods, techniques, models, methodologies, and computational technologies combined to manage provenance data is a trend to be considered in different HIS scenarios.
Therefore, given the wide variety of terms and concepts used in the literature relating to the provenance data management in HISs, we not only provide the reader with a consistent taxonomy of provenance data concepts, but also relate them to terminology used by other researchers. As a result, our taxonomy focuses on different directions regarding the flow of the provenance data management in HISs.
Finally, the four dimensions of our taxonomy aim to inform and improve the understanding of the distinction between different perspectives regarding the provenance data management in HISs. In fact, our taxonomy can contribute to the decision and selection of the most adequate solution for the needs of the healthcare scenario. In addition, potential researchers in the field, software developers, and others interested in the available approaches to managing provenance data in HISs presented here can understand the open problems seen in practice in order to improve their research and contribute to new implementations.
8. Data Provenance in the Medical Systems Industry
Industrial efforts in data provenance are increasingly evolving, particularly in the healthcare industry, which has aggressively invested in provenance technology [
119]. In this sense, the entire healthcare ecosystem is moving towards Healthcare 4.0, through industry 4.0 methodological applications [
120]. As the scope of this article is focused on the analysis of studies found in the scientific literature, we seek to follow some of the contributions of studies [
121,
122] to perform a preliminary analysis of the main tools or technological solutions that contribute to the management of provenance data in HISs found in the medical systems industry.
In this sense, using the five essential elements for the use of data provenance that are part of the taxonomy of [
28] (e.g., data quality, audit trail, replication, attribution, and informational), five questions were elaborated to evaluate the technologies or solution technologies in the medical systems industry. These questions serve to create the characterization process based on studies [
121,
122]. Q01—What does the tool or technological solution offer to qualify the provenance data in HISs?; Q02—Does the tool or technological solution provide the opportunity to carry out audit tests on the provenance data to be managed in HISs?; Q03—Does the tool or technological solution make it possible to generate the replication of provenance data managed in HISs?; Q04—Does the tool or technological solution enable the attribution of provenance data managed in HISs?; Q05—Does the tool or technological solution have the informational concept in relation to provenance data managed in HISs? To answer these questions, we used Google Scholar following the criteria defined in [
121]. Thus, in the eight retrieved studies, we obtained a set of 10 tools or technological solutions found in the medical systems industry that contribute to the management of provenance data in HISs. After that, to evaluate the 10 tools or technological solutions based on studies [
121,
122], we used the following rules: (i) Y means “Yes” and represents that this tool or technological solution fully answers this question; (ii) N means “No” and represents that this tool or technological solution does not support this question; (iii) P means “Partially” and represents that this tool or technological solution only partially supports this question.
Table 13 presents the preliminary assessment of the 10 tools or technological solutions found in the eight studies referring to the medical systems industry. They are artificial intelligence (AI); big data analytics (BDA); cloud computing; fog computing; the IoT; FHIRs; findable, accessible, interoperable, and reusable (FAIR); consumer-generated health data (CGHD); HL7; and blockchain.
Regarding the technological tools or solutions evaluated in
Table 13, some appear more frequently, such as blockchain, IoT, cloud computing, and fog computing. Based on the reading of the studies, evidently this frequency is related to the need to use computer networks in health and the large volume of data generated over time. In this sense, the tools or technological solutions mentioned above that appear most frequently are currently the ones that contribute most to the management of provenance data in HISs pointed out by the medical systems industry.
Another important observation is that the vast majority of studies fully answer the questions prepared based on the study by [
28] on the use of data provenance impartiality. Thus, from
Table 13, we can observe that most tools or technological solutions are suitable for managing provenance data in HISs. Most of the tools or technological solutions evaluated in
Table 13 are contained in the studies evaluated. Finally, we consider that, with this very preliminary analysis, the current situation shows that there is an important evolution for software engineering in this aspect. This opens the way for broader future research on this topic, as the studies presented in
Table 13 present relevant technological tools or solutions for the management of provenance data in HISs, which are still in constant evolution. Therefore, it is important to emphasize that the technologies/solutions presented in
Table 13 go beyond the relevant studies presented in
Table 11, as they present even more differentiated approaches to the problem investigated in this SLR.
However, an important point to be highlighted in the technologies/solutions presented in
Table 13 is the concern with the reliability of the systems used in the management of provenance data in HISs. For [
130,
131], reliability plays a very important role in obtaining quality software. Analyzing the study [
130,
131] in the context of provenance data management in HISs, it is a necessary factor in the medical industry to guarantee the quality of stored and shared information and the reliability of health data.
11. Conclusions
The provenance data management in HISs is presented in different methods, techniques, models, and methodologies in different health scenarios using different computational technologies. However, this theme is still barely explored in the literature. In this SLR, we focused on studies that presented approaches in relation to the provenance data management in HISs in order to map what already exists, and to explore what is being developed in relation to this theme. Based on the results of this SLR, it was possible to answer general and specific questions. Thus, in relation to the main methods, techniques, models, and methodologies found for managing data from different HISs, it was possible to identify the models indicated by the W3C that most appeared in the studies selected for analysis, which are PROV, PROV-O, OPM, PROV-DM, and PROV-N. In addition to these models, which were observed with greater frequency of application in the management of provenance data in HISs, other models were observed based on the PROV family, such as PROV–IoT based on PROV-DM and OPM; PROV-Chain based on blockchain technologies and the OPM model; PROV-Comics based on PROV, PROV-DM, PROV-O, and PROV-N; PTN based on the models PROV-O, BFTRN, TVC, and ATDM and an algorithm with data provenance techniques for middleware. In a way, they can have different applications in different HISs, depending on the need for and use of computational strategies, which are mentioned in this SLR. Different types of HISs were found and are presented in this SLR, such as EHRs, PHRs, the LHS, HMSs, the CRIS, and the HIS, as PHRs appeared with 41% in the selected studies in this SLR. In fact, reading these studies demonstrates the appreciation of provenance data in terms of storage, availability, traceability, confidentiality, integrity, authenticity, and auditability in these systems. Special attention should be paid to EHRs, which must comply with HIPAA standards, regulated in the United States, focusing on the confidentiality, integrity, and availability of protected health information. The main benefits of the HIPAA standards in healthcare institutions are as follows: ensuring the confidentiality, integrity, and availability of all information created, received, stored, or transmitted; identifying and protecting against threats to data security or integrity; protecting data against uses or disclosures not consented to by the data subject; and ensure that employees and collaborators comply with good information security practices. Therefore, it is important that healthcare institutions, such as hospitals, seek to comply with HIPAA standards in order to be seen as institutions that meet the most rigorous international standards of health information security. This, in fact, contributes to the success of the strategies used to manage provenance data in HISs. It is noteworthy that, of the 17 studies selected for this SLR, 59% are conference papers, justifying a common situation in publications in the field of computer science. It is also important to highlight that, of the studies selected in this SLR, those dated 2020 present HISs focused on the IoT in health (IoHT) scenarios, remote health monitoring, and mobile health devices monitored in cloud applications, among other scenarios that contemplate the convenience of patients, which, in fact, present themselves as a global trend in health scenarios. It is also important to highlight that AI, blockchain, middleware, fog computing, cloud computing, BDA, and HL7 FHIRs, among other technologies, in addition to being highlighted, are trends that contribute to the management of provenance data in HISs. An important point to be highlighted is in relation to the challenges found in the studies referring to the different methods, techniques, models, and methodologies that were identified in relation to the management of provenance data in HISs, such as inconsistencies, leaks, and data security that can occur in HISs; making provenance data more secure and reliable in HISs; unusual structures with regard to security regarding the management of provenance data in HISs; limits to presenting physicians with the clinical context of medical record data; interoperability, privacy, and confidentiality issues of provenance data in HISs; and finally, challenges related to real-time applications that occur with health devices in PHR/IoHT scenarios, which may include barriers to use due to regulatory, financial, and organizational issues, in addition to the lack of interoperability standards between HISs. Another important point of observation was the identification of the main categories present in the selected studies in this SLR in relation to the management of provenance data in HISs (namely, storage, availability, traceability, confidentiality, integrity, authenticity, and auditability), which are mentioned as positive factors in the management of provenance data in HISs. In addition, by bringing together the results of the general and specific questions of this SLR, it was also possible to propose a taxonomy containing the following dimensions: methods, techniques, models, and methodologies for management provenance data existing in HISs; different types of HISs; computational technologies employed in HISs; and international standards between HISs, based on the selected studies, in order to update the understanding of the subject for researchers, software developers, and professionals working in the management of source data in HISs. Thus, we consider that the proposed taxonomy provides valuable information about the different views of the provenance data management in HISs. In addition, the taxonomy proposed here can be useful to identify similarities and differences between the technologies, methods, techniques, models, and methodologies used to manage provenance data in HISs. Another important point to highlight is related to studies focused on the medical systems industry, which present tools or technological solutions also mentioned in the studies selected in the SLR in this article. In this sense, the following stand out: blockchain, IoT, cloud computing, fog computing, and, in some studies, middleware is mentioned. In fact, this proves that the industry follows science in relation to the tools and technological solutions that contribute to the management of provenance data in HISs. In this sense, it is possible to conclude that this research presents evidence for researchers and professionals in the field to consider the necessary decision making within the HIS of their country, contemplating the benefits to healthcare.