A Comprehensive Data Maturity Model for Data Pre-Analysis

Knoflach, Lukas; Shao, Lin; Ullrich, Torsten

doi:10.3390/data10040055

Open AccessArticle

A Comprehensive Data Maturity Model for Data Pre-Analysis

by

Lukas Knoflach

^1,†

,

Lin Shao

^2,†

and

Torsten Ullrich

^1,2,*,†

¹

Institute of Visual Computing, Graz University of Technology, 8010 Graz, Austria

²

Fraunhofer Austria Research GmbH, 8010 Graz, Austria

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Data 2025, 10(4), 55; https://doi.org/10.3390/data10040055

Submission received: 21 February 2025 / Revised: 11 April 2025 / Accepted: 17 April 2025 / Published: 19 April 2025

Download

Browse Figures

Versions Notes

Abstract

:

Data analysis is widely used in research and industry where there is a need to extract information from data. A significant amount of time within a data analysis project is required to prepare the data for subsequent analysis. This paper presents a comprehensive weighted maturity model to estimate the readiness of data for subsequent data analysis, with the goal of avoiding delays due to data quality problems. The maturity model uses a questionnaire with nine criteria to determine the maturity level of data preparation. The maturity model is integrated into a web application that provides an automated evaluation of maturity and a novel visualization approach that combines a modified spider chart and augmented chord diagrams. The comprehensive weighted maturity model is a ready-to-use application that provides prospective users with an easy and quick way to check their data for maturity for subsequent data analysis, with the goal of improving the data preparation process. The weighted maturity model is applicable to all types of data analysis, regardless of the domain of the data.

Keywords:

data analysis; maturity model; data preparation

1. Motivation

The analysis of data to extract information is a crucial task in many applications and provides a plethora of different use cases for research and for industry. A 2017 survey conducted by CrowdFlower, which was later known as Figure Eight Inc. and was acquired by Appen Limited in 2019 (https://en.wikipedia.org/wiki/Figure_Eight_Inc. (accessed on 8 February 2025)), revealed that 51% of time spent on data analysis is dedicated to data preparation, encompassing activities such as data collection, labeling, cleaning, and organization [1]. Irrespective of the precise figure, the significant investment of time required for data preparation tasks for data analysis projects can be confirmed by various scientific references as well as by different data analysis software providers or news portals, which estimate the required time for data preparation tasks for a data analysis project to be between 37.75%, 45%, 60%, or up to 70% and 80% [2,3,4,5,6,7,8,9].

This considerable time investment required for data preparation is a well-known challenge in research, and numerous efforts have been made to reduce this time investment. In order to obtain a valid estimate of the readiness of available data for a subsequent data analysis, it would be beneficial to have a maturity model that can describe the level of maturity of data preparation. A maturity model would function as an auxiliary tool in data analysis projects to rapidly assess the level of maturity of the data used for the subsequent data analysis.

2. Related Work

Different standardized processes and models that are suitable for data analysis are available. One widely known and often applied standardized process is the Cross-Industry Standard Process for Data Mining (CRISP-DM) framework. The CRISP-DM process model is a tool-, industry-, and application-neutral model, which means that it can be used independently of the technology used and the industry sector [10,11]. The CRISP-DM model divides the life cycle of data into six phases and is defined as a circle to symbolize the cyclic nature of data analysis itself irrespective of data mining.

Another standardized process for data analysis is the Knowledge Discovery from Databases (KDD) process, which is an iterative and interactive process to extract knowledge from data [12]. The KDD process consists basically of nine different steps to gather knowledge from given data. However, the KDD process is also described with seven steps, where three steps are summarized [12,13,14]. The Sample, Explore, Modify, Model, and Assess (SEMMA) model is another standardized process for data analysis, which is applicable to a variety of different industries [15]. The SEMMA process consists of five different steps; however, the model is not strictly sequential. This means that, depending on the data analysis, individual steps of the SEMMA process may be skipped or steps may be repeated several times. The SEMMA process is linked to the software SAS Enterprise Miner (https://www.sas.com/en_us/software/enterprise-miner.html (accessed on 8 February 2025)), which is software to stream data mining processes. If these three described processes are compared, similarities between the different phases, stages, and steps can be deduced [13,16]. As the KDD process was first published, the CRISP-DM and SEMMA processes can be seen as implementations of the previously published KDD process [13]. However, the CRISP-DM process is a more complete process than the SEMMA process [16]. In addition, the CRISP-DM can be seen as the standard approach to data analysis in the industry as, for example, also the software for statistical analysis SPSS replaced its own developed five A’s process for data analysis with the CRISP-DM process [17].

Maturity models are well-known practices to determine the level of maturity for a specific task and are used in a variety of different application areas [18]. The Capability Maturity Model (CMM) is considered the first maturity model and was published in 1993 to enhance and guide the development of software processes [19]. The CMM is a generic model and consists of five maturity levels in ascending order to assess and also continuously improve processes of organizations. The CMM can be used to address software management and engineering process issues [20]. An improved version of the CMM is the Capability Maturity Model Integration (CMMI) [21]. The CMMI is a well-accepted maturity model in the literature and also uses five maturity levels to assess processes and methods of organizations. Although the CMMI was originally developed to improve the processes of organizations, it is used as a basis for maturity models in a wide range of different areas [22].

In the area of big data, different maturity models are available that serve different purposes. Due to the large amount of available maturity models for big data, review and overview studies are available that compare different maturity models for big data [20,22,23]. Therefore, only a selection of maturity models and application areas within big data are further described, which are relevant for this work. Many maturity models within the big data area deal with advantages and business opportunities for organization with big data. One of these maturity models focuses, for example, especially on small and medium enterprises and provides a big data maturity model that provides a roadmap for data analysis for an organization [24]. Another available maturity model for big data also focuses on business opportunities created by big data; however, this maturity model is based on qualitative data analysis where interviews with domain experts were conducted to determine the maturity model [25].

Another bunch of maturity models that are relevant for data analysis can be summarized as analytics maturity models. With analytics, the extensive use of data for quantitative and statistical analyses, explanatory and predictive models are understood to derive fact-based decisions and actions [26]. Therefore, these maturity models asses the capabilities of organizations to conduct and implement data analysis for specific tasks [27]. As is the case with the maturity models for big data, many analytics maturity models can be found in the literature, and review and overview studies that compare these maturity models are available [22,27]. Therefore, only a few maturity models that are relevant for this work are further described. The Data Warehousing Institute (TDWI)’s analytics maturity model uses a survey to assess the level of maturity of an organization in the analytics domain. This maturity model uses the five stages to guide an organization to gain more value from their investments in analytics [28]. Another maturity model, the Capability Maturity Model for Advanced Data Analytics (ADA-CMM), guides organizations through the development and improvement of their advanced data analytics capabilities to enhance the competitiveness and business operations of the organizations [29]. Furthermore, some maturity models are also available for more specific tasks with data. The master data management maturity model (MD3M), for example, focuses on the maturity of master data of an organization by using a maturity matrix and an assessment questionnaire [30]. Another maturity model, which is based on the CMM, focuses on the data management of scientific data to support organizations by the assessment and improvement of data management practices [31]. Regarding data management, several maturity models are also available. The Data Management Maturity (DMM) model is based on the CMMI and consists of the six key categories to help organizations to build and improve data management capabilities and to assess their current state in data management [32,33,34].

To assess the data quality in organizations, maturity models that deal with the data quality management (DQM) are available. These maturity models basically classify organizations into different maturity levels according to their currently applied DQM practices [35]. Other maturity models try to support the build process of corporate DQM or operate on an enterprise-wide DQM with different measures to assess the DQM of an enterprise [36,37]. These maturity models often use maturity levels and different criteria to assess the organization’s capabilities for DQM, whereby the criteria either relate to the organization, which is going to be assessed, or to data management practices [38,39]. To apply a maturity model and determine the level of maturity, open-source tools are available for specific maturity models that enable organizations to assess their level of maturity with little time investments [40].

As outlined above, many maturity models within the context of data and analysis are available in the literature. Among all shown maturity models, it is conspicuous that the maturity models focus on the assessment of an organization into a maturity level regarding some specific criteria. These models apply a holistic view on a complete analysis process, the integration of data in the organization, or the overall performance of an organization regarding DQM to assess an organization in a model-specific maturity level, or a level of maturity for the organization is determined. Therefore, a research gap can generally be determined in the utilization of maturity models in the data science domain [22].

To the best of the authors’ knowledge, there is no maturity model available that focuses on the individual steps within a specific data analysis or on the data preparation process for a specific data analysis. Although quantitative and qualitative approaches and open-source tools are available to easily determine the level of maturity for an organization, these approaches and tools also focus on the assessment of an organization and not on the assessment of data regarding the subsequent data analysis. Additionally, in the qualitative approaches, the qualitative data are used for the development of the maturity model and not for additional weighting purposes.

To sum up, a research gap identified is the lack of a maturity model that assesses available data for subsequent data analysis. Furthermore, no approaches are available that additionally weight the different criteria of a maturity model based on qualitative data analysis. Consequently, the comprehensive weighted maturity model developed in this work tackles the described gap and provides a maturity model to assess data for subsequent data analysis. The maturity model operates as an auxiliary tool within the first two phases and especially in the third phase of the CRISP-DM model. With the additional weighting of the internal criteria of the maturity model, a more expressive level of maturity for the data preparation process can be determined.

3. Methodology

The occurrence of delays in data analysis projects can be attributed to various factors. Data quality issues, such as the incompleteness, inconsistency, and inaccuracy of data, necessitate extensive cleaning, which can be a significant contributing factor to delays. Additionally, the lack of clarity in project objectives, the evolution of stakeholder requirements, and the absence of domain expertise can result in rework and inefficiencies. Finally, inadequate planning, unrealistic timelines, or unexpected roadblocks like regulatory constraints can hinder timely completion.

3.1. Questionnaire

This research began with a workshop series on lessons learned in data analysis with the research project PRESENT–PREdictions for Science, Engineering N’ Technology. The objective of the workshops was twofold: first, to examine data analysis failures and delays, and second, to explore the underlying causes of these issues and the methods for identifying them as early as possible. The result was a compilation of reasons for delays that were grouped and systematized. For each reason, a diagnostic inquiry was formulated with the objective of identifying the corresponding aspect. The resulting questionnaire will raise awareness of these criteria. We formulated a single question for each failure in a data science project, checking the particular reason. A comprehensive analysis of the failures was conducted, resulting in the identification of prevalent failure types. To address the various failure types that had been enumerated, a system of weighted factors for each cluster was developed.

To be able to determine the level of maturity of data, these criteria clusters were evaluated. For the sake of simplicity, the term “criteria” is used within this work. However, it should be noted that the term “dimensions” can also be used for this maturity model. The developed maturity model consists of nine criteria that were identified in order to determine the level of maturity for data preparation [41,42,43,44].

C1: Completeness
The completeness criterion describes whether sufficient data have been collected for the analysis and whether the dataset is populated accordingly. This criterion also includes the completeness of the paradata and metadata.
C2: Uniqueness
The uniqueness criterion describes the uniqueness of the data. The goal of this criterion is to record all entries in the dataset only once to avoid duplicates within the data.
C3: Timeliness
The timeliness criterion describes the delay between the receipt of the data and the inclusion of the data in the corresponding dataset. This criterion is highly dependent on the application of the data preparation process and the subsequent data analysis.
C4: Validity/Interoperability
The validity/interoperability criterion describes the correctness of the data in terms of formats, syntax, and further processing steps. Different formats in the datasets, text responses that could have been converted to numbers, or incorrectly selected scaling can make further processing of the data very difficult.
C5: Accuracy
The accuracy criterion describes how accurately the data reflect the underlying reality. In other words, this criterion is used to assess whether the collected data accurately represent the real world or the event that is being described by the data. In addition, this criterion also indicates whether potential influences on the data have occurred during the data preparation process.
C6: Consistency
The consistency criterion describes whether the data are logically interconnected. This means that there are no contradictions in the dataset, but when there are contradictions, it is important to distinguish between possible and valid contradictions in the records and true contradictions in the data themselves, which are contradictions that contradict related data. The consistency criterion focuses only on true inconsistencies, such that similar records provide the same information and the data are consistent.
C7: Credibility/Accessibility/Findability
The credibility/accessibility/findability criterion describes the reliability of the data and the underlying data sources. This criterion depends on factors such as the verifiability of the data sources, the use of known and widely used methods in the data preparation process, and the authorization to use the collected and prepared data.
C8: Relevance/Interpretability
The relevance/interpretability criterion describes the relevance and affiliation of the data to the defined objectives of the subsequent data analysis. Accordingly, this criterion is concerned with the comprehension of the defined objectives of the subsequent analysis in order to prepare the data in the best possible way during the data preparation process.
C9: Reusability
The reusability criterion describes whether the data can be reused for other analyses. Reusability does not only depend on the publication of the prepared data, paradata, and metadata, but also on the nature of the dataset. This means that complete, cleaned, and organized datasets are well suited for reuse.

The questionnaire used to determine the level of maturity for data preparation follows the following scheme:

Each question in the questionnaire is clearly assigned to only one criterion.
The criteria are not communicated to the respondent in advance; therefore, the respondent is unaware of which question is assigned to which criterion.
The order of the questions gives no indication of the criteria.
All questions are decision questions that can be answered with either a yes, a no, or a not relevant.
All questions are formulated independently of any topic.
If a question is unclear, an explanation or an example is provided.

The questionnaire consists of a total of 52 questions, divided into the criteria as shown in Table 1.

The number of questions of a criterion is not critical to the criterion’s impact on the level of maturity because the criteria’s interdependencies are taken into account in the calculation of the level of maturity (see Section 4). The presented questionnaire is applicable for data preparation, which includes the discovery, validation, structuring, and cleaning of data with data filtering and data enrichment [45].

3.2. Importance Weighting via Expert Interviews

In order to assign a weight to each question according to its importance on data maturity, several experts were interviewed. In this work, the term “expert” is defined as an individual who engages in the processing and analysis of data on a daily basis as part of their professional responsibilities. Consequently, these experts possess the expertise necessary to perform the steps required for data preparation and analysis. The principal objective of the interviews is to discuss potential problems that may arise during the process of data analysis. This includes insights into problems that may have occurred in past data analysis projects or general insights and experiences that could prevent problems in the data preparation process and in data analysis. The interviews are based on a semi-structured interview guide (see Appendix A). In total, 15 interviews were conducted. The experts interviewed were classified according to the type of company they worked for. Five of the interviewed experts worked at a university where they taught or conducted data analyses for their institution. Six of the interviewed experts worked at non-university research institutions where they performed data analysis tasks for external customers. The remaining four interviewed experts worked at industrial companies where they either worked with company-internal data and conducted data analyses for different use cases or worked as an advisory for different applications with data and data analyses. All participants worked for companies in Austria. Given that the interviews were conducted according to a semi-structured interview guide, the duration of the interviews varied. The average duration of the interviews was 26 min and 32 s. To facilitate subsequent analysis, a transcript was created for each interview. The gathered qualitative data from the interviews were used for the qualitative content analysis [46]. The objective of the qualitative content analysis was to determine a weighting scheme for the maturity model based on the experiences and recommendations of the experts. Consequently, the research question for the qualitative content analysis is formulated as follows:

How important are the predefined individual criteria in the data preparation process?

Based on the formulated research question, several conclusions can be drawn. Firstly, the scope of the qualitative content analysis is limited to the data preparation process. Secondly, the criteria are predefined. Therefore, it was not the objective of the qualitative content analysis to generate new criteria from the collected qualitative data. In the context of the qualitative content analysis, this means that the codes were built a priori, independently of the empirical data, an approach that is also known as deductive code building. Thirdly, the importance of these criteria in the data preparation process should be determined from the qualitative content analysis. Consequently, the qualitative content analysis was based on the evaluative qualitative content analysis with evaluative, scaling codes, as this method was the most appropriate for the objective of the research question [46]. The applied evaluative qualitative content analysis uses deductive code building, whereby each criterion of the maturity model is associated with a single code. This methodology enables the assessment of the importance of the criteria of the maturity model for the data preparation process. Therefore, the anonymized transcripts of the interviews were evaluated according to an evaluative qualitative content analysis [46]. In the first step, the qualitative dataset was coded according to a defined coding frame (see Appendix B). Afterwards, all coded segments of the transcripts were additionally coded according to defined characteristics for the codes according to their importance. The evaluation was supported by MAXQDA (https://www.maxqda.com (accessed on 9 February 2025)), which is a software program developed for computer-aided analyses of qualitative data.

3.3. Qualitative Content Analysis

Table 2 presents the results of the qualitative content analysis. For each criterion (C1–C9) and for each expert (B1–B15), the table lists the absolute frequencies, split into important criteria (

!!

) and unimportant ones (−). Additionally, the table presents the summations of the aforementioned criteria in their absolute forms.

3.4. Implementation

The focus of the automated evaluation of the maturity model is to facilitate easy access without the necessity of installing additional software. It should be sufficient for users of the maturity model to have a comprehensive understanding of their data and the condition of the corresponding dataset, as well as to be fully informed about the overarching task with the data and the subsequent data analysis. Based on these requirements, the automated evaluation for the level of maturity was developed as a static web application, as just a web browser is required to access it. An additional constraint for the web application is that the web server can only be used as a storage location for the corresponding files to host the website. Consequently, the web application solely uses client-side scripting [47] and can be used offline and on-premise. The web application is based on the web standards model and uses HyperText Markup Language (HTML), particularly HTML5 (https://html.spec.whatwg.org/multipage (accessed on 9 February 2025)); Cascading Style Sheets (CSS), particularly CSS3 (https://www.w3.org/TR/2001/WD-css3-roadmap-20010523 (accessed on 9 February 2025)); and JavaScript. In order to ensure that the content of the web application is appropriately and correctly displayed on different devices with different screen resolutions, the web application adheres to the principles of a responsive web design [48,49]. Therefore, the web application uses Bootstrap where the responsive web design can be established due to the grid system feature and the internal CSS media queries of Bootstrap. With Bootstrap, the content of the web application is optimized for the device that is used to view the web application [50,51].

A significant aspect of the web application is the integration of the developed questionnaire to determine the level of maturity for data preparation. To implement the questionnaire into the web application, the JavaScript library SurveyJS, particularly the SurveyJS Form Library (https://surveyjs.io/form-library (accessed on 9 February 2025)), was used. As it is a standalone JavaScript library free from any additional dependencies, it was also especially suitable for utilization in the web application for the maturity model, thus satisfying the requirement of client-side scripting. Upon completion of the questionnaire, the option to save the data in a comma-separated value (CSV) file is available. This file can be utilized for documentation purposes and for the purpose of re-evaluating the questionnaire at a later time. Following the successful completion of preliminary checks, the level of maturity for data preparation is calculated according to the calculations shown in Section 4. The results for the level of maturity and for the influence factors are conveyed afterwards to the visualization (see Section 5).

The designed questionnaire is not only implemented in a static web application (see https://codeberg.org/FraunhoferAustria/PRESENT_DataMaturityModel.git (accessed on 14 April 2025)), but it is also available as a user-friendly checklist that fits on a single double-sided piece of paper (see Appendix C).

4. Determination of the Level of Maturity

The criteria shown in Section 3.1 cannot be considered in isolation, as the aspects covered by each criterion are strongly interrelated. Therefore, these interdependencies are shown in Figure 1 to incorporate them in the determination of the level of maturity.

When calculating the level of maturity for data preparation, the criteria, the criteria interdependencies, and a weighting scheme are taken into account. Therefore, the following basic rules are defined for determining the level of maturity:

If all criteria are completely fulfilled, the level of maturity is 100%.
A question of a criterion influences the criterion itself as well as all other criteria that depend on this criterion (see Figure 1).
The effect of a question on the criteria is referred to as influence factor in this model.
Each question and, thus, each influence factor is weighted according to the weighting scheme.
The influence factor is equal to 1 if the criterion corresponds to the criterion to which the question is assigned.
The influence factor is halved per dependency.

As each question in the questionnaire is clearly assigned to only one of the nine criteria (see Section 3.1), there are nine question categories, where one category corresponds to one criterion. With this information and these rules, an overview table between the categories of the questions and the influence factors of the criteria can be created. In order to identify the influence factors, the shortest possible connection is always selected. Based on the criteria interdependencies outlined in Figure 1, the shortest possible connections between all criteria and the question categories are counted. It is important to note that the number of connections may only be counted in the direction of the arrow. As a consequence, not every criterion can be reached by every category of a question. As mentioned above, the influence factor is equal to 1 if the category of the question is equal to the criterion. In other words, this implies that the influence factor is equal to 1 if the number of connections is 0. The influence factor is divided by a factor of 2 with each new connection, resulting in Table 3, which shows the influence factors between the criteria and the categories of the questions.

That means, if the number of connections is equal to 1, the influence factor is equal to 0.5; if the number of connections is 2, the influence factor is equal to 0.25; if the number of connections is 3, the influence factor is equal to 0.125; and so on.

Based on the results of the evaluative qualitative content analysis, the weighting scheme for the maturity model was developed. In the first step, the importance of each criterion was determined. As each code in the evaluative qualitative content analysis corresponds to one criterion of the maturity model, the results of the analysis can be directly transferred to the weighting scheme. Each interview was assessed equally in the determination of the weighting scheme. As the number of coded segments varied among the interviews, the relative frequencies were used for the weighting scheme. To retrieve the importance of the criteria, the number of codes analyzed with the characteristic Unimportant/Uncritical (−) was subtracted from the number of codes analyzed with the characteristic Important/Critical (

!!

) for all codes. To retrieve one value for the importance of each criterion, the sum among all interviews was calculated from the previous subtractions. By the calculation of the sum among all interviews, all interviews were assessed equally. In the final step of determining the weights for the criteria, it was necessary to transform the importance of the criteria in such a manner that they could be applied in the maturity model. The maximum possible weight for a criterion is 1. Negative values for the weights may lead to confusion and contradictions with the logic of the questionnaire. Therefore, the minimum possible weight for a criterion is 0. This phenomenon manifests exclusively in instances of uniqueness, a criterion that the majority of experts cited as being of negligible significance. This scheme for the weights was applied to the determined importance of the criteria. Initially, all negative values of the importance of the criteria are set to zero, and afterwards, the values are normalized by the maximum value of the importance. These two steps ensure that the final weights range from 0 to 1, where 0 equals the minimum possible weight and 1 equals the maximum possible weight. The application of this scheme is presented in Table 4. The last column contains the normalized values that correspond to the final weights for the criteria.

Before the level of maturity can be calculated, all questions of the completed questionnaire have to be evaluated regarding the correct answer of the underlying criterion. If a question is answered correctly with regard to the criterion, the influence factors of this question are determined and noted for all criteria in accordance with Table 3. If a question is answered incorrectly with regard to the criterion, the influence factors are zero for all criteria for this question. If a question is answered with not relevant, the question is completely neglected within the evaluation and the subsequent calculation of the level of maturity. The determined weights for the criteria describe the importance of a criterion in the data preparation process. As each question of the questionnaire influences the other criteria due to the criteria interdependencies, the weight of a criterion directly impacts the influence factor of each question.

The mathematical calculation for the level of maturity for data preparation

ρ

with the integrated weighting scheme can be described as follows:

ρ = 100 \cdot \frac{\sum_{i = 1}^{n} Θ_{i}}{\sum_{i = 1}^{n} Θ_{i}^{max}}

(1)

where the number of criteria is n,

Θ_{i}

is the weighted influence factor of criterion i for all questions, and

Θ_{i}^{max}

is the maximum achievable weighted influence factor. The weighted influence factor of a criterion is calculated by

Θ_{i} = \sum_{j = 1}^{m} (\sum_{k = 1}^{n} κ_{j, k} \cdot ω_{i})

(2)

where m is the number of questions that were answered with either a yes or a no,

κ_{j, k}

is the influence factor of criterion k for question j, and

ω_{i}

is the weight of the criterion. By evaluating the multiplication

κ_{j, k} \cdot ω_{i}

, each influence factor of a question is additionally weighted with the weight of the criterion to which the question is assigned. Therefore, the impact of a question in the questionnaire on the maturity level depends on the assigned criterion or the weight of the assigned criterion. In this calculation, the maturity level is expressed as a percentage. It is also useful to calculate percentages for the weighted influence factors with the corresponding maximum achievable weighted influence factors for all criteria in order to provide more detailed feedback on each criterion.

5. Result Representation

As the developed weighted maturity model contains corresponding influencing factors and criteria interdependencies, we developed a novel interactive visualization technique that presents the complex relationships of the model in a compact form. The visualization is intended to help users interpret the results of the maturity model, show the percentage dependencies of the influencing factors, and provide an alternative perspective of the level of maturity.

Primarily, the result for the level of maturity and the percentage values of the influencing factors are visualized based on the layout of a spider chart (also known as a radar chart) to visualize the nine criteria of the maturity model. A spider chart is a visualization technique for multivariate data, using radial axes where connected data points form a shape to highlight relationships between variables [52,53]. Instead of using shape representations, we fill in the areas of the radial layout to represent the influence factor for the corresponding criteria. Since, the maturity model consists of nine criteria, the visualization technique creates a nonagon where each segment corresponds to one criterion. Information about influencing factors is visualized in the corresponding nonagon segment as stacked areas, similar to a stacked bar chart. In addition, each criterion is indicated by a distinct color where the contrast between the colors is maximized [54]. Figure 2 illustrates this using an evaluated questionnaire.

Since the criteria are interdependent, the contribution of each criterion to the achieved influence factor must also be visualized. This means we visualize the composition of each influence factor, which consists of the corresponding criterion itself and all other criteria it depends on. For this purpose, we incorporate chord diagrams into the visualization technique [55,56]. In a chord diagram, connections between categories are displayed as curved lines (chords) that link different segments around the circular layout. These connections visually represent relationships, dependencies, or flows between the categories. In our case, the chords illustrate the influence factor of a criterion, breaking it down into its own share and the contributions from other dependent criteria. Figure 2 shows an illustration of this interaction by visualizing the achieved shares for the Accuracy criterion. If a criterion has no connecting chord, it is not dependent on that criterion in this specific case and for the underlying maturity model results. However, this should not be compared to the criteria interdependencies in Figure 1, as those represent general relationships between the criteria. Additionally, the area of the influencing factor is divided into bins of varying sizes, with the width representing the proportion of the corresponding criterion or how much the influencing factor is composed of that criterion. The outermost strip is always the same as the current criterion for which the corresponding proportions are specified. The following strips are sorted according to their share. Mouse-over interaction activates the corresponding chord view to explore the shares of a criterion, with tables displayed alongside the visualization for an overview of all shares for the highlighted criterion.

This visualization technique offers an alternative view by displaying both the achieved and missing shares for each criterion. It allows for a clear representation of what is needed to reach the maximum maturity level for each influence factor (see Figure 3). Another feature allows for the dynamic adjustment of criteria weightings (see Table 4). When the weighting scheme is modified, the visualization and underlying maturity results are automatically updated based on the specified weights. This functionality enables refining the results to better align with the specific application area. For example, if a criterion is deemed irrelevant to data preparation or subsequent analysis, its weight can be set to zero, ensuring that all related questions in the questionnaire have no impact on the maturity level for data preparation.

Our novel visualization technique is integrated into the web application for the weighted maturity model. Since the application relies entirely on client-side scripting, the visualization utilizes the open-source JavaScript library D3.js (https://d3js.org (accessed on 9 February 2025)), which offers powerful functions and tools for creating dynamic, interactive visualizations [57].

6. Conclusions

The developed comprehensive weighted maturity model provides an enhancement for the data preparation process, ensuring that the data are prepared in the best possible way to meet the objectives of the data analysis. With the calculated level of maturity for data preparation, a quantifiable value is determined that describes the extent to which the prepared data align with the subsequent data analysis. The additional integration of the weighting scheme for the different criteria, which are used to ascertain the level of maturity, is a significant development as it enables the level of maturity to be much more expressive for the data preparation process. The development of an automated evaluation for the weighted maturity model with a web application, as well as a novel visualization for displaying the results of the level of maturity with different available features to obtain further insights into the level of maturity, provide an easy, accessible, straightforward, and fast way to utilize the weighted maturity model within a data analysis project.

An obvious step for future work to allow for the usage of the weighted maturity model is the publication of the web application to a publicly accessible web page so that it can be used by anyone. Furthermore, a comparison of the novel visualization developed for the weighted maturity model with established visualization types is necessary to demonstrate its advantages. The gathered weights for the criteria reveal the Relevance/Interpretability criterion as the most important criterion for the data preparation process. This observation can be further substantiated by the literature, as the understanding of the objectives can be seen as, perhaps, the most important phase for a data analysis project [11]. The Uniqueness criterion, conversely, is of the least importance. This observation prompts a review of the criteria for the weighted maturity model, with the possibility of removing or substituting the Uniqueness criterion with an alternative criterion that seems to be missing in the current weighted maturity model. In such cases, the criteria interdependencies, as well as the evaluative qualitative content analysis, must be revised to adjust the weighted maturity model to the changes in the criteria.

Author Contributions

Conceptualization and methodology, L.K., L.S. and T.U.; software, L.K.; investigation, L.K., L.S. and T.U.; writing and visualization, L.K., L.S. and T.U.; supervision, L.S. and T.U.; project administration, T.U.; funding acquisition, T.U. All authors have read and agreed to the published version of the manuscript.

Funding

The Austrian Research Promotion Agency (FFG) within the research project “PREdictions for Science, Engineering N’ Technology (PRESENT)”, GA No. FO999899544.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data available in a publicly accessible repository https://codeberg.org/FraunhoferAustria/PRESENT_DataMaturityModel.git (accessed on 14 April 2025).

Conflicts of Interest

Authors Lin Shao and Torsten Ullrich were employed by Fraunhofer Austria Research GmbH. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A. Interview Guide

It should be noted that the original interview guide and, thus, the version distributed to the interview participants were formulated in German. The interview guide shown below presents its English version.

Appendix B. Coding Frame

Table A1. Codebook for the qualitative content analysis.

Name of the code	Completeness
Content description	Describes if sufficient data have been collected for the analysis, if the dataset is filled accordingly, and if the corresponding paradata and metadata are available.
Use of the code	Completeness is to be coded if the following aspects are mentioned: Enough data collected, control of representativeness; Empty fields within the dataset and their impact; Paradata information such as the identity of the data collector, the devices used, or the description of the environmental conditions; Metadata information such as a description of the meaning of the values within the dataset, a description of applied methods for data cleaning and data organization, or a description of special formats and abbreviations.
Name of the code	Uniqueness
Content description	Describes the impact of duplicates in the dataset
Use of the code	Uniqueness is to be coded if the following aspects are mentioned: Duplicates and their impact; Application of a data cleaning process to detect duplicates.
Name of the code	Timeliness
Content description	Describes effects of delays between the receipt and the recording of the data
Use of the code	Timeliness is to be coded if the following aspects are mentioned: If a delay between the data collection and the integration of the data in the dataset affects the data analysis or the project.
Name of the code	Validity/Interoperability
Content description	Describes the correctness of the data in terms of formats, syntax, and further processing steps
Use of the code	Validity/Interoperability is to be coded if the following aspects are mentioned: Impacts of formats on subsequent analysis steps; Uniform way of formatting and organizing the data; Presence of non-digital data.
Name of the code	Accuracy
Content description	Describes how accurately the data reflect the underlying event that is being described
Use of the code	Accuracy is to be coded if the following aspects are mentioned: Accuracy of data sources such as the calibration of sensors; Influences on the data such as the integrity of the data or influences during the data preparation process due to labeling, classifications, or the type of data collection; Usage of correct scaling within the data collection for scalable data; Plausibility of the data.
Name of the code	Consistency
Content description	Describes if the data are logically interconnected and the influences of genuine contradictions
Use of the code	Consistency is to be coded if the following aspects are mentioned: Synchronization for time-related data; Genuine contradictions in the data such as different data records for the same information due to spelling errors or different units; Equal data for repeated data collections.
Name of the code	Credibility/Accessibility/Findability
Content description	Describes the reliability of the data and the underlying data sources
Use of the code	Credibility/Accessibility/Findability is to be coded if the following aspects are mentioned: Usage of standardized or independent methods within the data preparation process; Authorization to use the collected data; Accessibility of the data such as the process to be able to gather the data; Verifiability of the collected data for externals such as the reproducibility of the data collection or the public accessibility of the data; Publication of the data sources and the applied methods.
Name of the code	Relevance/Interpretability
Content description	Describes the affiliation of the data to the objectives of the data analysis
Use of the code	Relevance/Interpretability is to be coded if the following aspects are mentioned: Understanding the objectives of the data in all steps of the data preparation process and understanding the data themselves; Association of the data with the subsequent data analysis; Interpretability of the data.
Name of the code	Reusability
Content description	Describes the aim to reuse the data for other analyses or projects
Use of the code	Reusability is to be coded if the following aspects are mentioned: Reusability of the dataset or aspects of the data for other analyses; Publication of the data and the corresponding results of the data analysis.

Appendix C. Checklist

The final questionnaire can be used via the web application that includes an automatic evaluation method and a result visualization function or via a simple check list:

References

CrowdFlower. 2017 Data Scientist Report; CrowdFlower: San Francisco, CA, USA, 2017. [Google Scholar]
Anaconda. 2022 State of Data Science. Available online: https://www.anaconda.com/resources/whitepapers/state-of-data-science-report-2022. (accessed on 9 February 2025).
Woodie, A. Data Prep Still Dominates Data Scientists’ Time, Survey Finds. Available online: https://www.bigdatawire.com/2020/07/06/data-prep-still-dominates-data-scientists-time-survey-finds/ (accessed on 9 February 2025).
Anwar, M. Was ist Datenbereinigung? Ein vollständiger Leitfaden. Available online: https://www.astera.com/de/type/blog/data-cleansing/ (accessed on 9 February 2025).
Documentation, I. SPSS Modeler. Available online: https://www.ibm.com/docs/de/spss-modeler/18.4.0?topic=preparation-data-overview (accessed on 9 February 2025).
Dasu, T.; Johnson, T. Exploratory Data Mining and Data Cleaning; John Wiley & Sons: Hoboken, NJ, USA, 2003. [Google Scholar]
Matzer, M. Datenaufbereitung ist ein Unterschätzter Prozess. Available online: https://www.bigdata-insider.de/datenaufbereitung-ist-ein-unterschaetzter-prozess-a-803469/ (accessed on 9 February 2025).
Institute, P. Overcoming the 80/20 Rule in Data Science. Available online: https://www.pragmaticinstitute.com/resources/articles/data/overcoming-the-80-20-rule-in-data-science/ (accessed on 9 February 2025).
Lohr, S. For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights. The New York Times. Available online: https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html (accessed on 9 February 2025).
Wirth, R.; Hipp, J. CRISP-DM: Towards a Standard Process Model for Data Mining. In Proceedings of the 4th International Conference on the Practical Applications of Knowledge Discovery and Data Mining, Manchester, UK, 11–13 April 2000; Volume 1, pp. 29–39. [Google Scholar]
Shearer, C. The CRISP-DM Model: The New Blueprint for Data Mining. J. Data Warehous. 2000, 5, 13–22. [Google Scholar]
Fayyad, U.; Piatetsky-Shapiro, G.; Smyth, P. From Data Mining to Knowledge Discovery in Databases. AI Mag. 1996, 17, 37–54. [Google Scholar] [CrossRef]
Azevedo, A.; Santos, M.F. KDD, SEMMA and CRISP-DM: A Parallel Overview. In Proceedings of the IADIS European Conference Data Mining, Amsterdam, The Netherlands, 24–26 July 2008. [Google Scholar]
Dåderman, A.; Rosander, S. Evaluating Frameworks for Implementing Machine Learning in Signal Processing: A Comparative Study of CRISP-DM, SEMMA and KDD. Master’s Thesis, KTH Royal Institute of Technology, Stockholm, Sweden, 2018. [Google Scholar]
Center, S.H. Introduction to SEMMA. Available online: https://documentation.sas.com/doc/en/emref/14.3/n061bzurmej4j3n1jnj8bbjjm1a2.htm (accessed on 9 February 2025).
Shafique, U.; Qaiser, H. A Comparative Study of Data Mining Process Models (KDD, CRISP-DM and SEMMA). Int. J. Innov. Sci. Res. 2014, 12, 217–222. [Google Scholar]
Jackson, J. Data Mining; A Conceptual Overview. Commun. Assoc. Inf. Syst. 2002, 8, 19. [Google Scholar] [CrossRef]
Wendler, R. The Maturity of Maturity Model Research: A Systematic Mapping Study. Inf. Softw. Technol. 2012, 54, 1317–1339. [Google Scholar] [CrossRef]
Paulk, M.; Curtis, B.; Chrissis, M.; Weber, C. Capability Maturity Model, Version 1.1. IEEE Softw. 1993, 10, 18–27. [Google Scholar] [CrossRef]
Al-Sai, Z.A.; Abdullah, R.; Husin, M.H. A Review on Big Data Maturity Models. In Proceedings of the 2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT), Amman, Jordan, 9–11 April 2019; pp. 156–161. [Google Scholar] [CrossRef]
Moore, D.T. Roadmaps and Maturity Models: Pathways toward Adopting Big Data. In Proceedings of the Conference for Information Systems Applied Research, Baltimore, MD, USA, 6–9 November 2014; Volume 2167. [Google Scholar]
Gökalp, M.O.; Gökalp, E.; Kayabay, K.; Koçyiğit, A.; Eren, P.E. The Development of the Data Science Capability Maturity Model: A Survey-Based Research. Online Inf. Rev. 2021, 46, 547–567. [Google Scholar] [CrossRef]
Al-Sai, Z.A.; Husin, M.H.; Syed-Mohamad, S.M.; Abdullah, R.; Zitar, R.A.; Abualigah, L.; Gandomi, A.H. Big Data Maturity Assessment Models: A Systematic Literature Review. Big Data Cogn. Comput. 2023, 7, 2. [Google Scholar] [CrossRef]
Coleman, S.; Göb, R.; Manco, G.; Pievatolo, A.; Tort-Martorell, X.; Reis, M.S. How Can SMEs Benefit from Big Data? Challenges and a Path Forward. Qual. Reliab. Eng. Int. 2016, 32, 2151–2164. [Google Scholar] [CrossRef]
Comuzzi, M.; Patel, A. How Organisations Leverage Big Data: A Maturity Model. Ind. Manag. Data Syst. 2016, 116, 1468–1492. [Google Scholar] [CrossRef]
Davenport, T.; Harris, J. Competing on Analytics: Updated, with a New Introduction: The New Science of Winning; Harvard Business Press: Brighton, MA, USA, 2017. [Google Scholar]
Król, K.; Zdonek, D. Analytics Maturity Models: An Overview. Information 2020, 11, 142. [Google Scholar] [CrossRef]
Halper, F. TDWI Analytics Maturity Model. TDWI Res. 2020, 22, 22. [Google Scholar]
Korsten, G.; Aysolmaz, B.; Ozkan, B.; Turetken, O. A Capability Maturity Model for Developing and Improving Advanced Data Analytics Capabilities. Pac. Asia J. Assoc. Inf. Syst. 2024, 16, 1. [Google Scholar]
Spruit, M.; Pietzka, K. MD3M: The Master Data Management Maturity Model. Comput. Hum. Behav. 2015, 51, 1068–1076. [Google Scholar] [CrossRef]
Crowston, K.; Qin, J. A Capability Maturity Model for Scientific Data Management: Evidence from the Literature. Proc. Am. Soc. Inf. Sci. Technol. 2011, 48, 1–9. [Google Scholar] [CrossRef]
CMMI Institute. Data Management Maturity (DMM) Model. Available online: https://stage.cmmiinstitute.com/getattachment/cb35800b-720f-4afe-93bf-86ccefb1fb17/attachment.aspx (accessed on 14 April 2025).
Yang, B.; Wu, H.; Zhang, H. Research and Application of Data Management Based on Data Management Maturity Model (DMM). In Proceedings of the ICMLC 2018: 2018 10th International Conference on Machine Learning and Computing, Macau, China, 26–28 February 2018; pp. 157–160. [Google Scholar] [CrossRef]
Belghith, O.; Skhiri, S.; Zitoun, S.; Ferjaoui, S. A Survey of Maturity Models in Data Management. In Proceedings of the 2021 IEEE 12th International Conference on Mechanical and Intelligent Manufacturing Technologies (ICMIMT), Cape Town, South Africa, 13–15 May 2021; pp. 298–309. [Google Scholar] [CrossRef]
Ryu, K.S.; Park, J.S.; Park, J.H. A Data Quality Management Maturity Model. ETRI J. 2006, 28, 191–204. [Google Scholar] [CrossRef]
Hüner, K.M.; Ofner, M.; Otto, B. Towards a Maturity Model for Corporate Data Quality Management. In Proceedings of the 2009 ACM Symposium on Applied Computing, Honolulu, HI, USA, 8–12 March 2009; SAC ’09. pp. 231–238. [Google Scholar] [CrossRef]
Ofner, M.; Otto, B.; Österle, H. A Maturity Model for Enterprise Data Quality Management. Enterp. Model. Inf. Syst. Archit. 2013, 8, 4–24. [Google Scholar] [CrossRef]
Kim, S.; Pérez-Castillo, R.; Caballero, I.; Lee, D. Organizational Process Maturity Model for IoT Data Quality Management. J. Ind. Inf. Integr. 2022, 26, 100256. [Google Scholar] [CrossRef]
Kirikoglu, O. A Maturity Model for Improving Data Quality Management. Master’s Thesis, University of Twente, Enschede, The Netherlands, 2017. [Google Scholar]
Twilt, S. A Data Analytics Maturity Assessment Model for Data-Intensive Organizations. Master’s Thesis, Utrecht University, Utrecht, The Netherlands, 2023. [Google Scholar]
Giovannini, E.; Ward, D. Quality Framework for OECD Statistics. In Proceedings of the Conference on Data Quality for International Organizations, Wiesbaden, Germany, 27–28 May 2004. [Google Scholar]
Durand, M. Quality Framework and Guidelines for OECD Statistical Activities, Version 2011/1; OECD: Paris, France, 2012. [Google Scholar]
Askham, N.; Cook, D.; Doyle, M.; Fereday, H.; Gibson, M.; Landbeck, U.; Lee, R.; Maynard, C.; Palmer, G.; Schwarzenbach, J. The Six Primary Dimensions for Data Quality Assessment; Data Management Association: Bristol, UK, 2013. [Google Scholar]
RDA FAIR Data Maturity Model Working Group. FAIR Data Maturity Model: Specification and Guidelines. Research Data Alliance. Available online: https://zenodo.org/records/3909563#.YGRNnq8za70 (accessed on 14 April 2025). [CrossRef]
Hameed, M.; Naumann, F. Data Preparation: A Survey of Commercial Tools. ACM SIGMOD Rec. 2020, 49, 18–29. [Google Scholar] [CrossRef]
Kuckartz, U.; Rädiker, S. Qualitative Inhaltsanalyse. Methoden, Praxis, Umsetzung Mit Software Und Künstlicher Intelligenz, 6th ed.; Beltz Juventa: Weinheim, Germany, 2024. [Google Scholar]
Schultz, D.; Cook, C. Client-Side Scripting Basics. In Beginning HTML with CSS and XHTML: Modern Guide and Reference; Apress: New York, NY, USA, 2007; pp. 251–279. [Google Scholar] [CrossRef]
Marcotte, E. Responsive Web Design. Available online: https://alistapart.com/article/responsive-web-design/ (accessed on 9 February 2025).
Giurgiu, L.; Gligorea, I. Responsive Web Design Techniques. In International Conference Knowledge-Based Organization; Sciendo: Warsaw, Poland, 2017; Volume 23, pp. 37–42. [Google Scholar] [CrossRef]
Spurlock, J. Bootstrap: Responsive Web Development; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2013. [Google Scholar]
Gaikwad, S.S.; Adkar, P. A Review Paper On Bootstrap Framework. IRE J. 2019, 2, 349–351. [Google Scholar]
Liu, W.Y.; Wang, B.W.; Yu, J.X.; Li, F.; Wang, S.X.; Hong, W.X. Visualization Classification Method of Multi-Dimensional Data Based on Radar Chart Mapping. In Proceedings of the 2008 International Conference on Machine Learning and Cybernetics, Kunming, China, 12–15 July2008; Volume 2, pp. 857–862. [Google Scholar] [CrossRef]
Porter, M.M.; Niksiar, P. Multidimensional Mechanics: Performance Mapping of Natural Biological Systems Using Permutated Radar Charts. PLoS ONE 2018, 13, e0204309. [Google Scholar] [CrossRef] [PubMed]
Green-Armytage, P. A Colour Alphabet and the Limits of Colour Coding. JAIC-J. Int. Colour Assoc. 2010, 5, 1–23. [Google Scholar]
Mazel, J.; Fontugne, R.; Fukuda, K. Visual Comparison of Network Anomaly Detectors with Chord Diagrams. In Proceedings of the SAC 2014: Symposium on Applied Computing, Gyeongju, Republic of Korea, 24–28 March 2014; pp. 473–480. [Google Scholar] [CrossRef]
Keahey, T.A. Using Visualization to Understand Big Data. In IBM Business Analytics Advanced Visualisation; IBM Corporation: New York, NY, USA, 2013; Volume 16. [Google Scholar]
Teller, S. Data Visualization with D3.Js; Packt Publishing: Birmingham, UK, 2013. [Google Scholar]

Figure 1. Overview of the interdependencies between the criteria. The direction of the arrows indicates the direction of influence among the criteria. There are two types of connections: unidirectional and bidirectional.

Figure 2. Illustration of the novel visualization approach by showing the achieved shares of a criterion. In this example, the visualization depicts the attained value for accuracy, along with the factors that exert a dependent influence on it.

Figure 3. This alternative perspective on our visualization technique is intended to demonstrate the missing shares for a specific criterion. In this instance, the connections illustrate the impact of the specified criterion on the unaccounted portion of reusability.

Table 1. Divisionof the questions in the questionnaire among the criteria.

Criterion	Number of Question in the Questionnaire
Completeness	12
Uniqueness	4
Timeliness	1
Validity/Interoperability	6
Accuracy	8
Consistency	6
Credibility/Accessibility/Findability	7
Relevance/Interpretability	6
Reusability	2

Table 2. Absolute frequencies of the characteristics for each code.

		C1		C2		C3		C4		C5		C6		C7		C8		C9
		$!!$	−	$!!$	−	$!!$	−	$!!$	−	$!!$	−	$!!$	−	$!!$	−	$!!$	−	$!!$	−
Interview	B1	5	1	1	0	0	0	0	0	3	1	0	0	0	0	12	0	0	0
	B2	5	0	1	0	0	0	1	0	6	1	4	0	9	0	1	0	0	0
	B3	5	0	0	0	0	0	0	0	9	0	2	0	4	0	10	1	2	0
	B4	0	1	0	0	0	0	4	0	6	0	4	0	3	0	8	0	1	0
	B5	6	0	1	1	1	0	5	0	2	0	1	0	10	0	2	0	1	0
	B6	4	0	0	0	0	0	4	0	3	0	1	0	7	0	3	0	1	0
	B7	3	1	0	1	0	0	5	0	2	1	7	0	3	0	8	0	0	0
	B8	13	1	0	1	0	0	2	0	4	0	2	1	2	0	6	0	0	0
	B9	1	0	2	0	1	0	3	1	10	0	1	0	1	1	1	0	4	1
	B10	5	1	0	1	0	0	0	0	5	1	4	0	4	0	3	2	0	0
	B11	10	1	0	0	0	0	3	0	2	0	6	1	5	0	8	0	0	2
	B12	5	0	1	1	0	0	0	1	6	0	0	1	1	0	6	1	0	1
	B13	0	0	0	0	0	0	0	0	3	0	0	0	0	0	4	0	0	0
	B14	2	2	0	2	0	0	1	0	1	0	0	0	3	2	7	0	0	0
	B15	16	0	0	2	0	0	3	2	8	1	4	1	4	0	3	2	2	0
	∑	80	8	6	9	2	0	31	4	70	5	36	4	56	3	82	6	11	4

Table 3. Influence factors of the criteria for each question category.

		Criteria
		Completeness	Uniqueness	Timeliness	Validity/Interoperability	Accuracy	Consistency	Credib./Accessib./Findab.	Relevance/Interpretability	Reusability
Question category	Completeness	1	0.5	0.0625	0.5	0.25	0.5	0.25	0.125	0.5
	Uniqueness	0.25	1	0.015625	0.5	0.0625	0.125	0.0625	0.03125	0.5
	Timeliness	0.25	0.125	1	0.125	0.5	0.125	0.25	0.5	0.25
	Validity/Interoperability	0.5	0.25	0.03125	1	0.125	0.25	0.125	0.0625	0.5
	Accuracy	0.25	0.125	0.25	0.125	1	0.125	0.5	0.5	0.5
	Consistency	0.25	0.125	0.125	0.5	0.5	1	0.5	0.25	0.5
	Credib./Accessib./Findab.	0.25	0.125	0.25	0.125	0.5	0.125	1	0.5	0.5
	Relevance/Interpretability	0.5	0.25	0.5	0.25	0.25	0.25	0.125	1	0.25
	Reusability	0	0	0	0	0	0	0	0	1

Table 4. Finalweights for the criteria for the maturity model based on the results of the qualitative content analysis.

Criterion	Importance	Scheme Step 1	Scheme Step 2 (Maximum)	Final Weights
Completeness	2.25	2.25	3.21	0.70
Uniqueness	−0.09	0.00		0.00
Timeliness	0.07	0.07		0.02
Validity/Interop.	0.93	0.93		0.29
Accuracy	2.57	2.57		0.80
Consistency	1.03	1.03		0.32
Credib./Access./Find.	1.81	1.81		0.56
Relevance/Interpret.	3.21	3.21		1.00
Reusability	0.23	0.23		0.07

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Knoflach, L.; Shao, L.; Ullrich, T. A Comprehensive Data Maturity Model for Data Pre-Analysis. Data 2025, 10, 55. https://doi.org/10.3390/data10040055

AMA Style

Knoflach L, Shao L, Ullrich T. A Comprehensive Data Maturity Model for Data Pre-Analysis. Data. 2025; 10(4):55. https://doi.org/10.3390/data10040055

Chicago/Turabian Style

Knoflach, Lukas, Lin Shao, and Torsten Ullrich. 2025. "A Comprehensive Data Maturity Model for Data Pre-Analysis" Data 10, no. 4: 55. https://doi.org/10.3390/data10040055

APA Style

Knoflach, L., Shao, L., & Ullrich, T. (2025). A Comprehensive Data Maturity Model for Data Pre-Analysis. Data, 10(4), 55. https://doi.org/10.3390/data10040055

Article Menu

A Comprehensive Data Maturity Model for Data Pre-Analysis

Abstract

1. Motivation

2. Related Work

3. Methodology

3.1. Questionnaire

3.2. Importance Weighting via Expert Interviews

3.3. Qualitative Content Analysis

3.4. Implementation

4. Determination of the Level of Maturity

5. Result Representation

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Interview Guide

Appendix B. Coding Frame

Appendix C. Checklist

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI