Event Log Data Quality Issues and Solutions
Abstract
:1. Introduction
2. Theoretical Foundations
2.1. Event Log Concept
Case ID | Activity Name | Timestamp | Resource |
---|---|---|---|
Case 1 | Milling | 29 January 2023 23:24 | Machine 1 |
Case 1 | Laser marking | 30 January 2023 05:44 | Machine 2 |
Case 1 | Round grinding | 30 January 2023 06:59 | Machine 3 |
Case 1 | Packing | 30 January 2023 07:21 | Employee 1 |
Case 2 | Milling | 31 January 2023 13:20 | Machine 1 |
Case 2 | Laser marking | 1 February 2023 08:18 | Machine 2 |
2.2. Event Log Data Quality Issues
- The missing data category refers to a quality issue where information is missing in a log, although it is mandatory. For example, “the missing data: case issue refers to the scenario where a case has been executed in reality, but it has not been recorded in the log” [3];
- The incorrect data category refers to a quality issue where information is provided but logged incorrectly. For example, “the incorrect cases issue corresponds to the scenario where certain cases in the log belong to a different process” [3];
- The imprecise data category refers to quality issues where the logged entries are too coarse, leading to a loss of precision. For example, “the imprecise activity names issue responds to a scenario where within a trace, there may be multiple events with the same activity name” [3];
- The irrelevant data category refers to quality issues where the logged entries may be irrelevant for process mining analysis. For example, “the irrelevant cases issue responds to a scenario where certain cases in an event log are deemed to be irrelevant for a particular context of analysis” [3].
- The case entity refers to a process instance being executed;
- The event entity refers to the activity of a process;
- The relationship entity refers to an association between cases and events;
- The case and event attributes entity refers to additional information a case or entity can have. For example, for the event Milling (see Figure 1), the number of product parts can be missing;
- The position and timestamp entities both refer to the recorded time of the events, where the position entity describes the position of recorded events, and the timestamp entity describes the actual timestamp of an event;
- The activity name entity refers to the name or label of the recorded events;
- The resource entity refers to resources utilized to perform an activity, e.g., a human or a machine.
- Voluminous data, referring to a large number of recorded cases and events;
- Case heterogeneity, referring to a large number of distinct process traces, i.e., different executions of the same process;
- Event granularity, referring to a large number of distinct activities.
2.3. A Review of Event Log Preprocessing Techniques
Preprocessing Technique Categories | Preprocessing Technique | Software Tool | Frequency of Occurrence | Primary Studies |
---|---|---|---|---|
Trace clustering | Trace clustering plug-in | ProM | 13% | [23,24,25,26,27,28] |
Minimum spanning tree (MST) clustering | ProM | 3% | ||
Statistical inference-based clustering | / | 3% | ||
Total | 19% | |||
Trace/event filtering | Branch and bound algorithm | / | 9% | |
Entropy-based activity filtering | RapidProM | 3% | [26,29,30,31,32,33,34,35,36] | |
Infrequent behavior filter | ProM | 6% | ||
Repair log plug-in | ProM | 6% | ||
Total | 24% | |||
Artificial intelligence (AI), machine learning (ML), deep learning (DL) | Bayesian network | MATLAB | 9% | [14,15,16,17,18,19,20,21,22] |
Random forest | / | 3% | ||
SIER and MIEC algorithms | / | 3% | ||
Natural language processing (NLP) | / | 6% | ||
LSTM artificial neural network | / | 3% | ||
Decision tree algorithm CART | / | 3% | ||
Total | 27% | |||
Log repair techniques | Heuristic log repair | ProM | 6% | [37,38] |
Total | 6% | |||
Embedded preprocessing | Inductive miner | ProM | 3% | |
ILP miner | ProM | 3% | [39] | |
Total | 6% | |||
Alignment-based techniques | Cost-based alignment | ProM | 6% | |
Alignment-based conformance checking | ProM | 3% | [14,16,40] | |
Total | 9% | |||
Event abstraction | Semantic abstraction | CPN Tools | 3% | [41] |
Other | Blockchain technology | / | 3% | [42] |
3. Materials and Methods
3.1. Research Instrument
3.2. Sample and Data Collection
3.3. Applied Data Analysis Techniques
4. Results
4.1. Socio-Demographic Structure of Participants
4.2. The Perceived Importance and Frequency of Use of Event Log Data Quality Issues
4.3. The Perceived Importance and Frequency of Use of Event Log Preprocessing Techniques
4.4. The IPA Analysis of Perceived Importance and Frequency of Use of Event Log Data Quality Issues
- LDQI 1 Missing data: Case;
- LDQI 2 Missing data: Event (scattered event);
- LDQI 3 Missing data: Relationship (elusive case);
- LDQI 13 Incorrect data: Timestamp;
- LDQI 18 Imprecise data: Timestamp;
- LDQI 22 Volume, granularity, complexity.
4.5. The IPA Analysis of Perceived Importance and Frequency of Use of Preprocessing Techniques
4.6. The Relationship between Event Log Data Quality Issues and Categories of Preprocessing Techniques
5. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A. Data Quality in Process Mining: Issues and Solutions
Appendix A.1. Demographics
- Poor
- Fair
- Good
- Very good
- Excellent
- Researcher
- Practitioner
- Both
- Other (please specify)
- Less than 1 year
- 1–5 years
- 5–10 years
- More than 10 years
- ProM
- Celonis
- Fluxicon Disco
- RapidProm
- IBM Process Mining
- SAP Signavio Business Intelligence
- ARIS Process Mining
- MPM ProcessMining
- QPR ProcessAnalyzer
- Apromore
- Apian Process Mining
- Other (please specify)
- ProM
- Celonis
- Fluxicon Disco
- RapidProm
- IBM Process Mining
- SAP Signavio Business Intelligence
- ARIS Process Mining
- MPM ProcessMining
- QPR ProcessAnalyzer
- Apromore
- Apian Process Mining
- Other (please specify)
Appendix A.2. Event Log Data Quality Issues
- Missing data (different data can be missing from the event log),
- Incorrect data (data exists but is recorded incorrectly),
- Imprecise data (data are too coarse, leading to loss of precision), and
- Irrelevant data (the log entries are irrelevant for process mining tasks), manifested through event log entities (event, case, activity name, etc.).
Not Important | Slightly Important | Moderately Important | Important | Very Important | |
---|---|---|---|---|---|
Missing data: Case This quality issue refers to the scenario where a case has been executed in reality but has not been recorded in the log. | |||||
Missing data: Event (Scattered Event) This quality issue refers to the scenario where one or more events are missing within the trace, although they occurred in reality. | |||||
Missing data: Relationship (Elusive Case) This quality issue corresponds to the scenario where the association between events and cases are missing. | |||||
Missing data: Activity name This quality issue corresponds to the scenario where the activity names of events are missing. | |||||
Missing data: Case and/or event attribute This quality issue corresponds to the scenario where the values corresponding to case and/or event attributes are missing. | |||||
Missing data: Timestamp This quality issue corresponds to the scenario where for one or more events, no timestamp is given. | |||||
Missing data: Resource This quality issue corresponds to the scenario where the resources that executed an activity have not been recorded. | |||||
Incorrect data: Case This quality issue corresponds to the scenario where certain cases in the log belong to a different process. | |||||
Incorrect data: Event This quality issue corresponds to the scenario where certain events in the event log are logged incorrectly. | |||||
Incorrect data: Relationship (Scattered Case) This quality issue corresponds to the scenario where the associations between events and cases are logged incorrectly. | |||||
Incorrect data: Activity name (Polluted Label, Distorted Label) This quality issue corresponds to the scenario where the activity names of events are logged incorrectly. | |||||
Incorrect data: Case and/or event attribute This quality issue corresponds to the scenario where the values corresponding to case and/or event attributes are logged incorrectly. | |||||
Incorrect data: Timestamp (Form-based Event Capture, Inadvertent Time Travel, Unanchored Event) This quality issue corresponds to the scenario where the recorded timestamps of (some or all) events in the log do not correspond to the real-time at which the events have occurred. | |||||
Incorrect data: Resource (Polluted Label) This quality issue corresponds to the scenario where the resources that executed an activity are logged incorrectly. | |||||
Imprecise data: Relationship This quality issue refers to the scenario in which due to the chosen definition of a case, it is not possible anymore to correlate events in the log to another case type. | |||||
Imprecise data: Activity name (Homonymous Label) This quality issue corresponds to the scenario in which activity names are too coarse. As a result, within a trace, there may be multiple events with the same activity name. | |||||
Imprecise data: Case and/or event attribute (Synonymous labels) This quality issue refers to the scenario in which, for a case and/or attribute, it is not possible to properly use its value as the provided value is too coarse. | |||||
Imprecise data: Timestamp (Unanchored Event) This quality issue corresponds to the scenario where timestamps are imprecise, and a too coarse level of abstraction is used for the timestamps of (some of the) events. | |||||
Imprecise data: Resource This quality issue refers to the scenario in which, for the resource attribute of an event, more specific information is known about the resource(s) that performed the activity, but coarser resource information has been recorded. | |||||
Irrelevant data: Case This quality issue corresponds to the scenario where certain cases in an event log are deemed to be irrelevant for a particular context of analysis. | |||||
Irrelevant data: Event (Form-based Event Capture, Collateral Events) In some applications, certain logged events may be irrelevant as it is for analysis. | |||||
Volume, granularity, complexity |
Not Important | Slightly Important | Moderately Important | Important | Very Important | |
---|---|---|---|---|---|
Missing data: Case This quality issue refers to the scenario where a case has been executed in reality but has not been recorded in the log. | |||||
Missing data: Event (Scattered Event) This quality issue refers to the scenario where one or more events are missing within the trace, although they occurred in reality. | |||||
Missing data: Relationship (Elusive Case) This quality issue corresponds to the scenario where the association between events and cases are missing. | |||||
Missing data: Activity name This quality issue corresponds to the scenario where the activity names of events are missing. | |||||
Missing data: Case and/or event attribute This quality issue corresponds to the scenario where the values corresponding to case and/or event attributes are missing. | |||||
Missing data: Timestamp This quality issue corresponds to the scenario where for one or more events, no timestamp is given. | |||||
Missing data: Resource This quality issue corresponds to the scenario where the resources that executed an activity have not been recorded. | |||||
Incorrect data: Case This quality issue corresponds to the scenario where certain cases in the log belong to a different process. | |||||
Incorrect data: Event This quality issue corresponds to the scenario where certain events in the event log are logged incorrectly. | |||||
Incorrect data: Relationship (Scattered Case) This quality issue corresponds to the scenario where the associations between events and cases are logged incorrectly. | |||||
Incorrect data: Activity name (Polluted Label, Distorted Label) This quality issue corresponds to the scenario where the activity names of events are logged incorrectly. | |||||
Incorrect data: Case and/or event attribute This quality issue corresponds to the scenario where the values corresponding to case and/or event attributes are logged incorrectly. | |||||
Incorrect data: Timestamp (Form-based Event Capture, Inadvertent Time Travel, Unanchored Event) This quality issue corresponds to the scenario where the recorded timestamps of (some or all) events in the log do not correspond to the real-time at which the events have occurred. | |||||
Incorrect data: Resource (Polluted Label) This quality issue corresponds to the scenario where the resources that executed an activity are logged incorrectly. | |||||
Imprecise data: Relationship This quality issue refers to the scenario in which due to the chosen definition of a case, it is not possible anymore to correlate events in the log to another case type. | |||||
Imprecise data: Activity name (Homonymous Label) This quality issue corresponds to the scenario in which activity names are too coarse. As a result, within a trace, there may be multiple events with the same activity name. | |||||
Imprecise data: Case and/or event attribute (Synonymous labels) This quality issue refers to the scenario in which, for a case and/or attribute, it is not possible to properly use its value as the provided value is too coarse. | |||||
Imprecise data: Timestamp (Unanchored Event) This quality issue corresponds to the scenario where timestamps are imprecise, and a too coarse level of abstraction is used for the timestamps of (some of the) events. | |||||
Imprecise data: Resource This quality issue refers to the scenario in which, for the resource attribute of an event, more specific information is known about the resource(s) that performed the activity, but coarser resource information has been recorded. | |||||
Irrelevant data: Case This quality issue corresponds to the scenario where certain cases in an event log are deemed to be irrelevant for a particular context of analysis. | |||||
Irrelevant data: Event (Form-based Event Capture, Collateral Events) In some applications, certain logged events may be irrelevant as it is for analysis. | |||||
Volume, granularity, complexity |
Appendix A.3. Event Log Preprocessing Techniques
- Trace clustering (e.g., Trace clustering plug-in in ProM, Minimum Spanning Tree clustering, Statistical inference-based clustering, K-means trace clustering),
- Repair log techniques (e.g., Heuristic log repair plug-in in ProM, Repair log plug-in in ProM),
- Trace/event filtering (e.g., Infrequent behavior filter, Entropy-based activity filtering, branch and bound algorithm),
- Event abstraction (e.g., Semantic abstraction),
- Artificial Intelligence, Deep Learning, Machine Learning algorithms (e.g., Bayesian networks, Branch and bound algorithm, LSTM Artificial Neural Network, Decision Three Algorithm CART),
- Alignment-based techniques (cost-based alignment, Alignment based conformance checking, Trace alignment, TraceMatching plug-in in ProM),
- Embedded preprocessing (Preprocessing techniques are embedded in a process discovery algorithm such as Inductive miner, Split miner, ILP miner, or in an Interactive process discovery approach).
Not Important | Slightly Important | Moderately Important | Important | Very Important | |
---|---|---|---|---|---|
Trace clustering | |||||
Repair log techniques | |||||
Trace/Event filtering | |||||
Event abstraction | |||||
Artificial Intelligence, Deep Learning, Machine Learning algorithms | |||||
Alignment based techniques | |||||
Embedded preprocessing |
Not Important | Slightly Important | Moderately Important | Important | Very Important | |
---|---|---|---|---|---|
Trace clustering | |||||
Repair log techniques | |||||
Trace/Event filtering | |||||
Event abstraction | |||||
Artificial Intelligence, Deep Learning, Machine Learning algorithms | |||||
Alignment based techniques | |||||
Embedded preprocessing |
Appendix A.4. The Selection of Preprocessing Techniques
- Trace clustering
- Repair log techniques
- Trace/Event filtering
- Event abstraction
- Artificial Intelligence, Deep Learning, Machine Learning algorithms
- Alignment based techniques
- Embedded preprocessing
- Other (please specify)
- Trace clustering
- Repair log techniques
- Trace/Event filtering
- Event abstraction
- Artificial Intelligence, Deep Learning, Machine Learning algorithms
- Alignment based techniques
- Embedded preprocessing
- Other (please specify)
- Trace clustering
- Repair log techniques
- Trace/Event filtering
- Event abstraction
- Artificial Intelligence, Deep Learning, Machine Learning algorithms
- Alignment based techniques
- Embedded preprocessing
- Other (please specify)
- Trace clustering
- Repair log techniques
- Trace/Event filtering
- Event abstraction
- Artificial Intelligence, Deep Learning, Machine Learning algorithms
- Alignment based techniques
- Embedded preprocessing
- Other (please specify)
- Trace clustering
- Repair log techniques
- Trace/Event filtering
- Event abstraction
- Artificial Intelligence, Deep Learning, Machine Learning algorithms
- Alignment based techniques
- Embedded preprocessing
- Other (please specify)
- Trace clustering
- Repair log techniques
- Trace/Event filtering
- Event abstraction
- Artificial Intelligence, Deep Learning, Machine Learning algorithms
- Alignment based techniques
- Embedded preprocessing
- Other (please specify)
- Trace clustering
- Repair log techniques
- Trace/Event filtering
- Event abstraction
- Artificial Intelligence, Deep Learning, Machine Learning algorithms
- Alignment based techniques
- Embedded preprocessing
- Other (please specify)
- Trace clustering
- Repair log techniques
- Trace/Event filtering
- Event abstraction
- Artificial Intelligence, Deep Learning, Machine Learning algorithms
- Alignment based techniques
- Embedded preprocessing
- Other (please specify)
- Trace clustering
- Repair log techniques
- Trace/Event filtering
- Event abstraction
- Artificial Intelligence, Deep Learning, Machine Learning algorithms
- Alignment based techniques
- Embedded preprocessing
- Other (please specify)
- Trace clustering
- Repair log techniques
- Trace/Event filtering
- Event abstraction
- Artificial Intelligence, Deep Learning, Machine Learning algorithms
- Alignment based techniques
- Embedded preprocessing
- Other (please specify)
- Trace clustering
- Repair log techniques
- Trace/Event filtering
- Event abstraction
- Artificial Intelligence, Deep Learning, Machine Learning algorithms
- Alignment based techniques
- Embedded preprocessing
- Other (please specify)
- Trace clustering
- Repair log techniques
- Trace/Event filtering
- Event abstraction
- Artificial Intelligence, Deep Learning, Machine Learning algorithms
- Alignment based techniques
- Embedded preprocessing
- Other (please specify)
- Trace clustering
- Repair log techniques
- Trace/Event filtering
- Event abstraction
- Artificial Intelligence, Deep Learning, Machine Learning algorithms
- Alignment based techniques
- Embedded preprocessing
- Other (please specify)
- Trace clustering
- Repair log techniques
- Trace/Event filtering
- Event abstraction
- Artificial Intelligence, Deep Learning, Machine Learning algorithms
- Alignment based techniques
- Embedded preprocessing
- Other (please specify)
- Trace clustering
- Repair log techniques
- Trace/Event filtering
- Event abstraction
- Artificial Intelligence, Deep Learning, Machine Learning algorithms
- Alignment based techniques
- Embedded preprocessing
- Other (please specify)
- Trace clustering
- Repair log techniques
- Trace/Event filtering
- Event abstraction
- Artificial Intelligence, Deep Learning, Machine Learning algorithms
- Alignment based techniques
- Embedded preprocessing
- Other (please specify)
- Trace clustering
- Repair log techniques
- Trace/Event filtering
- Event abstraction
- Artificial Intelligence, Deep Learning, Machine Learning algorithms
- Alignment based techniques
- Embedded preprocessing
- Other (please specify)
- Trace clustering
- Repair log techniques
- Trace/Event filtering
- Event abstraction
- Artificial Intelligence, Deep Learning, Machine Learning algorithms
- Alignment based techniques
- Embedded preprocessing
- Other (please specify)
- Trace clustering
- Repair log techniques
- Trace/Event filtering
- Event abstraction
- Artificial Intelligence, Deep Learning, Machine Learning algorithms
- Alignment based techniques
- Embedded preprocessing
- Other (please specify)
- Trace clustering
- Repair log techniques
- Trace/Event filtering
- Event abstraction
- Artificial Intelligence, Deep Learning, Machine Learning algorithms
- Alignment based techniques
- Embedded preprocessing
- Other (please specify)
- Trace clustering
- Repair log techniques
- Trace/Event filtering
- Event abstraction
- Artificial Intelligence, Deep Learning, Machine Learning algorithms
- Alignment based techniques
- Embedded preprocessing
- Other (please specify)
References
- Van der Aalst, W.; Carmona, J. Process Mining Handbook; van der Aalst, W.M.P., Carmona, J., Eds.; Lecture Notes in Business Information Processing; Springer International Publishing: Cham, Germany, 2022; ISBN 978-3-031-08847-6. [Google Scholar]
- Van Der Aalst, W.; Adriansyah, A.; Alves De Medeiros, A.K.; Arcieri, F.; Baier, T.; Blickle, T.; Chandra Bose, J.; Van Den Brand, P.; Brandtjen, R.; Buijs, J.; et al. Process Mining Manifesto; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
- Bose, R.P.J.C.; Mans, R.S.; Van Der Aalst, W.M.P. Wanna Improve Process Mining Results? In Proceedings of the 2013 IEEE Symposium on Computational Intelligence and Data Mining, CIDM 2013, Singapore, 16–19 April 2013; pp. 127–134. [Google Scholar]
- Suriadi, S.; Andrews, R.; ter Hofstede, A.H.M.; Wynn, M.T. Event Log Imper fection Patterns for Process Mining: Towards a Systematic Approach to Cleaning Event Logs. Inf. Syst. 2017, 64, 132–150. [Google Scholar] [CrossRef]
- Andrews, R.; Suriadi, S.; Ouyang, C.; Poppe, E. Towards Event Log Querying for Data Quality: Let’s Start with Detecting Log Imperfections. In Proceedings of the Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics),Valletta, Malta, 22–26 October 2018; Springer: Cham, Germany, 2018; Volume 11229 LNCS, pp. 116–134. [Google Scholar] [CrossRef] [Green Version]
- Andrews, R.; Emamjome, F.; Ter Hofstede, A.H.M.; Reijers, H.A. An Expert Lens on Data Quality in Process Mining; IEEE: Piscataway, NJ, USA, 2020. [Google Scholar]
- Fischer, D.A.; Goel, K.; Andrews, R.; van Dun, C.G.J.; Wynn, M.T.; Röglinger, M. Towards Interactive Event Log Forensics: Detecting and Quantifying Timestamp Imperfections. Inf. Syst. 2022, 109, 102039. [Google Scholar] [CrossRef]
- Verhulst, R. Evaluating Quality of Event Data within Event Logs: An Extensible Framework; Eindhoven University of Technology: Eindhoven, The Netherlands, 2016. [Google Scholar]
- Vugs, L.; van Asseldonk, M.; van Son, N. Lumigi: Shining Light on Your Process Data. In Proceedings of the 3rd International Conference on Process Mining (ICPM 2021), Eindhoven, The Netherlands, 31 October–4 November 2021. [Google Scholar]
- Kherbouche, M.O.; Laga, N.; Masse, P.-A. Towards a Better Assessment of Event Logs Quality. In Proceedings of the 2016 IEEE Symposium Series on Computational Intelligence (SSCI), Athens, Greece, 6–9 December 2016. [Google Scholar]
- Khannat, A.; Sbai, H.; Kjiri, L. Event Logs Pre-Processing for Configurable Process Discovery: Ontology-Based Approach. In Proceedings of the Colloquium in Information Science and Technology, CIST, Agadir—Essaouira, Morocco, 5 June 2020; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2020; Volume 2020, pp. 139–144. [Google Scholar]
- Marin-Castro, H.M.; Tello-Leal, E. Event Log Preprocessing for Process Mining: A Review. Appl. Sci. 2021, 11, 10556. [Google Scholar] [CrossRef]
- Levy, D. Production Analysis with Process Mining Technology. Dataset 2014. [Google Scholar]
- Rogge-Solti, A.; Mans, R.S.; van der Aalst, W.M.P.; Weske, M. Improving Documentation by Repairing Event Logs. Lect. Notes Bus. Inf. Process 2013, 165 LNBIP, 129–144. [Google Scholar] [CrossRef] [Green Version]
- Rogge-Solti, A.; Mans, R.S.; Van Der Aalst, W.M.P.; Weske, M. Repairing Event Logs Using Timed Process Models. In Lecture Notes in Computer Science; Including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics; SringerLink: Manhatn, NY, USA, 2013; Volume 8186 LNCS, pp. 705–708. [Google Scholar] [CrossRef]
- Shahzadi, S.; Fang, X.; Shahzad, U.; Ahmad, I.; Benedict, T. Repairing Event Logs to Enhance the Performance of a Process Mining Model. Math. Probl. Eng. 2022, 2022, 4741232. [Google Scholar] [CrossRef]
- Lu, Y.; Chen, Q.; Poon, S.K. A Deep Learning Approach for Repairing Missing Activity Labels in Event Logs for Process Mining. Information 2022, 13, 234. [Google Scholar] [CrossRef]
- Sim, S.; Bae, H.; Choi, Y. Likelihood-Based Multiple Imputation by Event Chain Methodology for Repair of Imperfect Event Logs with Missing Data. In Proceedings of the 1st International Conference on Process Mining, ICPM, Aachen, Germany, 24–26 June 2019; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2019; pp. 9–16. [Google Scholar]
- Liu, Y.; Yang, L.; Ghasemkhani, A.; Livani, H.; Centeno, V.A.; Chen, P.-Y.; Zhang, J. Robust Event Classification Using Imperfect Real-World PMU Data. IEEE Internet Things J. 2023, 10, 7429–7438. [Google Scholar] [CrossRef]
- Horita, H.; Kurihashi, Y.; Miyamori, N. Extraction of Missing Tendency Using Decision Tree Learning in Business Process Event Log. Data 2020, 5, 82. [Google Scholar] [CrossRef]
- Ramos-Gutiérrez, B.; Varela-Vaca, Á.J.; Ortega, F.J.; Gómez-López, M.T.; Wynn, M.T. A NLP-Oriented Methodology to Enhance Event Log Quality; Augusto, A., Gill, A., Nurcan, S., Reinhartz-Berger, I., Schmidt, R., Zdravkovic, J., Eds.; Online Conference; Springer: Berlin/Heidelberg, Germany, 2021; Volume 421. [Google Scholar]
- Chen, Q.; Lu, Y.; Tam, C.S.; Poon, S.K. A Multi-View Framework to Detect Redundant Activity Labels for More Representative Event Logs in Process Mining. Future Internet 2022, 14, 181. [Google Scholar] [CrossRef]
- Liu, J.; Xu, J.; Zhang, R.; Reiff-Marganiec, S. A Repairing Missing Activities Approach with Succession Relation for Event Logs. Knowl Inf Syst 2021, 63, 477–495. [Google Scholar] [CrossRef]
- Ceravolo, P.; Damiani, E.; Torabi, M.; Barbon, S. Toward a New Generation of Log Pre-Processing Methods for Process Mining. In Proceedings of the 15th International Conference on Business Process Management, BPM 2017, Barcelona, Spain, 3 August 2017; pp. 55–70. [Google Scholar]
- Sadeghianasl, S. The Quality Guardian: Improving Activity Label Quality in Event Logs Through Gamification. In Proceedings of the 2022 Best Dissertation Award, Doctoral Consortium, and Demonstration and Resources Track at BPM, BPM-D 2022, Münster, Germany, 13–15 September; Janiesch, C., Francescomarino, C.D., Grisold, T., Kumar, A., Mendling, J., Pentland, B., Reijers, H., Winter, R., Weske, M., Eds.; CEUR-WS: Leusden, The Netherlands, 2022; Volume 3216, pp. 1–5. [Google Scholar]
- Lu, X.; Fahland, D.; Van Der Aalst, W.M.P. Interactively Exploring Logs and Mining Models with Clustering, Filtering, and Relabeling. In Proceedings of the CEUR Workshop Proceedings, Rio de Janeiro, Brazil, 21 September 2016; CEUR-WS: Leusden, The Netherlands, 2016; Volume 1789, pp. 44–49. [Google Scholar]
- Nguyen, P.; Slominski, A.; Muthusamy, V.; Ishakian, V.; Nahrstedt, K. Process Trace Clustering: A Heterogeneous Information Network Approach. In Proceedings of the 16th SIAM International Conference on Data Mining 2016, SDM 2016, Miami, Floirda, USA, 5-7 May; Venkatasubramanian, S.C., Meira, W., Eds.; Society for Industrial and Applied Mathematics Publications: Philadelphia, PA, USA, 2016; pp. 279–287. [Google Scholar]
- Boltenhagen, M.; Chatain, T.; Carmona, J. Generalized Alignment-Based Trace Clustering of Process Behavior. In Lecture Notes in Computer Science; Including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics; SpringerLink: Aachen, Germany, 2019; Volume 11522, pp. 237–257. [Google Scholar]
- Huang, R.; Wang, J.; Song, S.; Lin, X.; Zhu, X.; Pei, J. Efficiently Cleaning Structured Event Logs: A Graph Repair Approach. ACM Trans. Database Syst. 2023, 48, 1–44. [Google Scholar] [CrossRef]
- Wang, J.; Song, S.; Zhu, X.; Lin, X.; Sun, J. Efficient Recovery of Missing Events. IEEE Trans Knowl. Data Eng. 2016, 28, 2943–2957. [Google Scholar] [CrossRef] [Green Version]
- Conforti, R.; La Rosa, M.; Ter Hofstede, A.H.M. Filtering Out Infrequent Behavior from Business Process Event Logs. IEEE Trans. Knowl. Data Eng. 2017, 29, 300–314. [Google Scholar] [CrossRef] [Green Version]
- van Zelst, S.J.; Fani Sani, M.; Ostovar, A.; Conforti, R.; La Rosa, M. Filtering Spurious Events from Event Streams of Business Processes. In Lecture Notes in Computer Science; Including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics; SpringerLink: Tallinn, Estonia, 2018; Volume 10816, pp. 35–52. [Google Scholar]
- Song, S.; Huang, R.; Cao, Y.; Wang, J. Cleaning Timestamps with Temporal Constraints. VLDB J. 2021, 30, 425–446. [Google Scholar] [CrossRef]
- Tax, N.; Sidorova, N.; van der Aalst, W.M.P. Discovering More Precise Process Models from Event Logs by Filtering out Chaotic Activities. J. Intell. Inf. Syst. 2019, 52, 107–139. [Google Scholar] [CrossRef] [Green Version]
- Fani Sani, M.; van Zelst, S.J.; van der Aalst, W.M.P. Repairing Outlier Behaviour in Event Logs. Lect. Notes Bus. Inf. Process 2018, 320, 115–131. [Google Scholar] [CrossRef]
- Sani, M.F.; van Zelst, S.J.; van der Aalst, W.M.P. Improving Process Discovery Results by Filtering Outliers Using Conditional Behavioural Probabilities. Lect. Notes Bus. Inf. Process 2018, 308, 216–229. [Google Scholar] [CrossRef]
- Song, W.; Xia, X.; Jacobsen, H.A.; Zhang, P.; Hu, H. Heuristic Recovery of Missing Events in Process Logs. In Proceedings of the IEEE International Conference on Web Services, ICWS 2015, New York, NY, USA, 27 June–2 July 2015; Zhu, H., Miller, J.A., Eds.; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2015; pp. 105–112. [Google Scholar]
- Kong, L.; Li, C.; Ge, J.; Li, Z.; Zhang, F.; Luo, B. An Efficient Heuristic Method for Repairing Event Logs Independent of Process Models. In Proceedings of the 4th International Conference on Internet of Things, Big Data and Security, IoTBDS 2019, Heraklion, Greece, 2-4 May 2019; Ramachandran, M., Walters, R., Wills, G., Eds.; SciTePress: Setúbal, Portugal, 2019; pp. 83–93. [Google Scholar]
- Lu, X.; Fahland, D.; van den Biggelaar, F.J.H.M.; van der Aalst, W.M.P. Handling Duplicated Tasks in Process Discovery by Refining Event Labels. In Lecture Notes in Computer Science; Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics; SringerLink: Manhatan, NY, USA, 2016; Volume 9850 LNCS, pp. 90–107. [Google Scholar] [CrossRef] [Green Version]
- Dixit, P.M.; Suriadi, S.; Andrews, R.; Wynn, M.T.; ter Hofstede, A.H.M.; Buijs, J.C.A.M.; van der Aalst, W.M.P. Detection and Interactive Repair of Event Ordering Imperfection in Process Logs. In Lecture Notes in Computer Science; Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics; SrpingerLink: Tallinn, Estonia, 2018; Volume 10816, pp. 274–290. [Google Scholar]
- Richetti, P.H.P.; Baião, F.A.; Santoro, F.M. Declarative Process Mining: Reducing Discovered Models Complexity by Pre-Processing Event Logs. In Lecture Notes in Computer Science; Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics; SringerLink: Manhatan, NY, USA, 2014; Volume 8659 LNCS, pp. 400–407. [Google Scholar] [CrossRef]
- Ekici, B.; Tarhan, A.; Ozsoy, A. Data Cleaning for Process Mining with Smart Contract. In Proceedings of the 4th International Conference on Computer Science and Engineering, UBMK 2019, Samsun, Turkey, 11–15 September 2019; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2019; pp. 324–329. [Google Scholar]
- Groves, F.J.; Fowler, R.M.; Couper, M.; Lepkowski, J.; Singer, E.; Tourangeau, J.M.R. Survey Methodology; John Wiley and Sons: Hoboken, NJ, USA, 2011. [Google Scholar]
- Etikan, I. Comparison of Convenience Sampling and Purposive Sampling. Am. J. Theor. Appl. Stat. 2016, 5, 1–6. [Google Scholar] [CrossRef] [Green Version]
- Campbell, S.; Greenwood, M.; Prior, S.; Shearer, T.; Walkem, K.; Young, S.; Bywaters, D.; Walker, K. Purposive Sampling: Complex or Simple? Research Case Examples. J. Res. Nurs. 2020, 25, 652–661. [Google Scholar] [CrossRef]
- Palinkas, L.A.; Horwitz, S.M.; Green, C.A.; Wisdom, J.P.; Duan, N.; Hoagwood, K. Purposeful Sampling for Qualitative Data Collection and Analysis in Mixed Method Implementation Research. Adm. Policy Ment. Health Ment. Health Serv. Res. 2015, 42, 533–544. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Martilla, J.; James, J. Importance-Performance Analysis. J. Mark. 1977, 41, 77–79. [Google Scholar] [CrossRef]
- Pallant, J. SPSS Survival Manual: A Step by Step Guide to Data Analysis Using IBM SPSS, 7th ed.; Routledge: London, UK, 2011. [Google Scholar]
Event Log Entities | ||||||||
---|---|---|---|---|---|---|---|---|
Case | Event | Relat. | Case/Event Attr. | Position/ Timestamp | Activity Name | Resource | ||
Data quality issues/ imperfection patterns | Missing data | I1 | I2, IP1 | I3, IP2 | I4, I9 | I5, I7 | I6 | I8 |
Incorrect data | I10 | I11 | I12, IP3 | I13, I18 | I14, I16, IP6, IP7, IP8 | I15, IP4, IP5 | I17, IP4 | |
Imprecise data | / | / | I19 | I20, I25, IP9 | I21, I23, IP8 | I22, IP10 | I24 | |
Irrelevant data | I26 | I27, IP6, IP11 | / | / | / | / | / |
Dimension | Item | Source |
---|---|---|
Data quality issues | Missing data: Case | [3] |
Missing data: Event (scattered event) | [3,4] | |
Missing data: Relationship (elusive case) | [3,4] | |
Missing data: Activity name | [3] | |
Missing data: Case and/or event attribute | [3] | |
Missing data: Timestamp | [3] | |
Missing data: Resource | [3] | |
Incorrect data: Case | [3] | |
Incorrect data: Event | [3] | |
Incorrect data: Relationship (scattered case) | [3,4] | |
Incorrect data: Activity name (polluted/distorted label) Incorrect data: Case and/or event attribute Incorrect data: Timestamp (form-based event capture, inadvertent time travel, unanchored event) | [3,4] | |
[3] | ||
[3,4] | ||
Incorrect data: Resource (polluted label) | [3,4] | |
Imprecise data: Relationship | [3] | |
Imprecise data: Activity name (homonymous label) | [3,4] | |
Imprecise data: Case and/or event attribute (synonymous label) | [3,4] | |
Imprecise data: Timestamp (unanchored event) | [3,4] | |
Imprecise data: Resource | [3] | |
Irrelevant data: Case | [3] | |
Irrelevant data: Event (form-based event capture, collateral events) | [3,4] | |
Volume, granularity, complexity | [3] | |
Preprocessing techniques | Trace clustering | [12,23,24,25,26,27,28] |
Repair log techniques | [12,37,38] | |
Trace/event filtering | [12,26,29,30,31,32,33,34,35,36] | |
Event abstraction | [12,41] | |
AI, ML, DL | [12,14,15,16,17,18,19,20,21,22] | |
Alignment-based techniques | [12,14,16,40] | |
Embedded preprocessing |
Region | % of Respondents |
---|---|
Europe | 71% |
Asia | 15.9% |
South America | 7% |
Australia | 3% |
USA | 3% |
Africa | 0.5% |
Data Preprocessing Experience | % of Respondents |
---|---|
Fair | 10.2% |
Good | 29.93% |
Very good | 36.73% |
Excellent | 23.13% |
Process Mining Experience | % of Respondents |
---|---|
1–5 years | 57.4% |
6–10 years | 23.2% |
More than 10 years | 13.8% |
Less than one year | 5.4% |
Software Tools | % of Respondents |
---|---|
Celonis | 28% |
ProM | 20% |
Fluxicon Disco | 11% |
PM4Py | 10% |
Apromore | 5% |
Noreja Process Intelligence | 3% |
SAP Signavio Process Intelligence | 3% |
Befha Lab | 2% |
R | 2% |
RapidProM | 1% |
Software Tools | % of Respondents |
---|---|
Celonis | 28% |
ProM | 20% |
Fluxicon Disco | 11% |
PM4Py | 10% |
Apromore | 5% |
Noreja Process Intelligence | 3% |
SAP Signavio Process Intelligence | 3% |
Befha Lab | 2% |
R | 2% |
RapidProM | 1% |
Perceived Importance in % | |||||
---|---|---|---|---|---|
Event Log Data Quality Issues | Not Important | Slightly Important | Moderately Important | Important | Very Important |
Missing data: Case | 4.0 | 15.8 | 21.8 | 28.7 | 29.7 |
Missing data: Event (scattered event) | 0 | 5.0 | 20.3 | 47.5 | 27.2 |
Missing data: Relationship (elusive case) | 1.0 | 6.9 | 15.3 | 36.6 | 40.1 |
Missing data: Activity name | 5.4 | 10.9 | 31.7 | 22.8 | 29.2 |
Missing data: Case and/or event attribute | 0 | 15.8 | 45.0 | 25.7 | 13.4 |
Missing data: Timestamp | 0.5 | 3.0 | 4.5 | 18.8 | 73.3 |
Missing data: Resource | 6.9 | 25.7 | 30.2 | 28.7 | 8.4 |
Incorrect data: Case | 1.0 | 15.8 | 18.3 | 41.1 | 23.8 |
Incorrect data: Event | 0 | 5.0 | 19.8 | 44.1 | 31.2 |
Incorrect data: Relationship (scattered case) | 1.0 | 8.4 | 21.8 | 37.6 | 31.2 |
Incorrect data: Activity name (polluted label, distorted label) | 2.0 | 19.3 | 35.1 | 22.3 | 21.3 |
Incorrect data: Case and/or event attribute | 1.0 | 11.4 | 31.7 | 41.6 | 14.4 |
Incorrect data: Timestamp (form-based event capture, unanchored event, inadvertent time travel) | 0 | 8.9 | 11.9 | 27.7 | 51.5 |
Incorrect data: Resource (polluted label) | 5.9 | 19.8 | 38.1 | 28.2 | 7.9 |
Imprecise data: Relationship | 4.0 | 13.4 | 29.7 | 37.6 | 15.3 |
Imprecise data: Activity name (homonymous label) | 3.0 | 11.9 | 40.1 | 31.2 | 13.9 |
Imprecise data: Case and/or event attribute (synonymous label) | 2.0 | 18.3 | 33.2 | 37.1 | 9.4 |
Imprecise data: Timestamp (unanchored event) | 1.0 | 5.0 | 18.3 | 31.2 | 44.6 |
Imprecise data: Resource | 7.9 | 29.2 | 34.2 | 22.8 | 5.9 |
Irrelevant data: Case | 15.3 | 25.2 | 29.7 | 21.8 | 7.9 |
Irrelevant data: Event (form-based event capture, collateral events) | 16.3 | 30.2 | 28.2 | 17.8 | 7.4 |
Volume, granularity, complexity | 1.0 | 6.9 | 23.8 | 39.1 | 29.2 |
Frequency of Encounters in % | |||||
---|---|---|---|---|---|
Event Log Data Quality Issues | Never | Not Often | Sometimes | Often | Very Often |
Missing data: Case | 5.0 | 31.7 | 34.2 | 19.8 | 9.4 |
Missing data: Event (scattered event) | 1.0 | 22.3 | 29.7 | 38.6 | 8.4 |
Missing data: Relationship (elusive case) | 7.4 | 26.7 | 36.1 | 24.3 | 5.4 |
Missing data: Activity name | 7.4 | 35.6 | 29.2 | 21.8 | 5.9 |
Missing data: Case and/or event attribute | 3.0 | 17.8 | 41.6 | 29.7 | 7.9 |
Missing data: Timestamp | 11.9 | 38.1 | 26.2 | 13.4 | 10.4 |
Missing data: Resource | 3.0 | 23.8 | 33.7 | 27.7 | 11.9 |
Incorrect data: Case | 11.9 | 40.6 | 25.7 | 17.8 | 4.0 |
Incorrect data: Event | 4.5 | 38.6 | 32.7 | 21.3 | 3.0 |
Incorrect data: Relationship (scattered case) | 6.4 | 40.6 | 34.7 | 12.4 | 5.9 |
Incorrect data: Activity name (polluted label, distorted label) | 5.9 | 33.2 | 37.6 | 17.8 | 5.4 |
Incorrect data: Case and/or event attribute | 3.5 | 33.2 | 42.1 | 16.8 | 4.5 |
Incorrect data: Timestamp (form-based event capture, unanchored event, inadvertent time travel) | 6.9 | 31.7 | 21.3 | 27.2 | 12.9 |
Incorrect data: Resource (polluted label) | 8.9 | 40.6 | 34.2 | 13.4 | 3.0 |
Imprecise data: Relationship | 5.9 | 39.6 | 35.6 | 17.8 | 1.0 |
Imprecise data: Activity name (homonymous label) | 5.9 | 29.2 | 38.6 | 21.8 | 4.5 |
Imprecise data: Case and/or event attribute (synonymous label) | 5.4 | 31.2 | 37.1 | 19.8 | 6.4 |
Imprecise data: Timestamp (unanchored event) | 3.5 | 32.2 | 35.1 | 15.3 | 13.9 |
Imprecise data: Resource | 10.4 | 34.2 | 34.7 | 16.8 | 4.0 |
Irrelevant data: Case | 5.9 | 36.6 | 25.7 | 22.8 | 8.9 |
Irrelevant data: Event (form-based event capture, collateral events) | 5.0 | 32.2 | 32.2 | 22.3 | 8.4 |
Volume, granularity, complexity | 5.4 | 13.9 | 29.2 | 37.1 | 14.4 |
Preprocessing Techniques | Perceived Importance in % | ||||
---|---|---|---|---|---|
Not Important | Slightly Important | Moderately Important | Important | Very Important | |
Trace clustering | 5.4 | 8.9 | 35.1 | 36.6 | 13.9 |
Repair log techniques | 8.4 | 22.8 | 24.3 | 32.7 | 11.9 |
Trace/event filtering | 3 | 3.5 | 11.9 | 39.6 | 42.1 |
Event abstraction | 5 | 2.5 | 37.6 | 28.2 | 26.7 |
AI, ML, DL | 10.4 | 20.3 | 32.7 | 26.2 | 10.4 |
Alignment-based techniques | 10.4 | 21.8 | 38.1 | 22.3 | 7.4 |
Embedded preprocessing | 8.9 | 16.3 | 22.8 | 33.7 | 18.3 |
Preprocessing Techniques | Frequency of Encounters in % | ||||
---|---|---|---|---|---|
Never | Not Often | Sometimes | Often | Very Often | |
Trace clustering | 14.9 | 20.3 | 33.2 | 20.8 | 10.9 |
Repair log techniques | 20.3 | 22.8 | 31.2 | 21.8 | 4 |
Trace/event filtering | 4.5 | 6.4 | 17.3 | 32.7 | 39.1 |
Event abstraction | 8.4 | 16.3 | 30.2 | 24.8 | 20.3 |
AI, ML, DL | 24.8 | 20.8 | 23.8 | 19.8 | 10.9 |
Alignment-based techniques | 26.7 | 23.8 | 23.8 | 20.3 | 5.4 |
Embedded preprocessing | 19.8 | 23.3 | 18.8 | 21.8 | 16.3 |
Importance | Frequency | |||
---|---|---|---|---|
Event Log Data Quality Issues | Mean 1 | Std. D 2 | Mean 1 | Std. D 2 |
LDQ1 Missing data: Case | 3.64 | 1.177 | 2.97 | 1.046 |
LDQ2 Missing data: Event (scattered event) | 3.97 | 0.822 | 3.31 | 0.945 |
LDQ3 Missing data: Relationship (elusive case) | 4.08 | 0.959 | 2.94 | 1.013 |
LDQ4 Missing data: Activity name | 3.59 | 1.173 | 2.83 | 1.042 |
LDQ5 Missing data: Case and/or event attribute | 3.37 | 0.906 | 3.22 | .932 |
LDQ6 Missing data: Timestamp | 4.61 | 0.753 | 2.72 | 1.156 |
LDQ7 Missing data: Resource | 3.06 | 1.077 | 3.22 | 1.033 |
LDQ8 Incorrect data: Case | 3.71 | 1.031 | 2.61 | 1.036 |
LDQ9 Incorrect data: Event | 4.01 | 0.843 | 2.80 | 0.927 |
LDQ10 Incorrect data: Relationship (scattered case) | 3.90 | 0.974 | 2.71 | 0.972 |
LDQ11 Incorrect data: Activity name (polluted label, distorted label) | 3.42 | 1.086 | 2.84 | 0.971 |
LDQ12 Incorrect data: Case and/or event attribute | 3.57 | 0.907 | 2.86 | 0.895 |
LDQ13 Incorrect data: Timestamp (form-based event capture, unanchored event, inadvertent time travel) | 4.22 | 0.973 | 3.07 | 1.176 |
LDQ14 Incorrect data: Resource (polluted label) | 3.12 | 1.012 | 2.61 | 0.931 |
LDQ15 Imprecise data: Relationship | 3.47 | 1.033 | 2.68 | 0.869 |
LDQ16 Imprecise data: Activity name (homonymous label) | 3.41 | 0.969 | 2.90 | 0.959 |
LDQ17 Imprecise data: Case and/or event attribute (synonymous label) | 3.34 | 0.949 | 2.91 | 0.991 |
LDQ18 Imprecise data: Timestamp (unanchored event) | 4.13 | 0.950 | 3.04 | 1.083 |
LDQ19 Imprecise data: Resource | 2.90 | 1.034 | 2.70 | 0.999 |
LDQ20 Irrelevant data: Case | 2.82 | 1.172 | 2.92 | 1.090 |
LDQ21 Irrelevant data: Event (form-based event capture, collateral events) | 2.70 | 1.160 | 2.97 | 1.041 |
LDQ22 Volume, granularity, complexity | 3.89 | 0.942 | 3.41 | 1.067 |
Preprocessing Techniques | Importance | Frequency | ||
---|---|---|---|---|
Mean | Std. Deviation | Mean | Std. Deviation | |
Trace clustering | 3.45 | 1.017 | 2.93 | 1.201 |
Repair log techniques | 3.17 | 1.160 | 2.66 | 1.144 |
Trace/event filtering | 4.14 | 0.964 | 3.96 | 1.108 |
Event abstraction | 3.69 | 1.049 | 3.32 | 1.210 |
AI, ML, DL | 3.06 | 1.140 | 2.71 | 1.326 |
Alignment-based techniques | 2.95 | 1.075 | 2.54 | 1.234 |
Embedded preprocessing | 3.36 | 1.211 | 2.92 | 1.378 |
Chi-Square Tests | |||
---|---|---|---|
Value | df | Asymp. Sig. (2-Sided) | |
Pearson chi-square | 1025.284 a | 160 | 0.000 |
Likelihood ratio | 1052.269 | 160 | 0.000 |
No. of valid cases | 4242 |
Symmetric Measures | |||
---|---|---|---|
Value | Approx. Sig. | ||
Nominal by nominal | Cramer’s V | 0.174 | 0.000 |
No. of valid cases | 4242 |
Alignment-Based Techniques | Embedded Preprocessing | Event Abstraction | AI, ML, DL | Repair Log Techniques | Trace Clustering | Trace/Event Filtering | SQL | |
---|---|---|---|---|---|---|---|---|
Missing data: Case | 11.9 | 1.0 | 5.4 | 14.9 | 21.8 | 25.7 | 19.3 | 0 |
Missing data: Event (scattered event) | 1.5 | 6.9 | 17.3 | 16.8 | 20.8 | 8.4 | 28.2 | 0 |
Missing data: Relationship (elusive case) | 3.2 | 5.0 | 21.2 | 13.8 | 12.1 | 26.3 | 18.4 | 0 |
Missing data: Activity name | 6.4 | 3.0 | 25.2 | 21.8 | 0.5 | 21.3 | 10.9 | 0 |
Missing data: Case and/or event attribute | 9.4 | 5.0 | 4.0 | 33.7 | 13.4 | 5.0 | 29.7 | 0 |
Missing data: Timestamp | 6.9 | 6.9 | 12.4 | 15.8 | 33.7 | 4.5 | 18.8 | 0 |
Missing data: Resource | 5.0 | 6.4 | 5.4 | 25.7 | 3.5 | 9.9 | 21.8 | 3.5 |
Incorrect data: Case | 6.4 | 6.9 | 2.0 | 4.5 | 28.2 | 19.3 | 27.7 | 5.0 |
Incorrect data: Event | 4.0 | 5.9 | 9.9 | 10.4 | 27.7 | 8.9 | 29.2 | 4.0 |
Incorrect data: Relationship (scattered case) | 5.9 | 9.9 | 6.9 | 11.4 | 17.8 | 11.9 | 32.2 | 4 |
Incorrect data: Activity name (polluted label, distorted label) | 7.4 | 5.9 | 15.8 | 16.3 | 25.7 | 5.0 | 19.8 | 4 |
Incorrect data: Case and/or event attribute | 4.0 | 9.4 | 13.4 | 19.3 | 22.3 | 5.9 | 24.8 | 0 |
Incorrect data: Timestamp (form-based event capture, inadvertent time travel, unanchored event) | 11.9 | 5.9 | 6.9 | 10.9 | 30.7 | 4.0 | 28.7 | 0 |
Incorrect data: Resource (polluted label) | 2.0 | 7.9 | 17.8 | 10.9 | 19.8 | 10.9 | 24.8 | 5.9 |
Imprecise data: Relationship | 6.4 | 12.4 | 11.4 | 12.4 | 18.8 | 8.9 | 24.3 | 5.4 |
Imprecise data: Activity name (homonymous label) | 5.0 | 11.4 | 30.7 | 8.9 | 16.3 | 5.0 | 17.8 | 5 |
Imprecise data: Case and/or event attribute (synonymous label) | 5.0 | 12.4 | 7.9 | 10.9 | 15.3 | 18.8 | 24.3 | 5.4 |
Imprecise data: Timestamp (unanchored event) | 2.5 | 9.4 | 6.9 | 14.9 | 33.2 | 9.9 | 17.8 | 4.5 |
Imprecise data: Resource | 4.5 | 10.4 | 9.9 | 22.3 | 7.9 | 11.4 | 26.7 | 5.4 |
Irrelevant data: Case | 2.0 | 10.4 | 6.9 | 7.4 | 6.4 | 5.0 | 58.4 | 3.5 |
Irrelevant data: Event (form-based event capture, collateral events) | 0 | 4.5 | 9.9 | 9.4 | 5.4 | 5.0 | 58.4 | 3.5 |
Volume, granularity, complexity | 4.0 | 6.9 | 15.3 | 16.8 | 1.0 | 10.4 | 42.1 | 3.5 |
Data Quality Issues | Preprocessing Technique Categories Rank | ||
---|---|---|---|
1 | 2 | 3 | |
Missing data: Case | Trace clustering | Repair log techniques | Trace/event filtering |
Missing data: Event (scattered event) | Trace/event filtering | Repair log techniques | Event abstraction |
Missing data: Relationship | Trace clustering | Event abstraction | Trace/event filtering |
Incorrect data: Timestamp | Repair log techniques | Trace/event filtering | / |
Imprecise data: Timestamp | Repair log techniques | Trace/event filtering | / |
Volume, granularity, complexity | Trace/event filtering | AI, ML, DL | Event abstraction |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Dakic, D.; Stefanovic, D.; Vuckovic, T.; Zizakov, M.; Stevanov, B. Event Log Data Quality Issues and Solutions. Mathematics 2023, 11, 2858. https://doi.org/10.3390/math11132858
Dakic D, Stefanovic D, Vuckovic T, Zizakov M, Stevanov B. Event Log Data Quality Issues and Solutions. Mathematics. 2023; 11(13):2858. https://doi.org/10.3390/math11132858
Chicago/Turabian StyleDakic, Dusanka, Darko Stefanovic, Teodora Vuckovic, Marina Zizakov, and Branislav Stevanov. 2023. "Event Log Data Quality Issues and Solutions" Mathematics 11, no. 13: 2858. https://doi.org/10.3390/math11132858
APA StyleDakic, D., Stefanovic, D., Vuckovic, T., Zizakov, M., & Stevanov, B. (2023). Event Log Data Quality Issues and Solutions. Mathematics, 11(13), 2858. https://doi.org/10.3390/math11132858