Next Article in Journal
Sports Analytics: Data Mining to Uncover NBA Player Position, Age, and Injury Impact on Performance and Economics
Previous Article in Journal
Intuitionistic Fuzzy Sets for Spatial and Temporal Data Intervals
Previous Article in Special Issue
Exploring Key Issues in Cybersecurity Data Breaches: Analyzing Data Breach Litigation with ML-Based Text Analytics
 
 
Article
Peer-Review Record

Automated Trace Clustering Pipeline Synthesis in Process Mining

Information 2024, 15(4), 241; https://doi.org/10.3390/info15040241
by Iuliana Malina Grigore 1, Gabriel Marques Tavares 2,3, Matheus Camilo da Silva 1, Paolo Ceravolo 4 and Sylvio Barbon Junior 1,*
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Information 2024, 15(4), 241; https://doi.org/10.3390/info15040241
Submission received: 18 March 2024 / Revised: 5 April 2024 / Accepted: 18 April 2024 / Published: 20 April 2024
(This article belongs to the Special Issue Advances in Machine Learning and Intelligent Information Systems)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper presents a solution for trace clustering pipeline synthesis that optimizes a multi-objective function matching clustering and process quality metrics.

The abstract is adapted to the paper content.

Lines 37-38: I don’t understand the sentence. Maybe rewriting it will increase the paper quality.

Lines 59-70: please avoid the term “we” in the paper.

 

Good introduction

The section 2 has to be more detailed and completed with more references. The link with the paper subject has to be more shown.

The section 3 seems good.

In the sections 4 and 5, it could be good to add sentences at the beginning and the end of this parts to linked them with the previous or the next sections. The section 5 finished with a formula without more explanation.

The section 6 seems good, but the titles of sub-sections could be changed to increase the quality of the paper.

The discussion part is too short. I think each idea could be more explained.

The conclusion has to be improved for increasing the quality of the paper.

Comments on the Quality of English Language

Good quality of language.

Author Response

Reply uploaded.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript introduces a new approach of automating the development of trace clustering pipelines in process mining by means of genetic programming techniques. The suggested approach optimizes pipeline parts such as encoding techniques, preprocessing methods, clustering algorithms, and hyperparameters to find the best pipelines particularly for a single event log. One of the main advantages is the usage of a multi-objective function that includes both clustering quality metrics and process mining metrics to control the evolutionary optimization. The approach is tested on four real-world event log datasets.

This manuscript focuses on one of the most significant issues that the process mining faces - the complexity of manual designing of good trace clustering pipelines. Application of this task by AutoML approaches being novel and very interesting to the community of process mining. The genetic programming approach is logical and well-founded with such large, hierarchical search space. Experimental results on multiple real datasets show that the method is promising in enhancing the downstream analysis by identifying a meaningful set of sub-logs to alleviate the log complexity but preserving a reasonable model quality.

The manuscript is lucidly written and well-organized in general. The background, methodology, results and discussion sections are reasonable. The manuscript seems to be scientifically sound in general, with an experimental design that is suitable for the testing of the proposed AutoML approach for trace clustering pipeline synthesis. A few potential limitations to note:

-The limited number of event logs (4) used during the assessment may not represent all the PM use cases. Testing on a bigger set of logs would improve the findings.

-The multi-objective function could be elaborated. What are the specific clustering and PM metrics employed and how are they calculated/aggregated?

-It would provide some more data on the genetic algorithm hyperparameters (e.g. population size, selection pressure) and how sensitive the outcomes are to these settings.

-The current quality metrics seem acceptable but it would be useful to investigate other quality measures from both the clustering and PM aspects.

-A more detailed presentation of the hyperparameter settings for the genetic algorithm (population size, crossover/mutation rates, etc.) and the influence of these choices on the results would improve the paper.

-Limitations of the present experiments are mentioned (small log collection, impossibility to do the direct comparisons with the prior AutoML methods). An expansion on these limitations and a description of the detail will be helpful for future research.

 

To enhance reproducibility of the main results I suggets additional improvments: 

-Some details in the approach on the exact formulating of the multi-objective scoring function could be named a bit more precisely. Both silhouette coefficient and sequence entropy metrics are defined, but the specific aggregation (?weighted average?) used to integrate the two is not stated.

-The genetic algorithm parametrization details are not completely specified, such as population size, number of generations, crossover and mutation rates, and parent selection mechanism. This information would help in reproducibility.

-The hardware infrastructure (for example, CPU, memory) used in the experiments is not provided. This might affect time results which are used for method comparison.

-Source code and documentation links for the implementation is a welcome enhancement. The manuscript states that code is available, but does not give a direct URL or repository.

 

Specific Comments:

- Section 5.4: Could you please specify which “two important metrics” are amalgamated in the multi-objective function? Although silhouette coefficient and sequence entropy metrics are defined, it is not clearly stated that these two are the used.

- Figure 4: The label should indicate that the x-axis represents execution time. Also, verify whether the colors in the caption match the figure (i.e., “Random Search” appears violet, not green, to me).

- In the revision a few small grammatical errors must be corrected (e.g. Instead of “Contrariwise” in Section 7, it should be “Contrarily”?).

 

The paper includes 41 references. From these, only less than twenty have been put up in the last four years (2020-2023). I do not think that the old references my bias the research grounding. Apart from that, I see 7 self-citations – do they really have to be present?  I gather that the authors have positioned their work within the current research scenario in process mining and AutoML. Nevertheless, seven is too many.

Author Response

Reply uploaded.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The article requires substantial changes.

The summary should be improved by including quantitative results of the metrics used to measure the proposed approach.

The organization of the document must be improved; it is tiring to read the article with a large amount of theoretical information without exemplifying the proposal. For example, the information presented in the related work section should probably be used in the discussion section to compare, argue, and discuss the advantages of the presented proposal.

Section 4 Trace Clustering can be included in the preliminary section as a subsection. Additionally, the text on lines 200 to 226 is not required in this section, as it is presented as a subsection of work related to trace clustering.

The Materials and Methods section should be improved, describing the proposed methodology in-depth, with less literature review. For example, lines 231 to 235 define: "The proposed method employs an evolutionary algorithm to automatically generate and optimize clustering pipelines by evolving a population of potential solutions, as depicted in Figure 1. It combines the optimization power of evolutionary computation with the domain knowledge of PM, to generate effective trace clustering pipelines tailored to event log datasets...", however, it did not find a figure, diagram, pseudocode, code or configuration parameters of the algorithm used that would allow a clear understanding of the proposal.

How was encoding functionality added to the pipeline search space?

Can you describe the Multi-Objective (MO) function in depth?

What are the dimensions that the proposed method optimizes?

In the results section, new data sets are recommended in addition to the classic event records used in the experimentation.

In lines 449-461, a data mining analysis is mentioned, but is this analysis not found in the document, or does it refer to the graph presented in Figure 4?

Comments on the Quality of English Language

A native speaker must review the English language and the structure of the manuscript.

Author Response

Reply uploaded.

Author Response File: Author Response.pdf

Round 2

Reviewer 3 Report

Comments and Suggestions for Authors

The state-of-the-art includes several data sets (event logs) available for executing new experiments with techniques based on the process mining approach.

Back to TopTop