1. Introduction
The efficient management of the production and business processes has a high priority in every organization. The activities are composed into different workflows executed regularly to achieve the defined goals. The efficient workflow design and control has a special high importance in the case of the complex production processes of our days.
Process mining is an umbrella term covering different activities related to workflow control. Process mining is used to discover, monitor, or improve processes and process models by extracting knowledge from the measured event logs. One of the key functionality fields is process discovery [
1], which is used to build up the workflow process model from the available event logs. This tool is especially useful when the manual workflow schema construction is too expensive or impossible. In these cases, the automatic workflow schema induction provides a feasible and cost-effective solution. The generated workflow model can be used later, among others, for process automatization or conformance checking.
According to the classification of the dominating business process-modelling tools presented in Mili et al. [
2], the main approaches in process modelling are as follows:
Automaton-based methods like Petri nets [
3], IDEF [
4], EPC (event-driven process chain) [
5], and role activity diagrams [
6];
Data flow diagrams and value-stream mapping;
Workflow modelling languages, like WPDL [
7] and BPMN [
8];
Programming language-oriented methods, including unified modelling languages (UML).
The Petri net model plays a key role as one of the oldest and most abstract models having many higher-level extensions. Regarding practical expressive power, the Business Process Model and Notation (BPMN) [
9] is the dominating tool. These models usually support the XML languages as data interchange formats. One of the most powerful modelling languages is the YAWL language [
10]. This model is the result of a systematic analysis by the Workflow Patterns Initiative [
11] on the existing process modelling notations and workflow languages. In the practical applications, the workflow structure can be modelled with the following control elements:
Sequence: linear structure, where a task in a process can be executed after the completion of a preceding task;
Branching (XOR): only one of the several outgoing paths is enabled;
Fork (AND): the process path is split into two or more branches, which are activated concurrently;
Split (OR): more or one of the several outgoing paths are enabled;
Merging: synchronization of the several incoming threads.
Process mining tools can be used also for performance analysis [
12], where the process mining is used to automate finding or calculating the performance metrics. This solution provides faster, immediate feedback on the current process status. Another application field is conformance checking [
13]. The engine compares the running or terminated processes with the given workflow model to discover differences and similarities between the model and event log. Most of the approaches perform the analysis on the complete terminated trace, which does not enable the prompt interaction of the control module. To enhance this important feature, current works focus on the development of on-line conformance analysis [
14]. The fourth pillar of process mining is the model enhancement domain [
15]. Having an existing workflow model, the enhancement module can manage dynamic workflow behavior; it can update the model with new elements and dependencies.
In all components of process mining, the workflow model plays a central role. The manual construction of the model usually requires the cooperation of more experts, and it takes more time to finalize the model; thus, the manual model construction requires a high-cost investment. Driven by the requirements on cost-effective workflow model construction, the automatic workflow model discovery became a highly investigated research domain.
The analysis of widely used process discovery methods in industry reveals that classical automata or pattern-matching-based approaches impose significant limitations on their applicability. The goal of this paper is to present a novel approach capable of alleviating these constraints. The proposed method advances the edge prediction approach utilized in graph neural networks by introducing a new multi-aspect embedding with a two-dimensional similarity matrix. Comparative tests clearly demonstrate the efficiency of the proposed method and its potential for broader applicability.
In the
Section 2 of the paper, we present a survey on the existing approaches in process discovery and graph link prediction domains. In
Section 3, we introduce a special equivalence relationship on the node set of the event log traces. We propose, here, a novel approach for the equivalence mining and a description of the schema induction process. The presented approach utilizes the discovered equivalence links. For the detection of the schema nodes, we apply a quasi-clique detection algorithm. In
Section 4, we present the test environment, and we analyze the test results. We included two benchmark methods for the link prediction task and three industry standard methods for the schema induction task. As the test results show, the proposed method dominates all the benchmark methods involved in the tests.
2. Related Works
2.1. Process Discovery
The automatic workflow schema induction is a complex process; it requires the integration of more processing modules. One of the key components is the mining of frequent subsequences in the event log. The field of frequent path mining is also of interest in many domains. For example, Laxman et al. developed an algorithm for sequence prediction over long categorical event streams [
16]. Karoly and Abonyi [
17] and Weiss [
18] propose methods to extract temporal patterns in industrial alarm management systems. Analyzing study paths for the prediction of student dropout [
19] and identifying efficient learning patterns in e-learning environments [
20,
21] are key concepts in educational data mining. In these examples, frequent patterns are selected based on their utility. In [
22], the authors argue that the cost perspective should also be considered when mining event sequences.
The drawbacks of the traditional frequent pattern mining solutions come to the surface when dealing with massive datasets [
23]. The algorithms have problems with runtime and accuracy [
24], and their output data are difficult to interpret and handle [
25]. In order to efficiently mine frequent patterns, a tree-based representation has proved to be more compact and, practically, more usable [
26]. For this purpose, Lin et al. proposed the FUSP-tree structure [
27], which gave rise to numerous variations of incremental tree-based pattern mining algorithms [
28,
29,
30,
31].
A more efficient approach is to apply incremental discovery, which derives event patterns recursively. Another representation model was proposed by Chen [
32], who introduced a directed acyclic graph representation, which allows for pattern growth along both ends of the detected patterns. This algorithm ensures a faster pattern discovery, too. The work of Patel and Patel [
33] combines the graph-based approach with the clustering of patterns to avoid the recursive reconstruction of intermediate trees. Singh et al. [
34] use a graph-based approach to extract frequent sequential web access patterns, while Dong et al. [
35] present a new weighted graph structure and a method to find variable-length sequential patterns.
The first proposals on process discovery were published in the late 1990s [
36]. The first widely used method is the alpha miner [
37]. The alpha-miner algorithm generates a Petri net schema result in the following steps:
Discovering distinct events;
Discovering starting and final events;
Grouping the events and constructing a footstep matrix to show direct succession, parallel, and choice control structures;
Determining the schema nodes and connecting them.
In the literature, we can find many application use cases of the alpha miner, including modeling educational processes [
38], clinical workflow analysis [
39], and software engineering [
40].
The inductive miner [
41] is an improved powerful process discovery algorithm. It applies a structured, divide-and-conquer approach. The method produces correct valid process models in the form of block-structured models (process trees and Petri nets). The approach processes the trace sequence recursively; the method splits the large input log into smaller segments, generates models for the segments, and finally, merges the local models into a global final model. Regarding the recent research directions, the main focus areas relate to adaptation of the method to large-scale problems using a simplified, approximated model [
42].
The heuristic miner generates the workflow schema based on probabilistic models using dependency frequency measures [
43]. The main steps of the workflow model construction are the following:
One key benefit of this approach is that this model can handle small noises in the event logs.
A recent comparative analysis of the different process discovery algorithms was presented in [
44]. In the tests on healthcare datasets, the inductive miner achieved the best result. Based on the author’s experiences, the PM4Py [
45] framework is ideal for professionals with knowledge in the area, while Disco [
46] is the simplest and most intuitive tool for discovering processes.
2.2. Graph Neural Networks in Process Mining
As neural network technology proves its efficiency in many application areas, there is active research on the development of process discovery applying neural networks. The main strength of the neural network technology over the automaton and pattern-matching approaches is that it can also process complex structures and noisy datasets.
To analyze interactions and dependencies within event-based process flows, researchers have explored various approaches, including both graph neural network (GNN) [
47] models and alternative machine learning techniques. GNNs have proven effective for link prediction in event sequence analysis, where they capture complex graph-based dependencies. For instance, Kipf and Welling’s work on semi-supervised GNNs [
48] laid the groundwork for using these networks to infer missing links, leveraging node features and graph structures. Their approach achieved notable accuracy in static graphs but faced challenges in adapting to dynamic, trace-based datasets, where graph structures emerge over time. Hamilton et al.’s GraphSAGE model introduced an efficient node-embedding technique via sampling [
49], which is particularly useful in event traces with sparse connections between events. Compared to Kipf and Welling’s approach, GraphSAGE is more adaptable to dynamic graphs due to its inductive nature. However, its reliance on sampling can lead to information loss in rare but significant transitions. Velickovic et al.’s graph attention networks (GATs) assign varying importance to edges, enhancing the interpretability of high-probability links [
50]. Unlike GraphSAGE, which emphasizes embedding efficiency, GATs provide edge-level attention, improving interpretability. However, their reliance on predefined edge weights makes them less effective in trace-based datasets with uncertain or evolving connections. Pattern-matching techniques have also been used to identify recurring structures or motifs within event sequences [
16]. These methods excel in rigid environments with well-defined rules but struggle to adapt to dynamic datasets. In contrast, deep learning methods like GNNs provide greater adaptability in scenarios with evolving patterns. Sequence-based models, such as long short-term memory networks (LSTMs), excel at capturing temporal dependencies within sequential data [
23]. However, they focus primarily on the temporal order of events, often overlooking the relational structures critical for graphs. Compared to GNNs, LSTMs are less effective in capturing the interplay of relational and temporal dependencies in event-based datasets. DeepWalk, a random-walk-based model, generates low-dimensional embeddings through graph connectivity [
51]. While effective for static graphs, DeepWalk struggles with dynamically evolving structures, as its embeddings rely on fixed connectivity patterns. Statistical methods, such as those by Yao et al. [
52], provide insights into graph properties using metrics like clustering coefficients and edge density. Although these methods enhance interpretability, they lack predictive power compared to GNN-based techniques. When comparing these methods, GNNs stand out for their ability to model relational dependencies, with GraphSAGE offering scalability and GATs enhancing interpretability. LSTMs are strong in temporal modeling but fall short in capturing structural relationships, while pattern-matching techniques and statistical methods provide useful insights but lack adaptability. DeepWalk bridges connectivity-based learning and dimensionality reduction but struggles with dynamic graphs.
Graph neural networks (GNNs) offer a mechanism to embed graphs explicitly [
53]. These networks process graph data to produce vector representations, which can be used for tasks such as classification. GNNs achieve this by learning graph embeddings guided by specific training objectives.
A key challenge in GNNs is adapting the concept of convolution to graph structures, inspired by traditional convolutional neural networks. Unlike Euclidean domains, defining convolution on graphs involves complex theoretical considerations due to the lack of a straightforward definition. Two main approaches have emerged: spectral and spatial methods.
Spectral Methods: These rely on the convolution theorem, which equates convolution in the spatial domain to a product in the frequency domain. Although the theorem is derived for Euclidean spaces, some approaches extend it to graphs by utilizing the graph. However, spectral methods face some limitations: (a) They are sensitive to changes in graph topology, as minor structural variations can significantly impact convolution results; and (b) the computation requires matrix diagonalization, which is computationally expensive due to the lack of an efficient graph-based fast Fourier transform (FFT).
Spatial Methods: These approaches attempt to replicate the classical convolution operation directly on the graph’s spatial domain. While they aim to preserve the graph’s structural information, many methods simplify the graph too much, potentially overlooking critical details.
2.3. Link Prediction with GNN
We selected the GNN approach for the equivalence link prediction, as the GNN has superior efficiency in the standard link-prediction problems. The input feature vectors for the link prediction are generated as low-dimensional embedding vectors describing the context and feature parameters of the nodes and edges.
In the GNN, for each node, the engine collects the context description vectors found in the neighboring nodes. The usual approach is to perform random walks and to collect the context vectors of the nodes touched in the walk. In the next phase, the collected context vectors are aggregated into an updated feature vector of the current node. The generated status vectors can be used as input feature vectors of the graph components in the required classification problems.
Regarding the concrete embedding generation approaches, we selected the following popular benchmark methods: Node2Vec, Hin2Vec, GraphSage, and HinSage.
The simplest approach is the Node2Vec [
54] method, which can be applied for homogeneous graphs. In this case, there is only one type of node and edge. The method uses a biased random walk to collect the local status vectors from the neighborhood nodes. The tested node sequences are processed with Skip-Gram model, which learns embeddings by maximizing the likelihood of co-occurring nodes within a defined context window. The simplicity and scalability of Node2Vec make it effective for homogeneous graphs, but it is limited in its ability to capture semantic relationships in heterogeneous information networks.
The Hin2Vec [
55] approach is the extension of the Node2Vec framework for heterogeneous graphs. In heterogeneous graphs, nodes and edges are associated types; different nodes or edges may belong to different types. This type of information can be used to make constraints on the generated walks during the embedding calculations. Like Node2Vec, Hin2Vec uses the Skip-Gram model to learn embeddings, but it outputs not only node embeddings but also link embeddings.
GraphSage [
56] is a general inductive framework that uses node feature data to generate the embedding vector. The resulting embedding is calculated by sampling and aggregating features from a node’s local neighborhood. The proposed method supports both supervised and unsupervised learning; it can be used for node classification/regression, and link prediction for homogeneous networks. The pseudo code of embedding generation is shown in Algorithm 1.
Algorithm 1: GraphSAGE embedding generation algorithm |
Input: Graph: G (V,E); input features: {xv}; depth: K; weight matrices: Wk;non-linearity factor: s; differentiable aggregator functions: AGGREGATEk; neighborhood function: N Output: Vector representation zv |
1. h0v = xv (for all v in V) |
2. for k in 1…K do |
3. for v in V do: |
4. hkN(v) = AGGREGATEk ({hk−1u, for all u in N(v)}) |
5. Hkv = s(Wk concat(hk−1v,hkN(v) )) |
6. end |
7. hkv = hkv/||hkv||2 for all v in V |
8. end |
9. zv = hKv, for all v in V |
The HinSage [
57] method is the enhancement of the inductive GraphSAGE model for heterogeneous graph representation learning. The method employs a neural aggregation mechanism to generate embeddings using sampling and aggregation steps. For each node, type-specific neighbors are sampled, and separate aggregations are performed for each type. The aggregated embeddings are then combined to update the node’s representation. Unlike Node2Vec and Hin2Vec, HinSage does not require manually defined meta-paths, as it learns the relationships directly from the data. Additionally, HinSage supports inductive learning, making it suitable for dynamic or partially observed graphs where new nodes or edges are introduced. We remark that most embedding frameworks found in the literature are using a transductive approach; thus, they can only generate embeddings for a single fixed graph. As the experiences show, transductive approaches do not efficiently generalize unseen nodes and consequently, these approaches cannot learn to generalize across different graphs.
2.4. Limitations of Current Process Discovery Methods
Considering the traditional, automaton-based process discovery methods, all the leading methods are based on some form of the Directly-Follows Graph (DGF), which is a directed graph that represents the sequence of activities. Process mining practitioners tend to use such simplified DFGs actively. In order to scope with complexity, DFGs are usually simplified by many ways [
58]. One standard method is the removal of infrequent nodes and edges. Another approach is to assume that the type of event node is identified uniquely and, thus, every node type has a unique node in the DFG and in the generated schema graph. In some real application fields, especially with an imperfect information background [
59], this uniqueness assumption cannot be guaranteed. In these cases, the resulted schemas are incorrect and misleading. Another source of confusion is the multi-occurrence of a given event type inside the traces. The same event type may have different prefix and postfix sequences at the different positions. The mapping the different subtypes to the same event type causes over-generalization. In this case, the schema generates hallucinations and invalid traces. Beside theses simplification assumption, another key issue is the complexity limitations of the standard methods. Due to the inherent algorithmic nature of the automaton approaches, the detection of complex, deeper nested structures is a great challenge, and the current methods are not able to manage these complex structures within an acceptable time.
Based on these shortcomings, there is intensive research on the efficiency improvement of the process discovery methods, focusing mainly on the application of machine learning methods, in the recent years. Although there are many promising results, the current technology is far from perfect. Summarizing the current achievements, we can highlight the following approaches:
Current industry level solutions (like alpha miner, inductive miner, or heuristic miner) are universal tools, but they cannot scope with event type multi-occurrences and deeper nesting.
Application of recurrent neural networks (LSTM) for process discovery [
60]. The model is strong at finding XOR patterns, but weaker at discovering loops or event type multi-occurrences.
There are only a few methods for process mining using graph neural networks technology. The most recent approach [
61] applies the GNN for extension of the DFG approach; the proposed method is not suitable to manage the problem of event type multi-occurrences. The main reason for the lack of more GNN -based solutions is that the GNN architecture is mainly focusing on the manipulation of edges and nodes inside a fixed input graph. The generation of a new graph from very different graph pieces requires a novel GNN architecture.
The main goal of our investigation is to address these shortcomings and to provide a new proposal for a process discovery approach which integrates both machine learning components and traditional graph methods. The proposed method applies convolutional neural networks (CNN) for fuzzy loop detection and similarity calculation. As the GNN is the standard tool for link prediction in the graph, we apply the GNN approach for finding the equivalent event instances of a given event type and, then, we employ standard quasi-clique graph detection methods to determine the event nodes of the resulting schema graph.
3. Proposed Schema Induction Method
The set of event types is given by
The workflow model is defined by a directed graph containing action nodes and control nodes:
where each node
has one input and one output edge,
is the fork node (one input edge, more output edges),
is the merge node (more input edges and one output edge),
is the start node (no input edge, one output edge), and
is the terminal node (one input edge and no output edge).
Each node in the model is uniquely identified by its index value:
. We assign to each action node an event type:
In our model, different nodes in the workflow model may have the same event type. In the model, two activity nodes
,
are in adjacency relation
if
or there is a
path in the model, where each
is a control node.
A trace related to model M is defined as a sequence of elementary events:
where
The corresponding model node for a trace node
is denoted by
. The m-identifier of trace event
e is defined with
The input dataset, an event log, is given by a set of traces related to the same model.
The set of events on
L is defined as
We define an equivalence relation
on
, where
Regarding the investigated task, we can see that our process discovery problem domain differs from the standard graph mining methods in many aspects:
The input graph and the output graph are disjoint;
The resulting graph is a new graph generated during the prediction process;
The input graph is a set of traces, where each trace is a sequence of nodes;
Each node in the input graph has a one-dimensional neighborhood.
Thus, our task is not a standard node or edge classification task; rather, it is the clustering of the input nodes where each cluster corresponds to a node in the resulting schema graph. The outline of the proposed method consists of the following steps:
Develop a link prediction method for mining the equivalence pairs on the nodes of the incoming traces;
Determine the best fitting equivalence classes on the event set;
Link prediction between the equivalence classes;
Finalization of the workflow graph model.
The equivalence relation plays a crucial role in the schema induction, as each node in the workflow schema model corresponds to an equivalence class, and the links in the schema are generated from the links between the equivalence classes.
3.1. Loop Detection
The loops cause special difficulties in the process discovery methods; the detection of loops is a hard task. Although finding exact repetition can be solved with direct sequence matching algorithms, the finding of nested loops or fuzzy (not exact repetition) loops is a complex problem. To enhance the efficiency of schema discovery, we also apply a loop detection mechanism as a pre-processing step, which identifies subsequences that are more likely to correspond to loops. This information is used as a component in the calculated context description matrix for equivalence mining.
In the loop detection module, we apply a loop prediction mechanism using a neural network approach. Having a sequence
, the input context is given by a similarity matrix of size
, which denotes which positions contain the same value. The matrix element is equal to 1, if the elements are the same, otherwise, the matrix element is 0.
Figure 1 shows examples for the construction of the similarity matrix. The applied convolutional network takes this similarity matrix as input, and the output is the prediction of the loop sections within the input sequence. With this method, we convert the loop detection problem into an image classification problem, namely, the similarity matrix can be considered as a two-dimensional image. In the case of loop segments, this image contains some characteristic lines parallel to the main diagonal. The applied convolutional network can detect these characteristic lines in the image.
We can remark that as this approach requires an input matrix of size, it can be applied for moderate-sized sequences. In our problem domain, this condition is met; the maximal trace length in the practice was always moderate.
3.2. Equivalence Prediction
If, for two nodes in the input traces, , then, they share the same neighborhood context in the workflow schema model. Thus, they will share similar neighborhood contexts in the trace level, too. It means that in the trace level neighborhood, the values and positions of the event type should be similar. It is important to preserve not only the value distribution, but also the position parameters. For example, in the two traces
- trace1:
t0, t1, t2, t3, t4
- trace2:
t0, t4, t3, t2, t1
The right-side contexts of event type t0 share the same set of event types, but the positions are very different. To preserve both the value and position information in a more efficient way, we propose to use a similarity matrix to describe context similarity.
Definition 1 (Multi-perspective embedding).
Having different aspects, the multi-perspective embedding feature description is represented by a set of embedding vectors instead of a single vector.where is an embedding vector of a node n related to aspect .
Example 1. In a graph, for the context description of a node, we can generate distinct description vectors for the different distance positions. Thus, denotes the subcontext description related to nodes with a distance value i from the current node n. The perspective here is based on the distance value in the graph. Each aspect has a corresponding subcontext: In general, the different aspects may also have overlapping subcontexts. Having the multi-perspective embedding, we apply a novel similarity operator which generates a similarity matrix instead of a scalar similarity value.
Definition 2 (Context similarity matrix).
The context similarity matrix S for two contexts is a M*M matrix with where we assume that all aspects are comparable, and we can calculate a similarity value. In the binary case,In our problem domain, having only directed sequence contexts, we apply the distance-based aspect definition. In the formula, the aggregation collects all element pairs from the graph neighborhood having the distances i and j. The aggregation function in the algorithm for feature calculation can be any of the usual operators like the average or max operators. In the case of directed graphs, we propose two S matrices for a given node, where is for the context of incoming edges and is for the outgoing edges.
Example 2. Let us take following three traces:
- trace1 context:
t1, t2, t3, t4
- trace2 context:
t4, t3, t2, t1
- trace3 context:
t1, t2, t5, t4
The corresponding similarity matrices are Thus, we use matrices for the context similarity, instead of the usual cosine similarity. In our current problem domain, as the context is always a simple directed sequence, we apply the matrices as similarity representations of two nodes.
For the prediction of the equivalence of two nodes, , we utilize a neural network architecture. As the input is a squared matrix, we employ the convolutional neural network (CNN). The main motivation to use the similarity matrix instead of the usual similarity vector was the fact that we must discover fuzzy similarity among the different subsections. In this case, any different subsections may relate to each other; thus, a rigid linear structure is not flexible enough. Having a similarity matrix, as a special image matrix, we can use CNN to discovery the similarity relationships among the different image sections.
The general structure of the proposed equivalence prediction framework is presented in
Figure 2. The inputs are the context description of two events from the same or different traces. We apply an embedding module to generate the context feature vectors. In the next step, we generate a similarity matrix that shows a multi-aspect context description. This matrix will be processed by a CNN module to predict the similarity level.
For the training of the CNN module, the input dataset was generated with the following procedure:
Generation of synthetic workflow models of different complexity;
Generating random traces, where each node has a label (i.e., event type, visible attribute) and a parent ID (i.e., the identification of the corresponding node in the schema, hidden attribute);
Clustering the nodes in the trace pool based on the label;
Generating node pairs where both members belong to the same cluster (same visible parameter);
Generation of a balanced training dataset from the pairs (binary category: matching parent ID or not);
In the prediction phase, the trained CNN will be used to find those node pairs in the trace pool, which are in equivalence. As the prediction is not perfect, we employ a post-processing phase, where we build the equivalence classes for the nodes using additional consistency optimization.
3.3. Building Process Schema Graph
Having an imperfect equivalence prediction among the nodes, we can construct a graph, where the nodeset is equal to the nodes in the trace pool and two nodes are connected if and only if the link is predicted by the CNN module. As the prediction is not perfect, the initial graph is not well separable, not disjoint. In order to determine the partitioning with the best separation, we utilize an optimization process.
To find the best separated clusters, we apply a quasi-clique detection algorithm. The quasi-clique in an undirected graph is defined as a subset of vertices with the least edge density. The maximum-clique problem consists in finding quasi-cliques of the largest cardinality and it is a known NP-hard problem [
62]. The task to find the partition with best separation is equivalent with finding the maximal quasi-cliques.
In the literature, there are three method families for the finding the maximal quasi-cliques. The first group covers the methods providing exact solutions; the linear programming or dynamic programming methods belong to this family. The main drawback of these solutions is that they cannot solve large scale problems; they have a too-large execution complexity. For practical applications, the general heuristics or customized heuristics are the preferred approaches. One of the candidate methods for general heuristics is the greedy algorithm [
63].
Usually, the greedy algorithm is extended with some special pre-processing or post-processing modules [
64]. In the literature, there are only a few proposals on customized heuristics; we can highlight the work of Guo et al. [
65] as the most recent novel approach. The proposed work is designed for large-scale quasi-clique mining problems; it uses a divide and conquer approach implemented in a parallel architecture.
In our test framework, we developed a heuristic method that works in a greedy way to extend the current clique. The main steps of the proposed algorithm can be summarized in the following list:
Calculating the degrees of the graph nodes (deg(n));
Ordering the nodes in descending degree values (list D); initially, each node in D is free (not assigned to a clique);
Taking the first free node in D (n0), open a new clique C (n0); the status of n0 is set to “processed”;
Processing the free nodes in D (node n) in a loop;
We select the test node with maximal density (ns);
If dens (ns) > alpha threshold, then
ns is assigned to C (n0); status of ns is set to “processed”
If there are free nodes in D, then go to Step 4; otherwise, terminate the process;
Close the current clique;
If there are free nodes in D, then, go to Step 3; otherwise, terminate the process.
The result of the presented algorithm is the set of quasi-cliques. The resulting quasi-cliques are treated as nodes in the final schema graph.
For the generation of the edges in the workflow schema model, we calculate the connection strengths (
st) for every directed cluster pair. To accept a link
e in the schema graph, it must meet the following criteria:
where
In the post-processing stage, we perform a finalization of the schema graph, where we adjust the link in order to meet the following integrity constraints:
The resulting graph will be presented as the induced workflow model schema graph.
3.4. Complexity Analysis
For the execution complexity analysis of the proposed method, we decompose the algorithm into the following key modules:
Pre-processing for eliminating trace duplications;
Pre-processing for fuzzy loop detection;
Calculating the similarity matrices;
Quasi-clique detection;
Schema generation;
In our model, the cost complexity value depends on the following parameters:
- -
N: number of traces in the train dataset.
- -
L: average length of the traces.
- -
E: number of distinct event types.
- -
W: window size for the graph context description.
In Step 1, we can convert the trace into string, and we use a hash structure or set to eliminate the duplication. The cost estimation is
where
denotes the bucket density factor of the hash table.
The cost of fuzzy loop detection can be given with
where
is the cost factor of the CNN
1 module.
Step 3 requires more resource consumption, as we must test all the possible event pairs belonging to the same event type. In this case, the number of pairs to be tested is
and the total cost for calculating the similarity values for all pairs is
Here,
is the cost factor of the CNN
2 module. Having the similarity graph, the next step is to detect the quasi-cliques inside the graph. We have separate graphs for each event type, having a node size
The general cost of clique detection can be approximated with
In the last step, we construct a graph from the cliques. As the number of cliques is less than the number of events, we obtain a cost as follows:
Thus, taking only the dominant components, the final cost approximation can be given with
This cost analysis shows that there are two critical components in the proposed methodology. The first is calculating the similarity values for each related event and the second component is the mining the quasi-cliques. In this version, we used the baseline implementation of both modules. The further optimization of these units will be next step of the project.
4. Test Experiments
For the tests, we generated synthetic event logs in order to cover a wide range of complexity scales. For the test data, we generated schemas of hierarchical structures, as the industrial processes models also are restricted to this kind of schema family. Many of the existing schema mining models apply an event tree structure for the schema representation. We introduced a well-nested linear schema description language with the following components:
[S1, S2, …] | : sequence |
‘X’, [S1, S2, …] | : XOR branching |
‘L’, [S1] | : LOOP with kernel S1 |
‘L’, [S1, S2] | : LOOP with a XOR kernel |
‘%’ | : start node |
‘#’ | : end node |
‘t’ | : activity event of type t |
For example, the description [‘%’, ‘a’, ‘X’, [[‘a’], [‘b’]], ‘d’, ‘e’, ‘#’] denotes a schema containing sequences and a simple XOR branching, where one of branches contains the event ‘a’, while the other contains event ‘b’.
To show the schema complexity, we introduced a complexity measure. The measure depends on the maximal depth of the nesting, and is given with the following symbols:
where symbol
n denotes the depth of the XOR components and
m is the depth of the LOOP components in the maximal nesting elements.
Table 1 shows some examples for the complexity measure.
The test environment was implemented in the Python language using the Tensorflow/Keras V 2.10 .framework.
4.1. Tests on Prediction of Equivalence
For the prediction of the Λ equivalence, we built up the architecture containing four main modules. There are three modules to process the context features related to the event types. One module is for the input context, one is for the output context, and one combines both directions. We have four input contexts; three of them are assigned to the mentioned three modules. The fourth input contains a context generated by the loop detection module. It shows the chance of the loop existence. The main neural network block will process the concatenation of the three outputs plus the fourth context as input for the link prediction. The code outline is presented in the Listing 1.
Listing 1. Architecture of the neural network for equivalence prediction. |
input_shape1 = (2*wsize, 2*wsize, 1) input_shape2 = (wsize, wsize, 1) input_shape3 = (2, )
I1 = Input (shape = input_shape1, name = “I1”) L11 = Conv2D (32, (2, 2), activation = ‘relu’, padding = ‘same’, name = “L11”) (I1) L12 = MaxPooling2D ((2, 2), name = “L12”) (L11) L13 = Conv2D (64, (3, 3), activation = ‘relu’, name = “L13”) (L12) L14 = MaxPooling2D ((2, 2), name = “L14”) (L13) L15 = Flatten (name = “L15”) (L14) L16 = Dense (128, activation = ‘relu’, name = “L16”) (L15) L17 = Dropout (0.5, name = “L17”) (L16) O1 = Dense (2, activation = ‘softmax’, name = “O1”) (L17)
I2 = Input (shape = input_shape2, name = “I2”) L21 = Conv2D (32, (2, 2), activation = ‘relu’, padding = ‘same’, name = “L21”) (I2) L22 = MaxPooling2D ((2, 2), name = “L22”) (L21) L23 = Conv2D (64, (3, 3), activation = ‘relu’, name = “L23”) (L22) L25 = Flatten (name = “L25”) (L23) L26 = Dense (128, activation = ‘relu’, name = “L26”) (L25) L27 = Dropout (0.5, name = “L27”) (L26) O2 = Dense (2, activation = ‘softmax’, name = “O2”) (L27)
I3 = Input (shape = input_shape2, name = “I3”) L31 = Conv2D (32, (2, 2), activation = ‘relu’, padding = ‘same’, name = “L31”) (I3) L32 = MaxPooling2D ((2, 2), name = “L32”) (L31) L33 = Conv2D (64, (3, 3), activation = ‘relu’, name = “L33”) (L32) L35 = Flatten (name = “L35”) (L33) L36 = Dense (128, activation = ‘relu’, name = “L36”) (L35) L37 = Dropout (0.5, name = “L37”) (L36) O3 = Dense (2, activation = ‘softmax’, name = “O3”) (L37)
I4 = Input (shape = input_shape3, name = “I4”)
X1 = Concatenate (name = “X1”) ([O1, O2, O3, I4]) X2 = Dense (16, activation = ‘relu’, name = “X2”) (X1) O = Dense (2, activation = ‘sigmoid’, name = “O”) (X2)
model1 = Model (inputs = I1, outputs = O1) model2 = Model (inputs = I2, outputs = O2) model3 = Model (inputs = I3, outputs = O3) model = Model (inputs = [I1, I2, I3, I4], outputs = O) |
In the training phase, we first trained the model1, model2, and model3 components separately. In the second phase, we changed these models to be frozen and only trained the main model.
In the first tests, we investigated the accuracy of the proposed embedding and similarity methods. We involved two widely used benchmark embedding methods used in heterogeneous graphs containing labeled nodes. These two methods are the Hin2Vec and GraphSAGE methods. We applied the implementation of Hin2Vec found at the Github repository
https://github.com/csiesheep/hin2vec/tree/master (accessed on 10 December 2024). Regarding GraphSAGE, we used the implementation developed in the Stellargraph project, the code is available at
https://pypi.org/project/stellargraph-mvisani/ (accessed on 12 December 2024).
In the efficiency tests, we have used a wider range of parameters, including the following:
- -
Epoch number (10, 20, 30, or 40)
- -
Window size of the context (4, 6, or 8)
- -
Train/test split ration (10%, 20%, or 30%)
Based on the test results with the proposed module, the following setting provided the best results: (epoch=30, window size: 6, train/test split ratio: 20%). The test results are summarized in
Table 2. The table shows the measured accuracy values; the first number is the validation accuracy, while the second is the test accuracy. The values are aggregated values of five runs. These results show the significant dominance of the proposed method for equivalence link prediction. The dominance of the proposed method was visible in all parameter settings; thus, we applied this method in the main process discovery.
In the tests, we first employed training data of homogeneous complexity, all traces are related to schemas of the same complexity level. As the test results given in
Table 2 show, the proposed method significantly dominates the other benchmark methods. In these tests, we processed training sets of moderate size (10,000 items). In the second phase of the test experiments, we generated heterogeneous training sets containing traces from schemas of different complexity levels. Here, the size of the training items runs from 100,000 (dataset ALL_2) to 200,000 (dataset ALL_3b). We can see that the proposed method also provides, here, a significantly better classification accuracy.
4.2. Tests on Schema Induction
In the next subsection, we present the comparison test results on the proposed schema induction method. In the tests, we also involved the industry standard PM4PY methods, such as the alpha miner, the heuristic miner, and the inductive miner methods. The implementation of all these methods is available at Github repository
https://github.com/process-intelligence-solutions/pm4py (accessed on 28 November 2024). In the tests, we applied the simple invocation of these methods, as presented in Listing 2.
Listing 2. Testing the PM4PY methods. |
(event_log,dataf) = conv_to_df(traces_1)
pnet, initial_marking, final_marking = alpha_miner.apply(event_log) pm4py.view_petri_net(pnet, initial_marking, final_marking, format = ‘png’)
tree = pm4py.discover_process_tree_inductive(event_log) pnet, initial_marking, final_marking = pm4py.convert_to_petri_net(tree) pm4py.view_petri_net(pnet, initial_marking, final_marking, format = ‘png’)
heu_net = pm4py.discover_heuristics_net(event_log, dependency_threshold = 0.0) pm4py.view_heuristics_net(heu_net) |
In addition to the PM4PY methods, the efficiency comparison tests also cover the GNN-based process mining method proposed in [
61]. Unlike the PM4PY methods, this method applies machine learning engines for enhanced efficiency in schema induction.
Regarding the test datasets, we involved both synthetic and real-world benchmark datasets. For the generation of synthetic datasets, we have developed a log data generation framework, where we can build trace datasets of different complexity levels. The complexity level is measured by the nesting level of the control structures and by the size of the schema graph.
The input for the schema induction task is a list of traces, which all belong to the same hidden schema. For the generation of the input trace list, first, a target schema was constructed, which was used for the generation of the traces. The events in the traces were identified by the event type. As the schema may contain the same event type at different nodes, the trace may also contain the repetition of the event types.
In the resulting graph of our proposed method, the nodes are symbolized with t_i where t is the event type and i denotes the index of the corresponding quasi-cluster. The event type is input data, while the cluster indexes are generated during the induction process.
The quality of schema generation is evaluated with the process mining measures proposed in [
66]. In the efficiency comparison tests, we applied the following measures.
- -
Simplicity (mS): This metric is calculated as the average of the number of both incoming and outgoing arcs calculated per node. A higher S value means larger complexity and it may show over-generalization.
- -
Structural similarity (mT): This metric shows the relative difference of the process matrixes that indicate the directly-follows relations between activities of the models to be compared. This metric returns a value between 0 and 1, where a high value indicates that the mined and the original models are similar.
- -
Capacity (mC): The number of different traces can be generated from the schema; the size of the trace set is related to the schema. In the tests of complex schemas, we applied a random sampling approach to calculate this parameter.
- -
Precision (mP): This metric checks whether enabled traces in the model correspond to observed traces in the log. The higher the difference, the less precise the mined model is. It is calculated with
4.3. Tests on Synthetic Data
In the following schemas, we first present some test examples of the different complexity levels. In the examples, we show the ground truth schema and the resulted schemas generated by the involved methods.
For this simple case, all the PM4PY methods (alpha miner, heuristic miner, and inductive miner) generated the same correct schema which is shown in
Figure 3.
The output of the proposed method is represented in
Figure 4.
Alpha miner:
Figure 5.
Result schema by alpha miner for T2.
Figure 5.
Result schema by alpha miner for T2.
Inductive miner:
Figure 6.
Result schema by inductive miner for T2.
Figure 6.
Result schema by inductive miner for T2.
Heuristic miner:
Figure 7.
Result schema by heuristic miner for T2.
Figure 7.
Result schema by heuristic miner for T2.
Proposed method:
Figure 8.
Result schema by the proposed method for T2.
Figure 8.
Result schema by the proposed method for T2.
Evaluation: Only the proposed method was able to distinguish the two events with the same event type (b). The results of alpha miner (see
Figure 5); of inductive miner (
Figure 6) and of heuristic miner (
Figure 7) contain invalid edges in the resulted schema graph. The related measures for our proposed methods (
Figure 8) are
The heuristic miner yields the following measurement values:
Alpha miner:
Figure 9.
Result schema by alpha miner for T3.
Figure 9.
Result schema by alpha miner for T3.
Inductive miner:
Figure 10.
Result schema by inductive miner for T3.
Figure 10.
Result schema by inductive miner for T3.
Heuristic miner:
Figure 11.
Result schema by heuristic miner for T3.
Figure 11.
Result schema by heuristic miner for T3.
Proposed method:
Figure 12.
Result schema by the proposed method for T3.
Figure 12.
Result schema by the proposed method for T3.
Evaluation: All the benchmark methods failed to detect the different occurrence contexts of the same event type (see
Figure 9,
Figure 10 and
Figure 11). Only the proposed method (
Figure 12) induced the expected schema. The related measures for our proposed methods are
The heuristic miner yields the following measurement values:
Alpha miner:
Figure 13.
Result schema by alpha miner for T4.
Figure 13.
Result schema by alpha miner for T4.
Heuristic miner:
Figure 14.
Result schema by heuristic miner for T4.
Figure 14.
Result schema by heuristic miner for T4.
Inductive miner:
Figure 15.
Result schema by inductive miner for T4.
Figure 15.
Result schema by inductive miner for T4.
Proposed method:
Figure 16.
Result schema by the proposed method for T4.
Figure 16.
Result schema by the proposed method for T4.
Evaluation: Both the alpha miner and the heuristic miner failed to recognize the loop (
Figure 13 and
Figure 14). Only the inductive miner (
Figure 15) and the proposed method (
Figure 16) generated the correct results. The related measures for our proposed methods are
The heuristic miner yields the following measurement values:
Alpha miner:
Figure 17.
Result schema by alpha miner for T5.
Figure 17.
Result schema by alpha miner for T5.
Inductive miner:
Figure 18.
Result schema by inductive miner for T5.
Figure 18.
Result schema by inductive miner for T5.
Heuristic miner:
Figure 19.
Result schema by heuristic miner for T5.
Figure 19.
Result schema by heuristic miner for T5.
Proposed method:
Figure 20.
Result schema by the proposed method for T5.
Figure 20.
Result schema by the proposed method for T5.
Evaluation: As the result schema graphs show, alpha miner (
Figure 17) and the inductive miner (
Figure 18) produced very imprecise schemas, while the heuristic miner yields invalid loop positions (
Figure 19) and it has the following measurement values:
The heuristic miner yields the following measurement values:
Only the proposed method (
Figure 20) could provide the perfect schema.
4.4. Tests on Real Data
In this section, we will use the same benchmark datasets as the paper on GNN-based process discovery [
66]. The source of the real datasets is the BLP challenge. BPI Challenge datasets (
https://www.tf-pm.org/competitions-awards/bpi-challenge, accessed on 3 January 2025) have become important benchmarks in the community. For the presentation, we selected the following three datasets:
- -
BPI_2012_O
- -
BPI_2017_O
- -
BPI_2020_Permit_log
The size parameters of the data sets are summarized in
Table 3.
The key problem in the resulting schema of the inductive miner (
Figure 21) is that it is very over-generalized; for example, it allows a loop f -> f or a → a, which is not part of the incoming traces. As the generated schema shows GNN miner (
Figure 22) performed a very high level of reduction, some frequent traces are not covered by the schema. For example, there are traces where the last step is O_CANCELED (code: f). Our proposed (
Figure 23) method generates a more compact schema; it distinguished four subtypes of the activity d (d_3, d_6, d_7, and d_8) because these occurrences have very different neighborhood contexts.
The result of the heuristic method (
Figure 24) contains some incorrect structure sections and the schema generated by GNN method (
Figure 25) is over simplified. For example, neither of these methods discover the repetition of the sequence CREATE_OFFER (code: a) → CREATED (code: b). On the other hand, the proposed method discovers that the event CANCELED (code: f) can be repeated only in specific contexts; thus, it generates more subtypes of this event. According to the test results, the proposed method (
Figure 26) provided the best matching with the input trace database.
The standard methods generated over-generalized schemas. The heuristic miner produced a complex and incorrect schema (
Figure 27). As this experiment shows, the GNN-based process discovery method (
Figure 28) is not suitable to manage the multiple occurrences of the same event type in the expected way. Only the proposed method (
Figure 29) could distinguish the different occurrence contexts of the event types. Although the number of nodes is larger in the output schema than in the baseline schemas, the node simplicity level (number of related edges) is at a lower value.
In
Table 4, we summarize the test results presenting the key efficiency measures of the investigated methods.
5. Discussion
We have tested our proposed method in two phases. In the first phase, we investigated the equivalence miner, which predicts which nodes from the different traces belong to the same node in the source schema graph. In the tests, we also analyzed two benchmark methods for the labeled graphs: Hin2Vec and GraphSAGE. As the test results show, our proposal could achieve significantly better accuracy in all test cases.
Having a precise equivalence classifier, we can construct the schema graph using a straightforward method: We discover the maximal quai-cliques in the equivalence graph and the cliques are converted into the nodes of the result schema. The performed comparison tests show that the proposed method provides similar or better results for all the investigated cases. The main benefit of the proposed method is that it can also be used for cases when the schemas and the traces may contain multi-occurrence event types. The standard process discovery methods cannot separate the different occurrences of the same type, which results in incorrect and inaccurate outcomes.
The results of the performed efficiency tests demonstrate that the proposed method can manage such complex cases which are not covered by the standard process discovery methods, and it provides more compact and more precise schema graphs. As
Table 4 shows, the proposed method can distinguish the different occurrence contexts of the same evet type. The main consequences of this feature are that the resulted schema is less over generalized, the precision related to the training set is better, and the simplicity level is improved.
Regarding more complex schemas, we can summarize the gained experiences in the following points:
The quality of the output significantly depends on the accuracy of the equivalence mining.
The increase in training datasets improves the accuracy of the prediction. For example, considering Example T5, using only 100,000 items in the training set, the engine generated some incorrect edges, as shown in
Figure 30. With a double-sized training set, the neural networks system was able to perform a perfect prediction.
The precise detection of the nested loops requires additional investigations, as with increased nesting depth, the uncertainty level also increases.
The tests also revealed areas where further research is needed in the next phase. During the tests, we experienced the following limitations of the proposed method:
- -
The quality of the training set is an important factor in the equivalence prediction. If the test data set differs significantly from the training set, the accuracy level is getting weaker.
- -
The discovery of deep-nested loops and a more detailed analysis of the optimal neural network and training patterns for performing equivalence prediction are required.
- -
The presented loop detection CNN module provides a weaker result in the case of complex nested structures; it is worth investigating some other alternative approaches too.
- -
The presented method may be extended with additional configuration parameters to implement a more flexible process discovery engine. The set of possible parameter extensions covers, among others, the frequency filtering, the decision threshold in the equivalence prediction, and the clique detection density threshold. The application of a more flexible process discovery mechanism increases the usability of the proposed method.
6. Conclusions
Process discovery is an important research field in the development of efficient tools for process automatization. Current technologies are dominantly based on automata and pattern-matching approaches. Current industrial standard workflow schema induction methods impose certain limitations on the system being examined. For instance, log traces cannot contain identical event identifiers at different positions within the schema. Moreover, in cases of complex, multi-level embeddings, the generated model becomes increasingly inaccurate. To address the aforementioned shortcomings, this article proposes a novel solution that employs graph neural networks to perform schema discovery.
In the developed procedure, we introduce a multi-aspect embedding format for characterizing nodes and we apply a new two-dimensional similarity measure method. In the first phase of schema generation, we perform equivalence prediction, implemented as an edge prediction task. From the obtained equivalence network, we identify the target schema nodes, which correspond to the maximal quasi-cliques of this network. The developed method has been implemented in a Python environment. Based on the conducted tests, we can see that the proposed context descriptor provides significantly more accurate results for the examined tasks than the other involved widely used methods (i.e., GraphSage, Hin2Vec). To evaluate the efficiency of schema induction, our method was compared with industrial standard approaches such as alpha miner, inductive miner, heuristic miner, and the recent GNN-based algorithms.
The results of the efficiency tests show that the proposed method can handle complex cases that standard process discovery methods cannot address. Moreover, it produces a more compact and precise schema graph. Based on the obtained efficiency results, the proposed method appears to be a strong candidate for solving real-world process mining challenges. Future work will focus on fine-tuning the internal modules of the schema induction engine.