Process Discovery for Event Logs with Multi-Occurrence Event Types

Kovács, László; Jlidi, Ali

doi:10.3390/a18020083

Open AccessArticle

Process Discovery for Event Logs with Multi-Occurrence Event Types

by

László Kovács

^* and

Ali Jlidi

Institute of Information Science, University of Miskolc, 3515 Miskolc, Hungary

^*

Author to whom correspondence should be addressed.

Algorithms 2025, 18(2), 83; https://doi.org/10.3390/a18020083

Submission received: 3 December 2024 / Revised: 10 January 2025 / Accepted: 16 January 2025 / Published: 4 February 2025

(This article belongs to the Section Combinatorial Optimization, Graph, and Network Algorithms)

Download

Browse Figures

Versions Notes

Abstract

:

One of the most actively researched areas in the field of process mining is process discovery, which aims to construct a schema that aligns with existing event trace sequences. Current standard industrial workflow schema induction methods impose certain limitations on the system being examined. To address the shortcomings, this article proposes a novel solution that employs graph neural networks and convolutional neural networks to perform schema discovery. In the first phase of schema generation, we perform equivalence prediction, implemented as an edge prediction task. From the obtained equivalence network, we identify the target schema nodes, which correspond to the maximal quasi-cliques of this network. The results of the performed efficiency tests demonstrate that the proposed method can manage such complex cases that are not covered by standard process discovery methods, and it provides more compact and more precise schema graphs.

Keywords:

process mining; process discovery; graph neural network; CNN; quasi-clique detection

1. Introduction

The efficient management of the production and business processes has a high priority in every organization. The activities are composed into different workflows executed regularly to achieve the defined goals. The efficient workflow design and control has a special high importance in the case of the complex production processes of our days.

Process mining is an umbrella term covering different activities related to workflow control. Process mining is used to discover, monitor, or improve processes and process models by extracting knowledge from the measured event logs. One of the key functionality fields is process discovery [1], which is used to build up the workflow process model from the available event logs. This tool is especially useful when the manual workflow schema construction is too expensive or impossible. In these cases, the automatic workflow schema induction provides a feasible and cost-effective solution. The generated workflow model can be used later, among others, for process automatization or conformance checking.

According to the classification of the dominating business process-modelling tools presented in Mili et al. [2], the main approaches in process modelling are as follows:

Automaton-based methods like Petri nets [3], IDEF [4], EPC (event-driven process chain) [5], and role activity diagrams [6];
Data flow diagrams and value-stream mapping;
Workflow modelling languages, like WPDL [7] and BPMN [8];
Programming language-oriented methods, including unified modelling languages (UML).

The Petri net model plays a key role as one of the oldest and most abstract models having many higher-level extensions. Regarding practical expressive power, the Business Process Model and Notation (BPMN) [9] is the dominating tool. These models usually support the XML languages as data interchange formats. One of the most powerful modelling languages is the YAWL language [10]. This model is the result of a systematic analysis by the Workflow Patterns Initiative [11] on the existing process modelling notations and workflow languages. In the practical applications, the workflow structure can be modelled with the following control elements:

Sequence: linear structure, where a task in a process can be executed after the completion of a preceding task;
Branching (XOR): only one of the several outgoing paths is enabled;
Fork (AND): the process path is split into two or more branches, which are activated concurrently;
Split (OR): more or one of the several outgoing paths are enabled;
Merging: synchronization of the several incoming threads.

Process mining tools can be used also for performance analysis [12], where the process mining is used to automate finding or calculating the performance metrics. This solution provides faster, immediate feedback on the current process status. Another application field is conformance checking [13]. The engine compares the running or terminated processes with the given workflow model to discover differences and similarities between the model and event log. Most of the approaches perform the analysis on the complete terminated trace, which does not enable the prompt interaction of the control module. To enhance this important feature, current works focus on the development of on-line conformance analysis [14]. The fourth pillar of process mining is the model enhancement domain [15]. Having an existing workflow model, the enhancement module can manage dynamic workflow behavior; it can update the model with new elements and dependencies.

In all components of process mining, the workflow model plays a central role. The manual construction of the model usually requires the cooperation of more experts, and it takes more time to finalize the model; thus, the manual model construction requires a high-cost investment. Driven by the requirements on cost-effective workflow model construction, the automatic workflow model discovery became a highly investigated research domain.

The analysis of widely used process discovery methods in industry reveals that classical automata or pattern-matching-based approaches impose significant limitations on their applicability. The goal of this paper is to present a novel approach capable of alleviating these constraints. The proposed method advances the edge prediction approach utilized in graph neural networks by introducing a new multi-aspect embedding with a two-dimensional similarity matrix. Comparative tests clearly demonstrate the efficiency of the proposed method and its potential for broader applicability.

In the Section 2 of the paper, we present a survey on the existing approaches in process discovery and graph link prediction domains. In Section 3, we introduce a special equivalence relationship on the node set of the event log traces. We propose, here, a novel approach for the equivalence mining and a description of the schema induction process. The presented approach utilizes the discovered equivalence links. For the detection of the schema nodes, we apply a quasi-clique detection algorithm. In Section 4, we present the test environment, and we analyze the test results. We included two benchmark methods for the link prediction task and three industry standard methods for the schema induction task. As the test results show, the proposed method dominates all the benchmark methods involved in the tests.

2. Related Works

2.1. Process Discovery

The automatic workflow schema induction is a complex process; it requires the integration of more processing modules. One of the key components is the mining of frequent subsequences in the event log. The field of frequent path mining is also of interest in many domains. For example, Laxman et al. developed an algorithm for sequence prediction over long categorical event streams [16]. Karoly and Abonyi [17] and Weiss [18] propose methods to extract temporal patterns in industrial alarm management systems. Analyzing study paths for the prediction of student dropout [19] and identifying efficient learning patterns in e-learning environments [20,21] are key concepts in educational data mining. In these examples, frequent patterns are selected based on their utility. In [22], the authors argue that the cost perspective should also be considered when mining event sequences.

The drawbacks of the traditional frequent pattern mining solutions come to the surface when dealing with massive datasets [23]. The algorithms have problems with runtime and accuracy [24], and their output data are difficult to interpret and handle [25]. In order to efficiently mine frequent patterns, a tree-based representation has proved to be more compact and, practically, more usable [26]. For this purpose, Lin et al. proposed the FUSP-tree structure [27], which gave rise to numerous variations of incremental tree-based pattern mining algorithms [28,29,30,31].

A more efficient approach is to apply incremental discovery, which derives event patterns recursively. Another representation model was proposed by Chen [32], who introduced a directed acyclic graph representation, which allows for pattern growth along both ends of the detected patterns. This algorithm ensures a faster pattern discovery, too. The work of Patel and Patel [33] combines the graph-based approach with the clustering of patterns to avoid the recursive reconstruction of intermediate trees. Singh et al. [34] use a graph-based approach to extract frequent sequential web access patterns, while Dong et al. [35] present a new weighted graph structure and a method to find variable-length sequential patterns.

The first proposals on process discovery were published in the late 1990s [36]. The first widely used method is the alpha miner [37]. The alpha-miner algorithm generates a Petri net schema result in the following steps:

Discovering distinct events;
Discovering starting and final events;
Grouping the events and constructing a footstep matrix to show direct succession, parallel, and choice control structures;
Determining the schema nodes and connecting them.

In the literature, we can find many application use cases of the alpha miner, including modeling educational processes [38], clinical workflow analysis [39], and software engineering [40].

The inductive miner [41] is an improved powerful process discovery algorithm. It applies a structured, divide-and-conquer approach. The method produces correct valid process models in the form of block-structured models (process trees and Petri nets). The approach processes the trace sequence recursively; the method splits the large input log into smaller segments, generates models for the segments, and finally, merges the local models into a global final model. Regarding the recent research directions, the main focus areas relate to adaptation of the method to large-scale problems using a simplified, approximated model [42].

The heuristic miner generates the workflow schema based on probabilistic models using dependency frequency measures [43]. The main steps of the workflow model construction are the following:

Building the dependency graph;
Constructing the causal matrix to represent the correct process model.

One key benefit of this approach is that this model can handle small noises in the event logs.

A recent comparative analysis of the different process discovery algorithms was presented in [44]. In the tests on healthcare datasets, the inductive miner achieved the best result. Based on the author’s experiences, the PM4Py [45] framework is ideal for professionals with knowledge in the area, while Disco [46] is the simplest and most intuitive tool for discovering processes.

2.2. Graph Neural Networks in Process Mining

As neural network technology proves its efficiency in many application areas, there is active research on the development of process discovery applying neural networks. The main strength of the neural network technology over the automaton and pattern-matching approaches is that it can also process complex structures and noisy datasets.

To analyze interactions and dependencies within event-based process flows, researchers have explored various approaches, including both graph neural network (GNN) [47] models and alternative machine learning techniques. GNNs have proven effective for link prediction in event sequence analysis, where they capture complex graph-based dependencies. For instance, Kipf and Welling’s work on semi-supervised GNNs [48] laid the groundwork for using these networks to infer missing links, leveraging node features and graph structures. Their approach achieved notable accuracy in static graphs but faced challenges in adapting to dynamic, trace-based datasets, where graph structures emerge over time. Hamilton et al.’s GraphSAGE model introduced an efficient node-embedding technique via sampling [49], which is particularly useful in event traces with sparse connections between events. Compared to Kipf and Welling’s approach, GraphSAGE is more adaptable to dynamic graphs due to its inductive nature. However, its reliance on sampling can lead to information loss in rare but significant transitions. Velickovic et al.’s graph attention networks (GATs) assign varying importance to edges, enhancing the interpretability of high-probability links [50]. Unlike GraphSAGE, which emphasizes embedding efficiency, GATs provide edge-level attention, improving interpretability. However, their reliance on predefined edge weights makes them less effective in trace-based datasets with uncertain or evolving connections. Pattern-matching techniques have also been used to identify recurring structures or motifs within event sequences [16]. These methods excel in rigid environments with well-defined rules but struggle to adapt to dynamic datasets. In contrast, deep learning methods like GNNs provide greater adaptability in scenarios with evolving patterns. Sequence-based models, such as long short-term memory networks (LSTMs), excel at capturing temporal dependencies within sequential data [23]. However, they focus primarily on the temporal order of events, often overlooking the relational structures critical for graphs. Compared to GNNs, LSTMs are less effective in capturing the interplay of relational and temporal dependencies in event-based datasets. DeepWalk, a random-walk-based model, generates low-dimensional embeddings through graph connectivity [51]. While effective for static graphs, DeepWalk struggles with dynamically evolving structures, as its embeddings rely on fixed connectivity patterns. Statistical methods, such as those by Yao et al. [52], provide insights into graph properties using metrics like clustering coefficients and edge density. Although these methods enhance interpretability, they lack predictive power compared to GNN-based techniques. When comparing these methods, GNNs stand out for their ability to model relational dependencies, with GraphSAGE offering scalability and GATs enhancing interpretability. LSTMs are strong in temporal modeling but fall short in capturing structural relationships, while pattern-matching techniques and statistical methods provide useful insights but lack adaptability. DeepWalk bridges connectivity-based learning and dimensionality reduction but struggles with dynamic graphs.

Graph neural networks (GNNs) offer a mechanism to embed graphs explicitly [53]. These networks process graph data to produce vector representations, which can be used for tasks such as classification. GNNs achieve this by learning graph embeddings guided by specific training objectives.

A key challenge in GNNs is adapting the concept of convolution to graph structures, inspired by traditional convolutional neural networks. Unlike Euclidean domains, defining convolution on graphs involves complex theoretical considerations due to the lack of a straightforward definition. Two main approaches have emerged: spectral and spatial methods.

Spectral Methods: These rely on the convolution theorem, which equates convolution in the spatial domain to a product in the frequency domain. Although the theorem is derived for Euclidean spaces, some approaches extend it to graphs by utilizing the graph. However, spectral methods face some limitations: (a) They are sensitive to changes in graph topology, as minor structural variations can significantly impact convolution results; and (b) the computation requires matrix diagonalization, which is computationally expensive due to the lack of an efficient graph-based fast Fourier transform (FFT).

Spatial Methods: These approaches attempt to replicate the classical convolution operation directly on the graph’s spatial domain. While they aim to preserve the graph’s structural information, many methods simplify the graph too much, potentially overlooking critical details.

2.3. Link Prediction with GNN

We selected the GNN approach for the equivalence link prediction, as the GNN has superior efficiency in the standard link-prediction problems. The input feature vectors for the link prediction are generated as low-dimensional embedding vectors describing the context and feature parameters of the nodes and edges.

In the GNN, for each node, the engine collects the context description vectors found in the neighboring nodes. The usual approach is to perform random walks and to collect the context vectors of the nodes touched in the walk. In the next phase, the collected context vectors are aggregated into an updated feature vector of the current node. The generated status vectors can be used as input feature vectors of the graph components in the required classification problems.

Regarding the concrete embedding generation approaches, we selected the following popular benchmark methods: Node2Vec, Hin2Vec, GraphSage, and HinSage.

The simplest approach is the Node2Vec [54] method, which can be applied for homogeneous graphs. In this case, there is only one type of node and edge. The method uses a biased random walk to collect the local status vectors from the neighborhood nodes. The tested node sequences are processed with Skip-Gram model, which learns embeddings by maximizing the likelihood of co-occurring nodes within a defined context window. The simplicity and scalability of Node2Vec make it effective for homogeneous graphs, but it is limited in its ability to capture semantic relationships in heterogeneous information networks.

The Hin2Vec [55] approach is the extension of the Node2Vec framework for heterogeneous graphs. In heterogeneous graphs, nodes and edges are associated types; different nodes or edges may belong to different types. This type of information can be used to make constraints on the generated walks during the embedding calculations. Like Node2Vec, Hin2Vec uses the Skip-Gram model to learn embeddings, but it outputs not only node embeddings but also link embeddings.

GraphSage [56] is a general inductive framework that uses node feature data to generate the embedding vector. The resulting embedding is calculated by sampling and aggregating features from a node’s local neighborhood. The proposed method supports both supervised and unsupervised learning; it can be used for node classification/regression, and link prediction for homogeneous networks. The pseudo code of embedding generation is shown in Algorithm 1.

Algorithm 1: GraphSAGE embedding generation algorithm

Input: Graph: G (V,E); input features: {x_v}; depth: K; weight matrices: W^k;non-linearity factor: s; differentiable aggregator functions: AGGREGATE_k; neighborhood function: N
Output: Vector representation z_v

1. h⁰_v = x_v (for all v in V)

2. for k in 1…K do

3. for v in V do:

4. h^k_N(v) = AGGREGATE_k ({h^k−1_u, for all u in N(v)})

5. H^k_v = s(W^k concat(h^k−1_v,h^k_N(v) ))

6. end

7. h^k_v = h^k_v/||h^k_v||₂ for all v in V

8. end

9. z_v = h^K_v, for all v in V

The HinSage [57] method is the enhancement of the inductive GraphSAGE model for heterogeneous graph representation learning. The method employs a neural aggregation mechanism to generate embeddings using sampling and aggregation steps. For each node, type-specific neighbors are sampled, and separate aggregations are performed for each type. The aggregated embeddings are then combined to update the node’s representation. Unlike Node2Vec and Hin2Vec, HinSage does not require manually defined meta-paths, as it learns the relationships directly from the data. Additionally, HinSage supports inductive learning, making it suitable for dynamic or partially observed graphs where new nodes or edges are introduced. We remark that most embedding frameworks found in the literature are using a transductive approach; thus, they can only generate embeddings for a single fixed graph. As the experiences show, transductive approaches do not efficiently generalize unseen nodes and consequently, these approaches cannot learn to generalize across different graphs.

2.4. Limitations of Current Process Discovery Methods

Considering the traditional, automaton-based process discovery methods, all the leading methods are based on some form of the Directly-Follows Graph (DGF), which is a directed graph that represents the sequence of activities. Process mining practitioners tend to use such simplified DFGs actively. In order to scope with complexity, DFGs are usually simplified by many ways [58]. One standard method is the removal of infrequent nodes and edges. Another approach is to assume that the type of event node is identified uniquely and, thus, every node type has a unique node in the DFG and in the generated schema graph. In some real application fields, especially with an imperfect information background [59], this uniqueness assumption cannot be guaranteed. In these cases, the resulted schemas are incorrect and misleading. Another source of confusion is the multi-occurrence of a given event type inside the traces. The same event type may have different prefix and postfix sequences at the different positions. The mapping the different subtypes to the same event type causes over-generalization. In this case, the schema generates hallucinations and invalid traces. Beside theses simplification assumption, another key issue is the complexity limitations of the standard methods. Due to the inherent algorithmic nature of the automaton approaches, the detection of complex, deeper nested structures is a great challenge, and the current methods are not able to manage these complex structures within an acceptable time.

Based on these shortcomings, there is intensive research on the efficiency improvement of the process discovery methods, focusing mainly on the application of machine learning methods, in the recent years. Although there are many promising results, the current technology is far from perfect. Summarizing the current achievements, we can highlight the following approaches:

Current industry level solutions (like alpha miner, inductive miner, or heuristic miner) are universal tools, but they cannot scope with event type multi-occurrences and deeper nesting.
Application of recurrent neural networks (LSTM) for process discovery [60]. The model is strong at finding XOR patterns, but weaker at discovering loops or event type multi-occurrences.
There are only a few methods for process mining using graph neural networks technology. The most recent approach [61] applies the GNN for extension of the DFG approach; the proposed method is not suitable to manage the problem of event type multi-occurrences. The main reason for the lack of more GNN -based solutions is that the GNN architecture is mainly focusing on the manipulation of edges and nodes inside a fixed input graph. The generation of a new graph from very different graph pieces requires a novel GNN architecture.

The main goal of our investigation is to address these shortcomings and to provide a new proposal for a process discovery approach which integrates both machine learning components and traditional graph methods. The proposed method applies convolutional neural networks (CNN) for fuzzy loop detection and similarity calculation. As the GNN is the standard tool for link prediction in the graph, we apply the GNN approach for finding the equivalent event instances of a given event type and, then, we employ standard quasi-clique graph detection methods to determine the event nodes of the resulting schema graph.

3. Proposed Schema Induction Method

The set of event types is given by

T = \{t_{i}\}, i = 1, \dots, N_{t}

(1)

The workflow model is defined by a directed graph containing action nodes and control nodes:

M = (M_{N}, M_{E})

(2)

M_{N} = P_{e} \cup P_{c}

(3)

P_{e} = \{e_{i}\}

(4)

P_{c} = \{c_{f}, c_{m}, c_{s}, c_{t}\}

(5)

where each node

e_{i}

has one input and one output edge,

c_{f}

is the fork node (one input edge, more output edges),

c_{m}

is the merge node (more input edges and one output edge),

c_{s}

is the start node (no input edge, one output edge), and

c_{t}

is the terminal node (one input edge and no output edge).

M_{E} \subseteq M_{N} \times M_{N}

(6)

Each node in the model is uniquely identified by its index value:

i (e)

. We assign to each action node an event type:

λ : P_{e} \to T

(7)

In our model, different nodes in the workflow model may have the same event type. In the model, two activity nodes

e_{1}

,

e_{2}

are in adjacency relation

e_{1} A e_{2}

if

(e_{1}, e_{2}) \in M_{E}

(8)

or there is a

(e_{1}, e_{1}^{’}, \dots, e_{n}^{’}, e_{2})

path in the model, where each

e_{i}^{’}

is a control node.

A trace related to model M is defined as a sequence of elementary events:

s = e_{1}, \dots, e_{m},

(9)

where

\forall i : e_{i} \in M_{N}

(10)

\forall i (1 < i < m) : e_{i} A e_{i + 1}, c_{s} A e_{1}, e_{m} A c_{t}

(11)

The corresponding model node for a trace node

e

is denoted by

m n (e)

. The m-identifier of trace event e is defined with

m i (e) = i n d e x (m n (e))

(12)

The input dataset, an event log, is given by a set of traces related to the same model.

L = {s_{i}}

(13)

The set of events on L is defined as

L_{E} = \cup_{i} \{e| e \in s_{i}}

(14)

We define an equivalence relation

e_{1} Λ e_{2}

on

L_{E}

, where

e_{1} Λ e_{2} \Leftrightarrow m i (e_{1}) = m i (e_{2})

(15)

Regarding the investigated task, we can see that our process discovery problem domain differs from the standard graph mining methods in many aspects:

The input graph and the output graph are disjoint;
The resulting graph is a new graph generated during the prediction process;
The input graph is a set of traces, where each trace is a sequence of nodes;
Each node in the input graph has a one-dimensional neighborhood.

Thus, our task is not a standard node or edge classification task; rather, it is the clustering of the input nodes where each cluster corresponds to a node in the resulting schema graph. The outline of the proposed method consists of the following steps:

Develop a link prediction method for mining the equivalence pairs on the nodes of the incoming traces;
Determine the best fitting equivalence classes on the event set;
Link prediction between the equivalence classes;
Finalization of the workflow graph model.

The

Λ

equivalence relation plays a crucial role in the schema induction, as each node in the workflow schema model corresponds to an equivalence class, and the links in the schema are generated from the links between the equivalence classes.

3.1. Loop Detection

The loops cause special difficulties in the process discovery methods; the detection of loops is a hard task. Although finding exact repetition can be solved with direct sequence matching algorithms, the finding of nested loops or fuzzy (not exact repetition) loops is a complex problem. To enhance the efficiency of schema discovery, we also apply a loop detection mechanism as a pre-processing step, which identifies subsequences that are more likely to correspond to loops. This information is used as a component in the calculated context description matrix for equivalence mining.

In the loop detection module, we apply a loop prediction mechanism using a neural network approach. Having a sequence

s_{1}, s_{2}, \dots, s_{m}

, the input context is given by a similarity matrix of size

m \times m

, which denotes which positions contain the same value. The matrix element is equal to 1, if the elements are the same, otherwise, the matrix element is 0. Figure 1 shows examples for the construction of the similarity matrix. The applied convolutional network takes this similarity matrix as input, and the output is the prediction of the loop sections within the input sequence. With this method, we convert the loop detection problem into an image classification problem, namely, the similarity matrix can be considered as a two-dimensional image. In the case of loop segments, this image contains some characteristic lines parallel to the main diagonal. The applied convolutional network can detect these characteristic lines in the image.

We can remark that as this approach requires an input matrix of

O (n^{2})

size, it can be applied for moderate-sized sequences. In our problem domain, this condition is met; the maximal trace length in the practice was always moderate.

3.2. Equivalence Prediction

If, for two nodes

n_{1}, n_{2}

in the input traces,

n_{1} Λ n_{2}

, then, they share the same neighborhood context in the workflow schema model. Thus, they will share similar neighborhood contexts in the trace level, too. It means that in the trace level neighborhood, the values and positions of the event type should be similar. It is important to preserve not only the value distribution, but also the position parameters. For example, in the two traces

trace1:: t0, t1, t2, t3, t4
trace2:: t0, t4, t3, t2, t1

The right-side contexts of event type t0 share the same set of event types, but the positions are very different. To preserve both the value and position information in a more efficient way, we propose to use a similarity matrix to describe context similarity.

Definition 1

(Multi-perspective embedding). Having

a_{1}, \dots, a_{M}

different aspects, the multi-perspective embedding feature description is represented by a set of embedding vectors instead of a single vector.

v (n) = < v_{1} (n), \dots, v_{M} (n) >

(16)

where

v_{i} (n)

is an embedding vector of a node n related to aspect

a_{i}

.

Example 1.

In a graph, for the context description of a node, we can generate distinct description vectors for the different distance positions. Thus,

a_{i}

denotes the subcontext description related to nodes

n_{j}

with a distance value i from the current node n. The perspective here is based on the distance value in the graph. Each aspect has a corresponding subcontext:

C_{i} (n) = \{n_{j}| d i s t (n, n_{j}) = i}

(17)

In general, the different aspects may also have overlapping subcontexts. Having the multi-perspective embedding, we apply a novel similarity operator which generates a similarity matrix instead of a scalar similarity value.

Definition 2

(Context similarity matrix). The context similarity matrix S for two contexts

C_{1}, C_{2}

is a M*M matrix with

S_{i, j} = s i m i l a r i t y (v_{i} (n_{1}), v_{j} (n_{2}))

(18)

where we assume that all aspects are comparable, and we can calculate a similarity value. In the binary case,

S_{i, j} = {\begin{matrix} 1 i f v_{i} (n_{1}) = v_{j} (n_{2}) \\ 0, o t h e r w i s e \end{matrix}

(19)

In our problem domain, having only directed sequence contexts, we apply the distance-based aspect definition. In the formula, the aggregation collects all element pairs from the graph neighborhood having the distances i and j. The aggregation function in the algorithm for feature calculation can be any of the usual operators like the average or max operators. In the case of directed graphs, we propose two S matrices

(S_{i}, S_{o})

for a given node, where

S_{i}

is for the context of incoming edges and

S_{o}

is for the outgoing edges.

Example 2.

Let us take following three traces:

trace1 context:: t1, t2, t3, t4
trace2 context:: t4, t3, t2, t1
trace3 context:: t1, t2, t5, t4

The corresponding similarity matrices are

S (t r a c e 1 c, t r a c e 2 c) = [\begin{matrix} [0 0 0 1] \\ [0 0 1 0] \\ \begin{matrix} [0 1 0 0] \\ [1 0 0 0] \end{matrix} \end{matrix}], S (t r a c e 1 c, t r a c e 3 c) = [\begin{matrix} [1 0 0 0] \\ [0 1 0 0] \\ \begin{matrix} [0 0 0 0] \\ [0 0 0 1] \end{matrix} \end{matrix}]

Thus, we use matrices for the context similarity, instead of the usual cosine similarity. In our current problem domain, as the context is always a simple directed sequence, we apply the

(S_{i}, S_{o})

matrices as similarity representations of two nodes.

For the prediction of the equivalence of two nodes,

n_{1}, n_{2}

, we utilize a neural network architecture. As the input is a squared matrix, we employ the convolutional neural network (CNN). The main motivation to use the similarity matrix instead of the usual similarity vector was the fact that we must discover fuzzy similarity among the different subsections. In this case, any different subsections may relate to each other; thus, a rigid linear structure is not flexible enough. Having a similarity matrix, as a special image matrix, we can use CNN to discovery the similarity relationships among the different image sections.

The general structure of the proposed equivalence prediction framework is presented in Figure 2. The inputs are the context description of two events from the same or different traces. We apply an embedding module to generate the context feature vectors. In the next step, we generate a similarity matrix that shows a multi-aspect context description. This matrix will be processed by a CNN module to predict the similarity level.

For the training of the CNN module, the input dataset was generated with the following procedure:

Generation of synthetic workflow models of different complexity;
Generating random traces, where each node has a label (i.e., event type, visible attribute) and a parent ID (i.e., the identification of the corresponding node in the schema, hidden attribute);
Clustering the nodes in the trace pool based on the label;
Generating node pairs where both members belong to the same cluster (same visible parameter);
Generation of a balanced training dataset from the pairs (binary category: matching parent ID or not);

In the prediction phase, the trained CNN will be used to find those node pairs in the trace pool, which are in equivalence. As the prediction is not perfect, we employ a post-processing phase, where we build the equivalence classes for the nodes using additional consistency optimization.

3.3. Building Process Schema Graph

Having an imperfect equivalence prediction among the nodes, we can construct a graph, where the nodeset is equal to the nodes in the trace pool and two nodes are connected if and only if the link is predicted by the CNN module. As the prediction is not perfect, the initial graph is not well separable, not disjoint. In order to determine the partitioning with the best separation, we utilize an optimization process.

To find the best separated clusters, we apply a quasi-clique detection algorithm. The quasi-clique in an undirected graph is defined as a subset of vertices with the least edge density. The maximum-clique problem consists in finding quasi-cliques of the largest cardinality and it is a known NP-hard problem [62]. The task to find the partition with best separation is equivalent with finding the maximal quasi-cliques.

In the literature, there are three method families for the finding the maximal quasi-cliques. The first group covers the methods providing exact solutions; the linear programming or dynamic programming methods belong to this family. The main drawback of these solutions is that they cannot solve large scale problems; they have a too-large execution complexity. For practical applications, the general heuristics or customized heuristics are the preferred approaches. One of the candidate methods for general heuristics is the greedy algorithm [63].

Usually, the greedy algorithm is extended with some special pre-processing or post-processing modules [64]. In the literature, there are only a few proposals on customized heuristics; we can highlight the work of Guo et al. [65] as the most recent novel approach. The proposed work is designed for large-scale quasi-clique mining problems; it uses a divide and conquer approach implemented in a parallel architecture.

In our test framework, we developed a heuristic method that works in a greedy way to extend the current clique. The main steps of the proposed algorithm can be summarized in the following list:

Calculating the degrees of the graph nodes (deg(n));
Ordering the nodes in descending degree values (list D); initially, each node in D is free (not assigned to a clique);
Taking the first free node in D (n0), open a new clique C (n0); the status of n0 is set to “processed”;
Processing the free nodes in D (node n) in a loop;
- We test allocating n to the C (n0).
- Calculating the density of C (n0) and assigning it to dens (n)
We select the test node with maximal density (ns);
If dens (ns) > alpha threshold, then
- ns is assigned to C (n0); status of ns is set to “processed”
- If there are free nodes in D, then go to Step 4; otherwise, terminate the process;
Close the current clique;
If there are free nodes in D, then, go to Step 3; otherwise, terminate the process.

The result of the presented algorithm is the set of quasi-cliques. The resulting quasi-cliques are treated as nodes in the final schema graph.

For the generation of the edges in the workflow schema model, we calculate the connection strengths (st) for every directed cluster pair. To accept a link e in the schema graph, it must meet the following criteria:

s t (e) > β

where

s t (e) = \frac{# o f c o n n e c t e d n o d e p a i r s}{# o f a l l n o d e p a i r s}

(20)

In the post-processing stage, we perform a finalization of the schema graph, where we adjust the link in order to meet the following integrity constraints:

There is only one start and one end node;
Each node is part of a path from the start to the end node.

The resulting graph will be presented as the induced workflow model schema graph.

3.4. Complexity Analysis

For the execution complexity analysis of the proposed method, we decompose the algorithm into the following key modules:

Pre-processing for eliminating trace duplications;
Pre-processing for fuzzy loop detection;
Calculating the similarity matrices;
Quasi-clique detection;
Schema generation;

In our model, the cost complexity value depends on the following parameters:

-: N: number of traces in the train dataset.
-: L: average length of the traces.
-: E: number of distinct event types.
-: W: window size for the graph context description.

In Step 1, we can convert the trace into string, and we use a hash structure or set to eliminate the duplication. The cost estimation is

O (c_{1} N L)

where

c_{1}

denotes the bucket density factor of the hash table.

The cost of fuzzy loop detection can be given with

O ({c_{2} L}^{2} N)

where

c_{2}

is the cost factor of the CNN₁ module.

Step 3 requires more resource consumption, as we must test all the possible event pairs belonging to the same event type. In this case, the number of pairs to be tested is

O ({(\frac{N L}{E})}^{2})

and the total cost for calculating the similarity values for all pairs is

O (c_{3} {(\frac{N L}{E})}^{2} W^{2})

Here,

c_{2}

is the cost factor of the CNN₂ module. Having the similarity graph, the next step is to detect the quasi-cliques inside the graph. We have separate graphs for each event type, having a node size

O (\frac{N L}{E})

The general cost of clique detection can be approximated with

O ({(\frac{N L}{E})}^{3})

In the last step, we construct a graph from the cliques. As the number of cliques is less than the number of events, we obtain a cost as follows:

O (c_{4} {N L}^{2})

Thus, taking only the dominant components, the final cost approximation can be given with

O (c_{3} {(\frac{N L}{E})}^{2} W^{2} + {(\frac{N L}{E})}^{3})

This cost analysis shows that there are two critical components in the proposed methodology. The first is calculating the similarity values for each related event and the second component is the mining the quasi-cliques. In this version, we used the baseline implementation of both modules. The further optimization of these units will be next step of the project.

4. Test Experiments

For the tests, we generated synthetic event logs in order to cover a wide range of complexity scales. For the test data, we generated schemas of hierarchical structures, as the industrial processes models also are restricted to this kind of schema family. Many of the existing schema mining models apply an event tree structure for the schema representation. We introduced a well-nested linear schema description language with the following components:

[S1, S2, …]	: sequence
‘X’, [S1, S2, …]	: XOR branching
‘L’, [S1]	: LOOP with kernel S1
‘L’, [S1, S2]	: LOOP with a XOR kernel
‘%’	: start node
‘#’	: end node
‘t’	: activity event of type t

For example, the description [‘%’, ‘a’, ‘X’, [[‘a’], [‘b’]], ‘d’, ‘e’, ‘#’] denotes a schema containing sequences and a simple XOR branching, where one of branches contains the event ‘a’, while the other contains event ‘b’.

To show the schema complexity, we introduced a complexity measure. The measure depends on the maximal depth of the nesting, and is given with the following symbols:

X^{n} L^{m} or (m, n)

where symbol n denotes the depth of the XOR components and m is the depth of the LOOP components in the maximal nesting elements. Table 1 shows some examples for the complexity measure.

The test environment was implemented in the Python language using the Tensorflow/Keras V 2.10 .framework.

4.1. Tests on Prediction of Equivalence

For the prediction of the Λ equivalence, we built up the architecture containing four main modules. There are three modules to process the context features related to the event types. One module is for the input context, one is for the output context, and one combines both directions. We have four input contexts; three of them are assigned to the mentioned three modules. The fourth input contains a context generated by the loop detection module. It shows the chance of the loop existence. The main neural network block will process the concatenation of the three outputs plus the fourth context as input for the link prediction. The code outline is presented in the Listing 1.

Listing 1. Architecture of the neural network for equivalence prediction.

input_shape1 = (2*wsize, 2*wsize, 1)
input_shape2 = (wsize, wsize, 1)
input_shape3 = (2, )

I1 = Input (shape = input_shape1, name = “I1”)
L11 = Conv2D (32, (2, 2), activation = ‘relu’, padding = ‘same’, name = “L11”) (I1)
L12 = MaxPooling2D ((2, 2), name = “L12”) (L11)
L13 = Conv2D (64, (3, 3), activation = ‘relu’, name = “L13”) (L12)
L14 = MaxPooling2D ((2, 2), name = “L14”) (L13)
L15 = Flatten (name = “L15”) (L14)
L16 = Dense (128, activation = ‘relu’, name = “L16”) (L15)
L17 = Dropout (0.5, name = “L17”) (L16)
O1 = Dense (2, activation = ‘softmax’, name = “O1”) (L17)

I2 = Input (shape = input_shape2, name = “I2”)
L21 = Conv2D (32, (2, 2), activation = ‘relu’, padding = ‘same’, name = “L21”) (I2)
L22 = MaxPooling2D ((2, 2), name = “L22”) (L21)
L23 = Conv2D (64, (3, 3), activation = ‘relu’, name = “L23”) (L22)
L25 = Flatten (name = “L25”) (L23)
L26 = Dense (128, activation = ‘relu’, name = “L26”) (L25)
L27 = Dropout (0.5, name = “L27”) (L26)
O2 = Dense (2, activation = ‘softmax’, name = “O2”) (L27)

I3 = Input (shape = input_shape2, name = “I3”)
L31 = Conv2D (32, (2, 2), activation = ‘relu’, padding = ‘same’, name = “L31”) (I3)
L32 = MaxPooling2D ((2, 2), name = “L32”) (L31)
L33 = Conv2D (64, (3, 3), activation = ‘relu’, name = “L33”) (L32)
L35 = Flatten (name = “L35”) (L33)
L36 = Dense (128, activation = ‘relu’, name = “L36”) (L35)
L37 = Dropout (0.5, name = “L37”) (L36)
O3 = Dense (2, activation = ‘softmax’, name = “O3”) (L37)

I4 = Input (shape = input_shape3, name = “I4”)

X1 = Concatenate (name = “X1”) ([O1, O2, O3, I4])
X2 = Dense (16, activation = ‘relu’, name = “X2”) (X1)
O = Dense (2, activation = ‘sigmoid’, name = “O”) (X2)

model1 = Model (inputs = I1, outputs = O1)
model2 = Model (inputs = I2, outputs = O2)
model3 = Model (inputs = I3, outputs = O3)
model = Model (inputs = [I1, I2, I3, I4], outputs = O)

In the training phase, we first trained the model1, model2, and model3 components separately. In the second phase, we changed these models to be frozen and only trained the main model.

In the first tests, we investigated the accuracy of the proposed embedding and similarity methods. We involved two widely used benchmark embedding methods used in heterogeneous graphs containing labeled nodes. These two methods are the Hin2Vec and GraphSAGE methods. We applied the implementation of Hin2Vec found at the Github repository https://github.com/csiesheep/hin2vec/tree/master (accessed on 10 December 2024). Regarding GraphSAGE, we used the implementation developed in the Stellargraph project, the code is available at https://pypi.org/project/stellargraph-mvisani/ (accessed on 12 December 2024).

In the efficiency tests, we have used a wider range of parameters, including the following:

-: Epoch number (10, 20, 30, or 40)
-: Window size of the context (4, 6, or 8)
-: Train/test split ration (10%, 20%, or 30%)

Based on the test results with the proposed module, the following setting provided the best results: (epoch=30, window size: 6, train/test split ratio: 20%). The test results are summarized in Table 2. The table shows the measured accuracy values; the first number is the validation accuracy, while the second is the test accuracy. The values are aggregated values of five runs. These results show the significant dominance of the proposed method for equivalence link prediction. The dominance of the proposed method was visible in all parameter settings; thus, we applied this method in the main process discovery.

In the tests, we first employed training data of homogeneous complexity, all traces are related to schemas of the same complexity level. As the test results given in Table 2 show, the proposed method significantly dominates the other benchmark methods. In these tests, we processed training sets of moderate size (10,000 items). In the second phase of the test experiments, we generated heterogeneous training sets containing traces from schemas of different complexity levels. Here, the size of the training items runs from 100,000 (dataset ALL_2) to 200,000 (dataset ALL_3b). We can see that the proposed method also provides, here, a significantly better classification accuracy.

4.2. Tests on Schema Induction

In the next subsection, we present the comparison test results on the proposed schema induction method. In the tests, we also involved the industry standard PM4PY methods, such as the alpha miner, the heuristic miner, and the inductive miner methods. The implementation of all these methods is available at Github repository https://github.com/process-intelligence-solutions/pm4py (accessed on 28 November 2024). In the tests, we applied the simple invocation of these methods, as presented in Listing 2.

Listing 2. Testing the PM4PY methods.

(event_log,dataf) = conv_to_df(traces_1)

pnet, initial_marking, final_marking = alpha_miner.apply(event_log)
pm4py.view_petri_net(pnet, initial_marking, final_marking, format = ‘png’)

tree = pm4py.discover_process_tree_inductive(event_log)
pnet, initial_marking, final_marking = pm4py.convert_to_petri_net(tree)
pm4py.view_petri_net(pnet, initial_marking, final_marking, format = ‘png’)

heu_net = pm4py.discover_heuristics_net(event_log, dependency_threshold = 0.0)
pm4py.view_heuristics_net(heu_net)

In addition to the PM4PY methods, the efficiency comparison tests also cover the GNN-based process mining method proposed in [61]. Unlike the PM4PY methods, this method applies machine learning engines for enhanced efficiency in schema induction.

Regarding the test datasets, we involved both synthetic and real-world benchmark datasets. For the generation of synthetic datasets, we have developed a log data generation framework, where we can build trace datasets of different complexity levels. The complexity level is measured by the nesting level of the control structures and by the size of the schema graph.

The input for the schema induction task is a list of traces, which all belong to the same hidden schema. For the generation of the input trace list, first, a target schema was constructed, which was used for the generation of the traces. The events in the traces were identified by the event type. As the schema may contain the same event type at different nodes, the trace may also contain the repetition of the event types.

In the resulting graph of our proposed method, the nodes are symbolized with t_i where t is the event type and i denotes the index of the corresponding quasi-cluster. The event type is input data, while the cluster indexes are generated during the induction process.

The quality of schema generation is evaluated with the process mining measures proposed in [66]. In the efficiency comparison tests, we applied the following measures.

-: Simplicity (m_S): This metric is calculated as the average of the number of both incoming and outgoing arcs calculated per node. A higher S value means larger complexity and it may show over-generalization.
-: Structural similarity (m_T): This metric shows the relative difference of the process matrixes that indicate the directly-follows relations between activities of the models to be compared. This metric returns a value between 0 and 1, where a high value indicates that the mined and the original models are similar.
-: Capacity (m_C): The number of different traces can be generated from the schema; the size of the trace set is related to the schema. In the tests of complex schemas, we applied a random sampling approach to calculate this parameter.
-: Precision (m_P): This metric checks whether enabled traces in the model correspond to observed traces in the log. The higher the difference, the less precise the mined model is. It is calculated with

m_{P} = \frac{\min (m_{C 1}, m_{C 2})}{\max (m_{C 1}, m_{C 2})}

(21)

4.3. Tests on Synthetic Data

In the following schemas, we first present some test examples of the different complexity levels. In the examples, we show the ground truth schema and the resulted schemas generated by the involved methods.

T1: The schema contains only a simple XOR branching, and no event type repetition. The ground truth schema is represented as follows:

Schema: [‘%’, ‘x’, ‘X’, [[‘a’, ‘b’, ‘c’], [‘d’, ‘e’, ‘f’]], ‘y’, ‘#’]

For this simple case, all the PM4PY methods (alpha miner, heuristic miner, and inductive miner) generated the same correct schema which is shown in Figure 3.

The output of the proposed method is represented in Figure 4.

Evaluation: All methods generated the expected schema. The related measures are the same for all the methods:

m_S = 1.75, m_T = 1, m_P = 1

T2: The schema contains only a simple XOR branching, but this time with event type repetition. The ground truth schema is represented as follows:

Schema: [‘%’, ‘a’, ‘X’,[[‘b’], [‘c’]], ‘d’, ‘X’, [[‘e’],[‘f’]], ‘g’, ‘X’, [[‘h’], [‘i’]], ‘b’, ‘#’]

Alpha miner:

Figure 5. Result schema by alpha miner for T2.

Inductive miner:

Figure 6. Result schema by inductive miner for T2.

Heuristic miner:

Figure 7. Result schema by heuristic miner for T2.

Proposed method:

Figure 8. Result schema by the proposed method for T2.

Evaluation: Only the proposed method was able to distinguish the two events with the same event type (b). The results of alpha miner (see Figure 5); of inductive miner (Figure 6) and of heuristic miner (Figure 7) contain invalid edges in the resulted schema graph. The related measures for our proposed methods (Figure 8) are

m_S = 2,4, m_T = 1, m_P = 1

The heuristic miner yields the following measurement values:

m_S = 2,7, m_T = 0.8, m_P = 0.12

T3: The next schema contains a nested XOR branching and an event type repetition. The ground truth schema is represented as follows:

Schema: [‘%’, ‘a’, ‘X’, [[‘f’], [‘b’, ‘X’,[[‘c’], [‘d’]], ‘a’ ]], ‘g’, ‘#’]

Alpha miner:

Figure 9. Result schema by alpha miner for T3.

Inductive miner:

Figure 10. Result schema by inductive miner for T3.

Heuristic miner:

Figure 11. Result schema by heuristic miner for T3.

Proposed method:

Figure 12. Result schema by the proposed method for T3.

Evaluation: All the benchmark methods failed to detect the different occurrence contexts of the same event type (see Figure 9, Figure 10 and Figure 11). Only the proposed method (Figure 12) induced the expected schema. The related measures for our proposed methods are

m_S = 1.2, m_T = 1, m_P = 1

The heuristic miner yields the following measurement values:

m_S = 1.2, m_T = 1, m_P = 0.27

T4: The schema contains an XOR branching including a LOOP unit, without event type repetition:

Schema: [‘%’, ‘a’, ‘X’, [[‘L’, [[‘b’, ‘c’]]], [‘u’]], ‘e’, ‘#’]

Alpha miner:

Figure 13. Result schema by alpha miner for T4.

Heuristic miner:

Figure 14. Result schema by heuristic miner for T4.

Inductive miner:

Figure 15. Result schema by inductive miner for T4.

Proposed method:

Figure 16. Result schema by the proposed method for T4.

Evaluation: Both the alpha miner and the heuristic miner failed to recognize the loop (Figure 13 and Figure 14). Only the inductive miner (Figure 15) and the proposed method (Figure 16) generated the correct results. The related measures for our proposed methods are

m_S = 1.2, m_T = 1, m_P = 1

The heuristic miner yields the following measurement values:

m_S = 1, m_T = 0.8, m_P = 0.09.

T5: The last schema contains an XOR branching including a LOOP unit, with event type repetition. The ground truth schema is represented as follows:

Schema: [‘%’, ‘s’, ‘a’, ‘X’, [[‘L’, [[‘b’, ‘c’]]], [‘u’, ‘s’]], ‘a’, ‘e’, ‘#’]

Alpha miner:

Figure 17. Result schema by alpha miner for T5.

Inductive miner:

Figure 18. Result schema by inductive miner for T5.

Heuristic miner:

Figure 19. Result schema by heuristic miner for T5.

Proposed method:

Figure 20. Result schema by the proposed method for T5.

Evaluation: As the result schema graphs show, alpha miner (Figure 17) and the inductive miner (Figure 18) produced very imprecise schemas, while the heuristic miner yields invalid loop positions (Figure 19) and it has the following measurement values:

m_S = 1.12, m_T = 1, m_P = 1

The heuristic miner yields the following measurement values:

m_S = 1.15, m_T = 0.9, m_P = 0.08

Only the proposed method (Figure 20) could provide the perfect schema.

4.4. Tests on Real Data

In this section, we will use the same benchmark datasets as the paper on GNN-based process discovery [66]. The source of the real datasets is the BLP challenge. BPI Challenge datasets (https://www.tf-pm.org/competitions-awards/bpi-challenge, accessed on 3 January 2025) have become important benchmarks in the community. For the presentation, we selected the following three datasets:

-: BPI_2012_O
-: BPI_2017_O
-: BPI_2020_Permit_log

The size parameters of the data sets are summarized in Table 3.

T6: BPI_2012_O dataset

The key problem in the resulting schema of the inductive miner (Figure 21) is that it is very over-generalized; for example, it allows a loop f -> f or a → a, which is not part of the incoming traces. As the generated schema shows GNN miner (Figure 22) performed a very high level of reduction, some frequent traces are not covered by the schema. For example, there are traces where the last step is O_CANCELED (code: f). Our proposed (Figure 23) method generates a more compact schema; it distinguished four subtypes of the activity d (d_3, d_6, d_7, and d_8) because these occurrences have very different neighborhood contexts.

T7: BPI_2017_O dataset

The result of the heuristic method (Figure 24) contains some incorrect structure sections and the schema generated by GNN method (Figure 25) is over simplified. For example, neither of these methods discover the repetition of the sequence CREATE_OFFER (code: a) → CREATED (code: b). On the other hand, the proposed method discovers that the event CANCELED (code: f) can be repeated only in specific contexts; thus, it generates more subtypes of this event. According to the test results, the proposed method (Figure 26) provided the best matching with the input trace database.

T8: BPI_2021_Permit dataset

The standard methods generated over-generalized schemas. The heuristic miner produced a complex and incorrect schema (Figure 27). As this experiment shows, the GNN-based process discovery method (Figure 28) is not suitable to manage the multiple occurrences of the same event type in the expected way. Only the proposed method (Figure 29) could distinguish the different occurrence contexts of the event types. Although the number of nodes is larger in the output schema than in the baseline schemas, the node simplicity level (number of related edges) is at a lower value.

In Table 4, we summarize the test results presenting the key efficiency measures of the investigated methods.

5. Discussion

We have tested our proposed method in two phases. In the first phase, we investigated the

Λ

equivalence miner, which predicts which nodes from the different traces belong to the same node in the source schema graph. In the tests, we also analyzed two benchmark methods for the labeled graphs: Hin2Vec and GraphSAGE. As the test results show, our proposal could achieve significantly better accuracy in all test cases.

Having a precise equivalence classifier, we can construct the schema graph using a straightforward method: We discover the maximal quai-cliques in the equivalence graph and the cliques are converted into the nodes of the result schema. The performed comparison tests show that the proposed method provides similar or better results for all the investigated cases. The main benefit of the proposed method is that it can also be used for cases when the schemas and the traces may contain multi-occurrence event types. The standard process discovery methods cannot separate the different occurrences of the same type, which results in incorrect and inaccurate outcomes.

The results of the performed efficiency tests demonstrate that the proposed method can manage such complex cases which are not covered by the standard process discovery methods, and it provides more compact and more precise schema graphs. As Table 4 shows, the proposed method can distinguish the different occurrence contexts of the same evet type. The main consequences of this feature are that the resulted schema is less over generalized, the precision related to the training set is better, and the simplicity level is improved.

Regarding more complex schemas, we can summarize the gained experiences in the following points:

The quality of the output significantly depends on the accuracy of the equivalence mining.
The increase in training datasets improves the accuracy of the prediction. For example, considering Example T5, using only 100,000 items in the training set, the engine generated some incorrect edges, as shown in Figure 30. With a double-sized training set, the neural networks system was able to perform a perfect prediction.
The precise detection of the nested loops requires additional investigations, as with increased nesting depth, the uncertainty level also increases.

The tests also revealed areas where further research is needed in the next phase. During the tests, we experienced the following limitations of the proposed method:

-: The quality of the training set is an important factor in the equivalence prediction. If the test data set differs significantly from the training set, the accuracy level is getting weaker.
-: The discovery of deep-nested loops and a more detailed analysis of the optimal neural network and training patterns for performing equivalence prediction are required.
-: The presented loop detection CNN module provides a weaker result in the case of complex nested structures; it is worth investigating some other alternative approaches too.
-: The presented method may be extended with additional configuration parameters to implement a more flexible process discovery engine. The set of possible parameter extensions covers, among others, the frequency filtering, the decision threshold in the equivalence prediction, and the clique detection density threshold. The application of a more flexible process discovery mechanism increases the usability of the proposed method.

6. Conclusions

Process discovery is an important research field in the development of efficient tools for process automatization. Current technologies are dominantly based on automata and pattern-matching approaches. Current industrial standard workflow schema induction methods impose certain limitations on the system being examined. For instance, log traces cannot contain identical event identifiers at different positions within the schema. Moreover, in cases of complex, multi-level embeddings, the generated model becomes increasingly inaccurate. To address the aforementioned shortcomings, this article proposes a novel solution that employs graph neural networks to perform schema discovery.

In the developed procedure, we introduce a multi-aspect embedding format for characterizing nodes and we apply a new two-dimensional similarity measure method. In the first phase of schema generation, we perform equivalence prediction, implemented as an edge prediction task. From the obtained equivalence network, we identify the target schema nodes, which correspond to the maximal quasi-cliques of this network. The developed method has been implemented in a Python environment. Based on the conducted tests, we can see that the proposed context descriptor provides significantly more accurate results for the examined tasks than the other involved widely used methods (i.e., GraphSage, Hin2Vec). To evaluate the efficiency of schema induction, our method was compared with industrial standard approaches such as alpha miner, inductive miner, heuristic miner, and the recent GNN-based algorithms.

The results of the efficiency tests show that the proposed method can handle complex cases that standard process discovery methods cannot address. Moreover, it produces a more compact and precise schema graph. Based on the obtained efficiency results, the proposed method appears to be a strong candidate for solving real-world process mining challenges. Future work will focus on fine-tuning the internal modules of the schema induction engine.

Author Contributions

Conceptualization, L.K.; methodology, L.K.; software, L.K. and A.J.; validation, L.K. and A.J.; formal analysis, L.K.; investigation, L.K. and A.J.; resources, L.K. and A.J.; data curation, L.K.; writing—original draft preparation, L.K. and A.J.; writing—review and editing, L.K.; visualization, L.K.; supervision, L.K.; project administration, L.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data used for the tests are synthetic data; the generator code is available at requests.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Aalst, W.; Dongen, B.; Herbst, J.; Maruster, L.; Schimm, G.; Weijters, A. Workflow Mining: A Survey of Issues and Approaches. Data Knowl. Eng. 2003, 47, 237–267. [Google Scholar] [CrossRef]
Mili, H.; Tremblay, G.; Bou Jaoude, G.; Lefebvre, E. Business process modeling languages: Sorting through the alphabet soup. ACM Comput. Surv. 2010, 43, 1–56. [Google Scholar] [CrossRef]
Murata, T. Petri Nets: Properties, Analysis and Applications. Proc. IEEE 1989, 77, 541–580. [Google Scholar] [CrossRef]
Mayer, R.J.; Menzel, C.P.; Painter, M.K.; DeWitte, P.S.; Blinn, T.; Parakath, B. Information Integration for Concurrent Engineering (IICE) IDEF3 Process Description Capture Method Report; Technical Report; Texas A&M University: College Station, TX, USA, 1995. [Google Scholar]
Keller, G.; Nüttgens, M.; Scheer, A.-W. Semantische Prozeßmodellierung auf der Grundlage “Ereignisgesteuerter Prozeßketten (EPK); Technical Report; Institut für Wirtschaftsinformatik: St. Gallen, Switzerland, 1992. [Google Scholar]
Wilson, J.M. Business processes: Modelling and analysis for re-engineering and improvement. J. Oper. Res. Soc. 1996, 47, 595–596. [Google Scholar]
Junginger, S. The Workflow Management Coalition Standard WPDL: First Steps towards Formalization. In Proceedings of the 7th European Concurrent Engineering Conference (ECEC’2000), Leicester, UK, 17–19 April 2000; pp. 163–168. [Google Scholar]
Chinosi, M.; Trombetta, A. BPMN. Comput. Stand. Interfaces 2012, 34, 124–134. [Google Scholar] [CrossRef]
Pereira, J.L.; Silva, D. Business Process Modeling Languages: A Comparative Framework. Adv. Intell. Syst. Comput. 2016, 444, 619–628. [Google Scholar]
Hofstede, A.H.M.; van der Aalst, W.; Adams, M.; Russell, N. Modern Business Process Automation: YAWL and Its Support Environment; Springer: Berlin/Heidelberg, Germany, 2010. [Google Scholar]
Van der Aalst, W.M.P.; Ter Hofstede, A.H.M.; Kiepuszewski, B.; Barros, A.P. Workflow patterns. Distrib. Parallel Databases 2003, 14, 5–51. [Google Scholar] [CrossRef]
Vahideh, N.; Shahnorbanun, S.; Shukur, S. A review on conformance checking technique for the evaluation of process mining algorithms. TEM J. 2019, 8, 1232. [Google Scholar]
Dunzer, S.; Stierle, M.; Matzner, M.; Baier, S. Conformance checking: A state-of-the-art literature review. In Proceedings of the 11th International Conference on Subject-Oriented Business Process Management, Seville, Spain, 26–28 June 2019. [Google Scholar]
Burattin, A.; van Zelst, S.J.; Armas-Cervantes, A.; van Dongen, B.F.; Carmona, J. Online conformance checking using behavioural patterns. In Business Process Management: 16th International Conference, BPM 2018, Sydney, NSW, Australia, 9–14 September 2018, Proceedings 16; Springer International Publishing: Berlin/Heidelberg, Germany, 2018. [Google Scholar]
Yasmin, F.A.; Faiza, A.B.; Patricio De Alencar, S. Process enhancement in process mining: A literature review. In Proceedings of the CEUR 2018, Copenhagen, Denmark, 14 October 2018; p. 2270. [Google Scholar]
Manfaat, D.; Alex, H.; Duffy, B.; Lee, B. Review of pattern matching approaches. Knowl. Eng. Rev. 1996, 11, 161–189. [Google Scholar] [CrossRef]
Karoly, I.R.; Abonyi, J. Multi-temporal sequential pattern mining based improvement of alarm management systems. In Proceedings of the SMC 2016, Budapest, Hungary, 9–12 October 2016; pp. 3870–3875. [Google Scholar]
Weiss, G.M. Predicting Telecommunication Equipment Failures from Sequences of Network Alarms. In Handbook of Knowledge Discovery and Data Mining; Oxford University Press: Oxford, UK, 2001; pp. 891–896. [Google Scholar]
Csalódi, R.; Abonyi, J. Integrated Survival Analysis and Frequent Pattern Mining for Course Failure-Based Prediction of Student Dropout. Mathematics 2021, 9, 463. [Google Scholar] [CrossRef]
Kinnebrew, J.S.; Biswas, G. Identifying learning behaviors by contextualizing differential sequence mining with action features and performance evolution. In Proceedings of the 5th International Conference on Educational Data Mining, Chania, Greece, 19–21 June 2012. [Google Scholar]
Taub, M.; Azevedo, R.; Bradbury, A.E.; Millar, G.C.; Lester, J. Using sequence mining to reveal the efficiency in scientific reasoning during STEM learning with a game-based learning environment. Learn. Instr. 2018, 54, 93–103. [Google Scholar] [CrossRef]
Fournier-Viger, P.; Li, J.; Lin, C.W.; Chi, T.T.; Kiran, R.U. Mining cost-effective patterns in event logs. Knowl. Based Syst. 2020, 191, 105241. [Google Scholar] [CrossRef]
Truong, T.C.; Fournier-Viger, P. A Survey of High Utility Sequential Pattern Mining. In High-Utility Pattern Mining: Theory, Algorithms and Applications; Springer: Cham, Switzerland, 2019; pp. 97–129. [Google Scholar]
Belhadi, A.; Djenouri, Y.; Lin, J.C.W.; Cano, A. A general-purpose distributed pattern mining system. Appl. Intell. 2020, 50, 2647–2662. [Google Scholar] [CrossRef]
D’Aquin, M.; Jay, N. Interpreting data mining results with linked data for learning analytics: Motivation, case study and directions. In Proceedings of the ACM 2013, Leuven, Belgium, 8–13 April 2013; pp. 155–164. [Google Scholar]
El-Hajj, M.; Zaïane, O.R. On-recursive Generation of Frequent K-itemsets from Frequent Pattern Tree Representations. In Proceedings of the International Conference on Data Warehousing and Knowledge Discovery, Prague, Czech Republic, 3–5 September 2003; pp. 371–380. [Google Scholar]
Lin, C.W.; Hong, T.P.; Lu, W.H.; Lin, W.Y. An incremental FUSP-tree maintenance algorithm. In Proceedings of the International Conference on Intelligent Systems Design and Applications ISDA, Kaohsuing, Taiwan, 26–28 November 2008; pp. 445–449. [Google Scholar]
Liu, J.; Yan, S.; Wang, Y.; Ren, J. Incremental Mining Algorithm of Sequential Patterns Based on Sequence Tree. Adv. Intell. Soft Comput 2012, 138, 61–67. [Google Scholar]
van Zelst, S.J.; Leemans, S.J.J. Translating workflow nets to process trees: An algorithmic approach. Algorithms 2020, 13, 279. [Google Scholar] [CrossRef]
Rizvee, R.A.; Arefin, M.F.; Ahmed, C.F. Tree-Miner: Mining Sequential Patterns from SP-Tree. Lect. Notes Comput. Sci. 2020, 12085, 44–56. [Google Scholar]
Xin, L.; Liang, Z.; Weishan, Z.; Jiehan, Z.; Shuai, C.; Shaowen, Y. An Evolutive Frequent Pattern Tree-based Incremental Knowledge Discovery Algorithm. ACM Trans. Manag. Inf. Syst. 2022, 13, 1–20. [Google Scholar]
Chen, J. An UpDown directed acyclic graph approach for sequential pattern mining. IEEE Trans. Knowl. Data Eng. 2010, 22, 913–928. [Google Scholar] [CrossRef]
Patel, A.; Patel, A. Graph based Approach and Clustering of Patterns (GACP) for Sequential Pattern Mining. Int. J. Comput. Sci. Eng. 2011, 3, 1501–1509. [Google Scholar]
Dheeraj Kumar, S.; Sharma, V.; Sharma, S. Graph based approach for mining frequent sequential access patterns of web pages. Int. J. Comput. Appl. 2012, 40, 33–37. [Google Scholar]
Dong, W.; Lee, E.W.; Hertzberg, V.S.; Simpson, R.L.; Ho, J.C. GASP: Graph-Based Approximate Sequential Pattern Mining for Electronic Health Records. Commun. Comput. Inf. Sci. 2021, 1450, 50–60. [Google Scholar]
Cook, J.E. Process Discovery and Validation Through Event-Data Analysis. Ph.D. Thesis, University of Colorado at Boulder, Boulder, CO, USA, 1996. [Google Scholar]
Gamze, T.; Erkin, Z.; Lagendijk, I. Privacy-preserving alpha algorithm for software analysis. In Proceedings of the 37th WIC Symposium on Information Theory in the Benelux, Louvain-la-Neuve, Belgium, 19–20 May 2016. [Google Scholar]
Nafasa, P.; Waspada, I.; Bahtiar, N.; Wibowo, A. Implementation of alpha miner algorithm in process mining application development for online learning activities based on moodle event log data. In Proceedings of the International Conference on Informatics and Computational Sciences, Semarang, Indonesia, 29–30 October 2019. [Google Scholar]
Bernis, T.; Shahriar, H.; Zhang, C. Process mining algorithms for clinical workflow analysis. In Proceedings of the SoutheastCon 2018, St. Petersburg, FL, USA, 19–22 April 2018. [Google Scholar]
Gamze, T.; Erkin, Z.; Lagendijk, I. Mining encrypted software logs using alpha algorithm. In Proceedings of the International Joint Conference on e-Business and Telecommunications, ICETE 2017, Madrid, Spain, 24–26 July 2017; SciTePress: Setúbal, Portugal, 2017. [Google Scholar]
Leemans, S.J.; Fahland, D.; Van Der Aalst, W. Process and deviation exploration with inductive visual miner. In Proceedings of the 12th International Conference on Business Process Management, Haifa, Israel, 7–11 September 2014. [Google Scholar]
Van Detten, J.N.; Schumacher, P.; Leemans, S.J.J. An approximate inductive miner. In Proceedings of the 2023 5th International Conference on Process Mining (ICPM), Rome, Italy, 23–27 October 2023; IEEE: Piscataway, NJ, USA, 2023. [Google Scholar]
Van Dongen, B.F.; Alves de Medeiros, A.K.; Wen, L. Process mining: Overview and outlook of petri net discovery algorithms. In Transactions on Petri Nets and Other Models of Concurrency II: Special Issue on Concurrency in Process-Aware Information Systems; Springer: Berlin/Heidelberg, Germany, 2009; pp. 225–242. [Google Scholar]
Gomes, A.F.D.; Wanzeller Guedes, A.; Joana, R.S.F. Comparative analysis of process mining algorithms in process discover. In New Trends in Disruptive Technologies, Tech Ethics and Artificial Intelligence: The DITTET Collection 1; Springer: Cham, Switzerland, 2022. [Google Scholar]
Alessandro, B.; van Zelst, S.; Schuster, D. PM4Py: A process mining library for Python. Softw. Impacts 2023, 17, 100556. [Google Scholar]
Günther, C.W.; Rozinat, A. Disco: Discover your processes. In Proceedings of the 10th International Conference on Business Process Management, BPM Demos 2012, Tallinn, Estonia, 3–6 September 2012. [Google Scholar]
Wang, J.; Zhai, X.; Wu, X.; Shi, J.; Zeng, G. Applications of deep learning in computational imaging with structured illumi-nation. In Proceedings of the Sixth Conference on Frontiers in Optical Imaging and Technology: Imaging Detection and Target Recognition, Nanjing, China, 22–24 October 2023. [Google Scholar]
Kipf, T.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
William, L.; Hamilton, R.; Ying, Z.; Leskovec, J. Inductive Representation Learning on Large Graphs. Adv. Neural Inf. Process. Syst. 2017, 30, 1–11. [Google Scholar]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph Attention Networks. In Proceedings of the ICLR, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Perozzi, B.; Al-Rfou, R.; Skiena, S. DeepWalk: Online Learning of Social Representations. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 24–27 August 2014. [Google Scholar]
Yao, X.; Zhang, Y.; Liu, Z. Analyzing process graphs using clustering coefficients and edge density. J. Process Min. 2020, 5, 45–60. [Google Scholar]
Goyal, P.; Ferrara, E. Graph embedding techniques, applications, and performance: A survey. Knowl. Based Syst. 2018, 151, 78–94. [Google Scholar] [CrossRef]
Palumbo, E.; Rizzo, G.; Troncy, R.; Baralis, E.; Osella, M.; Ferro, E. Knowledge graph embeddings with node2vec for item recommendation. In Proceedings of the Semantic Web: ESWC 2018 Satellite Events: ESWC Satellite Event, Crete, Greece, 3–7 June 2018. [Google Scholar]
Tao-yang, F.; Lee, W.C.; Lei, Z. Hin2vec: Explore meta-paths in heterogeneous information networks for representation learning. In Proceedings of the ACM on Conference on Information and Knowledge Management, Singapore, 6–10 November 2017. [Google Scholar]
Hamilton, W.; Zhitao, Y.; Leskovec, J. Inductive representation learning on large graphs. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
CSIRO’s Data61. StellarGraph Machine Learning Library. Available online: https://github.com/stellargraph/stellargraph (accessed on 12 October 2024).
Van Der Aalst, W.M. A practitioner’s guide to process mining: Limitations of the directly-follows graph. Procedia Comput. Sci. 2019, 164, 321–328. [Google Scholar] [CrossRef]
Marco, P.; Seran Uysal, M.; van der Aalst, W.M.P. Efficient time and space representation of uncertain event data. Algorithms 2020, 13, 285. [Google Scholar] [CrossRef]
Abonyi, J.; Richárd, K.; Dörgö, G.y. Event-Tree Based Sequence Mining Using LSTM Deep-Learning Model. Complexity 2021, 1, 7887159. [Google Scholar] [CrossRef]
Sommers, D.; Menkovski, V.; Fahland, D. Supervised learning of process discovery techniques using graph neural networks. Inf. Syst. 2023, 115, 102209. [Google Scholar] [CrossRef]
Pattillo, J.; Veremyev, A.; Butenko, S.; Boginski, V. On the maximum quasi-clique problem. Discret. Appl. Math. 2013, 161, 244–257. [Google Scholar] [CrossRef]
Melo, R.A.; Celso, C.R.; Jose, A.R. The minimum quasi-clique partitioning problem: Complexity, formulations, and a com-putational study. Inf. Sci. 2022, 612, 655–674. [Google Scholar] [CrossRef]
Tsourakakis, C.; Bonchi, F.; Gionis, A.; Gullo, F.; Tsiarli, M. Denser than the densest subgraph: Extracting optimal quasi-cliques with quality guarantees. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL, USA, 11–14 August 2013; pp. 104–112. [Google Scholar]
Guo, G.; Yan, D.; Özsu, M.T.; Jiang, Z.; Khalil, J. Scalable mining of maximal quasi-cliques: An algorithm-system codesign approach. arXiv 2020, arXiv:2005.00081. [Google Scholar] [CrossRef]
Blum, F.R. Metrics in Process Discovery; Technical report, Tech. Rep. TR/DCC; University of Chile: Santiago, Chile, 2015; pp. 1–21. [Google Scholar]

Figure 1. Sample similarity matrices.

Figure 2. Prediction of the

Λ

equivalence.

Figure 2. Prediction of the

Λ

equivalence.

Figure 3. Result schema by Alpha miner for T1.

Figure 4. Result schema by the proposed method for T1.

Figure 21. Result schema by inductive miner for T6.

Figure 22. Result schema by GNN miner for T6 (https://gitlab.com/dominiquesommers/thesis_data, accessed on 4 January 2025).

Figure 23. Result schema by the proposed method for T6.

Figure 24. Result schema by the heuristic miner for T7.

Figure 25. Result schema by GNN miner for T7 (https://gitlab.com/dominiquesommers/thesis_data, accessed on 5 January 2025).

Figure 26. Result schema by the proposed method for T7.

Figure 27. Result schema by the heuristic miner for T8.

Figure 28. Result schema by GNN miner for T8 (https://gitlab.com/dominiquesommers/thesis_data, accessed on 5 January 2025).

Figure 29. Result schema by the proposed method for T8.

Figure 30. Imprecise prediction with an invalid link.

Table 1. Sample for complexity calculations.

Schema	Complexity
[‘%’, ‘a’, ‘X’, [[‘a’], [‘b’]], ‘d’, ‘e’, ‘#’]	(1, 0): X¹L⁰
[‘%’, ‘a’, ‘c’, ‘d’, ‘L’, [[‘a’], [‘b’]], ‘c’, ‘#’]	(1, 1): X¹L¹
[‘%’, ‘a’, ‘X’, [[‘f’], [‘b’, ‘X’, [[‘z’], [‘r’]], ‘i’]], ‘s’, ‘#’]	(2, 0): X²L⁰
[‘%’, ‘e’, ‘L’, [[‘a’, ‘b’, ‘L’, [[‘x’, ‘y’, ‘L’, [[‘o’, ‘i’]]]], ‘c’], [‘x’, ‘b’]], ‘f, ‘#’’]	(1, 3): X¹L

Table 2. Accuracy of the different embedding methods in equivalence link prediction.

Schema Complexity	GraphSAGE	Hin2Vec	Proposed
homogeneous dataset
(1, 0)	0.60/0.52	0.58/0.51	0.98/0.91
(1, 0)	0.59/0.46	0.54/0.50	0.98/0.87
(2, 0)	0.57/0.58	0.56/0.51	0.93/0.89
(1, 1)	0.69/0.68	0.65/0.52	0.98/0.97
(0, 1)	0.60/0.59	0.65/0.51	0.99/0.98
(0, 2)	0.71/0.55	0.58/0.49	0.98/0.88
(2, 1)	0.53/0.54	0.61/0.52	0.96/0.86
(1, 2)	0.54/0.53	0.62/0.52	0.97/0.88
(0, 3)	0.78/0.61	0.61/0.51	0.97/0.87
heterogeneous dataset
ALL_1	0.80/0.75	0.56/0.51	0.99/0.98
ALL_2	0.56/0.53	0.56/0.52	0.99/0.98
ALL_3	0.56/0.54	0.55/0.53	1.00/0.99

Table 3. Dataset parameters.

Dataset	Number of Traces	Number of Event Types	Average Trace Length
BPI_2012_O	31,244	8	7.6
BPI_2017_O	31,509	8	8.2
BPI_220_Permit_log	7065	18	11.5

Table 4. Efficiency measures of the investigated methods.

Measure	Experiment	Heuristic Method	GNN Method	Proposed Method
Number of activity nodes	T1	8	8	8
	T2	9	9	10
	T3	6	6	7
	T4	5	5	5
	T5	6	6	8
	T6	7	7	10
	T7	8	8	10
	T8	18	18	25
	Average	8.4	8.4	10.4
Simplicity (m_S)	T1	1	1	1
	T2	1.3	1.3	1.2
	T3	1.3	1.2	1.1
	T4	1	1	1.2
	T5	1.3	1.3	1.1
	T6	2.4	2.7	2.3
	T7	2.3	2.2	2.0
	T8	1.6	1.4	1.3
	Average	1.5	1.5	1.4
Precision (m_P)	T1 (2)	1	1	1
	T2 (8)	0.05	0.05	1
	T3 (3)	0.08	0.09	1
	T4 (11)	0.18	0.18	1
	T5 (11)	0.08	0.11	1
	T6 (14)	0.08	0.12	0.5
	T7 (10)	0.04	0.03	0.03
	T8 (13)	0.03	0.06	0.09
	Average	0.19	0.21	0.70

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kovács, L.; Jlidi, A. Process Discovery for Event Logs with Multi-Occurrence Event Types. Algorithms 2025, 18, 83. https://doi.org/10.3390/a18020083

AMA Style

Kovács L, Jlidi A. Process Discovery for Event Logs with Multi-Occurrence Event Types. Algorithms. 2025; 18(2):83. https://doi.org/10.3390/a18020083

Chicago/Turabian Style

Kovács, László, and Ali Jlidi. 2025. "Process Discovery for Event Logs with Multi-Occurrence Event Types" Algorithms 18, no. 2: 83. https://doi.org/10.3390/a18020083

APA Style

Kovács, L., & Jlidi, A. (2025). Process Discovery for Event Logs with Multi-Occurrence Event Types. Algorithms, 18(2), 83. https://doi.org/10.3390/a18020083

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Process Discovery for Event Logs with Multi-Occurrence Event Types

Abstract

1. Introduction

2. Related Works

2.1. Process Discovery

2.2. Graph Neural Networks in Process Mining

2.3. Link Prediction with GNN

2.4. Limitations of Current Process Discovery Methods

3. Proposed Schema Induction Method

3.1. Loop Detection

3.2. Equivalence Prediction

3.3. Building Process Schema Graph

3.4. Complexity Analysis

4. Test Experiments

4.1. Tests on Prediction of Equivalence

4.2. Tests on Schema Induction

4.3. Tests on Synthetic Data

4.4. Tests on Real Data

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI