1. Introduction
Nowadays, with the capability of computer hardware, every company can record any events of its operational activities in the system. Recorded events can be used by the company for further analysis. The possible use of recorded event data is to create a data-driven simulation model. The simulation model derived from the event data directly is very useful for users who are unfamiliar with simulations. Discrete-event simulation (DES) is one of the most popular existing simulation types [
1]. In DES, an event is typically defined as a specific change in the state of the system at a particular time [
1]. Another possible use of the recorded event data is to analyze the discrepancy between the guided plan of operational activity and the real activity in the field. A possible technique to conduct this type of analysis is process mining.
Process mining is both data-driven and process-centric [
2]. Although process mining is generally used for automated process discovery, it can be used for other purposes such as checking compliance, diagnosing deviations (base and real), highlighting bottlenecks, improving performance, predicting flow times, and recommending action [
2]. The input for process mining is an event log [
2]. This event log may contain all the events related to operational activity. In process mining, each event generally refers to an activity that occurs at a particular time for specific cases [
3]. Evidently, the definition of events in process mining and DES is related. Moreover, event data can be generated as in process mining based on the rules defined in the simulation model. Hence, combining process mining and simulation such that the results of the two can complement each other is reasonable, as suggested by Aalst [
3].
To the best of our knowledge, the first practical approach that combined process mining and simulation was proposed in 2009 [
4]. The author used a process model based on a Petri net representation discovered from an event log using a process discovery algorithm [
4]. This Petri net comprises places, transitions, and arcs. The process discovery algorithm used in this study is an alpha algorithm used to mine the workflow process from the event log [
5]. The discovered process model represents a control-flow perspective, combined with other analysis techniques such as decision point analysis, organizational miner, and statistical analysis [
4]. Decision point analysis is used to mine decisions based on the attribute data that influence the choices made in the process based on past process executions [
6]. The decision is triggered when the next execution occurs at a place where more than one possible output (activities) can be obtained. An organizational miner is conducted to mine resource allocations in a specific activity. Several algorithms such as default mining, doing similar task, and agglomerative hierarchical clustering can be used to perform this resource allocation [
7]. This aspect is also important in simulation because the execution of an activity in simulation is generally constrained by the resources. Statistical analysis is used to analyze the performance based on the event log. This analysis can also be used to fit the distribution of several parameters for simulation such as the execution time of activity and the arrival time of cases [
4]. The results from the organizational miner, decision point analysis, and statistical analysis are then merged and transformed into a colored Petri net (CPN) representation. If a considerable amount of data in the event log required to determine the parameters of the simulation is missing, the missing data can be predicted using machine learning by first changing the representation of the event log into an image [
8]. Related with mining the event log, another interesting research in process mining by implementing a verification method for the message passing behavior of IoT systems is proposed to extract event relations from the log and check for any minor deviations that exist in the system [
9]. Visual filtering approached for event logs that make the process analysis tasks more feasible and tractable are conducted and analyzed with the event log data from Business Process Intelligence (BPI) challenge 2017 [
10].
Although the idea of creating a CPN model based on process mining is promising, the first practical implementation still uses general/incorrect assumptions—for instance, all the resources are available if no task is being performed, and the processing time of the resources when executing tasks is always based on a specific underlying distribution in any condition [
4]. Several studies have attempted to mitigate this limitation. In particular, the resource behavior in process mining has been analyzed to explore the effect of workload on service times and subsequently incorporate it into the simulation model [
11]. This study assumed that a worker under time pressure may become more efficient and can consequently finish tasks faster [
11]. Another study calculated the batch process and resource availability for subsequent incorporation into the CPN model [
12]. The event logs were analyzed to calculate the percentage of resource availability in a specific period [
12]. Our study attempts to complement the previous studies on resource availability by incorporating shift time. The previous study did not set resource availability at specific times, i.e., it did not consider that resources, particularly humans, generally work based on shift times. Although the simulation model incorporates resource availability based on percentages, it still has a gap from reality, as it does not consider shift times. Without knowing the information related to the shift time, the simulation model generated could still not imitating reality. We attempt to use an event log to analyze and cluster the shift times of resources so that this information can be used in the simulation model and could improve the accuracy of the model.
The main contribution of our research is as follow:
We proposed the clustering method of mining the shift work operation of resources that can be put in the simulation model to generate a more accurate model.
We propose the new distance function to handle the problem that can be happened when clustering the time data.
We propose the new representation of activity that can handle the resource shift work rule in CPN.
The remainder of this paper is structured as follows.
Section 2 discusses some terminology, provides an overview of the related work, and presents a running example.
Section 3 outlines the data-driven method to retrieve shift work information.
Section 4 will explain about how we incorporate this shift work information in the simulation model.
Section 5 describes the explanation about the mechanism of how we generate the artificial event logs. The result from the method in
Section 3 is empirically evaluated in
Section 6. The paper ends with a discussion in
Section 7 and a conclusion in
Section 8.
2. Preliminary
In this section, we describe the preliminary of our study. The definition and overview of process mining, event logs, process model, Petri net, CPN, resource allocation, and resource shift time work of operation will be described bellow.
2.1. Process Mining
Process mining has been used in the field of process management to support the analysis of business process based on the event logs. During process mining, specific data mining algorithms are applied to event log data in order to identify trends, patterns, and details contained in event logs recorded by an information system. Process mining aims to improve process efficiency and understanding of processes.
There are three classes of process mining techniques. This classification is based on whether there is a prior model and, if so, how the prior model is used during process mining [
2].
Discovery
The discovery technique is used in process mining, usually when a prior model does not exist. Based on the event log, a new model is constructed based on low-level events.
Conformance Checking
Conformance checking is used when there is a prior model. The existing model is compared with the process event log. Discrepancies between the log and model are analyzed.
Performance Mining
The process model based on a prior model or generated from a discovery technique is extended with performance information such as processing times, cycle times, waiting times, etc.
2.2. Event Logs
This section formally defines an event log describing an event, event properties, and traces. Let E be the event universe, i.e., the set of all possible event identifiers, A be a set of all possible activity names, R be a set of all possible resource names, and T be the time domain. An event has a number of properties, and we can define functions assigning properties to events:
act assigning activities to events,
type {schedule, assign, start, suspend, resume, abort activity, complete} assigning event types to the events,
time assigning timestamps to events, and
assigning resources to events.
An event
e is described by some unique identifier and can have several properties. An event log is a set of events. Each event in the log is linked to a particular trace which represents a process instance and is globally unique, i.e., the same event cannot occur twice in a log [
11].
2.3. Process Model
A process model is a graphical representation of a process. The model or modeling can be based on various notations and standards, such as Business Process Management Notation (BPMN) 2.0, Petri net, Event Driven Process Chain (EPC), or Unified Modeling Language (UML) activity diagrams, etc. A process model can be created manually or discovered from the event logs using several process discovery algorithm based on the process mining technique.
2.4. Petri Net
A (classical) Petri net is a directed bipartite graph with two types of nodes called places and transitions. The places and transitions are connected to each other with directed arcs. Arcs connecting two nodes of the same type to each other are forbidden. In a Petri net, places are graphically represented by circles, and transitions are represented by rectangles. The elements of places and transitions are referred to as nodes. A node is an input node of a second node if and only if an arc connects this node to the second node. The representation of a Petri Net is shown in
Figure 1 below.
2.5. Colored Petri Nets
Colored Petri nets (CPN) provide a modeling language that combines Petri nets with the functional programming language [
13]. Previously, the functional programming language used for executing CPN is based on Standard ML, but currently, several tools’ executions of CPN are based on other different programming languages. Petri nets model the basic behavior of a process while the programming language handles the definition and manipulation of data values. In a CPN model, all the tokens are distinguishable by assigning to each token a certain value, referred to as a color.
Figure 2 shows a fragment of a CPN model, which consists of two places and one transition. The circles in the model indicate the places in a Petri net, and the boxes indicate the transitions. Each place in a colored Petri net has a certain type. Each transition can have certain properties such as guard, action, and priority. Guard is represented as the conditional execution of the transition.
CPN is a discrete-event modeling language that combines the capabilities of Petri nets with those of a high-level programming language [
13]. CPN belongs to the class of high-level Petri nets owing to the inclusion of several aspects, such as time, colored (token attributes), and hierarchical representation. CPN models are executable; hence, they can be used to simulate different scenarios of the system. A computer tool for CPN modeling that can perform simulation-based performance analysis is the CPN tool. Dealing with CPN from the process mining perspective (process model on Petri nets) is easy because the foundation of the representation is the same. Moreover, hierarchical representation in the CPN can omit the details of the model in the specific module and provide the detailed implementation in another module. Therefore, it is conceivable to create a module in CPN based on the Petri nets generated from process discovery and set the detailed implementation of each transition in a separate module. An example representation of CPN is shown in
Figure 2 below.
2.6. Resource Allocation
By considering the resource assignment, resource allocation determines which specific employee or resource will execute a work item. From the process mining perspective, this execution of work items can be related to the activity.
Resource allocation has also been extensively studied. This involves investigating the allocation of work items to specific staff members while taking into account, for example, relations between resources and emergency circumstances. In the process mining field, the organizational miner is one of the algorithms that try to conduct this kind of approach by clustering resources into several organizational units and then assigning the organization unit to specific activities.
2.7. Resource Shift Time Work of Operation
Usually, humans work daily based on shift time. Shift work for humans basically can be divided into two categories, which are fixed or rotational shift work. Fixed shift work follows the same pattern every day at a fixed time, whereas rotational shift work changes the working time of resource at regular intervals. It is important to distinguish between the two types of shift work because their characteristics are distinctly different. For example, fixed shift workers should be able to keep their daily rhythms as consistent as possible, so the resource will work based on this pattern consistently.
This information is somehow very important to be considered when we want to simulate the work operation of the process. Without considering this information in the model when performing the simulation can lead to generating less useful results because the model is not reflecting the reality. This is the reason we tried to conducting mining the shift work of human resources based on the event log.
3. Shift Time Clustering on Event Logs
To the best of our knowledge, clustering shift times based on event logs has not been studied thus far. Although regular event logs may not contain data about resources, the event log in this study is required to include data about resources. Before conducting clustering, we first need to preprocess the event log data. This preprocessing mechanism is based on the assumption that resources, i.e., humans, will have a certain amount of rest time between consecutive shift time working. This assumption is based on data from the International Labor Organization, which released a report and standard guideline for the working times of a worker employed by a company [
14]. The guideline states not only the limitations of daily and weekly working times but also the rest period after work [
15]. It states that the rest period after work must follow the following rules [
15]:
After a working day, you may not work for at least 11 consecutive hours. However, this may be reduced once every seven days to eight hours if the nature of the work or the company circumstances make this necessary.
After a working week of five days, you may not work for at least 36 consecutive hours.
A longer working week is also possible, provided you have a rest period of at least 72 h once every 14 days. This 72 h period may be split into two separate periods, neither of which may be shorter than 32 h.
We need to find a rest period between the working times of each resource to cluster the shift time of an event log. In the event logs, data are grouped based on case ID. Here, we need to rearrange the event logs so that we can easily map each resource across multiple cases. Based on the rules defined in [
15], we preprocess the event log to merge and split the row of each resource in the event logs. First, we group the event log by each resource and sort the row of resources of each group by the start time of the event in ascending order. Subsequently, we calculate the gap between the completion time of the previous event and the starting time of the subsequent event. We split two events if the calculated gap is equal to or longer than eight hours. This split creates another subgroup consisting of rows of events. Subsequently, we take the starting time of the first event, the completion time of the last event, and the distance between the two for each subgroup consisting of rows of events. The data here are not represented by the shift time of the resource directly because the recorded event of each resource may not consist of a complete event from the start to the end of the shift time. Hence, we use clustering for this problem, not only to cluster the resources into a specific group of shift times but also to be more certain of the shift time range. The preprocessing mechanism is shown in Algorithm 1.
Algorithm 1: Event logs preprocessing. |
Input: Event Logs Output: Array of object Read event logs data. Group rows of an event by resource. Order rows in each group by start time attribute ascendingly. Calculate the gap (completion of the first event and starting time of the second event) between two consecutive event. Split two consecutive event if the gap h. This split will create new subgroup. For each subgroup, take the first event starting time and the last event completion time, calculate the duration between the two. From the previous step, take the only time of the starting time and the completion time of each subgroup without considering the date. Make an object data consist of the starting time (without date info), the completion time (without date info), and the duration. Flatten each object into one collection of array.
|
The formalized representation of Algorithm 1 can be described by the following mathematical definitions:
Event logs
Based on the definition of event logs in the Preliminary section, an event is an atomic representation to an activity instance. It thus contains a single timestamp. However, in Algorithm 1, we refer to the event with start and complete lifecycle. An event has a number of properties such as caseid, activity, resource, start, and complete timestamp.
Resource
R be a set of all possible resource names, where assigning resources to events. .
Group rows of an event by resource and order by start time
G is a set of subset event group by the resources.
, each subset of events is ordered by start time attribute.
Splitting G of set into another subset
is a set of subset event group after splitting.
, where and have the gap ≥ 8 h
The array of object data
is a set of object data (array of object data).
where each have information. Start and completion time in is contained only time (hour/minutes/seconds) without any date information. The start and the completion in are gathered from the subset event explained previously. Take the start time of the first event in each subset as a start for and take the end time of the last event in each subset as a completion.
As explained in the above algorithm, we consider only the time without the date to cluster the data. The date is neglected because, in shift time clustering, working hours that start today at time, for example, 08:00 should be considered the same as other working hours that started yesterday or last week also at the time 08:00. This is reasonable for shift time clustering because the critical aspect is the time, but not the date. In addition to time information, we include the distance (duration of working hours) in the input. Clustering the data considering only the time but not the date is a trivial task but should be performed carefully. The distance between two different times could not be calculated using only a general distance function, as shown in
Figure 3. The distance between time 23 and time 18 is 5 based on
Figure 3 and based on normal calculation
= 5. However, the distance between time 23 and time 1 is 2 based on
Figure 3, but if we calculate the distance normally, it becomes 21,
= 21 (which is not correct). We need to modify the distance function between times
and
to calculate the distance accurately. The symbol
here refers to the time threshold for a specific time unit. For example, if the time unit is hours,
is 24, and if the time unit is minutes,
is 1440. The distance function of time, which always generates a positive value, can be shown in the following equation:
Theorem 1.
in Equation (
1) is a distance metric. The Equation (
1) is a distance metric if it is a non-negative function
and satisfies several properties based on the theorem of metrics as follow [
16]:
The non-negativity ,
The identity property ,
The symmetry property, , and
The triangle inequality,
in Equation (
1) is a non-negative function because no input combination can yield a negative result.
. Therefore,
D satisfies the identity property. As we use the absolute value and the minimum and maximum values between two input variables, changing the order of the input will not yield different results. This indicates that
D also satisfies the symmetry property. We now prove that
D satisfies the triangle inequality as well. Assume that
- –
- –
- –
Evidently, the above equation satisfies the triangularity based on
Figure 4 below.
We use the distance function described in Equation (
1) in the self-organizing map (SOM) and
k-means algorithms to cluster the resources based on shift work.
k-means is used because of its simplicity and its speed computation. SOM is used because then the number of clusters does not have to be determined. This algorithm can help realize dimensionality reduction, which can be used to determine the number of clusters automatically.
k-means is more restricted in that the number of clusters needs to be defined. However,
k-means can be used to determine the number of clusters automatically by incorporating the elbow method. The basic idea of SOM and
k-means is to group the data by calculating the Euclidian distance between several attributes of data and the groups (weight in SOM or centroid in
k-means). To measure the Euclidian distance correctly, the distance between the group and each attribute of data should be on the same scale to avoid bias toward a specific attribute. Thus, researchers generally normalize and standardize the attributes to ensure that all the attribute data are on the same scale. In this study, we use data normalization to [0, 1] using the equation given in [
16]. The limitation of the normalization based on the equation in [
16] is that, if we consider the input as the time value (without date), the result will not be useful, and an incorrect distance may be generated. The distance between two different time values is calculated by considering the threshold value, as shown in Equation (
1). To cope with this problem, we modified the value for the time attribute and normalized it after the distance between two different time values was calculated. The equation is as follows, where
is the distance function from Equation (
1),
is the maximum distance among the data (it is similar to
, but the formula is different for time values), and
is the distance value after it is normalized. For example, if the data are 1 and 23, then
is 2, but if the data are 1, 13, 23, then
is 12. Here, the largest
that could be obtained for any combination of the time value data is 12.
SOM is a neural-network-based clustering technique widely used for visualization and data reduction of high-dimensional data [
17,
18]. SOM comprises
m neurons located at a low-dimensional map representation that are generally spread across a 2-D map in
x and
y coordinates [
19]. The two common types of topologies generally used in SOM are rectangular and hexagonal topologies [
19]. Each neuron
i has a
p-dimensional weight vector
, where
, which has the same dimension as the input space.
If we have
N input data with a
p-dimensional attribute vector, then the following steps describe the modified SOM algorithm based on the conventional SOM algorithm [
20]:
Set the maximum number of iterations, s.
Initialize the weight vectors for each neuron.
Randomly select an input vector
among
n data and check its distance from each weight
of neurons. We modified the Euclidian distance by separating the calculation of the data of each attribute based on its characteristics. The attribute data that need to be evaluated with time (time-dependent) are grouped together (start from
until
), whereas the attribute data that do not need to be evaluated with time (not time-dependent) are grouped into different groups (start from
until
). Here,
and
are the distance function normalized in Equation (
2) and the input vector
for a specific attribute
p based on the
p-dimensional attribute vector of data, respectively.
is the weight vector
for a specific attribute
p based on the
p-dimensional attribute vector of data in a specific iteration
s.
Determine the best matching unit (BMU) to find the winner neuron
c by comparing the previously calculated Euclidian distances.
Calculate the learning rate and neighborhood function. Several learning rates and neighborhood functions have been proposed in the literature. The learning rate
and neighborhood function
are functions that decrease with the iteration
s. The symbol
and
indicate the initial learning rate and initial neighborhood function, respectively. In this study, we use the learning rate
and neighborhood function
based on [
20,
21] as follows:
Calculate the neighborhood distance weights
between the winner neuron
c and every other node neuron
i, where the
mean coordinate of the neuron in the lattice or map is given as follows:
Update the weight vectors of the neurons within the local neighborhood of the BMU (including the winner neuron
c). We modified the updating weight mechanism as shown in Equation (
4) to satisfy the time value constraint.
in Equation (
5) is the new weight of attribute
p, where
p is the non-time value at iteration
s.
in Equation (
6) is the new temporary weight of the time value at iteration
s. We use the term "temporary" because this value is not the final value of the new weight (we need to evaluate this value with the threshold first).
in Equation (
7) is the displacement value between
and
. The terms displacement and distance are not the same. The displacement value can be negative, but the distance value is always positive.
Repeat steps c to g until the maximum number of iterations is reached or other stopping criteria are found.
We should first determine the number of neurons in the SOM to execute it. We determine the number of neurons in the SOM using Equation (
8) based on a previous study [
22]. The equation calculates the number of neurons from the observations in the training data, where
M is the number of neurons, and
N is the number of observations in the training data.
To compare the effect of the modified distance function of time value for different algorithms, we also implement it using
k-means clustering.
k-means is a centroid-based partitioning technique that uses the centroid of a cluster,
, to represent that cluster [
16]. Conceptually, the centroid of a cluster is its center point. The quality of cluster
can be measured using the within-cluster variation, which is the sum of squared errors (SSEs) between all the objects in
and the centroid
, defined as follows:
If the data consist of a dataset containing N observations and k is the number of clusters, the k-means algorithm operates based on the following procedure, where is each datum contained in the dataset and is the centroid label cluster for datum :
Arbitrarily choose k objects from the data as the initial cluster centers.
Assign each object to the cluster to which the object is the most similar, based on the mean value (centroid) of the objects in the cluster. We use the same distance function described in Equation (
3), but the term weight (
w) is changed to mean (
).
Update the cluster mean—that is, calculate the mean value of the objects for each cluster. For each
i cluster, we modified the cluster mean updating by incorporating an incremental average. The incremental average is used because, by averaging the time value, we could not compute the mean directly for all the values. We must evaluate the average value one by one for each element, as shown in Equations (
10) and (
11), where
refers to the
nth element among the elements that will be averaged.
Repeat steps b to c until the stopping criteria are found (for example, the cluster member is not changed).
Several methods to define the number of clusters in
k-means exist, such as the elbow method, gap statistic, silhouette coefficient, canopy, Akaike’s information criterion, and the Monte Carlo technique [
23,
24]. As mentioned previously, we employed the elbow method to determine the number of cluster
k in
k-means clustering. Generally, the elbow method is based on the subjective visualization of the user. However, in this study, we performed the elbow method without subjective visualization. The elbow method used in this study is based on the method described in [
24], using the square of the distance between the sample points in each cluster and the centroid of the cluster. In our case, the square of the distance between the sample points in each cluster and the centroid of the cluster is based on the SSE formula in Equation (
9). Our program will monitor the SSE value for each increment of the
k cluster and detect the place at which a rapid decline occurs, similar to the paper [
25], but by using Heron’s formula.
The shift work generated based on our approach is conducted based on the clustering method. There is also interesting research conducted by other researchers that tried to mine the resource availability by considering the temporal perspective [
26]. Our approach is different from the existing method in that our method is using not only specific on the single resource but aggregating the resource shift work using clustering. Our resource focuses on humans who usually work based on shift time. We are using a clustering method, so by incorporating the shift mining, the resource that only recorded a short time in the event log because of the limited job assigned to him/her will be automatically grouped into the quite similar cluster. Nakatumba [
12] and Martin [
26] are not using this kind of mechanism but focus on every single resource, and Nakatumba and Martin also use a statistical analysis approached, whereas we use a machine learning technique using a clustering algorithm. We also represent the proposed representation activity that can incorporate the shift work rule in the simulation model, whereas Nakatumba approached used resource availability percentage.
6. Results
As explained previously, we used an artificial data set generated based on simulation using the predefined rule of shift work of resources. The rule is embedded in the CPN model, as shown in
Figure 6. From this CPN model representation, we create a matching model in the Scala programming language for generating the output more flexibly. Using this model, we performed the simulation based on a three-day time window, as explained in
Section 5, and obtained the result by marking a token that transformed into a comma-separated values (CSV) event log containing life cycle information (start and completion times). In the token generator explained in the previous section, we set the number of cases arriving each day to 90. These 90 cases arrive at the beginning of the day, and they will be processed later by the resources. This condition is handled by the arc function
MATH_ROUND_GEN in the activity Gen, as shown in
Figure 7.
Then, the CSV event log is uploaded into the database and later used as an input to resource mining. The log data already existing in the database will be preprocessed based on the algorithm described in
Figure 1 of
Section 3. The screenshot of the data after preprocessing is shown in
Figure 8.
We conducted the
k-means clustering experiment using the elbow method. As explained in the previous section, the elbow method automatically detects the best number of clusters
k without subjective visualization. This step involves the mechanism to determine the best number of clusters,
k, based on the data shown in
Figure 7. First, we set the value of
to 8. The
in this experiment is the maximum number of
k that will be analyzed. We set the value of
to 8 because the number of different operational shifts in a day is generally relatively small. For each
k, we evaluate the SSE based on Equation (
13). The SSEs per
k are listed in
Table 1 below for scenario 1 and scenario 2.
Based on this SSE, we can determine the furthest line marked with the red line in
Figure 9 to decide the best value of
k. The furthest line can be determined easily using Heron’s formula if the information of each SSE in each
k is known.
We conducted an experiment with both algorithms, SOM and
k-means, using the data described above to compare their results when mining the shift work. Implementing the new distance function in the
k-means algorithm always yields better results than those of the SOM. Based on the experiment,
k-means using the new distance always divided the cluster into two groups for scenario 1 and three groups for scenario 2, which matched the baseline rule used in the simulation model. SOM tends to divide clusters into more than three groups. The experiment showed that SOM divided the cluster into six groups for scenario 1 and seven groups for scenario 2 using the number of neuron settings based on Equation (
8), which was far from the expected result.
Figure 10 and
Figure 11 are the results of
k-means without the new distance function and
k-means with the new distance function, respectively. The left side of
Figure 10 shows that the
k-means generate three groups of resources using the data from scenario 1, and the right side of
Figure 11 shown that the
k-means also generate three groups of resources using the data from scenario 2. Each cluster is separated by an empty space in the horizontal axis. Different colors in the chart represent different resources. Each vertical line in the chart that spreads along the horizontal axis is represented the object data explained in Algorithm 1. The start time of object data and completion time of object data is represented by the bottom and top side of each vertical line. The vertical axis on the left side of the chart visualizes the time information that is separated into the current day and the next day time. We use current day and next day time to handle visualization of data that can have the start time bigger than the completion time because we do not consider the date (for example, the start time of object data is 21, but the completion time is 9; this is like the representation of the shift time work that started at 9 p.m. and ended at 9 a.m.). We can see in the left side of
Figure 10 that although the start time of several object data is the same (hour is 12), if there is any completion time of object data is below and above h hour 00/24 on the next day, those data will be clustered into a different group. This happened because the distance between time above hour 24 and time below hour 24 is far enough if calculated using a normal distance function.
The previous simulation model to generate the event log is used again (all of the parameters are kept), but this time, the rule of predefined resource shift work is removed from the simulation model. We later used the minimum and the maximum time in each cluster generated from the clustering algorithm as the start time and end time of the shift work, respectively. This information of resource shift work from the clustering algorithm will then be embedded in the simulation model. The other simulation is run using the same simulation model but without any rule of shift work embedded inside. To compare the effectiveness of this approached to generate a better simulation model, we used basic Key Performance Indicator (KPI) that is usually used to measure the simulation accuracy of the process, which is flow time and throughput. Evidently, the KPI generated from the simulation model that incorporates shift work from the clustering algorithm fits better with the KPI from the baseline event log. If the shift work is not incorporated into the simulation model, all the resources are assumed to be available since the start of the simulation, thus leading to generation of inaccurate KPIs. This indicates that all the resources can perform work at the same time, which will tend to make the average flow time cases to finish shorter. The flow time used here is measured by calculating the duration of each from the generating time to finishing time in the process. The throughput is calculated by dividing the number of tokens finished by the time of the simulation. The comparison of simulation performance results using the configuration of scenarios 1 and 2 is shown in
Table 2 and
Table 3, respectively.
7. Discussion
In this study, we attempted to mine the shift work information from event logs. We then incorporated this information into the simulation model to obtain results that represented the operational event log more accurately. Based on our experiment, implementing this method could improve the results of the simulation model. However, several limitations should be noted, as the clustering algorithm in this article assumes that the data are separable. If the event log data about the shift work of the resources are not quite separable, obtaining good results could be difficult. We observed that the result from the SOM algorithm tends to generate too many clusters, even after the new distance function is incorporated.
A more thorough analysis of the result indicated that this phenomenon occurs because of the formula for determining the size of the neuron map in Equation (
8) and the method for initializing the weight of neurons at the beginning of execution. The size of the data determines the size of the neuron map in Equation (
8). However, if the number of data is small—for example, less than 25—a neuron map larger than the size of the data themselves will be generated.
Applying this method for generating a more accurate simulation model can be done by combining process model discovery, decision point analysis, organizational miner, and our proposed method. The result of our proposed method is to mine the shift work for the resources. We later can use the mined result of shift work resources from the clustering algorithm to be embedded in the simulation model.
Basically, to apply the proposed method in the real world, we first should find any event log with human resource information that exists in the event. Using that event log, we can generate the simulation model. The step to generate the simulation model is started by the process discovery algorithm. The process model from the process discovery algorithm will be used as a baseline of the process for the simulation. After this step, we mined the organizational miner to allocate resources into activities. Later, we performed the decision point analysis using input from the process model and the event log. Finally, we use shift work mining to get information about shift time of the resources. We then put this information as a rule for the resource in the simulation model. This shift work rule will add the constraint of the resource in the simulation model. By adding more constraints in the simulation model that reflect the real constraints that happened in reality, we can generate a more accurate simulation model. However, adding more constraints in the simulation model could make the running time of the simulation become longer.
There are several possibilities for future work from this research that can be conducted. Combining the proposed method with another result of the resource mining approached, for example, resource percentage availability explained by Nakatumba [
12], could possibly be useful. It is also possible to improve the currently proposed method by combining it with another clustering algorithm or another distance function to handle inseparable data explained previously in the limitation part of this section. Another interesting possible future work from this research is how to perform automatic optimization of resource allocation and execution based on event log automatic generated simulation.