This section introduces two methods of filtering chaotic activities in the log: direct chaotic activity filtering and indirect chaotic activity filtering, through which chaotic activities in the event log can be identified and deleted.
3.2. Direct Chaotic Activity Filtering
Algorithm 1 describes an algorithm that uses an iterative approach to filter chaotic activity. The algorithm generates a list of event logs and takes the original event log as the starting element of the list. Each component in the queue includes a filtered version of L, and compared with the previous element, each subsequent component of the queue has supplementary activities filtered out.
The process of the algorithm proposed in this section is as follows: firstly, the relation between activities in the log is extracted from the input event log to form a relationship table, and then the relation table is used to count the statistics of the relation between activities to determine which activities are chaotic activities. Generally speaking, chaotic activities have no clear position in the log, and the direct following relationship between chaotic activities and other activities is disordered. Therefore, in the event log, the direct following relationship and direct preceding relationship between chaotic activities and other activities are confusing, i.e., chaotic activities and most activities in the log have different following relations. Therefore, we can obtain the corresponding activity set that meets the following conditions by counting the different follows relations between activities, and calculate the chaos degree of the activities through the obtained activity set, and judge whether the activities are chaotic according to the chaos degree of the activities.
Define the chaos degree of activity in the log for the i-th condition as CHi(y).
The program code is as follows:
Algorithm 1: Direct Chaotic Activity Filtering |
Input: event log L |
Output: event log list QLS |
1: L’ ← L |
2: QLS ← <L’> |
3: While |ActSet(L’)| > 2 |
4: actset ← ActSet(L’) |
5: counts ← ∅,countms ← ∅,countns ← ∅,countmns ← ∅ |
6: for i in ActSet(L’) do |
7: count1 ← 0,countm ← 0,countn ← 0,countmn ← 0 |
8: for j in ActSet (L’) do |
9: if(Dij.dfs > 0) then |
10: count1++ |
11: if(Dij.dps > 0) then |
12: count1++ |
13: if(Dij.dfs > 0 and Dij.dps > 0) then |
14: countm++ |
15: if(|Dij.dfs − Dij.dps| > 0) then |
16: countn++ |
17: if(countm! = 0) then |
18: Countmn ← countn/countm |
19: end for |
20: counts ← {counts∪(i,count1)} |
21: countms ← {countms∪(i,countm)} |
22: countns ← {countns∪(i,countn)} |
23: countmns ← {countmns∪(i,countmn)} |
24: end for |
25: rem ← ∅ |
26: for a in ActSet (L’) do |
27: if(counts.get(a)>average(counts) and countms.get(a)>average(countms)) then |
28: if(countns.get(a)>average(countns) and countmns.get(a)>average(countmns)) then |
29: rem ← {rem∪a} |
30: end for |
31: L’ ← L’↑ actset\rem |
32: QLS ← QLS∙< L’> |
33: end while |
34: return QLS |
Condition (1): The degree of activity disorder is represented by the sum of the number of direct following relationships and the number of direct preceding relationships that exist between an activity and all other activities in the log, that is, for activity
p, let the log be
L, and the total of the number of elements in set
S1 and
S2 be counted as the degree of chaos in activity
p. If the chaotic degree of activity
p exceeds the set threshold, it is judged that
p is a candidate chaotic activity.
Take the log
L = [<
a,
b,
c,
d,
x>
10,<
a,
b,
x,
c,
d>
10,<
a,
x,
b,
c,
d>
10, <
a,
x,
b,
c,
d>
10] as an example, and
Table 1 is used to calculate for activity
x. First of all, for activity
x, find the activities that have a direct following relationship with
x in the relation table, and find that the values of #(<
x,
b>,
L), #(<
x,
c>,
L) and #(<
x,
d>,
L) are all greater than 0, so the set of activities that have a direct following relationship with
x is {
b,
c,
d}, and the number of elements in the set is 3. Secondly, for activity
x, find the activities that have a direct preceding relationship with
x in the relation table, and find that the values of #(<
a,
x>,
L), #(<
b,
x>,
L), #(<
c,
x>,
L) and #(<
d,
x>,
L) are all greater than 0, so the set of activities that have a direct following relationship with
x is {
a,
b,
c,
d}, and the number of elements in the set is 4. So for activity
x, the collection of activities that satisfy the conditions is {
b,
c,
d} and {
a,
b,
c,
d}. The sum of the number of elements in the two sets is calculated, and the result is 7, so
CH1(
x) = 7 is obtained. This calculation is performed for all activities in the log, and the corresponding results of each activity are obtained, and the average value of all results is obtained. In this log, the average value is 5. For activity
x, the calculated result is greater than the average value, so it is judged that the candidate chaotic activities are
x.
Condition (2): Chaotic activities in logs often have both direct following relationships and direct preceding relationships with more activities in the log. Accordingly, the amount of activities in the event log that have both a direct following relationship and a direct preceding relationship with a certain activity can be counted to calculate the chaos degree of the activity. That is, for activity
p, let the event log be
L, and the sum of the number of elements in the set be counted as the degree of chaos in activity
p. If the chaos degree of activity
p exceeds the set threshold, it is judged that
x is a candidate chaotic activity.
Take the log
L = [<
a,
b,
c,
d,
x>
10,<
a,
b,
x,
c,
d>
10,<
a,
x,
b,
c,
d>
10, <
a,
x,
b,
c,
d>
10] as an example, and use
Table 1 to calculate activity
x. First of all, for activity
x, find the activities that have both a direct following relationship and direct preceding relationship with x in the relation table, and find that #(<
x,
b>,
L), #(<
b,
x>,
L) are all greater than 0, and the values of #(<
x,
c>,
L), #(<
c,
x>,
L) are all greater than 0, and the values of #(<
x,
d>,
L), #(<
d,
x>,
L) are all greater than 0. So for activity
x, the collection of activities that satisfy the conditions is {
b,
c,
d}. This calculation is performed for all activities in the log, and the corresponding results of each activity are obtained, and the average value of all results is obtained. In this log, the average value is 1.2. The result of activity
x is greater than the average, so it is judged that the candidate chaotic activity is
x.
Condition (3): If there are both direct following relationships and direct preceding relationships between a certain activity and additional activities, and the difference between the direct following frequency and the direct preceding frequency is small, the activity is judged to be chaotic. The average value of the direct following frequency and the direct preceding frequency between two activities can be set as the threshold to judge the gap between them. Therefore, activities that have both a direct following and direct preceding relationship with a certain activity can first be counted in the log to form an activity set. Then, in this set, the activities with a small difference between the direct following frequency and the direct preceding frequency between the activities are counted to form the activity set. Finally, the number of activities in the activity set can be counted to calculate the chaos degree of the activity, i.e., for activity
p, let the event log be
L, and the sum of the number of elements in the set
S4 be counted as the degree of chaos in activity
p. If the chaos degree of activity
p exceeds the set threshold, it is judged that
p is a candidate chaotic activity.
Take the log
L = [<
a,
b,
c,
d,
x>
10,<
a,
b,
x,
c,
d>
10,<
a,
x,
b,
c,
d>
10, <
a,
x,
b,
c,
d>
10] as an example, and use
Table 1 to calculate activity
x. First of all, for activity x, find the activities that have both a direct following relationship and direct preceding relationship with
x in the relation table, and find that #(<
x,
b>,
L), #(<
b,
x>,
L) are all greater than 0,
,#(<
x,
c>,
L), #(<
c,
x>,
L) are greater than 0,
,#(<
x,
d>,
L), #(<
d,
x>,
L) are all greater than 0, and
. So for activity
x, the collection of activities that satisfies the conditions is {
b,
c,
d}. This calculation is performed for all activities in the log, and the corresponding results of each activity are obtained, and the average value of all results is obtained. In this log, the average value is 1.2. The result of activity
x is greater than the average, so it is judged that the candidate chaotic activity is
x.
Condition (4): By dividing the value obtained from condition (3) with the value obtained from condition (2), the set of activities that have both a direct following relationship and direct preceding relationship with an activity in the log is obtained through condition (2), and through condition (3), we obtain the set of activities with a small difference between the direct following frequency and the direct preceding frequency between the activities in the log. Calculate the proportion of the latter to the former; this is the chaos degree of the activity. If the value calculated for an activity is 0 for condition (2), the value calculated for that activity is 0 for condition (4). A threshold is set, and if the chaos degree of activity x exceeds this threshold, it is judged that x is a candidate chaotic activity.
Take the log
L = [<
a,
b,
c,
d,
x>
10, <
a,
b,
x,
c,
d>
10, <
a,
x,
b,
c,
d>
10, <
a,
x,
b,
c,
d>
10] as an example, and use
Table 1 to calculate activity
x. Firstly, for activity
x in the event log, condition (2) is calculated, and
CH2 (
x) = 3 is obtained. For condition (3),
CH3 (
x) = 3, so for condition (4), the calculated result is
CH4 (
x) = 1. This calculation is performed for all of the activities in the log, and the corresponding results of each activity are obtained. For all of the results, the average value is 0.8. The results of activities
b,
c, and
x are greater than the average, so it is judged that the candidate chaotic activities are
b,
c, and
x.
Take the event log L = [<a,b,c,d,x>10, <a,b,x,c,d>10, <a,x,b,c,d>10, <a,x,b,c,d>10] as an example. For condition (1), we obtain {CH1(a) = 2, CH1(b) = 4, CH1(c) = 4, CH1(d) = 3, CH1(x) = 7}, with an average of 5, so the candidate chaotic activities are b, x. For condition (2), it is found that {CH2(a) = 0, CH2(b) = 1, CH2(c) = 1, CH2(d) = 1, CH2(x) = 3} and the average value is 1.2, so the candidate chaotic activity is x. For condition (3), we find that {CH3(a) = 0, CH3(b) = 1, CH3(c) = 1 CH3(d) = 1, CH3(x) = 3} and the average value is 1.2, so the candidate chaotic activity is x. For condition (4), it is found that CH4(a) = 0, CH4(b) = 1, CH4(c) = 1, CH4(d) = 1, CH4(x) = 1}, with an average value of 0.8, and the candidate chaotic activities are b, c, d, and x. Finally, the chaotic activity is x.
3.3. Indirect Chaotic Activity Filtering
An alternative to the approach proposed in Algorithm 1 is to filter out certain activities in the event log to reduce the total chaos degree of the log. Define the overall chaos degree of the event log as the sum of the chaos degree of all activities in the event log, that is, CHi(L) = . Algorithm 2 proposes an algorithm that iteratively filters activities from an event log that results in a significant reduction in log chaos, as opposed to Algorithm 1, which selects activities to filter based on the overall clutter of the log after deleting the activity.
By calculating the relationship table among activities corresponding to logs obtained after filtering a certain activity, Algorithm 2 obtains the change degree of the overall chaos degree value of the filtered event log, and identifies the chaotic activities in the event log on the basis of this value. Firstly, the activities are deleted from the input original event log. After each activity is deleted, the relation between activities in the log is re-extracted to form a new relation table. Then, the total chaos degree of the event log after deleting activities is calculated according to the new relation table. The total chaos degree of the log is influenced by various activities in the log, and chaotic activities have a significant impact on the overall level of chaos in event logs, and the main purpose of filtering chaotic activities is to improve the quality of process models by reducing the level of chaos in event logs. Consequently, by calculating the change degree of the total log chaos degree after deleting certain activities, the influence degree of the activity on the total log chaos degree can be obtained, so as to obtain the corresponding activity set that meets the conditions, and then judge whether the activity is chaotic or not through the obtained activity set. In Algorithm 1, four conditions are used to calculate the chaos degree of activities, and in Algorithm 2, condition (4) in Algorithm 1 is deleted from the calculation of the chaos degree of activities. The total chaos degree of the event log L’ for the i-th condition after deleting activity y∈ActSet(L) in log L is defined as , = = .
The program code is as follows:
Algorithm 2: Indirect Chaotic Activity Filtering |
Input: event log L |
Output: event log list QLS |
L’ ← L |
QLS ← <L’> |
While |ActSet(L’)| > 2 |
actset ← ActSet (L’) |
countos ← ∅, countmos ← ∅, countnos ← ∅ |
for a in ActSet (L’) do |
L’ ← L’↑actset\{a} |
counts ← ∅, countms ← ∅, countns ← ∅ |
for i in ActSet (L’) do |
count1 ← 0, countm ← 0, countn ← 0, countmn ← 0 |
for j in ActSet (L’) do |
if(Dij.dfs > 0) then |
count1++ |
if(Dij.dps > 0) then |
count1++ |
if(Dij.dfs > 0 and Dij.dps > 0) then |
countm++ |
if(|Dij.dfs − Dij.dps| > 0) then |
countn++ |
end for |
counts ← {counts∪(i,count1)} |
countms ← {countms∪(i,countm)} |
countns ← {countns∪(i,countn)} |
end for |
counto ← 0, countmo ← 0, countno ← 0 |
for ac in ActSet (L’) |
counto ← counto+counto+counts.get(ac) |
countno ← countno+countno+countns.get(ac) |
end for |
countos ← {countos∪(i,counto)} |
countmos ← {countmos∪(i,countmo)} |
countnos ← {countnos∪(i,countno)} |
end for |
rem ← ∅ |
for a in ActSet (L’) do |
if(countos.get(a)>average(countos) and countmos.get(a)>average(countmos)) then |
if(countnos.get(a)>average(countnos)) then |
rem ← {rem∪a} |
end for |
L’ ← L’↑ actset\rem |
QLS ← QLS∙< L’> |
end while |
Take log L = [<a,b,c,d,x>10, <a,b,x,c,d>10, <a,x,b,c,d>10, <a,x,b,c,d>10] as an example.
For condition (1), take activity x as an example. After deleting activity x, the event log L’ is L’ = [<a,b,c,d>40]. For the calculation of the relation between activities, #(<a,b>,L’) = 40, #(<b,c>,L’) = 40, and #(<c,d>,L’) = 40 is obtained. Thus the corresponding relation tables are obtained, and using the relationship tables to calculate, {CH1(a) = 1, CH1(b) = 2, CH1(c) = 2, CH1(d) = 1} is obtained, so we get = 6. This calculation is performed for all activities in the original log, and the results related to each activity are obtained, and the average value of all results is obtained. In this original log, the average value is 12.8. The calculated result of activity x is less than the average value, so it is judged that the candidate chaotic activity is x.
For condition (2), taking activity b as an example, after deleting activity b, the event log L’ is L’ = [<a,c,d,x>10,<a,x,c,d>20, <a,c,x,d >10], and the relation between activities is calculated to obtain #(<a,c>,L’) = 20, #(<a,x>,L’) = 20, #(<c,d>,L’) = 30, #(<c,x>,L’) = 10, #(<d,x>,L’) = 10, #(<x,c>,L’) = 20, and #(<x,d>,L’) = 10. Thus the corresponding relation tables are obtained, and using the relation tables to calculate, {CH2(a) = 0,CH2(c) = 1, CH2(d) = 1,CH2(x) = 2} is obtained, so we get = 4. This calculation is performed for all of the activities in the original log, and the results related to each activity are obtained. The average value of all of the results is obtained. In this original log, the average value is 3.6. The calculated result of activity x is less than the average value, so it is judged that the candidate chaotic activity is x.
For condition (3), taking activity a as an example, after deleting activity a, the event log L’ is L’ = [<b,c,d,x>10, <b,x,c,d>10, <x,b,c,d>10, <b,c,x,d>10], and the relation between activities is calculated to obtain #(<b,c>,L’) = 30,#(<b,x>,L’) = 10, #(<c,d>,L’) = 30, #(<c,x>,L’) = 10, #(<x,b>,L’) = 10, and #(<x,c>,L’) = 10, #(<x,d>,L’) = 10. Thus, the corresponding relation table is obtained. For the relational table, the calculation for condition (3) gives {CH3(b) = 1, CH3(c) = 1, CH3(d) = 1, CH3(x) = 3}. So, we get = 6. This calculation is performed for all of the activities in the original log, and the results related to each activity are obtained. The average value of all of the results is obtained. In this original log, the average value is 3.6. The calculated result of activity x is less than the average value, so it is judged that the candidate chaotic activity is x.
Take the event log L = [<a,b,c,x>10, <a,b,x,c>10, <a,x,b,c>10] as an example. For condition (1), we obtain { = 16, = 14, = 14, = 14, = 6}, with an average of 12.8, and so the candidate chaotic activities are x. For condition (2), it is found that { = 6, = 4, = 4, = 4, = 0}, and the average value is 3.6, so the candidate chaotic activity is x. For condition (3), we find that { = 6, = 4, = 4, = 4, = 0}, and the average value is 3.6, so the candidate chaotic activity is x. Finally, the chaotic activity is x.