1. Introduction
Real-world processes involve internal and external determinants, which act together to produce a certain expected output. These determinants, independently or in response to other factors, change over time, causing systematic and long-term shifts in their actual output patterns. Discovering these shifts and their determinants is thus a vital issue in understanding and managing the behavior of such processes. We note here that the term process is used in a broader sense, not restricted to just business or technological systems [
1,
2,
3] that conform to well-defined aims such as producing specific quantities or qualities [
4]. We also consider processes that have complex social, physical, economic or other determinants and interactions within or across their environments [
5]. For instance, commodity price or demand generation is a process that has socio-economic determinants such as collective human behavior and geopolitical, economic or even health-related events [
6,
7,
8]. Although there is no well-defined central control in such processes, stakeholders still seek to discover behavioral shifts for obvious reasons. It may also refer to socio-technical processes such as in information systems development and/or adoption [
9,
10] in timetabling, transportation and healthcare [
11,
12] and in road traffic systems, in which the output may be in the form of accidents or injuries [
13]. In such systems, the concerned stakeholders are interested in discovering process dynamics and their conformity to intended behavior while taking appropriate enhancement measures [
14].
In this context, we observe that the related process-mining literature has primarily taken a predominantly technological perspective. More precisely, the evaluated processes are well-defined, centrally managed, isolated from their wider environment (e.g., social or economic), and they behave mechanistically. The literature is rich in this direction, covering process discovery, conformance and enhancement [
15], where significant progress has been made in algorithm development, improvement and scalability [
16,
17,
18]. A gap thus exists, as process mining of socio-economic or socio-technical systems has received minimal attention. Understanding overall behavioral dynamics is vital for such systems, in which behavioral regime shift discovery and analysis are crucial unaddressed questions. This has been recently highlighted by Zerbino, Stefanini and Aloini [
15] in the business context, pinpointing a lack of coverage of the social side of business problems.
In this work, we aim to address this gap, and we propose a methodology that applies to studying behavioral regime shifts or changes in processes of such varied nature. Specifically, we aim to deal with these dynamic and temporally shifting processes at two levels. The
first level involves finding all the behavioral regime shifts [
19]. To illustrate, we use the time series event log of two different processes (
Figure 1), in which regime changes are evident. In the first example (
Figure 1a), we see three distinct regimes: A, B and C, in which the process output
level significantly and systematically increased between regimes A and B. Later, between regimes B and C, the level significantly decreased. Similarly, regime A shows a stable behavior in the direction or trend example (
Figure 1b). In contrast, regime B shows a systematic rising trend. We also note that we mainly focus on significant, systematic and potentially explainable shifts and not on small random fluctuations. As all regime shifts are discovered in the first level, the
second level focuses on
statistically finding their determinants. Although details are presented later, in summary, this phase involves identifying potential determinants, which are then statistically evaluated for their role in the formation of these regime shifts. Clearly, a lack of such knowledge only leads to inaccurate process understanding, predictability and inevitably its management and control.
Formally, we thus translate our aim in this work to the following three linked research questions:
Are there any regime shifts in the process output?
What are the potential determinants that may have affected the regime change?
What role do these potential determinants play in forming these regimes?
We propose a novel algorithm that addresses the above questions in a stepwise fashion. Before discussing it in detail, we first refer to a common process modeling framework, which allows translating different process views to a commonly agreed model, which is then used in regime shift analysis. This framework, which is theoretically grounded, well-defined and applicable to most process scenarios, is borrowed from the recent process modeling literature [
20], and it is briefly discussed in
Section 3.1.
Once a process is mapped to a model, the algorithm’s first step defines its modeling basis, referring to the aspect of the process output being analyzed. For example, in a particular manufacturing process, the production line analyst may be interested in the output levels. In contrast, the quality control analyst may look to find any systematic trend (positive or negative) in a quality characteristic. These two distinct aims can be captured via respective level and trend regression modeling bases. Once a statistical modeling basis is agreed upon, it is then used with the output event log time series data to test for any regime changes. Since, at this stage, we are concerned only with discovering all regime shifts, this model basis is kept to its minimal form, i.e., the level model has the intercept term only, and the trend model has an intercept and a proxy time-index term capturing any trend element in the data. Since there are no explicit variables for any determinants in this form, it allows unearthing maximum regime shifts.
Once all regime shifts are discovered, the methodology addresses the second question: identifying potential process determinants. This question is mainly dealt with at the modeling stage mentioned earlier, in which a process model is built to identify any related elements. As these determinants are found, relevant time series data from event logs are further gathered. A key aspect to note here is that a track of changes in these determinants on the timeline is needed, which can then be used in the next phase to statistically link any change in the process output to a change in a particular determinant.
The rest of the methodology deals with statistically relating changes in determinants to regime shifts in process output data (the third question). The idea is to sequentially introduce determinants as independent variables to the base model, one at a time. The introduction sequence is based on the first appearance of change on the timeline in these determinants. Each time a determinant variable is introduced, the revised model is statistically fitted and tested again for regime shifts. If the statistical tests now fail to recognize a regime shift discovered in the first stage, we deduce that this determinant played a causal role in forming this shift. This step is repeated until all potential determinants are introduced and tested. Finally, the fitted models can then be analyzed for their significance or impact on the process output. It is also easy to see that this approach can be used in conformance analysis when a determinant is intentionally adjusted for a potential enhancement.
To demonstrate the merits of our proposed approach, we use three case studies from processes in manufacturing, commodity price generation and road traffic safety analysis. These cases vary significantly in terms of their technical, socio-economic and socio-technical nature and their levels of complexity. The results indicate regime shifts in all cases, with varying relationship structures with their respective determinants. The first case is used as a validation or test case, in which actual relationships and regime structure are known in advance, whereas the other two cases are used to demonstrate the use of the methodology in distinct and complex scenarios.
In the rest of the paper, we present a discussion in
Section 2 on the recent process mining literature and how our work is positioned within the extant literature.
Section 3 then presents the proposed methodology, where we first present the process modeling framework, followed by the methodology and its details. The three case studies and their analyses are presented in
Section 4, and conclusions of our study are presented in
Section 5.
2. Literature Review
Process mining initially emerged as a workflow mining technique to extract a process model from software engineering data [
21,
22]; other aspects such as process cost, risk reduction, maximizing productivity, resource utilization and improving quality have progressively gained attention [
23]. Although this makes the process mining landscape quite extensive with multiple recent literature review papers available providing a comprehensive account [
15,
23,
24], we mainly discuss those works that help us position our contribution within the extant literature. Moreover, we focus on identifying the application context, the methodology and/or tools used and the aim in terms of process discovery, conformance or enhancement within those works.
Bernardi et al. [
25] used smart meter readings to discover anomalous customer behavior in energy usage over time using Hamming distance and cosine similarity. In another paper, Bernardi et al. [
26] applied process mining on call traces to characterize smartphone applications for malware detection. Myers et al. [
27] investigated cyberattacks in industrial control systems by comparing five algorithms to create an accurate yet simple process model and recommended Inductive Miner as the most suitable algorithm. Sahlabadi et al. [
28] introduced genetic process mining to detect deviant user behavior in social media websites. They applied their study on Facebook users by first generating a process model for normal user behavior and then by identifying the abnormal behavior by conformance checking. For web application security, Compagna et al. [
29] proposed Aegis—a tool to process mine a target web application to discover its workflow model and consequently enhance its security policies. In the same context, Bernardi et al. [
30], using model-driven engineering and process mining, applied Unified Modeling Language to generate the formal model and ProM visualization techniques for deviation identification [
31].
In the software development arena, Leppäkoski and Hämäläinen [
32] used process mining as an aid for agile (human-centric) development. In contrast, Gupta et al. [
33] used it for software maintenance or enhancement. In pre-production software quality assurance, a common process mining-based research trend is either error or bug minimization or their detection. To enhance software reliability, the recent works of Rubin et al. [
34] and Xu et al. [
35] focused on identifying errors or bugs, and Lübke [
36] focused on generating test cases for conformance checking.
In healthcare, one of the early papers by Ciccarese et al. [
37] used process discovery for resource analysis and conformance to clinical guidelines. Later, for role description and collaboration among hospital emergency department professionals, Günther and Van Der Aalst [
38] and Li et al. [
39] proved the usefulness of Fuzzy Miner and MinAdept algorithms, respectively, to deal with inherent unstructured characteristics of healthcare process models. In emergency medicine, process mining has recently been applied for the initial evaluation and diagnosis of accidental injuries, unexpected diseases and coordination among care providers for surgical and non-surgical treatments [
40,
41,
42].
Trcka et al. [
43] demonstrated process discovery and conformance checking in education to identify student behavior using the ProM tool. In the same domain, Okoye et al. [
44] discovered behavioral patterns and rules for personalized learning with Web Ontology Language and Semantic Web Rule Language. For improving student learning in a university environment, Groba et al. [
45] used the SoftLearn tool to analyze student flows based on social networks with a graphical interface.
In the business domain, process mining aims to fill the gap between data mining and business process management [
46]. As per Porter’s Value Chain, the literature coverage in this context is divided into primary processes, including logistics, service, marketing and sales, and operations, and into secondary processes, such as infrastructure, procurement, research and development (R&D), and human resource management (HRM) [
15]. In business services, Cho et al. [
47] redesigned the customer reservation process of the largest travel agencies in Korea. Moreover, Syamsiyah et al. [
48] compared form handling process variants within Xerox Services for actionable insights on the most common process behaviors. Similarly, marketing and sales have previously been analyzed using data mining techniques such as clustering, profiling and predictive modeling [
49]. The latest research in this area was reported by Măruşter and van Beest [
50], who redesigned the booking process of a utility company considering process and time perspectives.
In operations, Roldán et al. [
51] exploited process, time and organizational perspectives in an emergency context to detect bottlenecks and inefficiencies in multi-robot missions. Ruschel et al. [
52] integrated process mining and Bayesian networks for predicting maintenance intervals for manufacturing equipment. In logistics, Paszkiewicz [
53] investigated a warehouse management system of a manufacturing company with conformance checking, and Sutrisnowati et al. [
54] assessed lateness probability in container handling. In another work, Repta et al. [
55] analyzed event data of a warehouse to reconstruct a process model using Global Positioning System devices and Radio Frequency Identification readers. Finally, in the procurement process, Jans et al. [
56], Outmazgin and Soffer [
57] and Reijers et al. [
58] performed the detection of internal control violations or workarounds in the procurement process, and Fleig et al. [
59] streamlined and standardized the IS-supported procurement process of three manufacturing companies with a process-mining-enabled decision support system.
It is clear from the above discussion that the bulk of the process mining literature has primarily focused on problems that may be conceived as technological systems in the sense that their expected behavior is mechanistic due to their expectedly well-defined structure, and they do not have considerable convoluted social and economic determinants. Therefore, a major gap exists in mining socio-economic or socio-technical systems that are prevalent in the wider technological, social, political and economic spheres. Zerbino, Stefanini and Aloini [
15] have recently highlighted the same issue of ignoring the social side in business problems. Consequently, a general approach is needed, which may be used to mine complex process behaviors of such systems. We thus propose a methodology that seeks to capture behavioral changes (regime shifts) in such complex systems, whose knowledge can be employed in process conformance and enhancement.
3. Methodology
The proposed methodology sequentially addresses the three research questions stated in
Section 1, i.e., revealing any present regime shifts in the process output data and identifying and statistically linking these shifts to the identified potential determinants.
The methodology comprises several steps, including (1) translating different process views to a commonly agreed model. This translation is performed using a generally applicable Stochastic System and Process Modeling Framework [
20], built upon the theoretical foundation of Bunge’s Ontology [
60,
61]. (2) Using this model, a statistical modeling basis is selected as the process output evaluation criterion in the regime change analysis. We note that this criterion may vary according to the output of the process being evaluated and the analyst’s choice. (3) By using a minimal form of this basis (i.e., the statistical model with no explicit determinant variable included), statistical tests are performed to identify and date all regime shifts present in the data. (4) Process determinant variables are then sequentially introduced into the base model according to their change on the timeline and are statistically tested again for regime shifts. If an earlier regime shift disappears, it shows that the introduced determinant has a causal relationship with the regime shift. This step is repeated until all potential determinants are introduced and tested. (5) Lastly, the fitted model is analyzed for its significance or impact on the process output. A detailed account of all these steps is presented in the following sub-sections.
3.1. Process Modeling and Mapping
The most primitive SSPM concept is a Thing , which is an individual x that exists independently and has some defining properties p(x) ∈ P (P being the set of all possible properties of x). All things possess a state (where is the set of all possible states) at any time t, which is determined by the current values of its perceived properties called attributes. All these states are determined by some state function following state laws (probabilistic or deterministic). Using these constructs, we define a system Y as a coupled collection of interacting things (y), which demonstrate some basic properties (present in constituent things) or emergent properties (not present in constituent things), i.e., p(y). Other things or systems affecting the system shape its environment.
Such a system is expected to execute some process, which is a sequence of unstable system states leading to some stable state (ideally its goal). This is reflected in a change in system attributes, including the considered output (i.e., the attribute of interest). Here, process validity is a key issue, as the process may or may not lead to a stable state within a reasonable time or even to the intended state. In this context, a valid process has a process path (i.e., change in states leading to a stable state) that ends within its generally defined goal states (e.g., a success) and within a finite-time bound. Moreover, the goal state set must be reachable via at least one valid path. Here, we must mention that the sequence of states is dependent on internal and external triggers and events and on the governing transition laws. We further distinguish between properties and system couplings that are directly affected or changed by these triggers and the properties that, as a result, adjust to the change. Here, we refer to the latter as the dependent variable and the former as the independent variable (only in relation to the latter). In this sense, a good process design is the one that ends in its goal state, which is vice versa for a failed process.
In reality, a system may face varying internal or external environmental triggers, potentially leading to varying resulting processes. In the process mining context, the question is thus to empirically test if the enacted process is valid and has led to its intended goals. Thus, we can generally represent the problem of interest as shown in
Figure 2, where we can identify the System of Interest (SoI) producing the processes, and its properties may allow us to identify both key determinants and the property of interest (a dependent process output) to analyze. Moreover, external determinants can be identified based on external systems and things in the environment.
Finally, we note that the triggers can be intentional, unintentional, visible, or invisible. Accordingly, the process mining task here can be of discovery, conformance, or enhancement. In all cases, the above modeling approach holds.
3.2. Algorithm and Statistical Tests
We now present the method to identify regime shifts in the SoI’s process output (i.e., the dependent attribute of interest) and to find its causal relationship with potential determinants. The high-level algorithm is first presented (
Figure 3) and discussed, and details of statistical modeling and testing are discussed in the latter part of this section.
As shown in
Figure 3, the procedure requires an
initialization phase, in which the first step is to model the system and its properties (denoted as
Y and
p(
y), respectively) via the SSPM framework. The modeling approach is already discussed in
Section 3.1 above. Based on
p(
y), a dependent property
p′(
y) ∈
p(
y) of interest is identified, whose time series event log data are considered for process mining. Furthermore, all determinants
d ∈
D are identified, which may be related properties and/or couplings (internal or environmental). Once the model is formed, the evaluation or the mining criterion is set for
p′. This criterion is based on the analyst’s preference and on the attribute of interest. Accordingly, the criterion is translated to a minimally described model
B, i.e., with no determinants initially considered. This minimal model allows the discovery of all regime shifts, irrespective of their causes. Thus, in the first
post initialization step, model
B is fitted and tested for all regime shifts using the time series data for
p′ with a statistical procedure. This procedure involves determining and dating all present regime shifts, the details of which are discussed later in this section. If no regime change is detected, the procedure is terminated; otherwise, the procedure moves to the phase for
modeling and testing the impact of determinants. In this phase, a potential (or even exhaustive) subset is extracted from
D (i.e.,
d′ ⊆
D). This list is then ranked (
D′: the ranked list) based on the earliest chronological change in respective determinants, i.e., a determinant is placed as the first item in the set, which sees the first change in its state on the timeline, etc. Notably, the ranking and testing in this sequential and incremental manner allow simulating the actual timeline of events that may have caused various regime shifts. In this step, when an independent determinant variable, based on its rank, is added to the base model, the updated model (
B′) is then fitted and tested again for regime shifts. During testing, if a regime shift found in the previous phase remains undetected, it is inferred that the determinant has (at least partially) caused the formation of this shift. Finally, the fitted model is used to
compare the impact of determinants having a causal relationship with p′.
Statistical Modeling and Testing
The above procedure requires an appropriate statistical modeling approach and a general testing procedure accommodating varied modeling bases. We discuss both the issues in this section; however, we first refer to the following notations needed in our discussion:
Time index used with the time series event log data (i.e., for n observations)
Time series event log data of evaluated dependent property at time i
A vector of determinant regressors of size , at the time i
A vector of model coefficients for all
Random noise at the time i
The number of regime shifts in data
Set of all regime shift points
Regime index. With m regime shift points, we have m + 1 regimes.
In the above notations,
is the time series event log for the dependent system property and
is the same for the considered determinants. As demonstrated in the initialization phase of the algorithm (
Figure 3), we identified the need for an evaluation criterion, which is used to form the base statistical model. Specifically, the base model form is dependent on the nature of the considered properties, determinants and evaluation criteria.
To illustrate, we present two scenarios and their corresponding general model forms. The first scenario involves a property
, which is assumed to be normally distributed. This may be the case of a product’s quality characteristics produced by a manufacturing process. The second one is the case in which
represents some count data, which thus follow a Poisson or a negative binomial distribution. This case is prevalent in, e.g., traffic accident data involving accident counts. For the two cases, we require separate modeling bases, which in generalized terms are represented, respectively, as:
As demonstrated above, modeling bases for other situations can be formed. Although the above generalized and fully defined models show a relationship structure between the independent and all the dependent determinant variables, the initial requirement of the procedure is the minimal model form, which is used to discover all regime changes in
. Thus, the above generalized but fully defined model form can be rewritten for its minimal form as:
Here, we removed variables for all considered determinants in Equation (2a) and Equation (2b) and replaced them with a simple time index term ti. This term in the minimal model captures any trend in the data. Later, respective determinant regressor variables are added when evaluating their roles.
Statistical testing: For any of the above model structures, the regime shifts’ problem is simply to hypothesize any change in the coefficients
of the determinants. In other words, if the coefficients remain stable throughout the timeline, there is no regime change. On the other hand, any change in even one of the coefficients means a regime change. Accordingly, we define the null and the alternative hypotheses as follows:
For the general case of
m regime shifts, we have
m combinations of stable coefficients, or, simply, there are
m + 1 regimes. In this case, the minimal models presented in Equation (2a) and Equation (2b) can be rewritten as:
We also note that the number and locations of regime shifts are generally unknown for this scenario. Thus, all shifts need to be empirically determined. Moreover, as we are not limiting ourselves to a particular model form, a generally applicable statistical testing approach is needed that is valid for the respective assumptions applied to these models. Accordingly, we resort to the statistical procedure that tests for the unknown number of regime changes and finds their number and locations. These shift dates are then used to fit the model for stable regime segments. The key to this procedure is two statistical tests developed by Bai and Perron [
62,
63] and Zeileis et al. [
64]. The tests are quite robust in terms of assumptions on the nature and distribution of both regressors and noise. To further explain the nature of tests, we use tests developed by Bai and Perron [
62] that applicable to a single regime change. The first is an F-test [
65], in which there is only one shift assumed, though of unknown timing. The F-statistic used is:
The test assumes an alternative hypothesized regime shift at the considered time
i, which uses a sequence of
Fi statistics for any change at
i, i.e.,
. The residuals above
are then compared for the competing models, i.e., with and without regime segments. A certain threshold is used for rejecting the null hypothesis when the supremum of these statistics is above this threshold. We note here that the index
h in the term
represents a parameter for a reasonably minimum time segment length choice, and thus,
is estimated as
. In other words, based on the system and process scenario, a minimum period length is defined for a shift to be considered. For the generalized case of
m regimes, we use a test variant developed by Bai and Perron in [
62] and [
63]. In this case, the same evaluation is performed for
m vs.
m + 1 regime shifts. We also make use of a second generalized fluctuation testing framework. In this test, the models are fitted to the data for which residual fluctuations using a process governed by the fluctuational central limit theorem is determined [
64]. Any increase in the fluctuations or the process trajectory suggests a deviation from the null hypothesis. We refer the readers to [
62,
63,
64] for full details of both the procedures.
As these tests confirm the presence of a regime shift, the points where shifts have occurred are to be dated, and the confidence interval is to be determined. The dates are needed to specify the segments’ timeline ranges and to fit the corresponding models. Assuming an arbitrary model fit, i.e., in terms of the dates, its sum of squared residuals (
R) can easily be determined as
(
r being the squared residuals sum of a regime segment). Thus, the actual regime shift dates are the ones that globally minimize the function
[
63]. Due to the computational complexity of the problem when the number of segments is large, Bai and Perron [
63] proposed a dynamic programming method, which suggests the optimal segmentation to occur for
. They also suggested a method for finding confidence intervals. Their method puts minimal restrictions on the distributions of the data and the residuals and is thus suitable for varying cases. As the regime segments are found, the potential determinants can then be evaluated for their role in their formation. The same statistical tests hold in this case as well. The only difference is that a determinant variable is introduced based on its rank and then tested for the regime shifts using the above procedure. If the tests fail to find an earlier shift, it is implied that the introduced determinant has played a role in its formation. The same procedure is repeated every time a determinant variable is introduced. When all potential determinants are tested, the fitted models are analyzed for their scale of impact on
.
4. Case Studies
In this section, we analyze three cases which are of technical, socio-economic and socio-technical nature, respectively. The first case involves a steel rod cutting process, producing several rods per day of a particular design specification. The objective is to discover if and when the process systematically deviates (perhaps due to tool breakage) from its quality objective, i.e., a specified rod diameter. The second case involves the oil price generation process. Here, the mining task aims to discover all regime shifts in the oil price data and statistically test if deaths caused by COVID-19 produced any regime changes. Finally, we present the case of deaths caused by road accidents in a state of a Middle Eastern country. The state underwent a road safety program, and the government wanted to evaluate whether three of its key initiatives made any significant impact in improving the existing situation.
4.1. Case 1: Manufacturing Quality
The first case involves a technological process. More precisely, we evaluate the output of a
turning process, which originates from a simple
lathe machine system. This system comprises several interacting components (or
things) that include structural body parts, a cutting tool, a motor and a steel rod vised and rotated for cutting. Although several basic and emergent properties can be identified for this system, the
dependent property of interest is the rod diameter that changes when the turning process is enacted. The
aim of the process is to reduce the diameter of the rod to a specified level (
Figure 4). This process is continuously repeated to manufacture several hundred rods. After a rod is produced, the reduced diameter is logged and checked to ensure its acceptability. As the process is prone to tool breakage (tool length being an
internal determinant property), which causes a sudden increase in the diameters being produced. The intention is to catch such a regime shift in the diameter data, followed by the needed tool replacement. We ignore any external environmental determinants (e.g., room temperature) in this analysis for simplicity.
This case is selected mainly for two reasons. First, it is a simple technological process that is easily visualizable, involving just one dependent property, i.e., the diameter of the rod, and one independent internal process determinant, i.e., the tool length. Second, as we exactly know when the tool was broken (nearly 5 mm tool tip lost when the 80th rod was processed), causing a regime change in our diameter log at the corresponding time, we use this data to validate the efficacy of the methodology.
The time series event log for this case is presented in
Figure 5, which reports data for 120 steel rods produced in a sequence. The intended reduced diameter is 100 mm. After the occurrence of tool breakage, an increase in the diameter at the 80th rod can be seen. As the tool continues to be used, an increase of approximately 5 mm is clearly visible in the diameter of the rest of the rods, showing the same breakage length as that which happened with the tool. For this problem, the determinant data was tool length, as recorded after every cutting run.
To test and validate the proposed methodology, the criterion used is the
level basis, in which the minimal base model via Equation (1a) and Equation (1b) is simply reduced to
, where
represents the diameter of the rod. Using this minimal base model, we tested the rod diameter time series data for regime changes. The results from both statistical tests are shown in
Figure 6a,b, respectively, which clearly show a deviation and a peak well outside the critical band (indicated by red lines) at around the 80th data point, indicating at least one regime shift.
We then moved to the steps for dating the shift(s) and producing the corresponding segmented base minimal models. The results for both these aspects are shown in
Figure 7,
Table 1 and
Table 2. First, it indicates that at a confidence level of 97.5%, the shift occurred between rod number 79 and 81 (most likely at the 80th rod). Second, it shows that two data segments are created, i.e., the first time series segment covering rods 1 to 79 and the second segment covering rods 80 to 120. The fitted model segments (
Table 2) are also graphically depicted in
Table 1, which shows the average diameter in the first segment to be around 100.05 mm and 105.12 in the second segment. A single level-model fitting the whole data is also shown (grey dashed line) in the figure to demonstrate why it is important to identify various regimes and have separate models for each.
We then evaluated the role of the tooltip length and any breakage in forming the new regime. Accordingly, the minimal model is revised to include tooltip length as an internal determinant variable. The base model changes to
, where
di (or ToolTip—the variable name) is the new independent variable for the tooltip length. The updated model is tested again. The statistical tests failed to detect the regime change found earlier (
Figure 8a,b). The details of the new fitted model are shown in
Table 3, showing that the ToolTip variable is highly significant (Pr(ToolTip) > |z|) < 2 × 10
−16 ***). The graphical plot of the fitted model with no regime changes is shown in
Figure 9.
4.2. Case 2: Gasoline Price and Impact of COVID-19
The second case involves the
process of
crude oil spot price generation. This process is generated from a
complex socio-economic system involving oil producers, transporters, governments, end-users and several other factors as its components or things. It also includes a complex interaction between these things, which may be in the form of politics, trade and conflicts. The process is socially and economically complex, in which many identified or unidentified process determinants lead to unclear system boundaries and its environment. For now, we treat this system as a blackbox, in which the main output or dependent variable of interest is the weekly WTI oil spot prices. The event log analyzed is for the period of 1 January 2015 to 31 December 2021 (
Figure 10), which is obtained via the U.S. Energy Information Administration website (
https://www.eia.gov/ accessed on 1 February 2022).
The aim of this mining exercise is twofold. First, we seek to identify all regime shifts. Second, we aim to evaluate the role of COVID-19 deaths in the oil price formation out of many determinants. As the planning horizon is long, we can safely assume several time-dependent determinants, such as population and industrial growth. Accordingly, the minimal base model used has both the level and the trend components, i.e., we use
as suggested in Equation (2a) and Equation (2b). This model, as we recall, has a proxy time index term to capture any trend in the data. We thus used this minimal base model and tested the price data for regime changes. For brevity, we directly present the graphical results in
Figure 11. The results indicate twelve regime shifts during the analysis period, the location and confidence intervals of which are indicated in
Table 4.
To test whether the deaths caused by COVID-19 impacted the formation of any of these breaks, we used the COVID deaths log (
Figure 12) obtained from Our World in Data (
https://ourworldindata.org/ accessed on 1 February 2022). We tested the model again with the added determinant variable for COVID-19 and found one regime shift located at week 262 not recognized by the statistical tests. This week corresponds to the second week of January 2020, in which a sharp rise in deaths started to appear. The refitted model with one less segment is shown in
Figure 13. Similarly, other determinants can be tested for regime changes.
4.3. Case 3: Deaths in Road Accidents and Impact of the Safety Program
The third case belongs to the road traffic system, which can be classified as a socio-technical system. This system involves several things, including a road network infrastructure, road safety rules, cars and drivers, among others. Several environmental factors may be important, including weather, etc. We are interested in mining the process generating major road accidents leading to deaths for this system. We mined ten years of a log obtained for a major middle eastern region (
Figure 14). The main determinants considered are road safety measures implemented during the same periods. These measures include strict seatbelt penalty laws, which came into full effect by January 2015. Following this, cameras for detecting red lights running were introduced in March 2015. Finally, automatic detection cameras with heavy penalties for speeding were installed by 1 March 2017.
The mining objectives are again twofold, i.e., determining whether there are any positive or negative regime changes in deaths in road accidents; and whether the safety measures played any role in enhancing the existing traffic safety situation. For this case, as compared to the first two, we are dealing with count data that may be following Poisson or negative binomial distribution, depending upon mean vs. variance present in the data. Accordingly, we found that the mean = 60.25 deaths/month and that the variance = 406.91, and thus we considered a generalized linear regression model with a negative binomial distribution. Considering both the level and trend, the minimal form of the model turns out to be
(as suggested via Equation (2a) and Equation (2b)). The model was tested for regime changes, and the results are shown in
Figure 15. The results show two regime changes, where, in the first regime, a sharp rise in deaths is evident, followed by the lowering of the death rate in the second regime. Finally, we clearly see a negative trend in the third regime.
Since the multiple new safety measures were introduced at varying times, we ranked these measures based on their occurrences and tested them in the same sequence. The results of this incremental procedure are shown in
Table 5.
The first iteration shows the results for the base model with level and trend only. Both the level and the trend term turn out to be significant in all three segments. When the seatbelt variable was introduced, it did not cause any regime shifts. This determinant seems to have a mild effect only in the second regime segment. A similar outcome is evident in the case of red-light cameras. However, when speed cameras were introduced, the second shift turned out to be undetected, showing that the last segment’s negative trend is caused by this measure.
5. Discussion
Regime shift detection is paramount for real-world processes, such as financial and economic planning and manufacturing, in making operational and strategic decisions. However, as highlighted in
Section 2, the extant literature is primarily focused on well-defined processes, lacking the consideration of external factors. Additionally, we found a consistent, though implicit assumption that these processes’ overall underlying behavioral mechanism remains unchanged. This assumption is highly questionable, as shifts in such mechanisms do frequently occur, as exemplified in cases discussed in this paper (
Section 1 and
Section 4). Thus, our work addresses this significant shortcoming in the process mining literature through a novel and generally applicable approach that detects the presence of shifts in the process behavior and their locations and causal determinants. This vital contribution significantly adds to the process mining literature in all three dimensions of discovery, conformance and enhancement. Although its role in process discovery is obvious, the proposed methodology is equally applicable to process enhancement agendas in terms of its behavioral shift conforming to intended objectives.
This methodology was employed to analyze three distinct cases of technological, socio-economic and socio-technical nature. The varying nature of the cases demonstrates the applicability of the methodology in broader contexts. The analysis results show regime changes in all three cases, and various determinants were identified and analyzed for their role in the formation of these different regimes. In the first case of a manufacturing process, a regime shift was found where the tool breakage happened. In the second case of spot oil prices, we found twelve regime shifts, out of which the ninth regime shift turned out to be due to the occurrence of COVID-19. Finally, in the third road accident mortality case, we found that speed cameras have the most significant effect in reducing the occurrence of deaths in road accidents.
This work is useful for industry practitioners to detect such shifts in processes under their supervision to fix them based on identified determinants. For academics, this methodology can allow them to take a new perspective, wherein their analysis of processes vsn be evaluated on a more realistic segmented view rather than as a whole.
6. Conclusions
In this paper, we presented a novel methodology that is used to identify regime shifts in processes of varied nature. Despite the evidence of the use of process mining on a broader set of applications, we observed that its use is limited to well-defined processes that are centrally managed, isolated from their wider socio-economic environments, and which behave mechanistically. Hence, there is an evident lack of attention to complex socio-economic or socio-technical processes. This motivated us to develop a generally applicable methodology that firstly focuses on identifying behavioral regime shifts from the process event logs. Secondly, the methodology extends further in providing an approach that allows identifying and statistically relating determinants in forming these regime shifts. Thirdly, significant determinants are analyzed for their impact on process output. We have demonstrated the application and use of this methodology via three case studies, in which the importance and criticality of detecting behavioral shifts in terms of understanding or controlling the processes are highlighted.
As a major limitation, the proposed methodology is currently developed only for a single dependent variable scenario. In the future, it may be extended to a point in which multiple dependent variables are simultaneously considered. Furthermore, other modeling bases (e.g., volatility or AI-based) need to be integrated into the methodology. Moreover, this methodology can be developed for composite modeling bases. Another shortcoming of this work is that the methodology was tested only on three cases. A more extensive set of cases of varied nature needs to be investigated to fully justify and evaluate its performance. Finally, the proposed approach is ad hoc to problem application, as the choice of determinants is contingent upon the problem being addressed and thus needs to be treated accordingly.