Next Article in Journal
Optimized GRU with Self-Attention for Bearing Fault Diagnosis Using Bayesian Hyperparameter Tuning
Next Article in Special Issue
Mathematical Modelling and Numerical Analysis of Turbulence Models (In a Two-Stage Laboratory Turbine)
Previous Article in Journal
A Quantum Frequency-Domain Framework for Image Transmission with Three-Qubit Error Correction
Previous Article in Special Issue
Secant-Type Iterative Classes for Nonlinear Equations with Multiple Roots
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Smooth-Delayed Phase-Type Mixture Model for Human-Driven Process Duration Modeling

by
Dongwei Wang
1,
Sally McClean
1,*,
Lingkai Yang
2,
Ian McChesney
1 and
Zeeshan Tariq
1
1
School of Computing, Ulster University, Belfast BT15 1AP, UK
2
Research Institute of Mine Artificial Intelligence, Chinese Institute of Coal Science, Beijing 100013, China
*
Author to whom correspondence should be addressed.
Algorithms 2025, 18(9), 575; https://doi.org/10.3390/a18090575
Submission received: 29 July 2025 / Revised: 4 September 2025 / Accepted: 7 September 2025 / Published: 11 September 2025

Abstract

Activities in business processes primarily depend on human behavior for completion. Due to human agency, the behavior underlying individual activities may occur in multiple phases and can vary in execution. As a result, the execution duration and nature of such activities may exhibit complex multimodal characteristics. Phase-type distributions are useful for analyzing the underlying behavioral structure, which may consist of multiple sub-activities. The phenomenon of delayed start is also common in such activities, possibly due to the minimum task completion time or prerequisite tasks. As a result, the distribution of durations or certain components does not start at zero but has a minimum value, and the probability below this value is zero. When using phase-type models to fit such distributions, a large number of phases are often required, which exceed the actual number of sub-activities. This reduces the interpretability of the parameters and may also lead to optimization difficulties due to overparameterization. In this paper, we propose a smooth-delayed phase-type mixture model that introduces delay parameters to address the difficulty of fitting this kind of distribution. Since durations shorter than the delay should have zero probability, such hard truncation renders the parameter not estimable under the Expectation–Maximization (EM) framework. To overcome this, we design a soft-truncation mechanism to improve model convergence. We further develop an inference framework that combines the EM algorithm, Bayesian inference, and Sequential Least Squares Programming for comprehensive and efficient parameter estimation. The method is validated on a synthetic dataset and two real-world datasets. Results demonstrate that the proposed approach maintains a suitable performance comparable to purely data-driven methods while providing good interpretability to reveal the potential underlying structure behind human-driven activities.

Graphical Abstract

1. Introduction

A business process is a set of structured activities that people or systems perform to achieve a specific organizational goal, in which efficiency of operations is a key factor in economic success, with the duration of processes and individual activities serving as the most direct measures for related applications, such as the prediction of completion time [1] and the detection of bottlenecks [2]. Therefore, precisely modeling duration can facilitate a better analysis and is of significant importance.
Although automation technologies have been applied in some processes, many critical tasks are still driven by humans, especially those that involve actual executions [3]. Although these processes are typically designed as a sequence of distinct activities, some of them are not truly indivisible. Some human-driven activities can be broken down into multiple steps in actual performance or performed in different ways depending on individual habits or resource scheduling. The duration of these activities typically exhibits complex and multimodal distributions. However, only the aggregate duration is recorded, leaving the structure and duration of the underlying sub-activities latent and unobserved.
In such compound activities, delays are a frequently observed phenomenon and a major factor affecting efficiency. The delays discussed here are a characteristic of the distribution, specifically manifested as right-shifted probability peaks. These peaks can be caused by true delays or other factors that, while not delays in the strict sense, lead to similar effects. For example, in processes with timeout penalties, instances may be accelerated just before or after the timeout threshold, causing their durations to cluster around the timeout point. This creates a phenomenon resembling a delayed start, even though there may not be an actual delay in the strict sense. For simplicity, we collectively refer to all such factors as delays.
Delays in a process can arise for various reasons, such as inherent system latency, the minimum duration required to complete a task, the accumulation as deadlines approach, and limited resources. In the context of human-driven activities, delays can also result from work schedules, such as breaks and weekends. It should be clarified that the delay discussed in this paper refers to a common and systematic form that leads to a general shift in the observations, reflecting a delayed start. This is distinct from random and exceptional delays that produce only a small number of outliers. The occurrence of delays in processes is common and has been widely recognized and analyzed [4,5,6]. For example, Andrews et al. [7] found that delays are common in an insurance process, and the causes of these delays include claim handlers being occupied with other tasks or waiting for reports from medical examiners, among others.
While modeling the shifted aggregate duration may suffice for some applications, uncovering and analyzing the latent sub-activities and delays can provide deeper insights into the execution of activities, which facilitates precise bottleneck detection and efficiency analysis.
The phase-type (PH) distribution is a probabilistic model that represents the time until absorption in a finite Markov process [8], where each phase corresponds to a minimal activity with an exponentially distributed duration. In other words, the resulting PH parameter has the potential to uncover latent sub-activities. However, two common issues limit the applicability of the PH distribution in such modeling tasks.
The first issue is the fitting of peaks located far from the origin that are caused by significant delays. These peaks are characterized by relatively large means and relatively small variances. In this paper, we refer to these peaks as delayed narrow peaks. These delayed narrow peaks typically require a combination of many exponential distributions to achieve a close approximation [9]. In the representation of PH distributions, the dimension of the sub-generator matrix can be extremely high, which not only causes computational difficulties but also makes parameter estimation infeasible due to the excessive number of free parameters [10]. Moreover, the interpretation that a single activity is composed of too many sub-activities is also implausible.
In this paper, a delay function is introduced into the PH distribution corresponding to such delayed narrow peaks. The function serves as a smooth approximation of an explicit delay, which not only separates the delay to reduce the number of required phases but also eliminates the adverse effects on convergence caused by explicit delays. Furthermore, considering that the duration distribution of some activities may consist of multiple components with different delays, we adopt a mixture model. It should be noted that while the PH distributions are theoretically closed under mixtures, allowing multiple PH components to be merged into a single PH distribution [11], the use of different delay functions in the components prevents such merging. Here, the delay functions are adaptive to ensure consistent convergence across different components.
Identifiability is another issue. PH distributions are generally not identifiable, as the same distribution can often be represented by multiple distinct parameterizations [12]. In other words, the Markov process inferred from the PH distribution may differ from the actual underlying process that we aim to reveal. Moreover, mixture models are usually not identifiable [13,14], especially when the component distributions overlap significantly. The identifiability issue makes the parameter estimation challenging, as the point estimation obtained from the frequentist optimization does not necessarily reflect the true rates and structure of the underlying sub-activities.
To address this issue, Bayesian inference is used in this paper to estimate the posterior distributions of parameters, where Markov Chain Monte Carlo (MCMC) is used to sample from the posteriors. The priors are incorporated in the estimation to guide the sampling toward domain knowledge. Moreover, considering the high dimensionality of the parameters in mixture models, which makes direct MCMC sampling over the entire parameter space challenging, expectation–maximization (EM) is employed to enable separate sampling of each component. Additionally, frequentist point estimation is used in the early stages of the EM iterations to accelerate the computation. A detailed derivation for the parameter estimation is provided later in this paper.
In summary, this paper proposes a smooth-delayed phase-type mixture model (sDPHMM) and presents the following key contributions.
  • We employ delayed PH mixture models to uncover the potential structure of latent sub-activities and quantify the associated uncertainty through Bayesian inference.
  • We introduce an adaptive delay function into the mixture of PH distributions to improve convergence.
  • The proposed sDPHMM allows us to better understand the nature of these delays and the underlying human and technical behaviors toward eliminating or, at least, managing their impact.
The remainder of this paper is organized as follows: Related work is discussed in Section 2. The formulation of the sDPHMM and its parameter estimation method are described in detail in Section 3. In Section 4, evaluations are performed on a synthetic dataset and two real-world datasets. Conclusions are given in Section 5.

2. Related Work

2.1. Data-Driven Duration Modeling

A data-driven approach relies on observed data to discover the model for duration, without assuming predefined rules. It has shown strong performance in capturing complex patterns and adapting to real-world variability. Therefore, data-driven methods have been widely applied in duration modeling [15]. For composite activities with complex duration distributions, mixtures and convolutions of simple distributions are often used. Yang et al. [16] used mixtures of gamma distributions to model the duration of a hospital billing process, which exhibited a multimodal distribution. Each of these peaks was captured by at least one gamma component, and the number of components in the mixture model was selected by the Bayesian Information Criterion (BIC). Nogayama and Takahashi [17] used the convolution of two gamma distributions to model the durations of service activities consisting of latent waiting and service time, but only total durations were recorded. Ogunwale et al. [18] proposed a continuous probability distribution that combines gamma distributions and exponential distributions obtained using a transformed transformer method. However, data-driven models tend to lack interpretability, making it difficult to extract domain insight from the model structure [19].

2.2. Phase-Type Distribution

The nature of the PH distribution gives it the potential to fit staged processes and uncover the underlying sub-activity structure. It has attracted considerable attention. McClean et al. [20] used the PH distribution to model the length of stroke patient care with multiple outcomes, and on the basis of that, compliance was measured with given targets. The parameters were optimized under the condition that the underlying process structure was already known. Le et al. [21] chose the Coxian distribution, which is a special case of the PH distribution, to model the length of stay in hospitals. Despite the fact that the Coxian distribution applies specific constraints on the transition probabilities, the true underlying structure was unknown in this work. The number of phases was selected according to the Akaike Information Criterion (AIC). Kim et al. [22] applied the PH distribution to model the time to failure of a system with series and parallel components, where transitions were governed by the operating mechanism. These studies primarily emphasized fitting the observed data, although they typically involve certain assumptions or restrictions on the transitions among phases.
The issue of the identifiability and uniqueness of PH distributions has been recognized and studied since the early stages of theoretical development [8,23]. In other words, multiple sets of parameters and structures of the PH model can yield the same duration distribution. In our previous work [19], this phenomenon was also observed: (1) When the number of phases in the PH distribution is not fixed in advance but determined using model selection criteria such as the AIC, BIC, or likelihood ratio test (LRT), the resulting number of phases can be less than the actual number of activity stages. However, the resulting likelihood is comparable to that of the true distribution. (2) Even when the number of phases is fixed and structural constraints are imposed on the transitions, the model may still produce different parameter sets that yield likelihoods similar to that of the true distribution. This issue can cause the optimization of the PH distribution, such as the maximum likelihood estimation (MLE) [24], to jump from one representation to another [12], and the discovered model can deviate from the truth.
Bayesian inference, in the context of revealing latent structures of activities, yields full posterior distributions rather than point estimates, which may deviate from the truth. This comprehensive representation improves the interpretation and practical relevance of the inference result. Moreover, by integrating prior knowledge with observed data, Bayesian inference constructs a posterior distribution that not only reflects the uncertainty but also guides the inference toward more reasonable and realistic outcomes, especially in data-scarce scenarios. One of the pioneering works in applying Bayesian inference to the PH distribution appeared in the early 2000s [25]. This study proposed using the PH distribution to approximate the general service time in the M/G/1 queue and developed a Bayesian framework for parameter estimation. In Bayesian inference for the PH distribution, the posterior distribution is typically intractable due to the difficulty of computing the marginal likelihood, making the direct solution infeasible. MCMC methods provide an effective solution by generating samples from the unnormalized posterior and provide an approximation of the true posterior distribution [26]. Although subsequent related studies adopted different sampling strategies, such as Metropolis–Hastings [27] and Gibbs sampling [28], they generally remain within the same overarching inferential framework. Jose et al. [29] modeled medical service data, which is right-censored, with the PH distribution, and a Bayesian estimate was obtained using Gibbs sampling. In [30], Metropolis–Hastings was adopted in a Coxian distribution parameter inference for a time-based reliability analysis.

3. Methodology

3.1. Introducing a Delay to a PH Distribution

If the duration of each sub-activity within a complex activity follows an independent exponential distribution, the overall duration can be represented by a PH distribution, which is specified by a pair ( α , S ) , where α is a vector representing the initial distribution over transient states, and S is a sub-generator matrix describing the transition rates among these states. A PH distribution and its parameters are denoted as P H ( α , S ) . The probability density function (PDF) is given by:
f ( x ) = α e S x ( S 1 ) ; x 0 ,
where 1 is a column vector of ones. The PH distribution naturally suits the duration of staged processes and provides better interpretation. Theoretically, the PH distribution can approximate any positive-valued distribution, but in practice it has difficulties in modeling a distribution or component with a large mean and a small variance.
Using the property of the coefficient of variation (CV) of a PH distribution [9], the minimum number of phases d required to fit a distribution can be determined.
d E ( X ) 2 Var ( X ) = 1 C V 2 .
When E ( X ) is large and Var ( X ) is small, the number of phases required in a PH distribution should be large. In real-world applications, as discussed in Section 1, this type of distribution is frequently observed. However, the underlying process is not composed of that many stages but involves a delay.
According to Equation (2), by subtracting a fixed value δ from all data, the mean of the distribution decreases while the variance remains unchanged, thus reducing the minimum required number of phases:
d n e w E ( X δ ) 2 Var ( X δ ) = E ( X δ ) 2 Var(X) .
The PDF of the delayed PH distribution is given by:
f ( x ) = α e S ( x δ ) ( S 1 ) ; x δ .

3.2. Mixture of Delayed PH Distributions

This subsection explains the rationale for adopting the mixture model formulation.
Assume that a distribution is composed of two PH components with a mixing weight. Since PH distributions are closed under weighted sums, components can be merged into a single PH expression, which is the most common practice. However, if the first component P H ( α 1 , S 1 ) needs a delay parameter δ 1 > 0 to reduce the number of phases, while the second component does not require a delay ( δ 2 = 0 ) or requires a different delay ( δ 2 δ 1 ), it is evident that the two components cannot be merged due to their different delay parameters.
Therefore, we adopt a mixture model formulation to allow for the coexistence of distinct delays. More generally, this can be extended to accommodate a wider range of delays. The parameter of the k-th component is denoted by ϕ k = ( α k , S k , δ k ) , while ϕ = ( ϕ 1 , ϕ 2 , , ϕ K ) , where K is the number of components. Let π = ( π 1 , π 2 , , π K ) denote the mixing weights that satisfy k = 1 K π k = 1 and π k 0 . Let θ = ( π , ϕ ) denote all parameters. Then, the PDF of the delayed PH mixture model (DPHMM) is given by:
f ( x ; θ ) = k = 1 K π k f ( x ; ϕ k ) = k = 1 K π k α k e S k ( x δ k ) ( S k 1 ) ,
where f ( x ; ϕ k ) = 0 ; x < δ k .
From the perspective of distribution representation, there is no problem with the DPHMM. However, δ k is not optimizable in the EM framework. This will be explained in Section 3.5, which introduces an improved and smoothed version of the DPHMM.

3.3. Maximum a Posteriori Estimation

Knowledge of business processes is typically available, which is especially valuable on a small dataset. When the amount of data is limited, the samples may deviate from the true distribution, and the prior can help estimate the true distribution more accurately. Moreover, in models with identifiability and uniqueness issues, the prior can guide the posterior toward a more reasonable distribution under uncertainty [25,26]. Therefore, this paper adopts the maximum a posteriori (MAP) estimation instead of MLE.
The prior distribution p ( θ ) can be factorized as:
p ( θ ) = p ( π ) k = 1 K p ( α k ) p ( S k ) p ( δ k ) .
Based on widely adopted prior distributions for PH distributions [28,29,30] and parameter ranges, we choose a Dirichlet distribution as the prior for α k , a truncated normal distribution as the prior for δ k , and a gamma distribution for each rate in S k . The prior distribution over π is given by a Dirichlet distribution:
p ( π ) = 1 B ( β ) k = 1 K π k β k 1 ,
where β = ( β 1 , β 2 , , β K ) are the parameters of the prior, and B ( · ) is a multivariate Beta function.
The likelihood can be defined as:
p ( X θ ) = i = 1 N p x i θ = i = 1 N k = 1 K π k p x i ϕ k ,
where X = { x 1 , x 2 , , x N } is the collection of observations. According to Bayes’ theorem, the posterior distribution is given by:
p ( θ X ) = p ( X θ ) p ( θ ) p ( X ) ,
where p ( X ) is the marginal likelihood that does not depend on θ . Therefore, the posterior is proportional to the product of the likelihood and the prior:
p ( θ X ) p ( X θ ) p ( θ ) .
Then, the objective of optimization is to find the optimal parameter θ ^ to maximize p ( X θ ) p ( θ ) .

3.4. Expectation–Maximization

Direct estimation of the high-dimensional parameter θ is challenging, as optimization in high dimensions is prone to numerical instability and local optima. Therefore, the EM algorithm is applied, with components updated separately in each M-step, which simplifies high-dimensional optimization and stabilizes convergence. The log-posterior is given by:
log p ( θ X ) = log p ( X θ ) + log p ( θ ) log p ( X ) ,
in which log p ( X ) is a constant. A latent variable Z is introduced to represent the underlying factors or categories of the observed data. By introducing an arbitrary distribution q ( Z ) , satisfying Z q ( Z ) = 1 , and applying Jensen’s inequality, the objective function, which is the evidence lower bound (ELBO) of the log-posterior can be obtained:
L ( q ( Z ) , θ ) = Z q ( Z ) log p ( X , Z θ ) q ( Z ) + log p ( θ ) log p ( X ) .
In the E-step, given a fixed θ , the objective is to find q ( z ) to make the bound tight: log p ( θ X ) = L ( q ( Z ) , θ ) . After simplification, this yields q ( Z ) = p ( Z X , θ ) . Then, γ i k , known as the responsibility that the k-th component takes for data x i , can be calculated as:
γ i k = q ( z i = k ) = p ( z i x i , θ ) = π k p ( x i ϕ k ) j = 1 K π j p ( x i ϕ j ) .
In the M-step, q ( Z ) is fixed, and the objective is to maximize L θ q ( Z ) . Starting from the identity p ( X , Z θ ) = p ( X Z , θ ) p ( Z θ ) , and after replacing X and Z with the full data and simplifying, we arrive at:
L ( θ q ( Z ) ) = i = 1 N k = 1 K γ i k log π k + i = 1 N k = 1 K γ i k log p x i ϕ k + log p ( π ) + log p ( ϕ ) + const .
By applying the Lagrange multiplier method and simplifying the expression, an analytical solution for π k can be derived.
π k = i = 1 N γ i k + β k 1 N + k = 1 K ( β k 1 ) .
The component parameters ϕ k are independent of each other. Therefore, each component can be optimized separately. Here, the terms related to ϕ k in L ( θ q ( Z ) ) are extracted and denoted as Q ( ϕ k ) .
Q ( ϕ k ) = i = 1 N γ i k log p ( x i ϕ k ) + log p ( ϕ k ) .
The objective is to find the optimal ϕ ^ k to maximize Q ( ϕ k ) . However, there is no closed-form solution for ϕ ^ k . Hence, we combine the optimization method and Bayesian inference to analyze ϕ k , which will be discussed in Section 3.6.

3.5. Smooth-Delayed PH Distribution

The DPHMM in Equation (5) does not have a problem representing a probability distribution but has an issue in optimization or inference with EM, as, in this case, the delay parameter δ k is not optimizable. Taking into account a component f ( x ; ϕ k ) of the DPHMM, which has a delay δ k as defined before, the PDF of this component is zero for any x i < δ k :
p ( x i ϕ k ) = f ( x i ; ϕ k ) = 0 , x i < δ k .
Then:
γ i k = π k p ( x i ϕ k ) j = 1 K π j p ( x i ϕ j ) = 0 , x i < δ k .
Therefore, as shown in Equation (16), any x i < δ k is excluded from Q ( ϕ k ) . In other words, the optimization or inference of ϕ k only considers data beyond its delay δ k . As a result, δ k is not adjusted toward smaller values.
On the other hand, δ k is also not adjusted for higher values. For any x i δ k , γ i k > 0 , and if δ k is adjusted to a new value δ k n e w > δ k , then:
f ( x i ; α k , S K , δ k n e w ) = 0 , δ k n e w > x i > δ k ,
γ i k log f ( x i ; α k , S K , δ k n e w ) = , δ k n e w > x i > δ k .
Since Equation (20) is part of Q ( ϕ k ) , Q ( ϕ k ) = , which contradicts the objective of maximization, the delay δ k will consequently not increase.
The simplest solution is to let f ( x i ; ϕ k ) = ϵ , x i < δ k , where ϵ > 0 is a small constant. However, this solution is not ideal. First, the new function is discontinuous, resulting in unstable optimization or inference. Second, this slightly violates the normalization required by probability distributions. Moreover, the value of ϵ is not transferable across different datasets and must be chosen according to the specific distribution. If ϵ is too large, the delay may increase abnormally, while if it is too small, the delay becomes nearly impossible to optimize.
In general, in the delayed PH distribution, the input is essentially transformed using max ( 0 , x δ ) . This transformation introduces an optimization issue in the EM framework. To address this, we adopt a parameterized softplus function [31] as the differentiable, monotonic, and positive-valued delay function to approximate max ( 0 , x δ ) .
g ( x ; δ ) = 1 κ log 1 + e κ ( x δ ) ,
where κ is a smoothness parameter that controls how sharply the transition occurs around x = δ . We use this hyperparameter because we need to control how closely the original function is approximated, as an insufficient approximation would undermine the interpretability of the parameters. The derivative of g ( x ; δ ) is:
g ( x ; δ ) = e κ ( x δ ) 1 + e κ ( x δ ) .
Then, the modified PDF, which serves as a smoothed form of f ( x ; δ ) , is given by the change in variable for the PDF with the Jacobian correction:
h ( x ; ϕ ) = f ( g ( x ; δ ) ; α , S ) · g ( x ; δ ) ,
where the term g ( x ; δ ) accounts for the Jacobian adjustment due to transformation x g ( x ; δ ) . Here, the function is first discussed under the conditions x δ and x δ to analyze how well h ( x ; ϕ ) approximates f ( x ; ϕ ) :
h ( x ; ϕ ) = f ( g ( x ; δ ) ; α , S ) · g ( x ; δ ) 0 , x δ f ( x ; α , S , δ ) , x δ .
The integral of h ( x ; ϕ ) over [ 0 , ) is equal to 1, ensuring the normalization of the distribution.
0 f ( g ( x ; δ ) ; α , S ) · g ( x ; δ ) d x = g ( 0 ; δ ) = 0 g ( ; δ ) = f ( g ( x ; δ ) ; α , S ) d g ( x ; δ ) = 1 .
Moreover, the new function h ( x ; ϕ ) is continuous and smooth, which is good for optimization and inference.
Figure 1 shows the curves of h ( x ; ϕ ) with different κ . It can be observed that the larger the value of κ , the closer the curve approximates the delayed PH distribution. However, even with the same value of κ , the degree of approximation varies across different distributions. The larger the width of the peak, the better the approximation. This raises a concern: inconsistent approximation among the components of a mixture model may negatively impact the convergence.
To address the issue of inconsistent approximation quality, we assign a unique value to each component according to its variance.
κ k = c ( t ) 2 α k S k 2 1 ( α k S k 1 1 ) 2 ,
where c ( t ) is a uniform parameter for all components in the t-th EM iteration, and κ k is the adaptive parameter for the k-th component. Figure 2 shows the approximation under different c. The higher the variance of the distribution, the lower the value of κ . Consequently, it can be observed that the degree of approximation is similar under the same c, which ensures the consistency of the parameter convergence among the components of a mixture model. As c increases sufficiently, the approximation becomes sufficiently close.
However, a big κ results in the original issue, that is, the non-optimizable δ in EM. On the other hand, a small κ makes a difference that cannot be ignored, compromising the interpretability of the parameters. Therefore, we expect κ to be small in the early stage of the EM iterations, so that the curve h ( x θ ) is smooth enough, which facilitates convergence. As the parameters converge, κ gradually increases, making h ( x θ ) approach f ( x θ ) . To achieve that, we set an initial value c ( 1 ) and it increases at a rate η in EM iterations:
c ( t + 1 ) = ( 1 + η ) · c ( t ) , η > 0 .
As both c and κ increase, the limiting behavior of h ( x ; ϕ ) is analyzed.
lim κ h ( x ; ϕ ) = 0 = f ( x ; ϕ ) , x < δ 1 2 α e S 0 ( S 1 ) = 1 2 f ( x ; ϕ ) , x = δ α e S ( x δ ) ( S 1 ) = f ( x ; ϕ ) , x > δ .
As in Equation (28), if κ is large enough, h ( x ; ϕ ) and f ( x ; ϕ ) are equivalent except when x = δ . However, the probability that x hits a specific point δ in a continuous one-dimensional space is zero. Therefore, as the EM iterates and κ becomes sufficiently large, h ( x ; ϕ ) converges to f ( x ; ϕ ) . Consequently, the estimated parameter θ preserves the original interpretability of the PH distribution for further analysis.
In this paper, c ( 1 ) = 1 and η = 1.2 , with the total number of iterations fixed at 30. Under this set of hyperparameters, c ( 30 ) = 237 , which is sufficiently large to ensure that h ( x ; ϕ ) closely approximates f ( x ; ϕ ) . All three case studies presented in this paper adopt this setting and achieve stable convergence. These hyperparameters can be adjusted if necessary. For example, iterations may be terminated early if the objective function exhibits minimal change. However, when parameter interpretability is concerned, it is recommended to ensure that the final value of c is sufficiently large, preferably above 200.
Since κ is not a parameter to be fitted in EM, h ( x ; ϕ ) and f ( x ; ϕ ) are formally equivalent in EM. Moreover, equations in MAP (Section 3.3) and EM (Section 3.4) are independent of the specific expression of p ( x i ϕ k ) . Thus, h ( x i ; ϕ k ) can replace f ( x i ; ϕ k ) in MAP and EM without affecting the validity of the equations. To distinguish from the original DPHMM in Equation (5), we name the new form the smooth-delayed PH mixture model (sDPHMM), whose formula is given by:
h ( x ; θ ) = k = 1 K π k h x ; ϕ k = k = 1 K π k α k e S k 1 κ k log 1 + e κ k x δ k · S k 1 · e κ k x δ k 1 + e κ k x δ k .
In summary, the sDPHMM is proposed based on the DPHMM and serves as a smoothed approximation to the DPHMM, whose parameters have precise physical meanings. This approximation addresses the issue of parameter estimation but compromises the parameter interpretability. However, when the approximation is sufficiently close, which can be achieved by increasing the value of c, the parameters of the sDPHMM can be considered to possess the same interpretability as those of the DPHMM. Here, the delay function serves to smooth the explicit delay, allowing the sDPHMM to be optimized within the EM framework. Moreover, it adaptively adjusts the parameter based on the variance of each component to balance the convergence across the components.

3.6. Optimization and Bayesian Inference

As mentioned in Section 3.4, Q ( ϕ k ) in Equation (16) is the objective to be maximized for each component of the sDPHMM in the M-step. Optimization methods are the most commonly used approach for solving problems of this kind. However, in addition to the identifiability and uniqueness problems discussed in Section 2.2, it is typically difficult to optimize a PH distribution in its relatively large parameter space, as the parameters often become trapped in local optima.
Compared to optimization methods, which provide point estimates, Bayesian inference explores the parameter space and provides the distributions of parameters, which is more comprehensive and provides much better interpretability. As shown in Equation (9), to obtain the posterior of θ , the marginal likelihood p ( X ) = p ( X θ ) p ( θ ) d θ should be calculated first.
However, in a mixed PH distribution, θ lies in a high-dimensional parameter space; thus, p ( X ) is intractable. Instead, the MCMC method, specifically the Metropolis–Hastings algorithm, is employed to approximate p ( θ X ) . Due to the typically high dimensionality of θ , directly sampling the full parameter set is challenging. Therefore, within the EM framework, a strategy of sampling the parameters of each component in the sDPHMM separately is adopted, i.e., ϕ k . As discussed in Section 3.4, the mixing parameter π has been obtained analytically, while the target function Q ( ϕ k ) for each component parameter ϕ k is given by Equation (16). The acceptance of the Metropolis–Hastings sampling is given by:
p ϕ k * ϕ k = min 1 , e Q ϕ k * Q ϕ k · r ϕ k ϕ k * r ϕ k * ϕ k .
where r ( · ) is the proposal distribution to propose candidates of ϕ k .
In each sampling step, a new parameter ϕ k * is proposed based on the current parameter ϕ k through the proposal distribution r ( ϕ k * ϕ k ) . The probability of accepting this new parameter ϕ k * is calculated and denoted by p ( ϕ k * ϕ k ) . If ϕ k * is accepted, the current parameter is updated to ϕ k * ; otherwise, the parameter remains at ϕ k . After each sampling, the value of ϕ k is collected to form a set { ϕ k ( m ) } m = 1 M . If the number of samples M is sufficiently large, the distribution of { ϕ k ( m ) } m = 1 M approximates the posterior distribution of ϕ k .
Bayesian inference, while capable of capturing full posterior distributions, often incurs substantially higher computational costs, particularly in the case of PH distributions or when the dataset is large.
However, it is evident that analyzing the posterior distribution of ϕ k in the early stage of the EM iterations, when the responsibilities γ i k in Q ( ϕ k ) are still rapidly changing, is neither meaningful nor necessary. Given that the parameter estimation involves both boundaries and linear constraints in this case, we adopt Sequential Least Squares Programming (SLSQP), an efficient optimization method, to update the parameters from the beginning of EM and defer Bayesian inference until γ i k stabilizes. In practice, only a few EM iterations with Bayesian inference are required to obtain stable and well-behaved posterior distributions, which significantly alleviates the computational burden.
Let the number of phases in the k-th PH component be d k , with computational complexity O ( d k 3 ) . In one EM iteration, the MCMC sampling dominates the complexity, resulting in O ( M N d k 3 ) , where M is the number of MCMC samples and N is the size of the data. If an optimization-based method is used instead, the complexity of one EM iteration is O ( I N d k 3 ) , where I denotes the number of iterations of the optimization algorithm, typically on the order of tens, while MCMC sampling requires thousands of samples to obtain a reliable posterior distribution.
We acknowledge that even with this approach, the computational cost is only reduced from unbearably high to barely acceptable. Due to the inherent complexity of PH distributions and MCMC sampling, the method is not computationally fast. However, the main advantage lies in obtaining interpretable posterior distributions and analyzing the potential underlying structure of complex activities, which is the primary objective of the sDPHMM and is typically performed offline.
Finally, the samples from the posterior distribution of the parameters { ϕ ( m ) } m = 1 M = { { ϕ 1 ( m ) } m = 0 M , , { ϕ K ( m ) } m = 0 M } are obtained. Along with the mixing weight π , the PDF of the posterior predictive distribution can be given:
p ( x X ) = k = 1 K π k 1 M m = 1 M h x ; ϕ k ( m ) .
Equation (31) is based on the full posterior of the parameters, which entails a high computational cost. Alternatively, one can use the mode or the expectation of the posterior distribution to simplify the computation.
The complete procedure of the proposed method is presented in Algorithm 1.
Algorithm 1 Bayesian inference and MAP-based EM algorithm for sDPHMM parameter posterior distribution estimation
 Input: Observed data X = { x 1 , , x N } ;
        Prior distribution of mixing weights p ( π ) ;
        Prior distribution of each component p ( ϕ k ) = { p ( α k ) , p ( S k ) , p ( δ k ) } ;
        Proposal distribution r ( · ) ;
 Output: Estimated mixing parameter π = { π 1 , , π K } ;
        Samples of { ϕ ( m ) } m = 1 M , where ϕ = { ϕ 1 , , ϕ K } , ϕ k = { α k , S k , δ k } ;
 Define: κ k = c 2 α k S k 2 1 ( α k S k 1 1 ) 2 ;
 Define: g ( x ; δ k ) = 1 κ k log ( 1 + e κ k ( x δ k ) ) ;
 Define: p ( x ϕ k ) = α k e S k g ( x ; δ k ) ( S k 1 ) g ( x ; δ k ) ;
 Define: Q ( ϕ k ) = i = 1 N γ i k log p ( x i ϕ k ) + log p ( ϕ k ) ;
 Initialize parameters π , ϕ and hyperparameters c, η
for  t = 1 to T 0 + T 1 do T 0 iterations: SLSQP, T 1 iterations: Bayesian inference
    for  i = 1 to N do▹ E-Step starts
    for  k = 1 to K do
       if  t T 0 + 1  then
        γ i k = π k p ( x i ϕ k ) j = 1 K π j p ( x i ϕ j ) ▹ E-Step of MAP EM iterations
       else
        γ i k = π k m = 1 M p ( x i ϕ k ( m ) ) j = 1 K π j m = 1 M p ( x i ϕ j ( m ) ) ▹ E-Step of Bayesian inference EM iterations
       end if
    end for
    end for 
    for  k = 1 to K do▹ M-Step starts
     π k = i = 1 N γ i k + β k 1 N + k = 1 K ( β k 1 ) ▹ Update π
    end for
    if  t T 0  then
    for  k = 1 to K do
        ϕ ^ k = arg max Q ( ϕ k ) using SLSQP▹ MAP using SLSQP
    end for
    else▹ Bayesian inference
    for  k = 1 to K do
       for  m = 1 to M do
       Propose ϕ k * r ( ϕ k * ϕ k ( m 1 ) )
        p ( ϕ k * ϕ k ( m 1 ) ) = min ( 1 , e Q ( ϕ k * ) Q ( ϕ k ( m 1 ) ) · r ( ϕ k ( m 1 ) ϕ k * ) r ( ϕ k * ϕ k ( m 1 ) ) ) ▹ Acceptance
       Sample μ Uniform ( 0 , 1 )
       if  p ( ϕ k * ϕ k ( m 1 ) ) > μ  then
           ϕ k ( m ) = ϕ k * ▹ Accept
       else
           ϕ k ( m ) = ϕ k ( m 1 ) ▹ Reject
       end if
       end for
    end for
    end if
     c ( 1 + η ) c
end for
   return π , { ϕ ( m ) } m = 1 M

4. Experiments

4.1. Experimental Design

The performance of the sDPHMM is evaluated with three case studies: one based on synthetic data and two based on real-world data. For the synthetic data, the ground truth of the activities is known, allowing quantitative analysis of interpretability. In contrast, for the two real-world datasets, the ground truth is unknown, so only qualitative analysis can be performed, i.e., assessing whether the revealed structure is reasonable.
In selecting the comparative baselines for these case studies, since the PH distribution or its mixtures have difficulty fitting the narrow-peaked data used in this paper, as proved in Section 3.1, Erlang mixture models and gamma mixture models are used as alternatives, denoted as M ( E ) and M ( Γ ) for brevity. Both M ( E ) and M ( Γ ) are optimized under the EM framework to ensure consistency in conditions. The gamma mixture model and the Weibull mixture model with explicit delays are also included as baselines, denoted as M D ( Γ ) and M D ( W ) , respectively. Models with explicit delays optimize all parameters jointly using SLSQP due to the optimization issue in the EM algorithm.
To maintain consistency, the sDPHMM and DPHMM are denoted as M s D ( P H ) and M D ( P H ) , respectively. Although a mixture of general PH distributions is denoted as M ( P H ) , it can be essentially merged into a larger phase-type representation.
In selecting metrics, the Kolmogorov–Smirnov (KS) statistic D K S is used to measure the maximum difference in cumulative distribution functions (CDFs) between the model and the empirical data, while the Wasserstein distance D W is used to measure the average difference between CDFs.
Each case study broadly follows the same structure. The dataset and the choice of priors are first introduced. Subsequently, the goodness of fit of M s D ( P H ) is compared with the selected baselines. Finally, an interpretability assessment is given. In the interpretability assessment, the approximation of M s D ( P H ) to M D ( P H ) is validated with Kullback–Leibler (KL) divergence, where the parameters obtained from M s D ( P H ) are used directly in M D ( P H ) . A sufficiently close approximation ensures that the parameters of M s D ( P H ) retain a general phase-type interpretation.

4.2. Convergence Assessment

Before presenting the case studies, we first evaluate the benefits of the adaptive delay function in terms of convergence on synthetic data.
For the convergence evaluation, the data are sampled from two distributions. Specifically, 1500 samples are drawn from the convolution of two exponential distributions with rates 0.05 and 0.025, and 2000 samples are drawn from the convolution of two other exponential distributions with rates 0.01 and 0.02. For delayed models, namely M s D ( P H ) , M D ( P H ) , M D ( Γ ) , and M D ( W ) , the 1500 samples from the first convolution were uniformly shifted by 200 to introduce a delay. The data for models without delays, namely M ( P H ) , M ( Γ ) , and M ( W ) , are used without modification.
To ensure consistency of the validation, M s D ( P H ) employs non-informative priors, and only SLSQP is used during the EM iterations. Other models are used as comparative baselines, and their components are jointly optimized via SLSQP. All methods adopt the same random initialization strategy for the parameters.
Figure 3 shows the KS distances between the model CDFs and the empirical CDF over 30 runs. It can be concluded that the introduction of explicit delay clearly leads to reduced convergence, or even failure. This may be caused by discontinuities and the absence of gradients before the explicit delay. In contrast, the adaptive delay function used in M s D ( P H ) significantly improves convergence. Benefiting from the advantages of the EM algorithm, its convergence can even outperform that of non-delayed models. A direct comparison can be made between M s D ( P H ) and M ( P H ) . Moreover, it is precisely the delay function that resolves the iteration issues of M D ( P H ) in the EM algorithm, making EM applicable.
Given the convergence instability of M D ( Γ ) and M D ( W ) , as baselines in the following case studies, they are assigned better initial parameters and run multiple times until successful convergence.

4.3. Case Study 1: Synthetic Data

4.3.1. Dataset and Priors

The structure of the synthetic data is shown in Figure 4, which can be completed through two distinct paths with probabilities of 3 7 and 4 7 , respectively. Each path consists of two sequential sub-activities, and the duration of each sub-activity follows an exponential distribution. For the first path, the two sub-activities are governed by exponential distributions with rate parameters 0.05 and 0.025, respectively. In addition, a fixed delay of 200 is introduced in this path. In the second path, the rate parameters for the two exponential distributions are 0.01 and 0.02. This structure simultaneously reflects multimodality, multistage progression, and partial delays. Although it is a relatively simple example, it encompasses behavior patterns commonly observed in real processes. Here, 3500 samples are generated.
In the specification of priors, this case study uses subjective priors to simulate the common practice of incorporating domain knowledge. Deterministic information is typically straightforward to summarize and is associated with high confidence as it remains consistent across observations. Consequently, the initial probability α k and the off-diagonal elements of the transition matrix S k are assigned strongly informative priors, thereby capturing the belief that instances along each path are executed in a fixed order. In contrast, probabilistic knowledge is generally less precise and carries lower confidence since the observations are subject to variation. Consequently, the mixing weights π , the diagonal elements of S k , and the delay parameters δ k are specified with moderately informative priors.
In addition, a model using non-informative priors for probabilistic knowledge is also included and is denoted M s D N I ( P H ) to assess the impact of priors on performance.

4.3.2. Goodness-of-Fit Assessment

The goodness of fit of M s D ( P H ) is validated and compared to baselines. As the distribution of the synthetic data is bimodal, all mixture models comprise two components.
Table 1 reports the metrics that evaluate the difference in CDFs between models and the generated data, in which M ( G T ) denotes the ground truth. The reported metrics of M s D ( P H ) and M s D N I ( P H ) are calculated based on the full posterior distribution of the model parameters. Since each component of M s D ( P H ) and M s D N I ( P H ) is sampled independently under the EM framework, samples from different components cannot be paired. Therefore, it is not possible to provide a reliable credible interval (CI) for metrics.
Two notes on Table 1: (1) M ( E ) and M ( Γ ) underfit the data because they naturally cannot fit the shifted skewed distribution well. (2) Despite adopting minimal models, M s D ( P H ) , M s D N I ( P H ) , M D ( Γ ) , and M D ( W ) exhibit a slight overfitting. The inclusion of priors reduces the fit but also mitigates overfitting.

4.3.3. Interpretability Assessment

Table 2 reports the medians and 95% CIs of the posteriors of M s D ( P H ) and M s D N I ( P H ) , along with the ground truth, while Table 3 reports the parameters of M ( E ) , M ( Γ ) , M D ( Γ ) , and M D ( W ) . The expected duration of each component, excluding the delay, is also provided.
The KL divergences of M D ( P H ) from M s D ( P H ) and M s D N I ( P H ) are 3.26 × 10 6 and 1.54 × 10 7 , respectively, suggesting that the parameters of M s D ( P H ) and M s D N I ( P H ) have a general phase-type interpretation.
When comparing M s D ( P H ) and M s D N I ( P H ) , appropriate priors not only lead the posteriors toward convergence with domain knowledge but also result in more concentrated posteriors, producing parameter estimates that are closer to those of M ( G T ) and with reduced uncertainty. The estimation errors of the parameters in M s D ( P H ) are generally acceptable, indicating that M s D ( P H ) effectively identifies the potential structure and quantifies the uncertainty.
The shape parameters of Component 1 of M ( E ) and M ( Γ ) are 46, suggesting that it comprises 46 sub-activities. This deviates from M ( G T ) , and such an interpretation would generally be unreasonable in real-world processes. It also indicates that a general PH distribution requires a large number of phases to fit Component 1, making it almost impossible to estimate. For M D ( Γ ) and M D ( W ) , their delay parameters are basically correct. However, due to intrinsic limitations, M D ( Γ ) cannot adequately represent a sequence of sub-activities with different rates since the gamma distribution is interpreted as the convolution of multiple independent exponential distributions sharing the same rate. The interpretation of the parameters of M D ( W ) is not applicable in this context.
The complete posteriors of the parameters of M s D ( P H ) and M s D N I ( P H ) are provided in Appendix A (Figure A1 and Figure A2).

4.4. Case Study 2: Hospital Billing Dataset

4.4.1. Dataset and Priors

The hospital billing dataset [32] is an event log that records the activities carried out to bill a package of medical services. This dataset contains 100,000 instances covering 18 unique activities. Among them, the activity “FIN” represents that the billing package is closed and cannot be changed anymore. It is the most important activity in the process, accounting for the vast majority of the process duration, with an average of 113.7 days, compared to the overall process average of 127.4 days. Moreover, the duration of the activity “FIN” exhibits a substantially more intricate distribution than other activities. Therefore, this case study is based on the duration of the activity “FIN”. There is an attribute called “close code” that indicates different types of reasons for closing a billing package, with 36 unique statuses. The duration distribution for different closing reasons varies significantly and the code “E” is chosen in this case study, which has 4233 instances.
Since domain knowledge related to the underlying structure of “FIN” is not available, we adopt empirical priors, which is a common practice when subjective priors are not applicable [33,34]. To simulate real-world conditions as realistically as possible, the earliest 20% of instances are treated as historical records and are used for empirical prior estimation, while the remaining instances are used for posterior estimation. The histogram of the 20% of instances are shown in Figure 5.
In empirical prior estimation and model size determination:
  • For the number of components: There are two prominent sharp components starting at 0 and approximately 40, denoted as Component 1 and Component 2, respectively. In addition, there should be another peak, denoted as Component 3, to the right of Component 2 at around 80, as well as a low-probability-density component starting near 0, denoted as Component 4.
  • For the delays of the four components: Components 1 and 4 clearly start near 0, for which strongly informative priors are used. The starting positions of Components 2 and 3 can be estimated but not precisely. Therefore, priors are specified with appropriate strength.
  • For the rates of sub-activities and the mixing weights: Priors cannot be clearly estimated from the histogram, and hence, non-informative priors are adopted.
  • For the number of phases: Except for Component 1, the number of phases, as well as the number of free parameters in S k , cannot be estimated directly and are determined using the Watanabe–Akaike Information Criterion (WAIC).

4.4.2. Goodness-of-Fit Assessment

Table 4 reports the metrics that evaluate the difference in CDFs between models and observations.
M ( Γ ) reports the best goodness of fit, while the others demonstrate similar fit performances. Figure 6a shows the PDF of M s D ( P H ) . The Quantile–Quantile (Q-Q) plot is provided in Figure 6b.
A slight deviation can be observed in the tail fitting of M s D ( P H ) . Since the potential structure of the sub-activities in this case study is mainly manifested before around 200, the tail deviation has a limited influence on parameter interpretation.

4.4.3. Interpretability Assessment

Table 5 reports the medians and 95% CIs of the posteriors of M s D ( P H ) , while Table 6 reports the parameters of M ( E ) , M ( Γ ) , M D ( Γ ) , and M D ( W ) . The expected duration of each component, excluding the delay, is also provided.
The KL divergence of M D ( P H ) from M s D ( P H ) is 4.38 × 10 4 , indicating that M s D ( P H ) has a general phase-type interpretation. Here is a potential structure of activity “FIN” discovered by M s D ( P H ) :
Component 1: Approximately 3% of instances show rapid responses of about two days, likely reflecting corrective actions triggered by erroneous information. These instances are indeed present in the event logs.
Component 2: Approximately 32% of instances exhibit a delay of around six weeks, followed by an accelerated processing of one week. This pattern is indicative of a typical timeout, implying that the process enforces a time limit for certain instances.
Components 3 and 4: Approximately 65% of instances go through two phases, both including a stage of approximately half a month. The difference lies in the other phase: for 40% of the instances, it lasts about half a month with a delay of approximately one month, whereas for the remaining 25%, it lasts around 80 days without any delay. A possible explanation is that two different types of treatment take different durations, while the approval process is similar for both of them.
Similarly to Case Study 1, M ( E ) and M ( Γ ) tend to increase the shape parameter to fit the delayed Components 2 and 3. Such parameter interpretation is unreasonable, which also indicates that, without incorporating delays, PH distributions encounter difficulties in fitting Components 2 and 3.
By introducing delays, M D ( Γ ) provides much better interpretability than M ( Γ ) and reveals a different structure. The identified structure is largely reasonable. However, the leftmost peak is not reflected in the parameters, and the shape parameters of 8.58 and 5.58 in Components 1 and 2 are still too large to be meaningful for interpreting the activity structure.
At a macro level in terms of delays and expectations, the structure revealed by M D ( W ) is similar to that of M s D ( P H ) . However, the parameters of the Weibull distribution are inherently less suitable in this scenario.
The complete posteriors of the parameters of M s D ( P H ) are provided in Appendix A (Figure A3).

4.5. Case Study 3: BPIC 2020

4.5.1. Dataset and Priors

BPIC 2020 [35] is a dataset that records events of travel expense claims from a university across two years. In this dataset, although the duration of most activities follows or approximates an exponential distribution, some activities exhibit a regular periodic pattern. We select an activity named “declaration approved by budget owner” in the process “Domestic Declarations” as a representative case to investigate the underlying behavior. The “declaration approved by the budget owner” activity has 2801 instances in total. As in Case Study 2, domain knowledge is not provided and empirical priors are estimated with the earliest 20% of instances, with the remaining 80% for posterior estimation. The histogram of the 20% of instances is shown in Figure 7.
In empirical prior estimation and model size determination:
  • For the number of components: Seven distinct peaks are observed to the left of 150, whereas to the right the multi-peak pattern persists but with sparse data. Accordingly, the mixture model is designed with eight components: seven representing the peaks on the left and one capturing the limited data on the right.
  • For delays: Each component has a clearly identifiable range for its starting position. Hence, the priors for the delays are set to be informative, with the strength determined by the width of the valleys.
  • For the rates of sub-activities and the mixing weights: They can be roughly estimated, but not precisely, and therefore, weakly informative priors are employed.
  • For the number of phases: It cannot be estimated directly and is determined using the WAIC.

4.5.2. Goodness-of-Fit Assessment

Table 7 reports the metrics that evaluate the difference in CDFs between models and observations.
M ( Γ ) achieves the best fit, slightly outperforming M s D ( P H ) and M ( E ) . Figure 8a shows the PDF of M s D ( P H ) . The Q-Q plot is provided in Figure 8b.
The fitting deviations of M s D ( P H ) occur mainly in the tail of the distribution, which arises from the use of a single component to approximate the sparse but multimodal tail for the sake of model simplification. However, this deviation does not affect the interpretation of the seven main peaks. In addition, M s D ( P H ) shows a slight deviation in fitting the second peak, as the shape of this peak tends to be normal, while approximating a normal distribution with a PH distribution requires a sufficient number of phases. This limitation comes from the inherent characteristics of the PH distribution.
As the number of explicit delays increases to eight, the convergence of M D ( Γ ) and M D ( W ) becomes extremely difficult. Even with multiple initializations and good initial values, none of the 100 fitting attempts achieve good convergence. Figure 9 shows D K S of 100 fittings of M D ( Γ ) and M D ( W ) , along with 30 fittings of M s D ( P H ) .

4.5.3. Interpretability Assessment

Table 8 reports the medians and 95% CIs of the posteriors of M s D ( P H ) , while Table 9 reports the parameters of M ( E ) and M ( Γ ) . The expected duration of each component, excluding the delay, is also provided. As M D ( Γ ) and M D ( W ) do not converge, their parameter estimates are without reference value and are thus omitted.
The KL divergence from M D ( P H ) to M s D ( P H ) is 9.48 × 10 3 , indicating that M s D ( P H ) has a general phase-type interpretation.
Component 1 indicates that one-third of cases are responded to immediately. The expected durations of the other components except Component 8 are similar and all consist of comparable two-stage structures, but with delays increasing in increments of approximately 24 h, reflecting the typical daily work schedule. The two-stage structure can be interpreted as the waiting duration and the processing duration.
The structures obtained by M ( E ) and M ( Γ ) are similar to that of M s D ( P H ) , although without explicit delays. M ( E ) and M ( Γ ) can only fit the delayed peaks by increasing the shape parameters, which implies that a single activity is composed of hundreds or even thousands of sub-activities. This interpretation is clearly unacceptable.
The complete posteriors of the parameters of M s D ( P H ) are provided in Appendix A (Figure A4).

4.6. Summary of Case Studies

From the convergence analysis and case studies, the following conclusions can be drawn:
  • Explicit delays reduce model convergence, whereas the adaptive delay functions adopted in M s D ( P H ) effectively mitigate this adverse impact.
  • The fit performance of M s D ( P H ) is not significantly worse than that of purely data-driven methods, while it provides better interpretability of the sub-activity structure.
  • Bayesian inference effectively quantifies the uncertainty of the revealed structure, offering a more comprehensive perspective compared to point estimates.
  • Due to the issue of identifiability, the posteriors can be relatively wide, and some parts may significantly differ from domain knowledge. Appropriate priors help guide the posteriors toward reasonable intervals.

5. Conclusions

This paper proposes the sDPHMM for modeling multimodal durations with delay characteristics and providing potential explanations for their sub-activity structures. The PH mixture model with delays is selected because they naturally capture the structure of such durations. The adaptive smoothed delay function, combined with the EM algorithm, improves the convergence of the sDPHMM. Bayesian inference is employed to address identifiability issues, providing full posterior distributions of parameters and quantifying the uncertainty of the sub-activity structure. Experimental results demonstrate that the sDPHMM can fit these durations stably and provide interpretable parameters.
In the context of business processes, revealing the latent underlying structure of complex human or technical behaviors is of significant importance for efficiency improvement, as it allows for more fine-grained analysis of bottleneck detection and delay diagnosis at the sub-activity level. The experiments in this paper are primarily based on human-driven business activities, but the approach is also applicable to other datasets with similar characteristics.
The proposed sDPHMM has some limitations. First, its complexity is relatively high. Both the MCMC sampling used in the sDPHMM and the PH distribution itself are computationally intensive, and combining them exacerbates this issue, resulting in slow computation on large-scale datasets. In these circumstances, subsampling the data constitutes an alternative strategy. Moreover, if the duration distribution is not well aligned with the PH family, the fit performance may deteriorate, thereby reducing the accuracy of parameter interpretation. However, this limitation originates from the PH distribution itself, and such issues are unavoidable as such process data is very complex and, inevitably, no model can fit every situation.
Future work will focus on algorithmic applications, such as employing the sDPHMM for delay diagnosis and drift detection at the sub-activity level, while also exploring potential computational optimizations for the sDPHMM.

Author Contributions

Conceptualization, D.W. and S.M.; Methodology, D.W.; Validation, D.W. and S.M.; Formal analysis, D.W. and S.M.; Investigation, D.W.; Data curation, D.W.; Writing—original draft, D.W.; Writing—review & editing, S.M., L.Y. and I.M.; Visualization, D.W.; Supervision, S.M., I.M. and Z.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Invest NI through the Advanced Research and Engineering Centre and was part-financed by the European Regional Development Fund under the Investment for Growth and Jobs Programme 2014–2020.

Data Availability Statement/

The two data sets presented in this study are openly available as follows: (1) Hospital Billing-Event Log: Mannhardt, Felix (2017): Hospital Billing-Event Log. Version 1. 4TU.ResearchData. dataset. https://doi.org/10.4121/uuid:76c46b83-c930-4798-a1c9-4be94dfeb741 [32]. (2) BPI Challenge 2020: Domestic Declarations: van Dongen, Boudewijn (2020): BPI Challenge 2020: Domestic Declarations. Version 1. 4TU.ResearchData. dataset. https://doi.org/10.4121/uuid:3f422315-ed9d-4882-891f-e180b5b4feb5 [35].

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

The posteriors of M s D ( P H ) in three case studies and M s D N I ( P H ) in Case Study 1 are provided by probability density plots. Here, α k , i denotes the i-th entry of the initial probability of Component k. S k , i j denotes the ( i , j ) -th entry of the sub-generator matrix of Component k. δ k denotes the delay of component k. The x-axis represents the values of the parameters, and the y-axis represents the probability density.
Figure A1. Case Study 1: posteriors of M s D ( P H ) .
Figure A1. Case Study 1: posteriors of M s D ( P H ) .
Algorithms 18 00575 g0a1
Figure A2. Case Study 1: posteriors of M s D N I ( P H ) .
Figure A2. Case Study 1: posteriors of M s D N I ( P H ) .
Algorithms 18 00575 g0a2
Figure A3. Case Study 2: posteriors of M s D ( P H ) .
Figure A3. Case Study 2: posteriors of M s D ( P H ) .
Algorithms 18 00575 g0a3aAlgorithms 18 00575 g0a3b
Figure A4. Case Study 3: posteriors of M s D ( P H ) .
Figure A4. Case Study 3: posteriors of M s D ( P H ) .
Algorithms 18 00575 g0a4aAlgorithms 18 00575 g0a4b

References

  1. Zou, M.; Zeng, Q.; Liu, C.; Cao, R.; Chen, S.; Zhao, Z. Prediction of Remaining Execution Time of Business Processes with Multiperson Collaboration in Assembly Line Production. IEEE Trans. Comput. Soc. Syst. 2025, 12, 1279–1295. [Google Scholar] [CrossRef]
  2. Bemthuis, R.; van Slooten, N.; Jayasinghe Arachchige, J.; Piest, J.; Bukhsh, F.A. A Classification of Process Mining Bottleneck Analysis Techniques for Operational Support. In Proceedings of the 18th International Conference on e-Business (ICE-B 2021), Amsterdam, The Netherlands, 7–9 July 2021; pp. 127–135. [Google Scholar]
  3. van Zelst, S.J.; Mannhardt, F.; de Leoni, M.; Koschmider, A. Event abstraction in process mining: Literature review and taxonomy. Granul. Comput. 2021, 6, 719–736. [Google Scholar] [CrossRef]
  4. Lashkevich, K.; Milani, F.; Chapela-Campa, D.; Suvorau, I.; Dumas, M. Why Am I Waiting? Data-Driven Analysis of Waiting Times in Business Processes. In Proceedings of the 31st International Conference on Advanced Information Systems Engineering (CAiSE 2023), Utrecht, The Netherlands, 12–16 June 2023; pp. 174–190. [Google Scholar]
  5. Martin, N.; Pufahl, L.; Mannhardt, F. Detection of Batch Activities from Event Logs. Inf. Syst. 2021, 95, 101642. [Google Scholar] [CrossRef]
  6. Sato, D.M.V.; De Freitas, S.C.; Barddal, J.P.; Scalabrin, E.E. A survey on concept drift in process mining. ACM Comput. Surv. 2021, 54, 1–38. [Google Scholar] [CrossRef]
  7. Andrews, R.; Wynn, M. Shelf Time Analysis in CTP Insurance Claims Processing. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2017), Jeju, Republic of Korea, 23–26 May 2017; pp. 151–162. [Google Scholar]
  8. O’Cinneide, C.A. Characterization of phase-type distributions. Commun. Stat. Stoch. Models 1990, 6, 1–57. [Google Scholar] [CrossRef]
  9. David, A.; Shepp, L. The least variable phase type distribution is Erlang. Stoch. Model. 1987, 3, 467–473. [Google Scholar] [CrossRef]
  10. Bladt, M. A review on phase-type distributions and their use in risk theory. Astin Bull. J. IAA 2005, 35, 145–161. [Google Scholar] [CrossRef]
  11. Bladt, M.; Nielsen, B.F. Matrix-Exponential Distributions in Applied Probability; Springer: New York, NY, USA, 2017; Volume 81, pp. 125–197. [Google Scholar]
  12. Lindqvist, B.H. Phase-type models for competing risks, with emphasis on identifiability issues. Lifetime Data Anal. 2023, 29, 318–341. [Google Scholar] [CrossRef]
  13. Mena, R.H.; Walker, S.G. On the Bayesian mixture model and identifiability. J. Comput. Graph. Stat. 2015, 24, 1155–1169. [Google Scholar] [CrossRef]
  14. Kéry, M. Identifiability in N-mixture models: A large-scale screening test with bird data. Ecology 2018, 99, 281–288. [Google Scholar] [CrossRef]
  15. Van der Aalst, W.M.; Schonenberg, M.H.; Song, M. Time Prediction Based on Process Mining. Inf. Syst. 2011, 36, 450–475. [Google Scholar] [CrossRef]
  16. Yang, L.; McClean, S.; Donnelly, M.; Khan, K.; Burke, K. *Detecting process duration drift using gamma mixture models in a left-truncated and right-censored environment*. ACM Trans. Knowl. Discov. Data 2024, 18, 1–24. [Google Scholar]
  17. Nogayama, T.; Takahashi, H. Estimation of average latent waiting and service times of activities from event logs. In Proceedings of the International Conference on Business Process Management, Innsbruck, Austria, 13 August 2015; pp. 172–179. [Google Scholar]
  18. Ogunwale, O.D.; Ajewole, K.P.; Olayinka, K.P.; Omole, E.O.; Amoyedo, F.E.; Akinremi, O.J. A new family of continuous probability distribution: Gamma-exponential distribution (GED) theory and its properties. In Proceedings of the 2024 International Conference on Science, Engineering and Business for Driving Sustainable Development Goals (SEB4SDG), Omu-Aran, Nigeria, 2–4 April 2024; pp. 1–8. [Google Scholar]
  19. Wang, D.; McClean, S.; McChesney, I.; Tariq, Z. Process Duration Modeling and Concept Drift Detection using Phase-Type Distributions. In Proceedings of the 12th International Conference on Information Technology (ICIT), Amman, Jordan, 27–30 May 2025; pp. 659–664. [Google Scholar]
  20. McClean, S.; Barton, M.; Garg, L.; Fullerton, K. A modeling framework that combines Markov models and discrete-event simulation for stroke patient care. ACM Trans. Model. Comput. Simul. 2011, 21, 25. [Google Scholar] [CrossRef]
  21. Le, T.V.; Kwoh, C.K.; Teo, E.S.; Lee, K.H. Analyzing trends of hospital length of stay using Phase-type distributions. In Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops, Vancouver, BC, Canada, 11 December 2011; pp. 1026–1033. [Google Scholar]
  22. Kim, H.; Kim, P. Reliability models for a nonrepairable system with heterogeneous components having a phase-type time-to-failure distribution. Reliab. Eng. Syst. Saf. 2017, 159, 37–46. [Google Scholar] [CrossRef]
  23. O’Cinneide, C.A. On non-uniqueness of representations of phase-type distributions. Commun. Stat. Stoch. Models 1989, 5, 247–259. [Google Scholar] [CrossRef]
  24. Cheng, B.; Mamon, R. A uniformisation-driven algorithm for inference-related estimation of a phase-type ageing model. Lifetime Data Anal. 2023, 29, 142–187. [Google Scholar] [CrossRef]
  25. Ausın, M.C.; Wiper, M.P.; Lillo, R.E. Bayesian estimation for the M/G/1 queue using a phase-type approximation. J. Stat. Plan. Inference 2004, 118, 83–101. [Google Scholar] [CrossRef]
  26. Gamerman, D.; Lopes, H.F. Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference; Chapman and Hall/CRC: Boca Raton, FL, USA, 2006. [Google Scholar]
  27. Karras, C.; Karras, A.; Avlonitis, M.; Sioutas, S. An Overview of MCMC Methods: From Theory to Applications. In Proceedings of the IFIP International Conference on Artificial Intelligence Applications and Innovations, Crete, Greece, 17–20 June 2022; pp. 319–332. [Google Scholar]
  28. Bladt, M.; Gonzalez, A.; Lauritzen, S.L. The Estimation of Phase-Type Related Functionals Using Markov Chain Monte Carlo Methods. Scand. Actuar. J. 2003, 4, 280–300. [Google Scholar] [CrossRef]
  29. Jose, J.K.; M, D.; Sangita, K.; George, S. Phase-type stress-strength reliability models under progressive type-II right censoring. Commun. Stat. Theory Methods 2024, 53, 8498–8524. [Google Scholar] [CrossRef]
  30. Li, J.; Chen, J.; Zhang, X.; Wu, Z.; Yu, L. A new phase-type distribution-based method for time-dependent system reliability analysis. Qual. Technol. Quant. Manag. 2025, 22, 577–604. [Google Scholar] [CrossRef]
  31. Glorot, X.; Bordes, A.; Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 11–13 April 2011; pp. 315–323. [Google Scholar]
  32. Mannhardt, F. Hospital Billing-Event Log; Eindhoven University of Technology: Eindhoven, The Netherlands, 2017. [Google Scholar]
  33. Nabi, S.; Nassif, H.; Hong, J.; Mamani, H.; Imbens, G. Bayesian meta-prior learning using Empirical Bayes. Manag. Sci. 2022, 68, 1737–1755. [Google Scholar] [CrossRef]
  34. Tran, B.H.; Rossi, S.; Milios, D.; Filippone, M. All you need is a good functional prior for Bayesian deep learning. J. Mach. Learn. Res. 2022, 23, 1–56. [Google Scholar]
  35. van Dongen, B. BPI Challenge 2020; 4TU.Centre for Research Data: Enschede, The Netherlands, 2020. [Google Scholar]
Figure 1. Approximation under different values of κ .
Figure 1. Approximation under different values of κ .
Algorithms 18 00575 g001
Figure 2. Approximation under different adjusted κ .
Figure 2. Approximation under different adjusted κ .
Algorithms 18 00575 g002
Figure 3. D K S over 30 runs. The box represents the interquartile range, the orange line shows the median, and the whiskers indicate the full range.
Figure 3. D K S over 30 runs. The box represents the interquartile range, the orange line shows the median, and the whiskers indicate the full range.
Algorithms 18 00575 g003
Figure 4. Process of synthetic data.
Figure 4. Process of synthetic data.
Algorithms 18 00575 g004
Figure 5. Case Study 2: histogram for empirical prior estimation.
Figure 5. Case Study 2: histogram for empirical prior estimation.
Algorithms 18 00575 g005
Figure 6. Case Study 2: fit performance of M s D ( P H ) .
Figure 6. Case Study 2: fit performance of M s D ( P H ) .
Algorithms 18 00575 g006
Figure 7. Case Study 3: histogram for empirical prior estimation.
Figure 7. Case Study 3: histogram for empirical prior estimation.
Algorithms 18 00575 g007
Figure 8. Case Study 3: fit performance of M s D ( P H ) .
Figure 8. Case Study 3: fit performance of M s D ( P H ) .
Algorithms 18 00575 g008
Figure 9. Case Study 3: D K S of M s D ( P H ) , M D ( Γ ) , and M D ( W ) . The box represents the interquartile range, the orange line shows the median, and the whiskers indicate the full range.
Figure 9. Case Study 3: D K S of M s D ( P H ) , M D ( Γ ) , and M D ( W ) . The box represents the interquartile range, the orange line shows the median, and the whiskers indicate the full range.
Algorithms 18 00575 g009
Table 1. Case Study 1: metrics of goodness of fit.
Table 1. Case Study 1: metrics of goodness of fit.
Model D KS ( × 10 2 ) D W
M ( G T ) 1.541.81
M s D ( P H ) 1.131.16
M s D N I ( P H ) 0.780.94
M ( E ) 3.395.13
M ( Γ ) 3.053.85
M D ( Γ ) 1.110.88
M D ( W ) 1.101.07
Table 2. Case Study 1: parameters of M ( G T ) , M s D ( P H ) , and M s D N I ( P H ) .
Table 2. Case Study 1: parameters of M ( G T ) , M s D ( P H ) , and M s D N I ( P H ) .
ModelComponentMixingRate 1 ( × 10 2 )Rate 2 ( × 10 2 )DelayExpectation
M ( G T ) 10.4352.520060
20.57120150
M s D ( P H ) 10.455.36 [4.16, 6.54]2.23 [2.04, 2.52]198.57 [197.76, 199.58]63.46 [60.97, 66.24]
20.551.09 [0.97, 1.31]1.88 [1.44, 2.34]0.39 [0.37, 0.39]144.43 [140.36, 149.16]
M s D N I ( P H ) 10.466.44 [5.38, 9.16]2.02 [1.87, 2.19]199.40 [198.53, 200.20]64.26 [61.67, 67.03]
20.541.25 [1.03, 2.06]1.67 [1.08, 2.28]0.28 [0.26, 0.29]140.81 [137.01, 145.93]
Table 3. Case Study 1: parameters of M ( E ) , M ( Γ ) , M D ( Γ ) , and M D ( W ) .
Table 3. Case Study 1: parameters of M ( E ) , M ( Γ ) , M D ( Γ ) , and M D ( W ) .
ModelComponentMixingShapeRate ( × 10 2 )DelayExpectation
M ( E ) 10.414618.31/251.25
20.5921.25/159.94
M ( Γ ) 10.4246.2618.46/250.66
20.581.781.11/159.48
M D ( Γ ) 10.451.873.01198.0762.32
20.551.681.185.04142.63
M D ( W ) 10.421.341.63202.1656.28
20.581.400.602.97152.15
Table 4. Case Study 2: metrics of goodness of fit.
Table 4. Case Study 2: metrics of goodness of fit.
Model D KS ( × 10 2 ) D W
M s D ( P H ) 2.071.14
M ( E ) 1.341.25
M ( Γ ) 1.330.76
M D ( Γ ) 2.021.36
M D ( W ) 1.931.55
Table 5. Case Study 2: parameters of M s D ( P H ) .
Table 5. Case Study 2: parameters of M s D ( P H ) .
ComponentMixingRate 1 ( × 10 2 )Rate 2 ( × 10 2 )DelayExpectation
10.0350.83 [42.89, 58.23]/(6.12 [0.20, 25.91])  × 10 4 1.97 [1.72, 2.33]
20.3213.40 [12.70, 14.11]/43.21 [43.16, 43.26]7.47 [7.08, 7.87]
30.405.94 [4.73, 7.79]5.65 [4.54, 7.34]32.21 [31.76, 32.64]34.42 [33.02, 35.80]
40.257.41 [5.74, 9.51]1.24 [1.14, 1.34](1.30 [0.05, 2.71])  × 10 2 94.03 [88.60, 100.18]
Table 6. Case Study 2: parameters of M ( E ) , M ( Γ ) , M D ( Γ ) , and M D ( W ) .
Table 6. Case Study 2: parameters of M ( E ) , M ( Γ ) , M D ( Γ ) , and M D ( W ) .
ModelComponentMixingShapeRate ( × 10 2 )DelayExpectation
M ( E ) 10.05133.58/2.98
20.30255543.04/46.96
30.351422.31/62.75
40.3022.00/100.00
M ( Γ ) 10.090.262.68/9.70
20.31218.49463.97/47.09
30.3412.9220.00/64.60
40.271.541.53/100.65
M D ( Γ ) 10.458.5814.483.7559.25
20.315.5869.5639.518.02
30.161.171.4251.5082.39
40.080.813.461.4623.41
M D ( W ) 10.030.8212.791.238.68
20.291.8415.5941.865.70
30.422.002.6329.9322.70
40.261.330.943.6697.68
Table 7. Case Study 3: metrics of goodness of fit.
Table 7. Case Study 3: metrics of goodness of fit.
Model D KS ( × 10 2 ) D W
M s D ( P H ) 2.911.42
M ( E ) 3.021.45
M ( Γ ) 1.641.09
M D ( Γ ) 7.227764.70
M D ( W ) 6.0914.03
Table 8. Case Study 3: parameters of M s D ( P H ) .
Table 8. Case Study 3: parameters of M s D ( P H ) .
ComponentMixingRate 1Rate 2DelayExpectation
10.340.41 [0.39, 0.42]/(5.49 [0.16, 17.15])  × 10 3 2.46 [2.40, 2.56]
20.310.22 [0.20, 0.24]0.20 [0.18, 0.22]12.29 [12.12, 12.41]9.57 [9.17, 10.16]
30.080.19 [0.17, 0.21]0.25 [0.22, 0.27]37.43 [36.95, 37.74]9.39 [8.85, 10.00]
40.100.21 [0.18, 0.25]0.17 [0.15, 0.19]60.14 59.74, 60.37]10.42 [9.76, 11.21]
50.060.21 [0.18, 0.25]0.17 [0.14, 0.21]85.25 [84.67, 85.53]10.59 [9.68, 11.76]
60.050.29 [0.25, 0.32]0.27 [0.19, 0.30]110.93 [110.55, 111.19]7.19 [6.69, 8.58]
70.030.21 [0.17, 0.25]0.26 [0.20, 0.30]133.00 [131.62, 133.63]8.67 [7.95, 9.68]
80.04(1.28 [1.04, 1.57])  × 10 2 1.09 [1.05, 1.11]157.55, [156.53, 158.71]78.00 [63.95, 95.67]
Table 9. Case Study 3: parameters of M ( E ) , M ( Γ ) , M D ( Γ ) , and M D ( W ) .
Table 9. Case Study 3: parameters of M ( E ) , M ( Γ ) , M D ( Γ ) , and M D ( W ) .
ModelComponentMixingShapeRateDelayExpectation
M ( E ) 10.3210.44/2.28
20.31200.97/20.58
30.091002.22/45.02
40.101992.87/69.38
50.112041.94/105.01
60.037675.33/143.98
70.029965.46/182.27
80.0219526.72/290.66
M ( Γ ) 10.350.890.40/2.23
20.3120.500.99/20.64
30.0998.422.18/45.06
40.09235.403.41/68.98
50.06383.914.09/93.83
60.05907.827.73/117.51
70.031192.008.25/144.48
80.022151.9511.82/182.04
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, D.; McClean, S.; Yang, L.; McChesney, I.; Tariq, Z. A Smooth-Delayed Phase-Type Mixture Model for Human-Driven Process Duration Modeling. Algorithms 2025, 18, 575. https://doi.org/10.3390/a18090575

AMA Style

Wang D, McClean S, Yang L, McChesney I, Tariq Z. A Smooth-Delayed Phase-Type Mixture Model for Human-Driven Process Duration Modeling. Algorithms. 2025; 18(9):575. https://doi.org/10.3390/a18090575

Chicago/Turabian Style

Wang, Dongwei, Sally McClean, Lingkai Yang, Ian McChesney, and Zeeshan Tariq. 2025. "A Smooth-Delayed Phase-Type Mixture Model for Human-Driven Process Duration Modeling" Algorithms 18, no. 9: 575. https://doi.org/10.3390/a18090575

APA Style

Wang, D., McClean, S., Yang, L., McChesney, I., & Tariq, Z. (2025). A Smooth-Delayed Phase-Type Mixture Model for Human-Driven Process Duration Modeling. Algorithms, 18(9), 575. https://doi.org/10.3390/a18090575

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop