3.3. Maximum a Posteriori Estimation
Knowledge of business processes is typically available, which is especially valuable on a small dataset. When the amount of data is limited, the samples may deviate from the true distribution, and the prior can help estimate the true distribution more accurately. Moreover, in models with identifiability and uniqueness issues, the prior can guide the posterior toward a more reasonable distribution under uncertainty [
25,
26]. Therefore, this paper adopts the maximum a posteriori (MAP) estimation instead of MLE.
The prior distribution
can be factorized as:
Based on widely adopted prior distributions for PH distributions [
28,
29,
30] and parameter ranges, we choose a Dirichlet distribution as the prior for
, a truncated normal distribution as the prior for
, and a gamma distribution for each rate in
. The prior distribution over
is given by a Dirichlet distribution:
where
are the parameters of the prior, and
is a multivariate Beta function.
The likelihood can be defined as:
where
is the collection of observations. According to Bayes’ theorem, the posterior distribution is given by:
where
is the marginal likelihood that does not depend on
. Therefore, the posterior is proportional to the product of the likelihood and the prior:
Then, the objective of optimization is to find the optimal parameter to maximize .
3.4. Expectation–Maximization
Direct estimation of the high-dimensional parameter
is challenging, as optimization in high dimensions is prone to numerical instability and local optima. Therefore, the EM algorithm is applied, with components updated separately in each M-step, which simplifies high-dimensional optimization and stabilizes convergence. The log-posterior is given by:
in which
is a constant. A latent variable
Z is introduced to represent the underlying factors or categories of the observed data. By introducing an arbitrary distribution
, satisfying
, and applying Jensen’s inequality, the objective function, which is the evidence lower bound (ELBO) of the log-posterior can be obtained:
In the E-step, given a fixed
, the objective is to find
to make the bound tight:
. After simplification, this yields
. Then,
, known as the responsibility that the
k-th component takes for data
, can be calculated as:
In the M-step,
is fixed, and the objective is to maximize
. Starting from the identity
, and after replacing
X and
Z with the full data and simplifying, we arrive at:
By applying the Lagrange multiplier method and simplifying the expression, an analytical solution for
can be derived.
The component parameters
are independent of each other. Therefore, each component can be optimized separately. Here, the terms related to
in
are extracted and denoted as
.
The objective is to find the optimal
to maximize
. However, there is no closed-form solution for
. Hence, we combine the optimization method and Bayesian inference to analyze
, which will be discussed in
Section 3.6.
3.5. Smooth-Delayed PH Distribution
The DPHMM in Equation (
5) does not have a problem representing a probability distribution but has an issue in optimization or inference with EM, as, in this case, the delay parameter
is not optimizable. Taking into account a component
of the DPHMM, which has a delay
as defined before, the PDF of this component is zero for any
:
Therefore, as shown in Equation (
16), any
is excluded from
. In other words, the optimization or inference of
only considers data beyond its delay
. As a result,
is not adjusted toward smaller values.
On the other hand,
is also not adjusted for higher values. For any
, and if
is adjusted to a new value
, then:
Since Equation (
20) is part of
,
, which contradicts the objective of maximization, the delay
will consequently not increase.
The simplest solution is to let , where is a small constant. However, this solution is not ideal. First, the new function is discontinuous, resulting in unstable optimization or inference. Second, this slightly violates the normalization required by probability distributions. Moreover, the value of is not transferable across different datasets and must be chosen according to the specific distribution. If is too large, the delay may increase abnormally, while if it is too small, the delay becomes nearly impossible to optimize.
In general, in the delayed PH distribution, the input is essentially transformed using
. This transformation introduces an optimization issue in the EM framework. To address this, we adopt a parameterized softplus function [
31] as the differentiable, monotonic, and positive-valued delay function to approximate
.
where
is a smoothness parameter that controls how sharply the transition occurs around
. We use this hyperparameter because we need to control how closely the original function is approximated, as an insufficient approximation would undermine the interpretability of the parameters. The derivative of
is:
Then, the modified PDF, which serves as a smoothed form of
, is given by the change in variable for the PDF with the Jacobian correction:
where the term
accounts for the Jacobian adjustment due to transformation
. Here, the function is first discussed under the conditions
and
to analyze how well
approximates
:
The integral of
over
is equal to 1, ensuring the normalization of the distribution.
Moreover, the new function is continuous and smooth, which is good for optimization and inference.
Figure 1 shows the curves of
with different
. It can be observed that the larger the value of
, the closer the curve approximates the delayed PH distribution. However, even with the same value of
, the degree of approximation varies across different distributions. The larger the width of the peak, the better the approximation. This raises a concern: inconsistent approximation among the components of a mixture model may negatively impact the convergence.
To address the issue of inconsistent approximation quality, we assign a unique value to each component according to its variance.
where
is a uniform parameter for all components in the
t-th EM iteration, and
is the adaptive parameter for the
k-th component.
Figure 2 shows the approximation under different
c. The higher the variance of the distribution, the lower the value of
. Consequently, it can be observed that the degree of approximation is similar under the same
c, which ensures the consistency of the parameter convergence among the components of a mixture model. As
c increases sufficiently, the approximation becomes sufficiently close.
However, a big
results in the original issue, that is, the non-optimizable
in EM. On the other hand, a small
makes a difference that cannot be ignored, compromising the interpretability of the parameters. Therefore, we expect
to be small in the early stage of the EM iterations, so that the curve
is smooth enough, which facilitates convergence. As the parameters converge,
gradually increases, making
approach
. To achieve that, we set an initial value
and it increases at a rate
in EM iterations:
As both
c and
increase, the limiting behavior of
is analyzed.
As in Equation (
28), if
is large enough,
and
are equivalent except when
. However, the probability that
x hits a specific point
in a continuous one-dimensional space is zero. Therefore, as the EM iterates and
becomes sufficiently large,
converges to
. Consequently, the estimated parameter
preserves the original interpretability of the PH distribution for further analysis.
In this paper, and , with the total number of iterations fixed at 30. Under this set of hyperparameters, , which is sufficiently large to ensure that closely approximates . All three case studies presented in this paper adopt this setting and achieve stable convergence. These hyperparameters can be adjusted if necessary. For example, iterations may be terminated early if the objective function exhibits minimal change. However, when parameter interpretability is concerned, it is recommended to ensure that the final value of c is sufficiently large, preferably above 200.
Since
is not a parameter to be fitted in EM,
and
are formally equivalent in EM. Moreover, equations in MAP (
Section 3.3) and EM (
Section 3.4) are independent of the specific expression of
. Thus,
can replace
in MAP and EM without affecting the validity of the equations. To distinguish from the original DPHMM in Equation (
5), we name the new form the smooth-delayed PH mixture model (sDPHMM), whose formula is given by:
In summary, the sDPHMM is proposed based on the DPHMM and serves as a smoothed approximation to the DPHMM, whose parameters have precise physical meanings. This approximation addresses the issue of parameter estimation but compromises the parameter interpretability. However, when the approximation is sufficiently close, which can be achieved by increasing the value of c, the parameters of the sDPHMM can be considered to possess the same interpretability as those of the DPHMM. Here, the delay function serves to smooth the explicit delay, allowing the sDPHMM to be optimized within the EM framework. Moreover, it adaptively adjusts the parameter based on the variance of each component to balance the convergence across the components.
3.6. Optimization and Bayesian Inference
As mentioned in
Section 3.4,
in Equation (
16) is the objective to be maximized for each component of the sDPHMM in the M-step. Optimization methods are the most commonly used approach for solving problems of this kind. However, in addition to the identifiability and uniqueness problems discussed in
Section 2.2, it is typically difficult to optimize a PH distribution in its relatively large parameter space, as the parameters often become trapped in local optima.
Compared to optimization methods, which provide point estimates, Bayesian inference explores the parameter space and provides the distributions of parameters, which is more comprehensive and provides much better interpretability. As shown in Equation (
9), to obtain the posterior of
, the marginal likelihood
should be calculated first.
However, in a mixed PH distribution,
lies in a high-dimensional parameter space; thus,
is intractable. Instead, the MCMC method, specifically the Metropolis–Hastings algorithm, is employed to approximate
. Due to the typically high dimensionality of
, directly sampling the full parameter set is challenging. Therefore, within the EM framework, a strategy of sampling the parameters of each component in the sDPHMM separately is adopted, i.e.,
. As discussed in
Section 3.4, the mixing parameter
has been obtained analytically, while the target function
for each component parameter
is given by Equation (
16). The acceptance of the Metropolis–Hastings sampling is given by:
where
is the proposal distribution to propose candidates of
.
In each sampling step, a new parameter is proposed based on the current parameter through the proposal distribution . The probability of accepting this new parameter is calculated and denoted by . If is accepted, the current parameter is updated to ; otherwise, the parameter remains at . After each sampling, the value of is collected to form a set . If the number of samples M is sufficiently large, the distribution of approximates the posterior distribution of .
Bayesian inference, while capable of capturing full posterior distributions, often incurs substantially higher computational costs, particularly in the case of PH distributions or when the dataset is large.
However, it is evident that analyzing the posterior distribution of in the early stage of the EM iterations, when the responsibilities in are still rapidly changing, is neither meaningful nor necessary. Given that the parameter estimation involves both boundaries and linear constraints in this case, we adopt Sequential Least Squares Programming (SLSQP), an efficient optimization method, to update the parameters from the beginning of EM and defer Bayesian inference until stabilizes. In practice, only a few EM iterations with Bayesian inference are required to obtain stable and well-behaved posterior distributions, which significantly alleviates the computational burden.
Let the number of phases in the k-th PH component be , with computational complexity . In one EM iteration, the MCMC sampling dominates the complexity, resulting in , where M is the number of MCMC samples and N is the size of the data. If an optimization-based method is used instead, the complexity of one EM iteration is , where I denotes the number of iterations of the optimization algorithm, typically on the order of tens, while MCMC sampling requires thousands of samples to obtain a reliable posterior distribution.
We acknowledge that even with this approach, the computational cost is only reduced from unbearably high to barely acceptable. Due to the inherent complexity of PH distributions and MCMC sampling, the method is not computationally fast. However, the main advantage lies in obtaining interpretable posterior distributions and analyzing the potential underlying structure of complex activities, which is the primary objective of the sDPHMM and is typically performed offline.
Finally, the samples from the posterior distribution of the parameters
are obtained. Along with the mixing weight
, the PDF of the posterior predictive distribution can be given:
Equation (
31) is based on the full posterior of the parameters, which entails a high computational cost. Alternatively, one can use the mode or the expectation of the posterior distribution to simplify the computation.
The complete procedure of the proposed method is presented in Algorithm 1.
Algorithm 1 Bayesian inference and MAP-based EM algorithm for sDPHMM parameter posterior distribution estimation |
Input: Observed data ; |
Prior distribution of mixing weights ; |
Prior distribution of each component ; |
Proposal distribution ; |
Output: Estimated mixing parameter ; |
Samples of , where , ; |
Define: ; |
Define: ; |
Define: ; |
Define: ; |
Initialize parameters , and hyperparameters c, |
for to do | ▹ iterations: SLSQP, iterations: Bayesian inference |
for to N do | ▹ E-Step starts |
for to K do |
if then |
| ▹ E-Step of MAP EM iterations |
else |
| ▹ E-Step of Bayesian inference EM iterations |
end if |
end for |
end for |
for to K do | ▹ M-Step starts |
| ▹ Update |
end for |
if then |
for to K do |
using SLSQP | ▹ MAP using SLSQP |
end for |
else | ▹ Bayesian inference |
for to K do |
for to M do |
Propose |
| ▹ Acceptance |
Sample |
if then |
| ▹ Accept |
else |
| ▹ Reject |
end if |
end for |
end for |
end if |
|
end for |
return , |