*2.3. Expectation-Maximization (EM) Algorithm*

In the original MCLUST method, the EM algorithm is used to estimate the unknown parameters and compute the membership probabilities. In this subsection, we will first review the EM algorithm under the general MCLUST framework, and then highlight the differences in implementation between the MCLUST method and the MCLUST-ME method.

**Complete data log likelihood** Given observations (*y*1, ..., *yn*), suppose that each *yi* is associated with one of *G* states. Then, there exists unobserved indicator vectors {*zi* = (*zi*1, ..., *ziG*)} where *zi* iid ∼ Mult*G*(1, *<sup>τ</sup>*) with *<sup>τ</sup>* = (*τ*1, ..., *<sup>τ</sup>G*). The complete data then consists of *xi* = (*yi*, *zi*). Assuming that the conditional probability density of *yi* given *zi* is ∏*<sup>G</sup> <sup>k</sup>*=<sup>1</sup> *fk*(*yi*;*μk*,Σ*k*,Λ*i*)*zik* , the complete data log likelihood can be derived as follows,

$$\begin{split} l\_{\mathbb{C}} &= \log \prod\_{i=1}^{n} f(\mathbf{y}\_{i}, \mathbf{z}\_{i}) \\ &= \log \prod\_{i=1}^{n} f(\mathbf{y}\_{i} | \mathbf{z}\_{i}) f(\mathbf{z}\_{i}) \\ &= \log \prod\_{i=1}^{n} \left[ \prod\_{k=1}^{G} f\_{k}(\mathbf{y}\_{i}; \boldsymbol{\mu}\_{k}, \mathbf{E}\_{k}, \mathbf{A}\_{i})^{z\_{ik}} \right] \left[ \prod\_{k=1}^{G} \mathbf{r}\_{k}^{z\_{ik}} \right] \\ &= \log \prod\_{i=1}^{n} \prod\_{k=1}^{G} \left[ \tau\_{k} f\_{k}(\mathbf{y}\_{i}; \boldsymbol{\mu}\_{k}, \mathbf{E}\_{k}, \mathbf{A}\_{i}) \right]^{z\_{ik}} \\ &= \sum\_{i=1}^{n} \sum\_{k=1}^{G} z\_{ik} \log \left[ \tau\_{k} f\_{k}(\mathbf{y}\_{i}; \boldsymbol{\mu}\_{k}, \mathbf{E}\_{k}, \mathbf{A}\_{i}) \right]. \end{split} \tag{11}$$

**EM iterations** The EM algorithm consists of iterations of an *M step* and an *E step*, as described below.


$$\hat{z}\_{ik} = \frac{\mathbf{f}\_k f\_k(\mathbf{y}\_i; \hat{\mathbf{p}}\_{k'} \hat{\mathbf{E}}\_{k'} \mathbf{A}\_i)}{\sum\_{j=1}^G \mathbf{f}\_j f\_j(\mathbf{y}\_i; \hat{\mathbf{p}}\_j, \hat{\mathbf{E}}\_j, \mathbf{A}\_i)}. \tag{12}$$

The two steps alternate until the increment in *lO* is small enough. Upon convergence, a membership probability matrix is produced and each observation is assigned to the most probable cluster, that is,

$$\text{membership of } \mathfrak{Y}\_i = \text{argmax}\_k \{ \hat{z}\_{ik} \}, \tag{13}$$

and the classification uncertainty for *yi* is defined as

$$1 - \max\_{k} \{ \mathcal{E}\_{ik} \}. \tag{14}$$

In two-group clustering, the classification uncertainty cannot exceed 0.5 (otherwise the point is incorrectly assigned).

For MCLUST, the component density *fk* is defined in (2), and for MCLUST-ME, *fk* is substituted by *gk* in (8).

**M-step implementation details** For likelihood maximization in the *M step*, a closed-form solution always exists for *τ*ˆ*k*, *k* = 1, . . . , *G* (see [12] for more details):

$$\mathfrak{H}\_{\mathbf{k}} = \frac{1}{n} \sum\_{i=1}^{n} z\_{i\mathbf{k}} \tag{15}$$

We can derive the estimation equations for *μ<sup>k</sup>* and Σ*<sup>k</sup>* (*k* = 1, ... , *G*) by taking the partial derivatives of *lC* with respect to *μ<sup>k</sup>* and Σ*<sup>k</sup>* and setting the derivatives to 0. For MCLUST-ME (see [13] for a summary of useful matrix calculus formulas, in particular (11.7) and (11.8)):

$$\frac{\partial l\_{\mathbb{C}}}{\partial \boldsymbol{\mu}\_{k}} = \sum\_{i=1}^{n} z\_{ik} (\mathbf{E}\_{k} + \mathbf{A}\_{i})^{-1} (\mathbf{y}\_{i} - \boldsymbol{\mu}\_{k}) = \mathbf{0} \tag{16}$$

and

$$\frac{\partial l\_{\mathbb{C}}}{\partial \mathbf{E}\_{k}} = \frac{1}{2} \sum\_{i=1}^{n} z\_{ik} (\mathbf{E}\_{k} + \mathbf{A}\_{i})^{-1} (\mathbf{y}\_{i} - \boldsymbol{\mu}\_{k}) (\mathbf{y}\_{i} - \boldsymbol{\mu}\_{k})^{T} (\mathbf{E}\_{k} + \mathbf{A}\_{i})^{-1} - \frac{1}{2} \sum\_{i=1}^{n} z\_{ik} (\mathbf{E}\_{k} + \mathbf{A}\_{i})^{-1} = \mathbf{0}. \tag{17}$$

For estimation equations under the MCLUST model, one set all the Λ*i*'s to 0 in the above two equations. Note that under MCLUST, if there is no constraint on Σ*k*, there are closed-form solutions for *μ<sup>k</sup>* and Σ*k*:

$$
\hat{\mathbf{p}}\_k = \frac{\sum\_{i=1}^n z\_{ik} \mathbf{y}\_i}{\sum\_{i=1}^n z\_{ik}} \tag{18}
$$

and

$$\boldsymbol{\Sigma}\_{k} = \frac{\sum\_{i=1}^{n} z\_{ik} (\mathbf{y}\_{i} - \hat{\boldsymbol{\mu}}\_{k}) (\mathbf{y}\_{i} - \hat{\boldsymbol{\mu}}\_{k})^{T}}{\sum\_{i=1}^{n} z\_{ik}}.\tag{19}$$

For MCLUST-ME, each *yi* corresponds to a different Λ*i*. One can see that, in general, there is no closed-form solution for (*μk*,Σ*k*). In our implementation of the MCLUST-ME M step, we solve *μ<sup>k</sup>* from (16),

$$\hat{\mathfrak{a}}\_{k} = \left[ \sum\_{i=1}^{n} z\_{ik} (\mathbf{E}\_{k} + \mathbf{A}\_{i})^{-1} \right]^{-1} \sum\_{i=1}^{n} z\_{ik} (\mathbf{E}\_{k} + \mathbf{A}\_{i})^{-1} \mathbf{y}\_{i\prime} \tag{20}$$

and plug it into (17), and then use the limited-memory BFGS, a quasi-Newton method in R (function optim [14]), to obtain an optimal solution for Σ*<sup>k</sup>* numerically. We obtain *μ*ˆ *<sup>k</sup>* by substituting the resulting Σˆ *<sup>k</sup>* into (20).

The complexity of the EM algorithm for the MCLUST-ME increase with the number of clusters, the number of parameters (which is determined by the dimension of the data), and the number of observations. It is much slower than the original MCLUST algorithm due to the fact we have to use a numerical optimization routine to find the maximum likelihood estimate (MLE) of *μk*'s and Σ*k*'s in the M step. (See Conclusion and Discussion for a brief summary of running time of MCLUST-ME on the real data example.)
