3.1.3. The Adjusted Rand Index

In this simulation study, as the true memberships of the observations are available, we can externally evaluate the performance of both clustering methods by calculating the Rand index [20]. Given *n* observations and two partitions *R* and *Q* of the data, we can use a contingency table (Table 2) to demonstrate their agreement.

**Table 2.** 2 × 2 contingency table for comparing partitions *R* and *Q*.


The Rand index (RI) is defined as

$$\text{RI} = \frac{a+d}{a+b+c+d}.$$

There are some pitfalls of the Rand index: for two random partitions, the expected value of RI is not equal to zero, and the value of RI tends to one as the number of partitions increases [21]. To overcome these problems, Hubert and Arabie [22] proposed the adjusted Rand index (ARI), which has an expectation of zero. The ARI is defined as

$$\text{ARI} = \frac{\text{RI} - \text{Expected}(\text{RI})}{1 - \text{Expected}(\text{RI})} = \frac{\binom{n}{2}(a+d) - \left[ (a+b)(a+c) + (c+d)(b+d) \right]}{\binom{n}{2}^2 - \left[ (a+b)(a+c) + (c+d)(b+d) \right]}.$$

ARI takes values between −1 and 1, with an ARI of 1 indicating perfect agreement between two partitions (i.e., RI = 1), and an ARI of 0 indicating independence between partitions (i.e., RI = Expected(RI)).

Permutation tests can be used to test whether the observed ARI is significantly greater than zero [23]. Although keeping the numbers of partitions and partition sizes the same as the original data, a large number of pairs of partitions are generated at random and ARI is computed for each generated pair. A randomization *p*-value can then be calculated based on the distribution of generated ARI's. Similarly, permutation *p*-values can be obtained for testing whether paired ARI values originating from two clustering methods are equal or not.

#### 3.1.4. Simulation 1 Results

**Decision boundary** We first visualize the clustering results from both methods, as well as the theoretical decision boundaries stated in Section 2.6. Figure 1 shows groupings of the same data generated with *η* = 0.5 and with random seed 7.

For MCLUST-ME, we identify two distinct decision boundaries: The dotted curve separates points measured *with* errors (solid) into two groups, whereas the dashed curve separates points *without* errors (empty). For MCLUST, one boundary separates all points, regardless of their associated errors. This confirms our findings in Section 2.6.

For this particular simulation, we make two interesting discoveries. First, the two MCLUST-ME boundaries are relatively far apart. Second, none of the three boundaries intersect with each other. As mentioned in Section 2.6.1, the shape and position of these boundaries completely depend upon corresponding values of MLEs, which, in turn, are end results of a procedure of iterative nature (the EM algorithm). We have additional plots similar to Figure 1 for other values of *η* and other random seeds in [12].

**Figure 1.** Clustering result of the sample generated with random seed = 7 and *η* = 0.5. *Both plots*: empty points represent observations with no measurement errors; solid points represent those generated with error covariance Λ. Clusters are identified by different shapes. *Left*: clustering result produced by MCLUST-ME. Dashed line represents classification boundary for error-free observations; dotted line represents boundary for those with error covariance matrix Λ; solid line represents boundary produced by MCLUST. *Right*: clustering result produced by MCLUST. Solid line is the same as in the left plot.

**Classification uncertainty** In Figure 2, we visualize the classification uncertainty of each point produced by both methods. Observe that for MCLUST, highly uncertain points are found close to the decision boundary, regardless of error. For MCLUST-ME, points with measurement errors (solid) near the outer boundary (dotted) in the overlapping region tend to have high clustering uncertainties. Likewise, error-free points (empty) near the inner boundary (dashed) tend to have high uncertainties. This is consistent with our statement in Section 2.6.2.

**Figure 2.** Clustering uncertainty of the sample generated with random seed = 7 and *η* = 0.5. Data points of larger size have a higher clustering uncertainty. All other graph attributes are the same as Figure 1.

**Accuracy** We first evaluate the performance of MCLUST and MCLUST-ME individually using ARI (between true group labels and predicted labels) as their performance measure. Figure 3 shows that for both methods, clustering accuracy tends to decrease as error proportion *η* increases. This is intuitively reasonable, because points associated with errors are more easily misclassified due to their high variability, and a larger proportion of such points means a lower overall accuracy.

**Figure 3.** Adjusted Rand indices for MCLUST-ME and MCLUST. Five different proportions of erroneous observations (*η*) were considered. Magenta: MCLUST-ME; Dark Cyan: MCLUST.

Next, we compare the performances of MCLUST and MCLUST-ME by examining pairwise differences in ARI. Figure 4 shows that on average, MCLUST-ME has a slight advantage in accuracy, and it appears that this advantage is greatest when *η* = 0.5, and becomes smaller as *η* gets closer to either zero or one. In the latter situation, error covariances will tend to become constant (all equal to 36*I* <sup>2</sup> as *η* → 1, or 0 as *η* → 0) across all points, meaning that MCLUST-ME will behave more and more like MCLUST, hence diminishing MCLUST-ME's advantage in accuracy.

**Figure 4.** Pairwise difference in adjusted Rand indices between MCLUST-ME and MCLUST. Five different proportions of erroneous observations were considered.

Using a permutation test to test the hypotheses *H*<sup>0</sup> : ARI*MCLUST*−*ME* = ARI*MCLUST* v.s. *H*<sup>1</sup> : ARI*MCLUST*−*ME* > ARI*MCLUST*, the *p*-values for the five cases are shown in Table 3. With the exception of *η* = 0.1, MCLUST-ME produced a significantly higher ARI than MCLUST.



Taking a closer look at the pairwise comparison when *η* = 0.5, Figure 5 shows that when MCLUST's accuracy is low, MCLUST-ME outperforms MCLUST most of the time, and when MCLUST's accuracy is relatively high, the two methods are less distinguishable on average.

**Figure 5.** Pairwise difference in accuracy relative to MCLUST accuracy. *X-axis*: MCLUST ARI; *Y-axis*: Pairwise difference between MCLUST-ME and MCLUST ARI values.

## *3.2. Simulation 2: Clustering Uncertainties and Magnitudes of Error Covariances*

In this simulation, our focus is on investigating how clustering uncertainties differ between MCLUST-ME and MCLUST: in particular, we want to see how the magnitudes of error covariances affect the uncertainty estimates. For this purpose, we will let the magnitudes of error covariances vary in a wide range.
