Clustering Mixed-Type Data via Dirichlet Process Mixture Model with Cluster-Specific Covariance Matrices

Burhanuddin, Nurul Afiqah; Ibrahim, Kamarulzaman; Zulkafli, Hani Syahida; Mustapha, Norwati

doi:10.3390/sym16060712

Open AccessArticle

Clustering Mixed-Type Data via Dirichlet Process Mixture Model with Cluster-Specific Covariance Matrices

by

Nurul Afiqah Burhanuddin

^1,2,*

,

Kamarulzaman Ibrahim

¹,

Hani Syahida Zulkafli

³ and

Norwati Mustapha

⁴

¹

Department of Mathematical Sciences, Faculty of Science and Technology, Universiti Kebangsaan Malaysia, Bangi 43600, Selangor, Malaysia

²

Institute for Mathematical Research, Universiti Putra Malaysia, Serdang 43400, Selangor, Malaysia

³

Department of Mathematics and Statistics, Faculty of Science, Universiti Putra Malaysia, Serdang 43400, Selangor, Malaysia

⁴

Department of Computer Science, Faculty of Computer Science and Information Technology, Universiti Putra Malaysia, Serdang 43400, Selangor, Malaysia

^*

Author to whom correspondence should be addressed.

Symmetry 2024, 16(6), 712; https://doi.org/10.3390/sym16060712

Submission received: 20 January 2024 / Revised: 8 April 2024 / Accepted: 12 April 2024 / Published: 8 June 2024

(This article belongs to the Section Mathematics)

Download

Browse Figures

Versions Notes

Abstract

:

Many studies have shown successful applications of the Dirichlet process mixture model (DPMM) for clustering continuous data. Beyond continuous data, in practice, one can expect to see different data types, including ordinal and nominal data. Existing DPMMs for clustering mixed-type data assume a strict covariance matrix structure, resulting in an overfit model. This article explores a DPMM for mixed-type data that allows the covariance matrix to differ from one cluster to another. We assume an underlying latent variable framework for ordinal and nominal data, which is then modeled jointly with the continuous data. The identifiability issue on the covariance matrix poses computational challenges, thus requiring a nonstandard inferential algorithm. The applicability and flexibility of the proposed model are illustrated through simulation examples and real data applications.

Keywords:

Dirichlet process mixture model; Bayesian nonparametric; model-based clustering; mixed-type data; latent variables

1. Introduction

Clustering analysis aims to partition a dataset into disjoint groups. Various clustering algorithms have been proposed, yet most of them are designed to work with only a single type of data at a time. For instance, in the context of model-based clustering, the multivariate normal distribution is usually employed as the component distribution of a mixture model for clustering continuous data [1,2,3]. However, datasets containing different types of variables frequently arise in practice. In the current literature, studies of the model-based clustering of mixed-type data are less explored [4].

The absence of a standard distribution to model the different data types simultaneously makes clustering mixed-type data challenging. A straightforward approach in clustering mixed-type data in the context of model-based clustering is by assuming local independence between the different types of data, in which the component of the mixture model is made of the product of the individual distributions of each data type, that is, normal distribution for continuous data and multinomial distribution for nominal data [5]. Nevertheless, this approach usually overestimates the number of clusters [6]. The location mixture model [7,8,9] is an alternative approach when considering the dependencies between different data types. These dependencies are captured by conditioning the normal distribution for continuous data on multinomial distribution for the nominal data. More recently, the location mixture model is revisited in [10] to accommodate missing data.

Another approach to dealing with mixed-type data is to use latent variables to underlie the non-continuous data [11]. In this approach, each ordinal variable is assumed to arise from thresholding a latent normal variable. Still, this approach is only feasible for up to two ordinal variables due to the computational complexity. The same approach has been used in [12] to cluster mixed continuous and binary data. A more recent attempt at leveraging the latent variables approach in clustering mixed-type data is reported in [13]. Their model can handle three data types: continuous, ordinal, and nominal, but allows a limited covariance structure.

Within the Bayesian literature, Murray and Reiter [14] integrate normal mixtures with multinomial mixtures through a prior tensor factorization in an attempt to capture the dependence between continuous and nominal data. Then, DeYoreo et al. [15] extend this approach by focusing on only certain variables to improve model fit. The use of latent variables in the Dirichlet process mixture model (DPMM) for clustering continuous and ordinal data has been proposed in [16]. In a similar line of research, the latent variable approach is also adopted in [17], but with the addition of a nominal data type. However, the model in [17] imposes a common covariance assumption with a complex prior update.

This article presents a model for the joint clustering of continuous, ordinal, and nominal data using DPMM. Our model utilizes a similar latent variable approach as in [17] to handle ordinal and nominal data, yet built upon different model construction. The significant differences in our proposed formulation are with regard to the Dirichlet process (DP) representation and the prior specification on the covariance matrix. Compared to [17], our prior specification is more straightforward to implement, with fewer prior hyperparameters to be tuned. In addition, our model offers a more flexible assumption, allowing for a cluster-specific covariance structure. The rest of this article is organized as follows. In Section 2, we briefly review the DPMM with multivariate normal components that outline the foundation of our model. The formulation of our proposed model, along with the algorithm involved, is introduced in Section 3. Then, the simulation examples and real data applications in Section 4 will illustrate how the covariance assumption affects the clustering performance. Finally, the article concludes with Section 5.

2. Dirichlet Process Multivariate Normal Mixture Model

Let

X = {(x_{1}, \dots, x_{N})}^{'}

be a continuous dataset of dimension

N \times Q

, where N denotes the sample size and Q denotes the number of variables. We assume that

X

arises from a Dirichlet process mixture model (DPMM) with multivariate normal components indexed by the mean

μ

and covariance

Σ

such that

\begin{matrix} p (x_{i}) = \int N_{Q} (x_{i} | μ, Σ) d G (μ, Σ), i = 1, \dots, N, \end{matrix}

(1)

where G is a mixing measure. Then, G is assumed to follow the Dirichlet process (DP) with a concentration parameter

α

and base distribution

G_{0}

, or, in short,

G \sim DP (α, G_{0})

. The DP draws distributions around the

G_{0}

, which is usually set as normal-inverse Wishart distribution, a natural conjugate prior of the data likelihood. Since samples drawn from DP are almost surely discrete, this property causes a tie among the observed component parameter

(μ, Σ)

s. Hence, the data instances that are indexed by the same

(μ, Σ)

are considered to belong to one cluster. Note that the parameter

α

controls the number of clusters produced. If

α \to 0

, the DPMM reduces to a single component mixture model where the samples are i.i.d. from a fully Bayesian parametric model with

N_{Q} (x_{i} | μ, Σ)

and

(μ, Σ) \sim G_{0}

. On the contrary, for

α \to \infty

, the model forms singleton clusters of each and every data instance, and we have

p (x_{i}) = \int N_{Q} (x_{i} | μ, Σ) d G_{0} (μ, Σ)

. From this, it follows that the DPMM can be presented equivalently in hierarchical form as

\begin{matrix} x_{i} | μ, Σ \sim p (x_{i} | μ_{i}, Σ_{i}), i = 1, \dots, N, \\ (μ_{i}, Σ_{i}) | G \sim G, i = 1, \dots, N, \\ G \sim DP (α, G_{0}), \end{matrix}

(2)

and

G_{0} \sim N I W_{Q} (b_{0}, κ_{0}, S_{0}, v_{0})

. Here, the

N I W

stands for the normal-inverse Wishart distribution specified by four parameters:

b_{0}, κ_{0}, S_{0}

, and

v_{0}

. The hierarchical structure in (2) highlights the grouping induced by the DPMM due to ties of the

(μ, Σ)

s.

We use the stick-breaking construction of the Sethuraman [18] for the DP. This construction does not rely on being able to integrate out the infinite-dimensional G, and thus allowing the direct specification of the prior. Then, the model in (2) can be represented in stick-breaking construction as

\begin{matrix} x_{i} | μ, Σ, c & \sim N_{Q} (μ_{c_{i}}, Σ_{c_{i}}), i = 1, \dots, N, \\ c_{i} | W & \sim \sum_{k = 1}^{\infty} w_{k} δ_{k} (\cdot), i = 1, \dots, N, \\ (μ_{k}, Σ_{k}) & \sim N I W_{Q} (b_{0}, κ_{0}, S_{0}, v_{0}), k = 1, \dots, \infty, \end{matrix}

(3)

where

δ_{k}

denotes a Dirac measure with a point mass at k. The vector of a cluster assignment variable

c = {(c_{1}, \dots, c_{N})}^{'}

is introduced to index each

x_{i}

to

(μ, Σ)

such that

p (x_{i} | c_{i} = k) = N_{Q} (x_{i} | μ_{k}, Σ_{k})

. Following [18], the weights

w_{k}

are defined such that

w_{1} = π_{1}

and

w_{k} = π_{k} \prod_{ℓ = 1}^{k - 1} (1 - π_{ℓ})

for

k = 2, 3, \dots

with

π_{k} \sim B e t a (1, α)

. The representation in (3) shows that the DPMM can be viewed as an infinite extension of the standard multivariate normal mixture model. Even though the mixture model has infinitely many clusters, it should be noted that for any sample of size N, only N clusters are realized at most. Therefore, we do not need to specify the number of clusters K as it will be part of the outputs of the inferential algorithm. Compared to other clustering algorithms that require the K to be set in advance [19], the DPMM offers an advantage for cases where the K is not well defined.

3. Methodology

3.1. Model Formulation

Let

X

be the input dataset that contains three different types of data: continuous, ordinal, and nominal. We divide the data instance into three parts:

x_{i} = {(x_{i}^{c'}, x_{i}^{o'}, x_{i}^{n'})}^{'}

for

i = 1, \dots, N

, where each part corresponds to one data type.

Suppose that, for the continuous part, there are

M_{c}

continuous variable such that

x_{i}^{c} = {(x_{i 1}^{c}, \dots, x_{i M_{c}}^{c})}^{'}

. We employ the following standard assumption:

x_{i}^{c} \sim N_{M_{c}} (μ_{k}^{c}, Σ_{k}^{c})

.

For ordinal data, we adopt the multivariate probit regression framework [20,21] that assumes each ordinal variable to be derived from an underlying normal latent variable. Let

j = 1, \dots, M_{o}

index the ordinal variable such that

x_{i}^{o} = {(x_{i 1}^{o}, \dots, x_{i M_{o}}^{o})}^{'}

, each having

L_{j}^{o}

levels with

x_{i j}^{o} \in {0, \dots, L_{j}^{o} - 1}

. Then, we introduce latent variables

z_{i}^{o} = {(z_{i 1}^{o}, \dots, z_{i M_{o}}^{o})}^{'}

for

i = 1, \dots, N

such that

\begin{matrix} x_{i j}^{o} = ℓ iff γ_{j, ℓ} < z_{i j}^{o} \leq γ_{j, ℓ + 1}, \end{matrix}

(4)

where

z_{i}^{o} \sim N_{M_{o}} (μ_{k}^{o}, Σ_{k}^{o})

and

γ_{j, ℓ}

s are the set of cut-offs for

j = 1, \dots, M_{o}

and

ℓ = 0, \dots, L_{j}^{o} - 1

. If we let the cut-offs as unknown parameters, the posterior inference becomes challenging due to the extra steps required to update the

γ_{j, ℓ}

s at each iteration. To avoid these complex updating steps, the

γ_{j, ℓ}

s can be treated as constants, and that the DPMM is robust with respect to the choice of value for the

γ_{j, ℓ}

s [22]. Hence, the cut-offs are set such that

γ_{j, ℓ} = Φ^{- 1} (ℓ / L_{j}^{o})

for

ℓ = 0, \dots, L_{j}^{o}

where

Φ^{- 1} (\cdot)

is the inverse of the standard univariate normal distribution function for computational practicality. Figure 1 shows the histogram of the generated underlying latent data of ordinal variables with four levels. From the figure, we can see that ordinal data are essentially the discretization of continuous data. Here, plots (a) and (b) illustrate two ordinal data of different proportions of instances between levels. Even for the same sets of cutoffs, the different levels of proportions will invoke different locations and shapes of the underlying data. These differences are the ones that contribute to cluster discrimination. Another issue that requires attention is the identifiability of the component parameters. If any of the

x_{i j}^{o}

s is binary, that is, when

L_{j}^{o} = 2

, the covariance matrix is not fully identifiable by the data likelihood. To address this, we need to restrict the diagonal elements of the covariance matrix that corresponds to the binary

x_{i j}^{o}

to one, as demonstrated in [23].

To model nominal data, the multinomial probit regression framework [24,25] is used. Let

j = 1, \dots, M_{n}

index the nominal variable such that

x_{i}^{n} = {(x_{i 1}^{n}, \dots, x_{i M_{n}}^{n})}^{'}

, each having

L_{j}^{n}

categories with

x_{i j}^{n} \in {0, \dots, L_{j}^{n} - 1}

. For each nominal instance

x_{i j}^{n}

,

i = 1, \dots, N

,

j = 1, \dots, M_{n}

, we introduce a

(L_{j}^{n} - 1)

-dimensional continuous latent vector

z_{i j}^{n} = {(z_{i j 1}^{n}, \dots, z_{i j (L_{j}^{n} - 1)}^{n})}^{'}

, defined as follows:

\begin{matrix} x_{i j}^{n} = \{\begin{matrix} 0 & iff max {z_{i j ℓ}^{n}, ℓ = 1, \dots, L_{j}^{n} - 1} < 0 \\ ℓ & iff max {z_{i j ℓ}^{n}, ℓ = 1, \dots, L_{j}^{n} - 1} = z_{i j ℓ}^{n} > 0, \end{matrix} \end{matrix}

(5)

where

x_{i j}^{n} = 0

is the most frequent category. Then, we concatenate the latent vectors

z_{i j}^{n'}

for

j = 1, \dots, M_{n}

, forming a

\sum_{j = 1}^{M_{n}} (L_{j}^{n} - 1)

-dimensional vector

z_{i}^{n} = {(z_{i 1}^{n'}, \dots, z_{i M_{n}}^{n'})}^{'} \sim N_{\sum_{j = 1}^{M_{n}} (L_{j}^{n} - 1)} (μ_{k}^{n}, Σ_{k}^{n})

. The scatter plot in Figure 2 depicts the generated underlying latent data of nominal variables with three categories. Here, plots (a) and (b) compare two nominal variables with different proportions of instances between categories. As we can see from the figure, even the underlying data of the first categories are constrained to take values below zero in both dimensions; the shape of the underlying data can vary due to the spread of the generated data instances from other categories. This can shift the overall location and shape of the data structure and thus contribute to cluster discrimination.

Further, observe that each

z_{i j ℓ}^{n}

in (5) is equivalent to (4) with

L_{j}^{o} = 2

. Hence, in this case, the full covariance matrix is also not identifiable. For ease of notation and computation, we will treat all binary ordinal data as nominal data. Therefore, we can leave the covariance matrix unrestricted for the ordinal data, but the covariance matrix for nominal data is restricted to a correlation matrix with unit elements on its diagonal.

Combining all the vector representations of each data type, we arrive at one vector

z_{i}

of dimension

Q = M_{c} + M_{o} + \sum_{j = 1}^{M_{n}} (L_{j}^{n} - 1)

such that

z_{i} = {({x_{i}^{c}}^{'}, {z_{i}^{o}}^{'}, {z_{i}^{n}}^{'})}^{'} \sim N_{Q} (μ_{k}, Σ_{k}^{#})

, for

i = 1, \dots, N

with mean and restricted covariance matrix of the forms:

\begin{matrix} μ_{k} = {({μ_{k}^{c}}^{'}, {μ_{k}^{o}}^{'}, {μ_{k}^{n}}^{'})}^{'}, Σ_{k}^{#} = (\begin{matrix} Σ_{k}^{c} & Σ_{k}^{co} & Σ_{k}^{cn} \\ {Σ_{k}^{co}}^{'} & Σ_{k}^{o} & Σ_{k}^{on} \\ {Σ_{k}^{cn}}^{'} & {Σ_{k}^{on}}^{'} & R_{k}^{n} \end{matrix}), \end{matrix}

where

R

is a correlation matrix. To complete the specification of our model, we require a prior for the component parameters

(μ_{k}, Σ_{k}^{#})

. However, owing to the restrictiveness of the component covariance matrix, we cannot use the standard conjugate normal-inverse Wishart prior. Instead, we adopt the following prior:

\begin{matrix} (μ_{k}, Σ_{k}^{#}) \sim N_{Q} (b_{0}, S_{0}) \times {PXW}_{Q} (v_{0}, Ω_{0}) . \end{matrix}

(6)

Here, we use the multivariate normal distribution with mean

b_{0}

and covariance

S_{0}

as a prior for the component mean. For the restricted covariance matrix, the PXW distribution [26] with degrees of freedom

v_{0}

and scale

Ω_{0}

is used as a prior. Originally, the PXW distribution is introduced to sample a correlation matrix

R_{k}

by decomposition of

Σ_{k} = D_{k}^{1 / 2} R_{k} D_{k}^{1 / 2}

where

Σ_{k}

is a standard unrestricted covariance matrix and

D_{k}

is a diagonal matrix of variances. Thus, multiplying the density of

Σ_{k}

by the Jacobian of the transformation from

Σ_{k}

to

(D_{k}, R)

, we arrive at the prior density of the PXW distribution,

\begin{matrix} p (D_{k}, R_{k}) & \propto | Σ_{k} |^{(ν_{0} - Q - 1) / 2} etr (- \frac{1}{2} {Ω_{0}}^{- 1} Σ_{k}) J (Σ_{k} \to D_{k}, R_{k}) \end{matrix}

(7)

where

Σ_{k} \sim W_{Q} (v_{0}, Ω_{0})

and etr stands for the expectation of the trace of a matrix. Here,

W

stands for Wishart distribution with degrees of freedom

v_{0}

and scale

Ω_{0}

. However, note that, in our case, we are interested in a restricted covariance matrix, not a correlation matrix. So, by setting some of the diagonal elements of

D_{k}

to be 1 such that

D_{k}^{#} = BD (D_{k}^{# c}, D_{k}^{# o}, D_{k}^{n})

is a block diagonal matrix with

D_{k}^{# c} = D_{k}^{# o} = diag (1, \dots, 1)

and

D_{k}^{n}

is the diagonal matrix of variances for nominal part, we will have

Σ_{k}^{#}

in the desired form from the decomposition of

Σ_{k} = D_{k}^{# 1 / 2} Σ_{k}^{#} D_{k}^{# 1 / 2}

. Thus, using the same idea, the prior density of the PXW distribution for our case is defined as

\begin{matrix} p (D_{k}^{#}, Σ_{k}^{#}) & \propto | Σ_{k} |^{(ν_{0} - Q - 1) / 2} etr (- \frac{1}{2} {Ω_{0}}^{- 1} Σ_{k}) J (Σ_{k} \to D_{k}^{#}, Σ_{k}^{#}) . \end{matrix}

(8)

The PXW distribution is easy to implement as we do not need to check the positive definiteness of the resulting matrix in each iteration as in [17].

Finally, the full hierarchical model can be written as:

\begin{matrix} z_{i} = {({x_{i}^{c}}^{'}, {z_{i}^{o}}^{'}, {z_{i}^{n}}^{'})}^{'} | μ, Σ^{#}, c & \sim N_{Q} (μ_{c_{i}}, Σ_{c_{i}}^{#}), i = 1, \dots, N, \\ c_{i} | W & \sim \sum_{k = 1}^{\infty} w_{k} δ_{k} (\cdot), i = 1, \dots, N, \\ (μ_{k}, Σ_{k}^{#}) & \sim N_{Q} (b_{0}, S_{0}) \times {PXW}_{Q} (v_{0}, Ω_{0}), k = 1, \dots, \infty . \end{matrix}

(9)

The model specification is concluded with the setting of parameter

α

,

b_{0}

,

S_{0}, v_{0}

, and

Ω_{0}

. We assign Gamma

(2, 4)

prior on

α

, following [27]. Next, the parameter

b_{0}

is set to be the approximate center of the data. In particular, the part of

b_{0}

that belongs to the continuous data is set to be the sample mean, whereas those that belong to ordinal and nominal data are set to be zero. The matrix

S_{0}

is set such that

S_{0} =

diag

{{(r_{1} / 3)}^{2}, \dots, {(r_{Q} / 3)}^{2}}

where

r_{j}

denotes the range of the data. The

r_{j}

s for ordinal and nominal parts are set arbitrarily as 6 [28]. Lastly, we set

ν_{0} = Q + 2

and

Ω_{0} = S_{0}^{*} / ν_{0}

with

S_{0}^{*}

being the matrix

S_{0}

where the entries for the nominal part are replaced by one.

3.2. Inferential Algorithm

We use a Markov chain Monte Carlo (MCMC) method to approximate the posterior distribution for inferential purposes. To avoid the need for a finite truncation of the DP, the slice sampler [29] is adapted to our formulation. The idea behind the slice sampler is to introduce an auxiliary variable that makes the infinite-dimensional G conditionally finite. Let

u = {(u_{1}, \dots, u_{N})}^{'}

be the auxiliary variable such that

0 < u_{i} < 1

. By augmenting the auxiliary variable into the model in (9), we have the following joint probability density function:

\begin{matrix} p (z_{i}, u_{i}, c_{i}) = I (u_{i} < w_{c_{i}}) N_{Q} (z_{i}; μ_{c_{i}}, Σ_{c_{i}}^{#}) . \end{matrix}

Then, the rest of the sampler follows the framework of the standard slice sampler where the component parameters

μ_{k}

and

Σ_{k}^{#}

, including the concentration parameter

α

, are updated successively according to the prior specification in Section 3.1. Since the latent variable for ordinal and nominal portions are conditional to the component parameters, the corresponding portions of the variable

Z

also need to be updated. In addition, two label-switching moves proposed in [30] are added as the last step of the sampler to help the MCMC chain mix better. The full MCMC sampler is implemented in Algorithm 1 as follows:

Algorithm 1 MCMC Algorithm

The state of the MCMC sampler consists of the $w = {(w_{1}, \dots, w_{K})}^{'}$ , $u = {(u_{1}, \dots, u_{N})}^{'}$ , $c = {(c_{1}, \dots, c_{N})}^{'}$ , $μ_{k} : k \in {c_{1}, \dots, c_{N}}$ , $Σ_{k}^{#} : k \in {c_{1}, \dots, c_{N}}$ , α, and $Z = {(z_{1}, \dots, z_{N})}^{'}$ . Then, at each iteration, we execute the following steps:
- Step 1: Sample the stick-breaking variable $w_{k}$ for $k = 1, \dots, K$ such that $w_{1} = π_{1}$ and $w_{k} = π_{k} \prod_{ℓ = 1}^{k - 1} (1 - π_{ℓ})$ , with $π_{k} \sim Beta (1 + N_{k}, α + \sum_{ℓ = k + 1}^{K} N_{ℓ})$ and $N_{k} = \sum_{i = 1}^{N} 1 (c_{i} = k)$ .
- Step 2: Update the latent variable $u_{i}$ for $i = 1, \dots, N$ by sampling from $(u_{i} | \dots) \sim U (0, w_{c_{i}})$ .
- Step 3: Update the cluster assignment variable $c_{i}$ for $i = 1, \dots, N$ with the conditional probability $p (c_{i} = k | \dots) \propto I (k \in K_{i}) N_{Q} (z_{i}; μ_{k}, Σ_{k}^{#})$ . Here, $K_{i} = {k : w_{k} > u_{i}}$ denotes a finite set of ks generated by sampling as many of the $w_{k}$ s for $k = 1, \dots, K$ with K is the smallest integer that satisfies $\sum_{k = 1}^{K} w_{k} > 1 - u^{*}$ and $u^{*} = min {u_{i}, i = 1, \dots, N}$ .
- Step 4: Update $μ_{k}$ for $k = 1, \dots, K$ by sampling from $(μ_{k} | \dots) \sim N_{Q} (b_{k}, S_{k})$ , where $b_{k} = S_{k} (S_{0}^{- 1} b_{0} + N_{k} S_{k}^{- 1} {\bar{z}}_{k})$ , $S_{k} = {(S_{0}^{- 1} + N_{k} S_{k}^{- 1})}^{- 1}$ , and ${\bar{z}}_{k}$ is the mean of the $z_{k} = (z_{i} : i such that c_{i} = k)$ .
- Step 5: Update the restricted covariance matrix $Σ_{k}^{#}$ using the parameter-extended Metropolis-Hastings algorithm (PX-MH) [26]. First, we need to sample an unrestricted form of covariance matrix $Σ_{k}^{*}$ from Wishart distribution such that $Σ_{k}^{*} \sim W_{Q} (η, Σ_{k} / η)$ with η is tuned such that the total acceptance rate across all the non-empty components is roughly $23 %$ [31]. Then, the restricted covariance matrix $Σ_{k}^{#}$ is generated from the decomposition $Σ_{k}^{*} = D_{k}^{# * \frac{1}{2}} Σ_{k}^{# *} D_{k}^{# * \frac{1}{2}}$ . The new $Σ_{k}^{# *}$ is accepted as $Σ_{k}^{#}$ according to probability
  
  $min \{1, \frac{p (D_{k}^{# *}, Σ_{k}^{# *} | \dots)}{p (D_{k}^{#}, Σ_{k}^{#} | \dots)} \frac{q (Σ_{k} | Σ_{k}^{*})}{q (Σ_{k}^{*} | Σ_{k})}\} .$
  
  Here, the joint posterior $p (D_{k}^{#}, Σ_{k}^{#} | \dots)$ is derived as
  
  $\begin{matrix} p (D_{k}^{#}, Σ_{k}^{#} | \dots) & \propto p (D_{k}^{#}, Σ_{k}^{#}) \times \prod_{i : c_{i} = k} p (z_{i} | μ_{k}, Σ_{k}^{#}) \\ \propto | Σ_{k} |^{(ν_{0} - Q - 1) / 2} etr (- \frac{1}{2} {Ω_{0}}^{- 1} Σ_{k}) J (Σ_{k} \to D_{k}^{#}, Σ_{k}^{#}) \times \\ \prod_{i : c_{i} = k} {| Σ_{k}^{#} |}^{- 1 / 2} \exp (- \frac{1}{2} (z_{i} - μ_{k}) Σ_{k}^{# - 1} {(z_{i} - μ_{k})}^{'}) \\ \propto | Σ_{k}^{#} |^{\frac{ν_{0} - Q - 1 - N_{k}}{2}} {| D_{k}^{#} |}^{\frac{ν_{0}}{2} - 1} etr (- \frac{1}{2} (Ω_{0}^{- 1} D_{k}^{# 1 / 2} Σ_{k}^{#} D_{k}^{# 1 / 2} + Σ_{k}^{# - 1} S_{μ_{k}})), \end{matrix}$
  
  where $S_{μ_{k}} = \sum_{i : c_{i} = k} (z_{i} - μ_{k}) {(z_{i} - μ_{k})}^{'}$ and $J (Σ_{k} \to D_{k}^{#}, Σ_{k}^{#}) = {| D_{k}^{#} |}^{(Q - 1) / 2}$ . For the proposal density, it is defined as
  
  $\begin{matrix} q (Σ_{k}^{*} | Σ_{k}) = W_{Q} (Σ_{k}^{*}; η, Σ_{k} / η) J (Σ_{k}^{*} \to D_{k}^{# *}, Σ_{k}^{# *}) \end{matrix}$
  
  with $J (Σ_{k}^{*} \to D_{k}^{# *}, Σ_{k}^{# *}) = {| D_{k}^{# *} |}^{(Q - 1) / 2}$ .
- Step 6: Update the concentration parameter α using the move proposed in [27] such that $(α | \dots) \sim w_{ζ} Gamma (2 + K, 4 - \log (ζ)) + (1 - w_{ζ}) Gamma (2 + K - 1, 4 - \log (ζ)),$
  where $(ζ | α, K) \sim Beta (α + 1, N)$ . Here, the weight $w_{ζ}$ is set to satisfy the following:
  
  $\begin{matrix} \frac{w_{ζ}}{1 - w_{ζ}} = \frac{2 + K - 1}{N (4 - log (ζ))} . \end{matrix}$
- Step 7: Update the latent variable $Z$ for ordinal and nominal data as follows:
  For each ordinal instance $x_{i j}^{o}, i = 1, \dots, N$ , $j = 1, \dots, M_{o}$ , sample $(z_{i j^{'}} | \dots) \sim 1 (γ_{j, x_{i j}^{o} - 1} < z_{i j^{'}} \leq γ_{j, x_{i j}^{o}}) N (z_{i j^{'}}; {\tilde{m}}_{i j^{'}}, {\tilde{v}}_{i j^{'}}),$ where $j^{'} = M_{c} + j$ , ${\tilde{m}}_{i j^{'}}$ and ${\tilde{v}}_{i j^{'}}$ are the conditional mean and variance of $z_{i j^{'}}$ given $z_{i ℓ}, ℓ \neq j^{'}$ such that
  
  $\begin{array}{c} {\tilde{m}}_{i j^{'}} = μ_{k j^{'}} + Σ_{k j^{'} (- j^{'})} Σ_{k (- j^{'}) (- j^{'})}^{- 1} (z_{i j^{'}} - μ_{k (- j^{'})}), \\ {\tilde{v}}_{i j^{'}} = Σ_{k j^{'} j^{'}} - Σ_{k j^{'} (- j^{'})} Σ_{k (- j^{'}) (- j^{'})}^{- 1} Σ_{k (- j^{'}) j^{'}} . \end{array}$
  
  Here, $μ_{k (- j^{'})}$ is the vector of the $μ_{k}$ excluding its $j^{'}$ th element, $Σ_{k j^{'} (- j^{'})}$ is the transpose of $Σ_{k (- j^{'}) j^{'}}$ , $Σ_{k j^{'} (- j^{'})}$ is the $j^{'}$ th row vector of covariance matrix $Σ_{k}$ excluding its $j^{'}$ th element, and $Σ_{k (- j^{'}) (- j^{'})}$ is the covariance matrix $Σ_{k}$ excluding its $j^{'}$ th row and $j^{'}$ th column.
- For each nominal instance $x_{i j}^{n}$ , $i = 1, \dots, N, j = 1, \dots, M_{n}$ , sample $(z_{i j^{'}} | \dots) \sim 1 (z_{i j^{'}} \in A_{i j^{'}}) N (z_{i j^{'}}; {\tilde{m}}_{i j^{'}}, {\tilde{v}}_{i j^{'}}),$ for $j^{'} = M_{c} + M_{o} + \sum_{h < j} (L_{h}^{n} - 1) + 1, \dots, M_{c} + M_{o} + \sum_{h \leq j} (L_{h}^{n} - 1)$ with
  
  $\begin{matrix} A_{i j^{'}} = \{\begin{cases} z_{i j^{'}} < 0 & if x_{i j}^{n} = 0 \\ z_{i j^{'}} = max {z_{i ℓ}}, > 0 & if x_{i j}^{n} = r \\ z_{i j^{'}} < max {z_{i ℓ}}, & otherwise \end{cases} \end{matrix}$
  
  for which $ℓ = M_{c} + M_{o} + \sum_{h < j} (L_{h}^{n} - 1) + 1, \dots, M_{c} + M_{o} + \sum_{h \leq j} (L_{h}^{n} - 1)$ , $r = j^{'} - (M_{c} + M_{o} + \sum_{h < j} (L_{h}^{n} - 1))$ , and ${\tilde{m}}_{i j^{'}}$ , ${\tilde{v}}_{i j^{'}}$ are as before. In case of missing instance, the univariate normal distribution with the conditional mean and variance is used to impute the corresponding $z_{i j^{'}}$ .
- Step 8: We implement the two label-switching moves described in [30] to improve the mixing of the algorithm. The moves are given as follows:
- (i) Choose randomly two non-empty clusters, say, i and j, and propose to exchange their assignment variable and cluster-specific parameters. Then, the exchange is accepted with the following probability:
  
  $\min \{1, {(\frac{w_{j}}{w_{i}})}^{N_{i} - N_{j}}\} .$
- (ii) Choose randomly two adjacent non-empty clusters, say, i and $i + 1$ , and propose to exchange their assignment variable and parameters. Then, the exchange is accepted with the following probability:
  
  $\min \{1, \frac{{(1 - π_{i + 1})}^{N_{i}}}{{(1 - π_{i})}^{N_{i + 1}}}\} .$
- These moves are complementary, as the former has a high probability of switching two clusters with similar weights while the latter has a high probability of switching two clusters of unequal weights.

Since this study focuses on the clustering problem, we are only interested in the cluster assignment variable

c = {(c_{1}, \dots, c_{N})}^{'}

where each unique value of

c_{i}

represents a cluster. Then, the number of clusters is derived as the number of unique values in

c

. Both these values will be kept throughout the iterations.

4. Experimental Results

This section presents some empirical experiments with simulated and real data to illustrate the performance of the proposed model.

4.1. Simulated Data Examples

In this experiment, we simulated two different examples to assess the performance of the proposed model in determining the true cluster structure. Throughout this experiment, we consider a setup consisting of two continuous variables

(X_{1}^{c}, X_{2}^{c})

, two ordinal variables

(X_{1}^{o}, X_{2}^{o})

, each having four levels, and two nominal variables

(X_{1}^{n}, X_{2}^{n})

, each having four categories.

In the first example, we consider a case where a mixture model with a common covariance matrix is used to simulate the underlying latent vectors of sample size 1000. Let

z_{i} = (z_{i 1} \dots, z_{i 10})

for

i = 1, \dots, 1, 000

be the underlying latent vectors simulated from an equal weight mixture model of three multivariate normal,

N_{10} (μ_{1}, I_{10}) + N_{10} (μ_{2}, I_{10}) + N_{10} (μ_{3}, I_{10})

, with the component means as follows

\begin{matrix} μ_{1 j} & = & (1 / j) {(- 1)}^{j} + Δ, \\ μ_{2 j} & = & (1 / j) {(- 1)}^{j}, \\ μ_{3 j} & = & (1 / j) {(- 1)}^{j} - Δ, j = 1, \dots, 10 . \end{matrix}

Here,

Δ

represents the Euclidean distance between the component means, and the following values

Δ \in {1.9, 2.1, 2.3}

are considered. These values of

Δ

are designed to cover a range of cases, from highly overlapped to highly separated cluster structures. The greater the value of

Δ

, there will be less overlap between the generated clusters as illustrated in Figure 3. Then, from the simulated latent vectors, we set the continuous data instance such that

(x_{i 1}^{c}, x_{i 2}^{c}) = (z_{i 1}, z_{i 6})

. For ordinal data instances, we set

x_{i 1}^{o}

and

x_{i 1}^{o}

by discretizing the scaled

z_{i 2}

and

z_{i 7}

, respectively, using the threshold described in Section 3.1. Lastly, for the nominal data instance, we set

x_{i 1}^{n}

by transforming scaled

(z_{i 3}, z_{i 4}, z_{i 5})

using the probit specification as in Section 3.1, while for

x_{i 2}^{n}

, we use

(z_{i 8}, z_{i 9}, z_{i 10})

.

To highlight the impact of different covariance structure assumptions on the clustering performance, we also run our model with the covariance matrix set to be common across components such that

\begin{matrix} z_{i} = {(x_{i}^{c'}, z_{i}^{o'}, z_{i}^{n'})}^{'} | μ, Σ^{#}, c & \sim N_{Q} (μ_{c_{i}}, Σ^{#}), i = 1, \dots, N, \\ c_{i} | W & \sim \sum_{k = 1}^{\infty} w_{k} δ_{k} (\cdot), i = 1, \dots, N, \\ μ_{k} & \sim N_{Q} (b_{0}, S_{0}), k = 1, \dots, \infty, \\ Σ^{#} & \sim {PXW}_{Q} (v_{0}, Ω_{0}) . \end{matrix}

(10)

The algorithm for the model in (10) is almost similar to Algorithm 1 except for Step 5, where the restricted covariance matrix

Σ^{#}

is updated only once. The joint posterior

p (D^{#}, Σ^{#} | \dots)

is then defined based on the likelihood of all data instances as follows:

\begin{matrix} p (D^{#}, Σ^{#} | \dots) & \propto p (D^{#}, Σ^{#}) \times \prod_{i = 1}^{N} p (z_{i} | μ_{k}, Σ^{#}) \\ \propto | Σ^{#} |^{\frac{ν_{0} - Q - 1 - N}{2}} {| D^{#} |}^{\frac{ν_{0}}{2} - 1} etr (- \frac{1}{2} (Ω_{0}^{- 1} D^{# 1 / 2} Σ^{#} D^{# 1 / 2} + Σ^{# - 1} S_{μ})), \end{matrix}

where

S_{μ} = \sum_{k = 1}^{K} \sum_{i : c_{i} = k} (z_{i} - μ_{k}) {(z_{i} - μ_{k})}^{'}

.

Throughout this paper, the algorithm is initialized using k-means on random

Z

with ten clusters for both cluster-specific and common covariance models. Then, the algorithm is run for 100,000 iterations with 50,000 as burn-in, after which we take samples of cluster assignment variable

c

with a thinning of 30, resulting in 1667 samples of

c

. As for the parameter settings, we use the default values described in Section 3.1. Our current implementation is performed in the R environment [32]. The total run time varied depending on the data sample size and the number of variables involved. For instance, the simulation study takes around 3 h per one run on a laptop with a 2.7 GHz quad-core processor.

Due to the weak identifiability of the mixture components, the posterior distribution exhibits a multimodal behavior [30]. This may cause the algorithm to stick in one of the modes. In such situations, even if the algorithm is run sufficiently many iterations, it will keep producing outputs around that particular mode. This issue has been noted in many Bayesian nonparametric models (see, for example, [30,33,34]), where most of the time the full exploration of the MCMC samples space is not guaranteed. Figure 4 provides the trace plots of the sampled number of clusters K by both models. These trace plots show that the K stabilizes to a common value in most cases. However, for the trace plot for the model with cluster-specific covariance when

Δ = 1.9

, we can see that the algorithm is stuck in a mode that exhibits a lower number of clusters. To handle this issue, it is common practice to restart the algorithm multiple times, ensuring the consistency of the results. Hence, the clustering results in this manuscript are reported based on ten runs of the algorithms with different initializations of

Z

. It should be noted that, if the data sample size is too small relative to the perceived number of clusters, the algorithm may fail to converge since the data are not sufficient to dominate the prior.

In order to gauge the clustering performance, we adopt the following criteria as a measure of agreement between the estimated partition and the true partition:

Normalized mutual information (NMI) [35]

$\begin{matrix} NMI = \frac{\sum_{k} \sum_{\hat{k}} \frac{| S_{k} \cap {\hat{S}}_{\hat{k}} |}{N} log \frac{N | S_{k} \cap {\hat{S}}_{\hat{k}} |}{| S_{k} | \cdot | {\hat{S}}_{\hat{k}} |}}{\sqrt{(\sum_{k} \frac{| S_{k} |}{N} log \frac{| S_{k} |}{N}) (\sum_{\hat{k}} \frac{| {\hat{S}}_{\hat{k}} |}{N} log \frac{| {\hat{S}}_{\hat{k}} |}{N})}} \end{matrix}$
Adjusted Rand index (ARI) [36]

$\begin{matrix} ARI = \frac{\sum_{i j} (\binom{| S_{i} \cap {\hat{S}}_{j} |}{2}) - \frac{\sum_{i} (\binom{| S_{i} |}{2}) \sum_{j} (\binom{| {\hat{S}}_{j} |}{2})}{(\binom{N}{2})}}{\frac{1}{2} [\sum_{i} (\binom{| S_{i} |}{2}) \sum_{j} (\binom{| {\hat{S}}_{j} |}{2})] - \frac{\sum_{i} (\binom{| S_{i} |}{2}) \sum_{j} (\binom{| {\hat{S}}_{j} |}{2})}{(\binom{N}{2})}} \end{matrix}$

where $\hat{S} = ({\hat{S}}_{1}, \dots, {\hat{S}}_{K})$ is the estimated partition such that ${\hat{S}}_{k} = {j : c_{j} = k for j = 1, \dots, N$ }, $S = (S_{1}, S_{2}, \dots, S_{K})$ is the true partition, and $| \cdot |$ denotes the number of instances in a partition.

Both NMI and ARI take a value between zero and one; a higher value indicates a better clustering performance.

Table 1 shows the average values of NMI and ARI (together with their standard deviation in parentheses) of both models. Since the algorithm is run multiple times, the average values here are taken over all 1667 × 10 sampled

c

s. The estimated number of clusters

\hat{K}

is also reported in Table 1, which is the most frequent number of clusters visited in all the sampled

c

s. In general, the model with cluster-specific covariance performs worse than the model with common covariance. This result has been anticipated since the simulated data are consistent with the assumption of the common covariance model. The difference in performance is more evident in the case of clusters with high overlap among them. Nevertheless, both models exhibit an almost similar clustering performance when the clusters are well separated. In terms of estimating the true number of clusters, both models succeed in capturing the true cluster structure with

\hat{K} = 3

. We also made an attempt to apply the model proposed in [17] on the simulated data as a comparison. Unfortunately, since their algorithm involves many prior hyperparameters that need to be tuned specifically for each dataset, we failed to obtain their algorithm to mix properly, resulting in many singleton clusters. In addition, compared to their algorithm, our proposed algorithm requires only one tuning hyperparameter to control the acceptance rate of the Metropolis–Hasting update. Note that we also use the same model parameters and algorithm settings throughout the paper.

As for the second example, we use a similar experimental setup as in the previous one but with a different covariance structure for each cluster. We set

Σ_{1} = CS (0.9)

,

Σ_{2} = CS (0.5)

, and

Σ_{3} = CS (0.1)

where CS refers to a compound symmetry matrix. The CS matrix is a matrix with equal variances on the main diagonal and equal correlations on all off-diagonal elements. The different parameters of the CS matrices introduce different variances and correlations between the underlying latent variables and thus change the geometric orientation of the covariance matrix of each cluster. From the scatterplot of the simulated data in Figure 5, especially on plot

X_{1}^{c}

versus

X_{2}^{c}

, we can see that each cluster exhibits different shapes. In this example, we consider

Δ \in {3.3, 3.5, 3.7}

.

We report the result for the second scenario in Table 2. It is clear that the model with cluster-specific covariance matrices performs better with a higher NMI and ARI scores compared to the model with common covariance. In addition, the model with cluster-specific covariance assumption is able to recover the true three-cluster structure in all cases. On the contrary, the model with a common covariance assumption overestimates the number of clusters. Even in a highly separated cluster structure, the model with common covariance fails to infer the true K, and the overestimation of K is still observed. This overestimation of the number of clusters is the result of the nonparametric feature of the DPMM where it will adapt the number of components to optimize the model fit. For illustration, we plot the simulated data according to one of the sampled assignment variables from the inferential algorithm in Figure 6. Due to the strict assumption of a common covariance matrix, the model in Figure 6a is forced to introduce superfluous components to fit the data better while maintaining the same shape of the covariance matrix across all clusters in the data.

4.2. Real Data Applications

As real data applications, we consider two datasets: prostate cancer and chronic kidney disease data.

4.2.1. Prostate Cancer Data

We examine a dataset from a randomized clinical trial [37], which has been used in several studies (see [10,38]). The dataset involves 502 patients classified into clinical stage 3 and stage 4 prostate cancer. The stage 3 cancer patients only show a local extension of the tumor and exclude the presence of distant metastasis, while stage 4 cancer patients report distant metastasis. As listed in Table 3, there are twelve clinical variables collected from the patients, with eight of them being continuous, one being ordinal, and three being nominal. Following the pre-processing steps outlined in [38], the variable size of primary lesion and prostatic acid phosphatase are transformed using square root and log function, respectively. In this study, we will evaluate the ability of the proposed model to differentiate between the two stages of cancer.

The clustering results for models with common covariance and cluster-specific covariance are reported in Table 4. The NMI and ARI scores are measured based on the agreement with the clinical stages. The model with cluster-specific covariance accurately select

\hat{K} = 2

, which is in line with the clinical stages. On the other hand, as expected, the model with common covariance overestimates the number of clusters to better accommodate the data. It is thus evident that the common covariance assumption is too rigid in real application, thus impairing the clustering performance.

Given samples of the posterior draw of cluster assignment variable

c

, it is often desirable to determine an optimal partition that best represents all the output of the algorithm. To obtain that single estimate, we use the method proposed in [39] to post-process all the sampled

c

s based on the variation of information (VI). This method involves calculating the loss function between each output of the inferential algorithm

\hat{c}

and the unknown true assignment variable

c

such that

\begin{matrix} c^{opt} & = {argmin}_{\hat{c}} E [VI (c, \hat{c}) | D] \\ = {argmin}_{\hat{c}} \sum_{n = 1}^{N} log (\sum_{n^{'} = 1}^{N} 1 ({\hat{c}}_{n} = {\hat{c}}_{n^{'}})) - \\ 2 \sum_{n = 1}^{N} log (\sum_{n^{'} = 1}^{N} p ({\hat{c}}_{n} = {\hat{c}}_{n^{'}} | D) 1 ({\hat{c}}_{n} = {\hat{c}}_{n^{'}})) \end{matrix}

(11)

where

D

represents the data. Hence, the optimal partition is derived from

c^{opt}

that minimizes (11). The search for

c^{opt}

is restricted to all the output

c

s sampled from the inferential algorithm. Table 5 reports the confusion matrix of the optimal partition under the model with common covariance. As the optimal partition is the one that best represents all the

c

s, the number of clusters reported in Table 5 may not match the most frequent number of clusters

\hat{K}

as inferred by the algorithm (see Table 4). Upon initial observation, the model appears to perform satisfactorily in grouping stage 3 patients into Cluster #1, with only two stage 3 patients grouped into different clusters. Despite this, over half of the stage 4 patients are also grouped into Cluster #1. This suggests that the model with common covariance has failed to differentiate between stage 3 and stage 4 patients.

Table 6 reports the confusion matrix of the optimal partition under the model with cluster-specific covariance. It appears that there is a disparity between the optimal partition and the clinical classification for 47 patients. However, in general, the model is able to differentiate between the two stages of cancer, with the majority of stage 3 patients being grouped into Cluster #1 and stage 4 patients into Cluster #2. It is also worth mentioning that our model seems to outperform the model in [10], which reports 67 patients being grouped into the wrong cluster. Their study treats variable performance rating as a nominal variable instead of an ordinal since their location mixture model can only handle continuous and nominal data. Apart from that, the performance of our model also surpasses the model in [38], which groups the patients into three clusters. We believe that this is due to the parsimonious covariance structure assumed by their model, which imposes some restrictions on the shape of the clusters and thus requires more clusters to better fit the data.

4.2.2. Chronic Kidney Disease

To further showcase the utility of the proposed model, we apply the model to the chronic kidney disease dataset from [40]. The dataset consists of 400 instances of clinical data from patients with an age range from 2 to 90 years. Out of 400 samples, 250 of them are diagnosed with chronic kidney disease (CKD). Therefore, we will use the proposed model to discriminate CKD patients from non-CKD patients. All 24 clinical variables collected are listed in Table 7. As far as we know, these data have not yet been analyzed in previous related studies involving model-based clustering.

Upon first inspection of the data, we observed that some data instances appear abnormal. One data point reports an unnatural level of sodium at

4.5

mEq/L, lower than that ever recorded in the medical literature, which is at 99 mEq/L [41]. In addition, we have also come across two other data points exhibiting significantly high potassium levels, at 39 and 47 mEq/L, respectively. These readings exceed those reported in a documented case study of extreme hyperkalemia with a potassium level of 14 mEq/L [42]. To account for the possibility of data entry errors, these three data points are treated as missing values. Furthermore, we also observe a significant skewness in variable blood urea and serum creatinine. Hence, a logarithmic transformation is applied to these variables.

From Table 8, a similar trend in performance can be seen where the model with common covariance overclusters the data with the estimated number of clusters of 9 and thus shows a poor clustering performance as measured by the NMI and ARI. On the other hand, the model with a cluster-specific covariance matrix correctly identifies the true number of clusters at 2.

The confusion matrix of the optimal estimated partition under the model with common covariance model is reported in Table 9. We can see that all 150 non-CKD patients are grouped into one cluster, Cluster #1. However, Cluster #1 does not contain only non-CKD individuals but also nine misclassified CKD patients. The way that the remaining CKD patients are grouped into eight distinct clusters, most of which are smaller in size, suggests that a single covariance shape was insufficient to fully capture the cluster structure of the CKD group. The difference only in the cluster locations is not enough to adequately represent the difference in shapes of the underlying data due to the different proportions of the instances across levels/categories along with the correlation that exists between variables. The assumption of common covariance results in overfitting one particular cluster, which in turn hinders the ability to fit the rest of the data.

Table 10 shows the optimal partition under the model with cluster-specific covariance matrices. It is evident that all of the 250 CKD patients have been accurately classified into Cluster #1. Out of 150 individuals who are not diagnosed with CKD, only 23 are misclassified as having CKD and grouped into Cluster #1. While the model with cluster-specific covariance provides a better fit for the CKD groups, it comes at the cost of a reduced fit for the non-CKD group. The outcome contradicts the one obtained from the model with a common covariance structure, which demonstrates a superior fit for the non-CKD group in comparison to the CKD group. This is due to the difference in covariance structure that molds the underlying latent vector and thus causes changes in the cluster configuration.

5. Conclusions

This article presents a model for clustering mixed continuous, ordinal, and nominal data that allow for cluster-specific covariance matrices. Combining the continuous variables with the latent variables framework to underlie the ordinal and nominal variables allows us to use a DPM model with multivariate Gaussian kernels. The flexibility of our model over the model with common covariance matrix assumption is illustrated in a simulated example of clusters with different geometric orientations. This is often the case in real practice, as supported by the application in prostate cancer staging and chronic kidney disease classification.

To compensate for the fixed cutoffs of the underlying variables of the non-continuous data, the different shapes of the covariance matrices are needed to fully capture the proportion variation across different levels/categories that contribute towards cluster discrimination. Applying the model with a common covariance assumption may occasionally improve the clustering performance in a dataset where every cluster is believed to have an exact identical shape. However, such situations are rare in real clustering problems since a mere correlation between the variables is enough to give the different orientations to the covariance matrix.

One key issue that arises in sampling is the restricted covariance matrices. By taking advantage of the development in Bayesian prior construction, we have run our model with the use of the PX-MH [26] due to its relatively simple implementation. Nevertheless, in general, other updating steps could be used, such as in [17], although computational complexity is a factor that needs to be considered.

Our future work in progress is to consider how to incorporate other hierarchical prior structures in an attempt to select relevant variables in clustering mixed-type data.

Author Contributions

Conceptualization, N.A.B.; Formal analysis, N.A.B.; Methodology, N.A.B.; Software, N.A.B.; Supervision, K.I., H.S.Z. and N.M.; Writing—original draft, N.A.B.; Writing—review and editing, N.A.B., K.I., H.S.Z. and N.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The prostate cancer and the chronic kidney disease data can be retrieved from https://hbiostat.org/data (accessed on 10 July 2023) and https://archive.ics.uci.edu/dataset/336/chronic+kidney+disease (accessed on 10 July 2023), respectively. The code is available at https://drive.google.com/drive/folders/1ELQ7BhASmui7lNL3nEwG48o65fys21_E?usp (accessed on 10 July 2023).

Acknowledgments

The first author is sponsored by Universiti Kebangsaan Malaysia and Malaysian Ministry of Higher Education.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Banfield, J.D.; Raftery, A.E. Model-based Gaussian and non-Gaussian clustering. Biometrics 1993, 49, 803–821. [Google Scholar] [CrossRef]
Celeux, G.; Govaert, G. Gaussian parsimonious clustering models. Pattern Recognit. 1995, 28, 781–793. [Google Scholar] [CrossRef]
Burhanuddin, N.A.; Adam, M.B.; Ibrahim, K. Clustering with label constrained Dirichlet process mixture model. Eng. Appl. Artif. Intell. 2022, 107, 104543. [Google Scholar] [CrossRef]
Hunt, L.; Jorgensen, M. Clustering mixed data. In Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery; Wiley: Hoboken, NJ, USA, 2011; Volume 1, pp. 352–361. [Google Scholar]
Moustaki, I.; Papageorgiou, I. Latent class models for mixed variables with applications in Archaeometry. Comput. Stat. Data Anal. 2005, 48, 659–675. [Google Scholar] [CrossRef]
Van Hattum, P.; Hoijtink, H. Market Segmentation Using Brand Strategy Research: Bayesian Inference with Respect to Mixtures of Log-Linear Models. J. Classif. 2009, 26, 297–328. [Google Scholar] [CrossRef]
Krzanowski, W. The location model for mixtures of categorical and continuous variables. J. Classif. 1993, 10, 25–49. [Google Scholar] [CrossRef]
Hunt, L.; Jorgensen, M. Theory & Methods: Mixture model clustering using the MULTIMIX program. Aust. N. Z. J. Stat. 1999, 41, 154–171. [Google Scholar]
Willse, A.; Boik, R.J. Identifiable finite mixtures of location models for clustering mixed-mode data. Stat. Comput. 1999, 9, 111–121. [Google Scholar] [CrossRef]
Hunt, L.; Jorgensen, M. Mixture model clustering for mixed data with missing information. Comput. Stat. Data Anal. 2003, 41, 429–440. [Google Scholar] [CrossRef]
Everitt, B.S. A finite mixture model for the clustering of mixed-mode data. Stat. Probab. Lett. 1988, 6, 305–309. [Google Scholar] [CrossRef]
Morlini, I. A latent variables approach for clustering mixed binary and continuous variables within a Gaussian mixture model. Adv. Data Anal. Classif. 2012, 6, 5–28. [Google Scholar] [CrossRef]
McParland, D.; Gormley, I.C.; McCormick, T.H.; Clark, S.J.; Kabudula, C.W.; Collinson, M.A. Clustering South African households based on their asset status using latent variable models. Ann. Appl. Stat. 2014, 8, 747. [Google Scholar] [CrossRef]
Murray, J.S.; Reiter, J.P. Multiple Imputation of Missing Categorical and Continuous Values via Bayesian Mixture Models with Local Dependence. J. Am. Stat. Assoc. 2016, 111, 1466–1479. [Google Scholar] [CrossRef]
DeYoreo, M.; Reiter, J.P.; Hillygus, D.S. Bayesian Mixture Models with Focused Clustering for Mixed Ordinal and Nominal Data. Bayesian Anal. 2017, 12, 679–703. [Google Scholar] [CrossRef]
Storlie, C.B.; Myers, S.M.; Katusic, S.K.; Weaver, A.L.; Voigt, R.G.; Croarkin, P.E.; Stoeckel, R.E.; Port, J.D. Clustering and variable selection in the presence of mixed variable types and missing data. Stat. Med. 2018, 37, 2884–2899. [Google Scholar] [CrossRef]
Carmona, C.; Nieto-Barajas, L.; Canale, A. Model-based approach for household clustering with mixed scale variables. Adv. Data Anal. Classif. 2019, 13, 559–583. [Google Scholar] [CrossRef]
Sethuraman, J. A Constructive Definition of Dirichlet Priors. Stat. Sin. 1994, 4, 639–650. [Google Scholar]
Ali, I.; Rehman, A.U.; Khan, D.M.; Khan, Z.; Shafiq, M.; Choi, J.G. Model Selection Using K-Means Clustering Algorithm for the Symmetrical Segmentation of Remote Sensing Datasets. Symmetry 2022, 14, 1149. [Google Scholar] [CrossRef]
Chib, S.; Greenberg, E. Analysis of Multivariate Probit Models. Biometrika 1998, 85, 347–361. [Google Scholar] [CrossRef]
Albert, J.H.; Chib, S. Bayesian Analysis of Binary and Polychotomous Response Data. J. Am. Stat. Assoc. 1993, 88, 669–679. [Google Scholar] [CrossRef]
Kottas, A.; Müller, P.; Quintana, F. Nonparametric Bayesian Modeling for Multivariate Ordinal Data. J. Comput. Graph. Stat. 2005, 14, 610–625. [Google Scholar] [CrossRef]
DeYoreo, M.; Kottas, A. Bayesian Nonparametric Modeling for Multivariate Ordinal Regression. J. Comput. Graph. Stat. 2018, 27, 71–84. [Google Scholar] [CrossRef]
McCulloch, R.; Rossi, P.E. An exact likelihood analysis of the multinomial probit model. J. Econom. 1994, 64, 207–240. [Google Scholar] [CrossRef]
McCulloch, R.E.; Polson, N.G.; Rossi, P.E. A Bayesian analysis of the multinomial probit model with fully identified parameters. J. Econom. 2000, 99, 173–193. [Google Scholar] [CrossRef]
Zhang, X.; Boscardin, W.J.; Belin, T.R. Sampling Correlation Matrices in Bayesian Models with Correlated Latent Variables. J. Comput. Graph. Stat. 2006, 15, 880–896. [Google Scholar] [CrossRef]
Escobar, M.D.; West, M. Bayesian Density Estimation and Inference Using Mixtures. J. Am. Stat. Assoc. 1995, 90, 577–588. [Google Scholar] [CrossRef]
Richardson, S.; Green, P.J. On Bayesian analysis of mixtures with an unknown number of components (with discussion). J. R. Stat. Soc. Ser. (Statistical Methodol.) 1997, 59, 731–792. [Google Scholar] [CrossRef]
Walker, S.G. Sampling the Dirichlet mixture model with slices. Commun. Stat.-Simul. Comput. 2007, 36, 45–54. [Google Scholar] [CrossRef]
Papaspiliopoulos, O.; Roberts, G.O. Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models. Biometrika 2008, 95, 169–186. [Google Scholar] [CrossRef]
Gelman, A.; Roberts, G.O.; Gilks, W.R. Efficient Metropolis jumping rules. In Bayesian Statistics 5; Bernardo, J.M., Berger, J.O., Dawid, A.P., Smith, A.F.M., Eds.; Oxford University Press: Oxford, UK, 1996; pp. 599–608. [Google Scholar]
R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2022. [Google Scholar]
Papaspiliopoulos, O.; Roberts, G.O.; Skold, M. A general framework for the parametrization of hierarchical models. Stat. Sci. 2007, 22, 59–73. [Google Scholar] [CrossRef]
Jain, S.; Neal, R.M. A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture model. J. Comput. Graph. Stat. 2004, 13, 158–182. [Google Scholar] [CrossRef]
Strehl, A.; Ghosh, J. Cluster ensembles—A knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 2002, 3, 583–617. [Google Scholar]
Hubert, L.; Arabie, P. Comparing partitions. J. Classif. 1985, 2, 193–218. [Google Scholar] [CrossRef]
Byar, D.P.; Green, S.B. The choice of treatment for cancer patients based on covariate information. Bull. Cancer 1980, 67, 477–490. [Google Scholar]
McParland, D.; Gormley, I.C. Model based clustering for mixed data: clustMD. Adv. Data Anal. Classif. 2016, 10, 155–169. [Google Scholar] [CrossRef]
Wade, S.; Ghahramani, Z. Bayesian Cluster Analysis: Point Estimation and Credible Balls. Bayesian Anal. 2018, 13, 559–626. [Google Scholar] [CrossRef]
Jerlin Rubini, L.; Perumal, E. Efficient classification of chronic kidney disease by using multi-kernel support vector machine and fruit fly optimization algorithm. Int. J. Imaging Syst. Technol. 2020, 30, 660–673. [Google Scholar] [CrossRef]
Gupta, E.; Kunjal, R.; Cury, J.D. Severe hyponatremia due to valproic acid toxicity. J. Clin. Med. Res. 2015, 7, 717–719. [Google Scholar] [CrossRef]
Tran, H. Extreme hyperkalemia. South. Med. J. 2005, 98, 729–733. [Google Scholar] [CrossRef]

Figure 1. Generated underlying latent data of ordinal variables with four levels: (a) the proportions of each level are almost evenly distributed, and (b) the proportions of each level are concentrated on the upper levels. Different colors indicate different levels.

Figure 2. Generated underlying latent data of nominal variables with three categories: (a) the non-frequent categories are evenly distributed, and (b) similarly, the non-frequent categories are evenly distributed but with a smaller frequent category. Different colors (symbols) indicate different categories.

Figure 3. Pairwise scatterplots for Simulation 1 on the different values of

Δ

: (a)

Δ = 1.9

, and (b)

Δ = 2.3

, depicting the different degrees of overlap between clusters. Different colors (symbols) represent different clusters.

Figure 3. Pairwise scatterplots for Simulation 1 on the different values of

Δ

: (a)

Δ = 1.9

, and (b)

Δ = 2.3

, depicting the different degrees of overlap between clusters. Different colors (symbols) represent different clusters.

Figure 4. The trace plot of the number of clusters K sampled in one of the algorithm runs for Simulation 1 on the different values of

Δ

: (a)

Δ = 1.9

, (b)

Δ = 2.1

, and (c)

Δ = 2.3

. Here, the number of iterations recorded is after thinning, and the dashed red line divides the burn-in and sampling iterations.

Figure 4. The trace plot of the number of clusters K sampled in one of the algorithm runs for Simulation 1 on the different values of

Δ

: (a)

Δ = 1.9

, (b)

Δ = 2.1

, and (c)

Δ = 2.3

. Here, the number of iterations recorded is after thinning, and the dashed red line divides the burn-in and sampling iterations.

Figure 5. Pairwise scatterplots for Simulation 2 on different values of

Δ

: (a)

Δ = 3.3

, and (b)

Δ = 3.7

, depicting different degrees of overlap between clusters. Different colors (symbols) represent different clusters.

Figure 5. Pairwise scatterplots for Simulation 2 on different values of

Δ

: (a)

Δ = 3.3

, and (b)

Δ = 3.7

, depicting different degrees of overlap between clusters. Different colors (symbols) represent different clusters.

Figure 6. Pairwise scatterplots for Simulation 2 with

Δ = 3.3

according to one of the sampled assignment variables for both models: (a) common covariance, and (b) cluster-specific covariance. Different colors (symbols) represent different clusters.

Figure 6. Pairwise scatterplots for Simulation 2 with

Δ = 3.3

according to one of the sampled assignment variables for both models: (a) common covariance, and (b) cluster-specific covariance. Different colors (symbols) represent different clusters.

Table 1. Results for Simulation 1.

		Common Covariance	Cluster-Specific Covariance
$Δ = 1.9$	NMI	0.803	0.683
		(0.015)	(0.088)
	ARI	0.850	0.625
		(0.015)	(0.177)
	$\hat{K}$	3	3
$Δ = 2.1$	NMI	0.853	0.836
		(0.014)	(0.018)
	ARI	0.894	0.873
		(0.012)	(0.020)
	$\hat{K}$	3	3
$Δ = 2.3$	NMI	0.897	0.888
		(0.012)	(0.015)
	ARI	0.932	0.920
		(0.010)	(0.013)
	$\hat{K}$	3	3

Table 2. Results for Simulation 2.

		Common Covariance	Cluster-Specific Covariance
$Δ = 3.3$	NMI	0.762	0.834
		(0.044)	(0.014)
	ARI	0.757	0.873
		(0.082)	(0.013)
	$\hat{K}$	5	3
$Δ = 3.5$	NMI	0.779	0.862
		(0.041)	(0.012)
	ARI	0.778	0.897
		(0.071)	(0.011)
	$\hat{K}$	4	3
$Δ = 3.7$	NMI	0.839	0.890
		(0.032)	(0.011)
	ARI	0.861	0.923
		(0.045)	(0.009)
	$\hat{K}$	4	3

Table 3. The variables in the prostate cancer data.

Variable	Type of Data	Number of Levels/Categories
Age	Continuous
Weight index	Continuous
Systolic blood pressure	Continuous
Diastolic blood pressure	Continuous
Serum hemoglobin	Continuous
Size of primary lesion	Continuous
Index of tumor stage/grade	Continuous
Prostatic acid phosphatase	Continuous
Performance rating	Ordinal	4
Cardiovascular disease	Nominal	2
Electrocardiogram code	Nominal	7
Bone metastasis	Nominal	2

Table 4. Results for the prostate cancer data.

	Common Covariance	Cluster-Specific Covariance
NMI	0.291	0.477
	(0.079)	(0.041)
ARI	0.287	0.584
	(0.113)	(0.043)
$\hat{K}$	5	2

Table 5. The confusion matrix for the prostate cancer data under the model with common covariance matrix.

		Clinical Classification
		Stage 3	Stage 4
Estimated	Cluster #1	287	123
	Cluster #2	1	80
	Cluster #3	0	10
	Cluster #4	1	0

Table 6. The confusion matrix for the prostate cancer data under the model with cluster-specific covariance matrices.

		Clinical Classification
		Stage 3	Stage 4
Estimated	Cluster #1	267	25
Estimated	Cluster #2	22	188

Table 7. The variables in the chronic kidney disease data.

Variable	Type of Data	Number of Levels/Categories
Age	Continuous
Blood pressure	Continuous
Blood glucose random	Continuous
Blood urea	Continuous
Serum creatinine	Continuous
Sodium	Continuous
Potassium	Continuous
Hemoglobin	Continuous
Packed cell volume	Continuous
White blood cell count	Continuous
Red blood cell count	Continuous
Specific gravity	Ordinal	5
Albumin	Ordinal	6
Sugar	Ordinal	6
Red blood cells	Nominal	2
Pus cell	Nominal	2
Pus cell clumps	Nominal	2
Bacteria	Nominal	2
Hypertension	Nominal	2
Diabetes mellitus	Nominal	2
Coronary artery disease	Nominal	2
Appetite	Nominal	2
Pedal edema	Nominal	2
Anemia	Nominal	2

Table 8. Results for the chronic kidney disease data.

	Common Covariance	Cluster-Specific Covariance
NMI	0.577	0.704
	(0.026)	(0.023)
ARI	0.449	0.767
	(0.033)	(0.025)
$\hat{K}$	9	2

Table 9. The confusion matrix for the chronic kidney disease data under the model with common covariance matrix.

		Clinical Classification
		CKD	Non-CKD
Estimated	Cluster #1	9	150
	Cluster #2	138	0
	Cluster #3	42	0
	Cluster #4	28	0
	Cluster #5	15	0
	Cluster #6	13	0
	Cluster #7	2	0
	Cluster #8	2	0
	Cluster #9	1	0

Table 10. The confusion matrix for the chronic kidney disease data under the model with cluster-specific covariance matrices.

		Clinical Classification
		CKD	Non-CKD
Estimated	Cluster #1	250	23
Estimated	Cluster #2	0	127

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Burhanuddin, N.A.; Ibrahim, K.; Zulkafli, H.S.; Mustapha, N. Clustering Mixed-Type Data via Dirichlet Process Mixture Model with Cluster-Specific Covariance Matrices. Symmetry 2024, 16, 712. https://doi.org/10.3390/sym16060712

AMA Style

Burhanuddin NA, Ibrahim K, Zulkafli HS, Mustapha N. Clustering Mixed-Type Data via Dirichlet Process Mixture Model with Cluster-Specific Covariance Matrices. Symmetry. 2024; 16(6):712. https://doi.org/10.3390/sym16060712

Chicago/Turabian Style

Burhanuddin, Nurul Afiqah, Kamarulzaman Ibrahim, Hani Syahida Zulkafli, and Norwati Mustapha. 2024. "Clustering Mixed-Type Data via Dirichlet Process Mixture Model with Cluster-Specific Covariance Matrices" Symmetry 16, no. 6: 712. https://doi.org/10.3390/sym16060712

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Clustering Mixed-Type Data via Dirichlet Process Mixture Model with Cluster-Specific Covariance Matrices

Abstract

1. Introduction

2. Dirichlet Process Multivariate Normal Mixture Model

3. Methodology

3.1. Model Formulation

3.2. Inferential Algorithm

4. Experimental Results

4.1. Simulated Data Examples

4.2. Real Data Applications

4.2.1. Prostate Cancer Data

4.2.2. Chronic Kidney Disease

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI