Predicting Heart Disease Using Collaborative Clustering and Ensemble Learning Techniques

Al-Sayed, Amna; Khayyat, Mashael M.; Zamzami, Nuha

doi:10.3390/app132413278

Open AccessArticle

Predicting Heart Disease Using Collaborative Clustering and Ensemble Learning Techniques

by

Amna Al-Sayed

¹,

Mashael M. Khayyat

^2,* and

Nuha Zamzami

¹

Department of Computer Science and Artificial Intelligence, College of Computer Science and Engineering, University of Jeddah, Jeddah 23442, Saudi Arabia

²

Department of Information Systems and Technology, College of Computer Science and Engineering, University of Jeddah, Jeddah 23442, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(24), 13278; https://doi.org/10.3390/app132413278

Submission received: 20 October 2023 / Revised: 6 December 2023 / Accepted: 11 December 2023 / Published: 15 December 2023

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Different data types are frequently included in clinical data. Applying machine learning algorithms to mixed data can be difficult and impact the output accuracy and quality. This paper proposes a hybrid model of unsupervised and supervised learning techniques, which can be used in modelling and processing mixed data with an application in heart disease diagnosis. The model consists of two main components: collaborative clustering and combining decisions (the ensemble approach). The mixed data clustering problem is considered as a multi-view clustering problem; each view is processed using specialised clustering algorithms. Since each algorithm operates on a different space of the data set’s features, a novel collaborative framework was proposed that promotes the clustering process through information exchange between the different clustering algorithms, thereby producing expert models that model other spaces of the data set’s features. The expectation maximisation algorithm forms the foundation for this optimisation process, enhancing the collaborative term representing entropy; excellent convergence characteristics are therefore ensured. An ensemble approach similar to the stacking approach was used. The logistic regression model was utilised as a meta-classifier, training the expert model prediction results, and was subsequently used to predict the final output. The results prove the efficacy of this collaborative approach in optimising different clustering algorithms and meta-classifier outcomes.

Keywords:

heart disease; clustering algorithms; collaborative clustering

1. Introduction

The heart is the second most critical human organ after the brain, as it pumps oxygen and nutrients to the body’s tissues and viscera via the circulation system [1]. If the cardiac function was to fail, the brain and other organs would stop operating, causing the individual to die within minutes. Heart-related disorders, or cardiovascular diseases (CVDs), have been a primary cause of global mortality for several decades, placing them amongst the most lethal diseases [2]. The prevalence of CVDs is currently rising worldwide, primarily due to lifestyle changes, work stress and poor eating habits. CVDs are thought to cause 17.9 million deaths annually, or 32% of all fatalities worldwide. Myocardial infarctions and cerebrovascular events comprise approximately 85% of deaths related to CVDs; in excess of 75% deaths are linked to cardiovascular disorders [3]. These statistics make the need to understand more about CVDs, their prevalence and diagnosis, more pressing.

Data clustering is an essential part of extracting information from databases. The aim is to identify patterns inherent in a set of items by clustering them based on standard features. However, the number of clusters to be found is usually unknown, which makes this process more difficult than supervised classification. Specifically, it becomes more challenging to assess the quality of the clustering section [4]. Over the last two decades, the increasing availability of more complex data sets, including multi-view, distributed and scaled data, has made the clustering process even more difficult. This problem can be solved efficiently by combining several different clustering algorithms, referred to as collaborative clustering [4]. This comprises an unsupervised machine learning approach which involves several different clustering algorithms working concordantly in order to identify structures within data sets. The process is characterised by frequent exchanges of information between collaborating members, direct action in relation to each member of the group and responsibility involving all group members, which affects the individual tasks of each of the collaborating members [4]. In comparison to a lone algorithm working autonomously on a data set, collaborative clustering can enhance the results and reliability of the clustering algorithms applied to the identical data set. The process can be used in various applications, including distributed data clustering, multi-expert clustering, multi-scale clustering analysis and multi-view clustering [4].

Collaborative clustering comprises two basic steps: a local step during which each member performs its task individually and produces a clustering solution, followed by a collaborative step, in which no fixed techniques are present. In the latter step, the collaborating members exchange their results and try to improve their models in order to achieve better clustering [5]. Each local computation, which may be performed on different data sets, gains from the efforts of the other contributors [4]. There are several ways to split the data, the most important of which is horizontal and vertical collaborative clustering [6].

Some applications for collaboration, such as multi-expert analysis [4], are a type of collaborative clustering in which all algorithms work on the same objects and features of a difficult data set. Multiple algorithms are applied to a data set and share their predictions. This enables the merging of information in clusters only identified by some algorithms and refines results based on clusters that have not yet been well classified. Multi-view clustering is a type of unsupervised learning in which multiple algorithms are used to analyse a set of objects. Each algorithm processes different attributes of the same objects, such as their geometry, text, colour and numerical data, in order to find clusters in the data set. The goal is to improve the clustering accuracy by combining predictions.

Multi-scale clustering is an example of a collaborative application. Various algorithms examine identical objects and properties whilst seeking a different number of clusters. Such an approach is advantageous for data sets with an inherently multi-scale structure, such as satellite image data. However, collaborative clustering faces many challenges, including which collaborating algorithms to choose in the collaboration process, the information that is exchanged, on what basis the collaborators make this decision and when to stop the collaboration, as increasing or decreasing the collaborative process leads to negative collaboration. Thus, in this work, ensemble learning is utilised to create several clusterings and to combine them in order to obtain a final consensus clustering. This technique promotes classification accuracy and prediction performance. A group of individual models is referred to as an ensemble model [7]. Typically, alternative methods, variable algorithm parameters or random data sampling create the initial clustering. Ensemble methods are simple supervised learning methods dependent on consensus-based techniques, as defining a set of predictive functions in order to produce an aggregated prediction function is straightforward. For example, a linear combination is used in boosting. It is also easy to gauge the effectiveness of individual predictive functions as well as the variety of the group of task contenders for inclusion in the final combined global decision function.

The main issues in the current research are that all the work in the literature used supervised learning algorithms for the prediction of heart disease; there is no work that utilised unsupervised learning algorithms or collaborative clustering technology for this application. In this paper, a hybrid model is proposed for unsupervised learning techniques. A summary of this contribution is detailed below:

The presented model comprises two main components: collaborative clustering and combining decisions (ensemble approach).
The problem of mixed data clustering, i.e., quantitative and qualitative, here deemed multi-view clustering, whereby each offer is processed using specialised clustering algorithms, was faced. Since each algorithm operates on a different area of the data set features, the proposed collaborative clustering framework is horizontal collaborative.
A clustering strategy similar to the stacking approach was employed, together with a logistic regression model as a meta-classifier. The latter was trained on the results of expert model predictions and then employed in order to predict the final decision.
The efficacy of the approach with respect to different clustering algorithm optimisation and meta-classifier results was proven using the collaborative technique. The strength of this proposed collaborative method is that it does not require the prototypes or models used by the different algorithms to be shared during the collaborative step. Only the solution vectors produced by all the algorithms require sharing.
Controlling collaborative step continuity also avoids negative collaboration.

The remainder of this paper is divided into several sections. Section 2 includes relevant work on predicting heart disease using supervised and unsupervised learning, together with a description of some general studies related to collaborative clustering. In Section 3, the proposal for collaborative clustering and the ensemble approach is presented; each stage of the methodology is explained in detail. Section 4 contains the results and discussion relating to a number of experiments for both parallel and reinforcement scenarios. The article concludes with a conclusion and thoughts relating to future work.

2. Literature Review

Several previous studies have been conducted in the field of CVD diagnosis. For instance, a new model was proposed for heart disease prediction and referred to as the Heart Disease Prediction Model Using a Hybrid Random Forest (RF) and Linear Model. This exploited several machine learning techniques, including a variety of feature combinations and numerous categorisation algorithms, and achieved an 88.7% accuracy [8].

Generally, research has focused on diagnosing heart disease based on historical data and information. The Smart Heart Disease Prediction model utilised naive Bayesian (NB) techniques to forecast risk factors for heart disease [9]. The results revealed that this diagnostic approach was highly successful at predicting risk factors for CVDs with an accuracy of 89.77% [9]. Fitriyani et al. [10] published an efficient heart disease prediction model for a clinical decision support system that utilised density-based spatial clustering of applications with noise (DBSCAN) in order to identify and eliminate outliers, in conjunction with a hybrid synthetic minority over-sampling technique-edited nearest neighbour technique. Two publicly available data sets, Statlog and Cleveland, were used to evaluate the model and to compare its performance against other widely used classifier models. The proposed model outperformed previous models, achieving an accuracy of up to 96%.

Various artificial intelligence strategies for coronary artery heart disease prediction were compared using seven computational intelligence approaches, namely logistic regression (LR), support vector machine (SVM), a deep neural network (DNN), decision tree (DT), NB techniques, RF, and k-nearest neighbour (K-NN) [11]. The Statlog and Cleveland heart disease data sets were used to assess each technique’s performance. DNNs were found to achieve the highest accuracy, i.e., 98.15%. Compared to the state-of-the-art methods, and focusing on heart disease prediction, their approach outperformed prior studies [11].

A novel method for predicting heart disease was also presented that included all techniques in a single algorithm, i.e., the hybridisation technique [12]. Several methods were listed and evaluated in order to determine their accuracy level, and the results showed that a correct diagnosis was possible when using a composite model comprising all approaches. An accuracy of 89.2% was attained.

A further paper utilised an optimum feature extraction method to generate a de novo clustering model for the prediction of CVD using numerical data and electrocardiographic (ECG) data [13]. Rather than directly grouping the numerical data, a principal component analysis was used for dimensionality reduction. The hybrid clustering technique consisted of improved DBSCAN combined with optimised k-means clustering (KMC). Simultaneously, four data sets were utilised, i.e., two data sets and an ECG data set, which is a test used to detect and to record the timing and intensity of human cardiac electrical activity. The findings indicated that the model effectively resolved the issues associated with using dual data for heart disease prediction.

A heart disease prediction system was also developed using the data mining approach by combining NB techniques and KMC [14]. This approach facilitated the prediction of CVD by incorporating numerous variables and providing output data in the prediction form, using the k-means and NB algorithms to group a range of factors and to perform prediction, respectively.

Unsupervised KMC has also been utilised in the clinical domain to detect anomalies as a method of CVD prediction [15]. The suggested model produced an ideal value for k by forming clusters to recognise anomalies using the silhouette approach. Identified anomalies were excluded from the data, and the resulting prediction model was created using the five most prominent machine learning classification approaches, namely KNN, RF, SVM, the NB technique and LR. This approach was justified by the efficacy of the method and substantiated using a common cardiovascular data set. Data charting was also used in order to determine the precision with which they found abnormalities in their experimental research.

The majority of the aforementioned studies, which produced good results in heart disease prediction, employed supervised and deep learning algorithms, both of which need extensive labelled data [16,17,18,19,20,21,22]. Several researchers have used collaborative clustering, which significantly improves the clustering outcomes [23,24,25,26,27].

All the above algorithms have the same limitations. Firstly, they rely on prototypes based on fixed parameters and algorithms from the same family, and secondly, they find identical clusters. In one study [28], a horizontal collaborative clustering method is used, referred to as SAMARA. However, this uses only hard clustering as it does not deal with prototypes and is not limited to a specific number of clusters. The goal is to modify the clustering results for the same data iteratively and collaboratively, thereby lowering their diversity and facilitating the discovery of a consensus solution. After completing the local step, the clusters are mapped to different clustering algorithms using a probabilistic confusion matrix.

A new method [5] was proposed that allows the enhancement of the clustering process by exchanging information between several local results. This addresses the limitations of previous techniques that depended on the algorithms being both homogeneous and from the same family. The data are split using a horizontal collaborative [4,5] method, either into subsets that represent the same data on different features, or into the same data but searching for a different number of clusters or a mixture between them. Several heterogeneous algorithms are used, including Self-Organizing Mapping (SOM) [23,25], the Generative Topographic Mapping Algorithm (GTM) [26,27] and Expectation Maximisation (EM) [4], which can be applied to different probability distributions. These probability distributions include the Gaussian mixture distribution, the Dirichlet distribution [29] and the Bernoulli mixture model (BMM) [30]. The approach based on the SAMARA modification [4] avoids the limitations found in previous studies by having the requisite for identical prototypes amongst various collaborators and by employing an abundance of clustering algorithms from other families. This is based on an estimation process that operates on subsequent distributions known as composite functions. These techniques handle data with the same feature data type, e.g., numeric or imaging data. One of the challenges in the current proposed approach is handling mixed data.

Most clustering algorithms do not have the ability to deal with mixed data. This is because each clustering algorithm has a specific distance measure. It may be specialised in dealing with quantitative data or dealing with qualitative data. For example, k-means depends on the Euclidean distance measure for quantitative data, and k-modes uses a categorical measure to calculate distance. Therefore, there is no single algorithm that works with mixed data.

The current study investigates a different approach in order to optimise the clustering algorithm results. This strategy has large-scale advantages which include the ability to apply collaborative clustering techniques to medical specifications or to alternative domains. Additionally, there is the option to predetermine the cluster number [31]. In this study, two clusters will be predetermined in other to predict whether or not an individual has a CVD.

3. Methodology

In this paper, a hybrid model comprising unsupervised and supervised learning techniques is proposed for the modelling of mixed data and specifically, of cardiology data sets. The general framework of the proposed model is illustrated in Figure 1.

The model consists of two main components: (i) collaborative clustering and (ii) combining decisions (the ensemble approach). In collaborative clustering, the mixed data clustering problem considered was multi-view clustering, where the data set is divided into two views, representing quantitative and qualitative features, respectively. Each is processed using specialised clustering algorithms. Since each algorithm operates on different areas of the data set’s features, a new collaborative method that enhances the clustering process by sharing information between results obtained from different clustering algorithms in the horizontal collaborative was designed. The collaboration goal was to process each type of data set feature with specialised clustering algorithms whilst giving these algorithms a more comprehensive picture of the data set by allowing some information to be exchanged between them, improving their results. This process led to the generation of expert models that model different areas of the data set’s features. Mapping of the models’ clusters to appropriate data set classes was then performed, resulting in a shift from the unsupervised to the supervised learning approach, and enabling meaningful predictions from the clustering models. Finally, to produce a single decision from these expert models, an ensemble approach similar to the stacking method was applied. The logistic regression model was utilised as a meta-classifier, training the expert model prediction results, and then used to predict the final output. In the remainder of this section, each phase of the proposed framework is described in detail.

3.1. Data Pre-Processing

The cardiac data set available on Kaggle was used for this study [32]. This resulted from merging five cardiac data sets, i.e., Cleveland, Hungarian, Swiss, Long Beach VA and Stalog (heart), containing 11 common clinical features. Each data set is available within the cardiology data sets index from the UCI Machine Learning Repository [33]. It contains 918 observations, the target variable representing whether or not a person has heart disease. The identification of heart disease is based on 11 clinical features of different types, 5 of which are quantitative and 6 of which are qualitative (Table 1).

Most machine learning models only work with numeric values, and so qualitative values have to be converted into numeric values so that the machine can learn from those data correctly. One HotEncoder technology was used to encode the qualitative features into a one hot-encoded binary feature, i.e., if a column represents the feature, it receives the symbol 1. Otherwise, it is assigned a 0. This can help avoid the ordering problem, which can occur when a qualitative variable has a natural ordering. The quantitative features were then normalised in order to enhance the machine learning model input quality by data rescaling to a standardised range. Although normalisation can be achieved using different methods, such as Robust Scaler, MinMax Scaler and Standard Scaler [34], the latter proved to be the most effective.

The Robust Scaler scales robust features for outliers [35]. This approach is almost the same as the MinMax Scaler but uses the range between quartiles instead of the min-max. This scaling algorithm removes the mean and measures the data according to the quantitative scope, following Equation (1):

z_{i} = \frac{x_{i} - Q_{1} (x)}{Q_{3} (x) - Q_{1} (x)}

(1)

where

Q_{1}

is the first quarter and

Q_{3}

is the third quarter.

Finally, the data set was divided into two views, the first representing quantitative feature data (n = 5) and the second the qualitative features. After coding, the number becomes 10 binary features.

3.2. Collaborative Clustering

In this subsection, the proposed collaborative method is presented. This can be applied to different types of general horizontal collaborative learning tasks and specifically to multi-view clustering tasks. The proposed method removes several limitations of previous collaborative frameworks. Data need not be shared between different algorithms, the number of clusters can vary and notably, different algorithms can collaborate. The proposed work is similar to that published by Author [5], i.e., collaborative clustering with heterogeneous algorithms, where information is exchanged between many results of different clustering algorithms. The currently proposed method differs in that it is concerned with processing different types of data using specialised clustering algorithms and offers a more general approach by controlling collaborative step continuity using the stopping criterion (entropy), thereby circumventing the negative collaboration issue between collaborative clustering algorithms.

Firstly, the principle of the proposed method and its theoretical basis are explained. The stopping criterion is examined, and the collaborative scenarios in which the collaborative process can be applied are presented. How these collaborative scenarios can enhance the heart disease data set compilation is then discussed. Finally, the clustering algorithms employed are introduced, and the way in which they are adapted to the collaborative method is demonstrated.

3.2.1. Formalism

In horizontal collaborative clustering, a limited set of algorithms,

A = \{A^{1}, A^{2}, \dots, A^{J}\},

is considered, which operates on the same data items, albeit with access to different features, and possibly also looking for a different number of clusters. Let

X = \{x_{1}, x_{2}, \dots, x_{N}\}, and x_{n} \in R^{d}

be a data set containing

N

elements, each with real numeric properties.

Each

A^{i}

clustering algorithm has parameters,

θ^{i}

, to describe the clusters or model, and produces its clustering solution,

S^{i},

made of

K_{i}

clusters based on the features of the

X^{i} \subseteq X

data set to which it has access. In the case of hard clustering,

S^{i}

can be translated into a solution vector of size

N

, soft clustering into a matrix of size,

N \times K_{i}

. This matrix is denoted

S^{i} = (s^{i}_{n, c})

, where

1 \leq n \leq N

and

1 \leq c \leq K_{i}

. Thus, the

S^{i}

solutions generated by the algorithms are 2D matrices of size

N \times K_{i}

, where each element

(s^{i}_{n, c})

expresses the responsibility (probability) given by the algorithm

A^{i}

to cluster

c

of the

x^{i}_{n}

data element. This matrix is transformed into a solution vector of size

N

by adopting the fact that each data set record belongs to one set with the highest probability,

S^{i} = \arg \max (s_{n, c}^{i})

.

The method assumes that any clustering algorithm whose model is collaboratively optimised attempts to optimise an objective function similar to Equation (2):

S = \underset{S}{\arg m a x} (P (S∣ (X, θ))) = \underset{S}{\arg m a x} (P (X∣ (S, θ)) \times P (S))

(2)

Equation (2) can be solved using a local minimisation process, as expressed in Equation (3):

P (X∣ (S, θ)) P (S) = \prod_{t}^{N} P (x_{t} ∣ (s_{t}, θ_{s_{t}})) \times P (s_{t})

(3)

Each algorithm may be based on different statistical models, e.g., Gaussian or polynomial, amongst others. It is hypothesised that this collaborative method can only be applied to algorithms attempting to optimise an equation similar to Equation (2). Most of the symbols used in this section are summarised in Table 2.

3.2.2. Problem Formulation

The proposed method enhances the clustering process by sharing information between several results obtained from different clustering algorithms. The originality of the proposed approach is that the collaboration step can use the clustering results obtained from any algorithm during the local step. The main issue comes from Equation (3) in calculating

P (s)

, which is the probability of each cluster. This is not known in advance as most clustering algorithms include an assumption to calculate

P (s)

. For example, the K-means algorithm and some versions of the EM algorithm assume that all combinations have the same occurrence probability. Most unsupervised probabilistic classifiers follow this method. The principal concept is to measure the event of all clusters after each iteration. The local probability hypothesis, primarily used in computer vision, will depend on the neighbour composition of the observed data, rather than measuring the global

P (s)

of the entire data set,

P (s) .

Thus, its value will be determined based on the clusters assigned to the neighbour data.

In the proposed collaborative method, a hypothesis similar to the local hypothesis was followed. During the collaborative step, it will be considered that the probability of occurrence of each

P (s)

cluster will not be a global probability. Instead,

P (s)

will be associated with clustering choices made by other algorithms for the same data point. The objective is to establish a method of modifying Equation (3) such that the solutions of all the clustering algorithms are used to compute

P (s)

, as indicated by Equation (4).

P (X∣ (S, θ)) P (S) = \prod_{t}^{N} P (x_{t} ∣ (s_{t}, θ_{s_{t}})) \times C (x_{t}, s_{t})

(4)

C (x_{t}, s_{t})

is a function that determines the value of

P (s)

by the consensus of all other clustering algorithms for the same data point

x_{t}

.

3.2.3. Local Step

In the local step, each clustering algorithm will process the data it can access, as shown in Table 3, and produce a clustering result as a solution vector.

The solution vectors of all clustering algorithms are clustered into a two-dimensional matrix similar to that illustrated in Table 4. Each vector column represents a solution proposed by a particular clustering algorithm. This matrix is used to form probability confusion matrices, which will be used in the collaborative step of computing the cs function by the consensus between the different clustering algorithms. This process will be discussed in detail in the collaborative step.

In summary, the local step involves the processing of the individual data set views by the appropriate clustering algorithms in order to generate the solutions matrix.

3.2.4. Collaborative Step

The local step clustering algorithm solutions matrix is then used in the initial stage of the collaborative step, during which the probability confusion matrices are calculated in order to generate the

C (x, s)

function. This idea was inspired by another work [28], in which the same confusion matrices are computed for a consensus-based method. The probability confusion matrix between two clustering algorithms is represented as shown in Equation (5).

Ψ^{i \to j} = (\begin{matrix} α_{1,1}^{i, j} & \dots & α_{1, k_{j}}^{i, j} \\ ⋮ & ⋱ & ⋮ \\ α_{1, k_{i}}^{i, j} & \dots & α_{k_{i}, k_{j}}^{i, j} \end{matrix}), where α_{k, l}^{i, j} = \frac{|S_{k}^{i} \cap S_{l}^{j}|}{|S_{k}^{i}|}

(5)

Ψ^{i \to j}

is a matrix of size

K_{i} \times K_{j}

which maps the clusters of the

A^{i}

algorithm to the clusters of the

A^{j}

algorithm. Each element of the matrix

α_{k, l}^{i, j}

represents the probability of placing the data in cluster

l

of the

A^{j}

algorithm if this data were already in cluster

k

of the algorithm

A^{i} . |S_{k}^{i} \cap S_{l}^{j}|

represents the number of data points in cluster

k

of

A^{i}

that are simultaneously in cluster

l

of

A^{j} .

/

|S_{k}^{i}|

is the number of data points belonging to the

k

cluster of the

A^{i}

algorithm.

Once all the probability confusion matrices,

Ψ,

have been computed, the second stage of the collaboration step commences. Given the matrices of

Ψ

and the results of other algorithms from the local step for a given clustering algorithm

A^{i}

, the consensus function

C (x, s)

to be estimated for cluster

s

becomes as shown in Equations (6) and (7).

C (x, s) = \prod_{a_{j} \in A, j \neq i} P (s ∣ s_{x, a_{j}})

(6)

s_{x, a_{j}}

is the cluster defined by algorithm

a_{j}

for object

x

. At this juncture, other algorithm solutions are incorporated to estimate the probability of cluster

s

. It is assumed that

P (s ∣ s_{x, a_{j}})

is independent. This assumption enables the use of the

Ψ

probability confusion matrices computed in the previous step, which yields Equation (7).

C (x, s) = \prod_{a_{j} \in A, j \neq i} Ψ_{s_{x, a_{j}}, s}^{j \to i}

(7)

Finally, Equation (3), which will be maximised during the collaborative step, becomes Equation (8):

P (x ∣ (s, θ_{s})) P (s) = \frac{1}{Z} P (x ∣ (s, θ_{s})) \times C (x, s), where Z = J - 1

(8)

where

Z

is a normalised constant independent of

s, and P (x ∣ (s, θ_{s}))

is the local term. This probability function depends on the type of probability distribution used in the clustering algorithm’s,

C (x, s)

, collaborative term in the form of a global consensus function between the solutions of all algorithms.

The concept of enhancing the objective function of the clustering algorithm is evident from Equation (8). This takes into account the two local solutions, together with those generated by alternative algorithms. Only the solution vectors, S, are taken into consideration; the parameters,

θ,

are excluded.

This change from

θ

to

S

is made possible owing to the use of an alternative maximisation procedure in which

S

partitions (solution vectors) are computed from the prototypes, which are then updated based on the partitions and data. Partitions can therefore be considered to reflect an estimate of the distributions described by the prototypes. The EM strategy is then used to improve Equation (8). The workflow in Algorithm 1 shows how EM can be implemented for a given clustering algorithm. During step E, the algorithm’s solutions,

S

, are updated by using constant

θ

parameter values distributed by the algorithm, together with information coming from the solutions of other algorithms as expressed by Equation (8). During step M, these parameters are updated based on the new

S

solutions. The solutions of the algorithm and the probability confusion matrices,

Ψ^{i \to j},

are then revised based on the updated parameters.

Algorithm 1 Collaborative “EM”

1 : Initialise θ

with the local step

2 : Retrieve the initial Ψ^{i \to j}

matrices

3 : while the global entropy H

decreases do

4 :

E - Step : S = \underset{S}{a r g m a x} (P (X ∣ (S, θ)) P (S)) :

5 :

for each x \in X^{i}

do

6 : s = \underset{s}{a r g m a x (P (x ∣ (s, θ_{s})) P (s)) using Equation (8)}

7 :

end for

8 :

M - Step : θ = \underset{θ}{a r g m a x (P (X ∣ (S, θ)))}

9 :

Update all Ψ^{i \to j} m a t r i c e s

10: end while

3.2.5. Stopping Criterion

One of the most challenging collaboration problems is knowing when to stop the collaboration. Often, continuity in the collaborative step leads to incorrect results. In this case, collaboration is negative, so defining a stopping criterion that controls the continuity of the collaborative step was necessary. It was defined as the entropy of probabilistic confusion [4] (Equation (9)).

H = \sum_{i = 1}^{J} \sum_{j \neq i}^{J} \frac{- 1}{K_{i} \times l o g (K_{j})} \sum_{m = 1}^{K_{i}} \sum_{l = 1}^{K_{j}} α_{l, m}^{j, i} \log (α_{l, m}^{j, i})

(9)

This entropy evaluates the pairwise differences between the algorithms. In short,

H

is the global entropy of collaboration, given that all algorithms are independent. The reason for this entropy is that it uses

α_{k, l}^{i, j}

, from the probability confusion matrices in Equation (5), which is calculated during the collaborative step. On this basis, the global entropy,

H,

is much less expensive to calculate than any other measure of divergence or consensus. In addition, this type of entropy criterion is consistent with studies that have showed the importance of diversity and entropy in collaborative clustering [36,37,38].

3.2.6. Collaborative Scenarios

In this section, some collaborative scenarios are presented in which the collaborative method could be used, together with the most critical collaborative applications of these scenarios and a description of the collaborative workflow process in each case.

Parallel Scenario

Figure 2 illustrates the parallelism scenario in collaborative clustering; several clustering algorithms run in parallel and improve each other’s results. This process underlies the previously discussed horizontal collaborative method.

The most important applications of this type of collaboration are several algorithms that operate on (i) different feature spaces searching for a different number of clusters, i.e., multi-view clustering; (ii) distributed data sets; and (iii) a problematic data set, whether distributed or undistributed. In all three applications, mutual improvement in the results would be beneficial, i.e., the outcome of collaborative clustering.

This collaborative method can work with all previous applications, but in this research, the first application is of interest. Qualitative and quantitative features of the heart disease data are processed with suitable clustering algorithms in an attempt to enhance their results by sending and receiving information determined by the collaborative method. In Algorithm 2, the workflow demonstrates how the collaborative process can be implemented with clustering algorithms running in parallel.

Algorithm 2 Collaborative Clustering for Parallel Scenario

1: Local step:

2 : for each A^{j} \in A

do

3 : Apply the clustering algorithm A^{j}

on the data X^{j}

4 :

Initialise the local parameters θ

5: end for

6: Collaboration step:

7 : Compute all Ψ

matrices

8 : while the global entropy H

decreases do

9 : for each A^{j} \in A

do

10 :

Running the A^{j}

algorithm to be optimised based on Equation (8)

11: end for

12 :

Update all solution vectors S

13 : Update all local parameters θ

14 :

Update all Ψ

matrices

15: end while

However, one should note that in this case, all collaborative algorithms must satisfy the collaborative method hypothesis, which is an objective functional optimisation similar to Equation (2).

Reinforcement Scenario

Figure 3 shows the reinforcement scenario, which is another possible scenario that could be handled by the collaborative method. In this case, the information is transmitted from one side only.

In Algorithm 3, the performance of an EM clustering algorithm is enhanced by the use of information from other algorithms. One-sided information transfer would be beneficial if the latter were specialised algorithms capable of detecting particular elements of the observed data. In this case, the process would be analogous to the reinforcing rather than the collaborative learning process.

In the current context, since the observed heart disease data were processed using clustering algorithms which specialised in qualitative and quantitative features, the collaborative method could be used to reinforce a qualitative algorithm with information from other quantitative algorithms and vice versa (Figure 3).

However, in this case, the EM algorithm must satisfy the premise of the collaborative method, which is an objective function optimisation similar to Equation (2). Although the remaining clustering algorithms can optimise different objective functions only, their solution vectors are of interest. Equation (9) is computed for the algorithm being reinforced only, as shown in Equation (10).

H = \sum_{j \neq i}^{J} \frac{- 1}{K_{i} \times l o g (K_{j})} \sum_{m = 1}^{K_{i}} \sum_{l = 1}^{K_{j}} α_{l, m}^{j, i} \log (α_{l, m}^{j, i})

(10)

Algorithm 3 Collaborative Clustering for Reinforcement Scenario

1: Local step:

2 : for each A^{j} \in A

do

3 : Apply the clustering algorithm A^{j}

on the data X^{j}

4 : Initialise the solution vectors S

5: end for

6 : Determine the A^{i} \in A

clustering algorithm to be optimised

7 : Initialise the local parameter θ^{i}

8: Collaboration step:

9 : Compute all Ψ^{i \to j}

matrices

10 : while the global entropy H

decreases do

11 : Running the A^{i}

algorithm to be optimised based on Equation (8)

12 : Update the solution vector S^{i}

13 :

Update the local parameters θ^{i}

14 : Update all Ψ^{i \to j}

matrices

15: end while

3.2.7. Generate and Adapt Collaborative Members

As explained earlier, the majority of clustering algorithms are unable to process mixed data, so specialised clustering algorithms for each data type, selected according to the previously described method limitations, have been used.

We show the clustering algorithms used in our proposed model and how they are adapted to our collaborative approach.

Gaussian and Bernoulli Mixture Models

The EM algorithm to estimate the parameters of a mixture model from data makes essential use of these posterior probabilities [39]. In this paper, both Gaussian Mixture Models (GMMs) and BMMs were employed in order to model the observed data set. As the qualitative data are binary, a superior fit is obtained from the Bernoulli distribution. Given that a Gaussian distribution operates on the binary data, i.e., 1 and 0, this was utilised to model the qualitative data after converting them into binary data in the pre-processing phase.

K-Means Clustering

The k-means approach to solving the problem is referred to as EM. The E-step maps the data points to the nearest cluster and the M-step calculates the centroid of each cluster using a k-means algorithm based on Euclidean similarity. A k-means algorithm was also employed, in which the similarity calculation equation was changed to the cosine similarity measure. The k-means algorithm is not probabilistic, but it can be considered to be a degraded case of the GMM algorithm.

K-modes clustering

A K-modes clustering algorithm was proposed by the author in [40]. This is an extension of the k-means algorithm, which can handle categorical features. It inherits the characteristics of the KMC algorithm and is efficient and easy to implement. It is therefore widely used in various fields. Since the k-modes algorithm is suitable for clustering categorical data, it was utilised to process the qualitative heart disease data features. However, it can only be used in the reinforcement scenario as it was not adapted for use with the collaborative method. In this paper, the collaborative method presented can only be applied to GMMs and BMMs. However, adaption of the technique is possible with respect to algorithms based on alternative probabilistic models.

3.3. Mapping Clusters to Class Labels

After applying clustering techniques to the data set, cluster 0 and cluster 1 are for each clustering model and mapped with the true prediction classes from the data set, i.e., set cluster 0 to class 0 and cluster 1 to class 1.

The purpose of the mapping process is to shift from clustering to a supervised learning or classification approach. The prediction process for the clustering algorithms then becomes clear, i.e., when predicting a new state, if it belongs to cluster 0, it is not infected, and if it belongs to cluster 1, it is infected. After the mapping process, the performance of the clustering models can be evaluated using different classification performance metrics. There is also the possibility of dealing with ensemble techniques in order to combine multiple model predictions into an optimised composite model.

3.4. Combining Different Models (Ensemble Approach)

The transition to a supervised learning approach produces two sets of models specialised in quantitative and qualitative data modelling, respectively. Many decisions are made by these models. Undoubtedly, one of these models alone cannot be selected because it specialises in a particular feature set space. The ensemble approach, therefore, had to be adopted to combine the different models and to produce a single composite decision. An ensemble-based system is obtained by blending diverse models, subsequently referred to as classifiers or experts. These systems are also known as multiple classifier systems or ensemble systems [41].

There are many scenarios in which the ensemble approach makes statistical sense. In many applications requiring automated decision making, receiving data from various sources that may provide complementary information is not unusual. An appropriate combination of this information is known as data or information fusion [41]. It can improve the accuracy of a classification decision compared to a decision based on any of the individual data sources alone [41].

It is clear that the scenario of data fusion has an excellent fit with the proposed model. The observed heart data set is considered to be heterogeneous data, coming from two different sources which represent the quantitative and qualitative features, respectively. Clustering algorithms specialised in processing these data types are used, which collaborate to improve their results through the previously described collaborative method. Finally, clusters are assigned to the data set classes, i.e., the transition from clustering to classification. This produces a set of models that are experts in modelling the heart data set features. In order to create a single decision, the decisions made by each expert are grouped by a specific combination rule, which, for the purposes of the current model, is described below.

The ensemble approach used in the model consists of two levels of learning: (i) essential learning and (ii) meta-learning. At the first level, a set of primary models is trained on the features of the observed data sets in order to produce a group of expert models. Once the training is finished, the expert models create a new data set containing the results of their predictions. The meta-learner is then trained using this new data set, and ultimately used to classify the latest cases. Figure 4 illustrates the ensemble approach used in the model.

Any meta-classifier can be used. However, LR was used as this performs well with binary classification problems. It is trained on the results of expert models’ predictions for both quantitative and qualitative data types and used to predict the final decision.

The stacking approach is in keeping with the current strategy of using a stacking rule for training a meta-classifier. The main difference between them is that the cross-validation type is usually used to prepare the first-level or basic models in stacking. In the current approach, it is trained on a set of data set feature sets, i.e., on different feature spaces.

The basic idea of using a two-level ensemble approach to learning is to see whether or not the data sets were learned correctly. For example, if a given model knew a particular region of the feature space incorrectly and thus consistently misclassified states coming from that region, a second-level meta-model could potentially learn that behaviour, together with the learned behaviours of the model. The improper training of the original model can then be corrected.

4. Experimental Results

In this section, the experiments carried out to test the proposed framework for predicting heart disease in both the reinforcement and parallel scenarios are described. For each context, the experimental setting is given and the obtained outcomes are discussed. Since the proposed model is a hybrid of supervised and unsupervised learning techniques, evaluation metrics, presented below, were applied for both clustering and classification.

4.1. Performance Measures

The evaluation of the clustering result quality is referred to as the evaluation of its validity, for which external and internal indicators are utilised. The external validation index is the most common validation method used in the clustering method. It is based on prior knowledge of the data and measures the similarity of clustering results to external ground truth information. Hence, any valid similarity metric suitable for partition comparison can be used as an external indicator [42].

In contrast, the internal validation index relies only on the information in the data without any additional information. Most internal validation indicators are based on two criteria, i.e., coherence and separation [43]. Compression is defined as a measure of how close objects are in mass. It is often measured by variance; a lower variance indicates better compressibility. Segregation is a measure of how separated a cluster is from other clusters. The distance between the cluster centroids is usually measured.

In order to assess the current model, the Rand index and purity score were utilised as external validation indicators, and the Davies–Bouldin (DB) index was used as the internal validation index.

The classification performance refers to how well a classification model can correctly predict the class labels of a given data set. The following metrics were applied as performance measures in order to appraise the proposed model:

Accuracy, which measures the proportion of correctly classified instances out of the total number of instances (Equation (11)):

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

(11)

Precision, which measures the proportion of true positives out of all the instances classified as positive (Equation (12)):

Precision = \frac{T P}{T P + T N + F P + F N}

(12)

Recall or sensitivity, which is defined as the proportion of true positives out of all the positive instances (Equation (13)):

Recall = \frac{T P}{T P + T N}

(13)

F1 score, which is a metric which combines precision and recall into a single value and provides a balanced measure of a model’s accuracy (Equation (14)):

F 1 = \frac{2 \times p r e c i s i o n \times r e c a l l}{p r e c i s i o n + r e c a l l}

(14)

4.2. Reinforcement Scenario Experiments

4.2.1. Experimental Setting

As mentioned earlier, in the reinforcement scenario, one of the clustering algorithms is improved by augmenting it with information from the remaining algorithms such that only a single clustering algorithm is optimised. In order to be enhanced by different clustering algorithms, this algorithm must satisfy the proposed collaborative method hypothesis, i.e., an objective functional improvement. However, the residual clustering algorithms can optimise different objective functions, which enables the use of various clustering algorithms. In these experiments, the collaborative clustering task framework in this scenario was evaluated to this end. The experiments relevant to this scenario were divided according to the algorithm selected for reinforcement. Algorithms specialised in each data type were therefore chosen and then reinforced with the results of other algorithms. The following experiences were realised:

Choosing an algorithm specialised in processing quantitative data and enhancing it through the results of algorithms specialised in processing qualitative data only.
Choosing an algorithm specialised in processing quantitative data and enhancing it through the results of algorithms specialised in processing both qualitative and quantitative data.
Choosing an algorithm specialised in processing qualitative data and enhancing it through the results of algorithms that only process quantitative data.
Choosing an algorithm specialised in processing qualitative data and enhancing it through the results of algorithms specialised in processing both quantitative and qualitative data.

The proposed model was initially evaluated after the local step. For all the clustering algorithms, the previously described internal and external validation indicators were measured, and their clusters were mapped to accurate class labels and assessed using the classification performance metrics. The meta-classifier (LR model) performance that collects decisions and produces the final decision was subsequently trained and evaluated. The efficacy of the collaborative method was finally compared with the output of the selected clustering algorithm and the meta-classifier both prior to and following the collaborative stage in order to identify any benefit from the collaboration.

4.2.2. Results Discussion

Table 5 shows the clustering algorithm validation indicators and the classification performance measures, which were assessed prior to the collaborations and following the mapping phase, respectively.

The performances of clustering algorithms specialised in qualitative data are often higher than those specialised in quantitative data. This does not mean that these algorithms are superior to the other algorithms. However, it does imply that for determining CVD incidence, qualitative variables have a higher weight. Consequently, it is possible to improve the performance of clustering algorithms specialised in quantitative data processing through the collaborative method, which gives them insight into qualitative data clustering, and vice versa.

In Table 5, one model of each of the quantitative data algorithm models from each of the qualitative data algorithms, i.e., k-modes, K-means_{cosine (1)} and BMM, as well as two models from each of the k-modes and BMM qualitative data algorithms, are presented.

It is evident that the clustering algorithms specialised in qualitative data have a superior performance to those specialised in quantitative data. This does not mean that these algorithms are more effective than the other algorithms. However, it does infer that with respect to recognising the incidence of heart disease, the qualitative variables have a higher weight. Consequently, there is the potential to enhance the performance of clustering algorithms specialised in quantitative data processing through the collaborative method, which gives them insight into qualitative data clustering, and vice versa.

It is also clear that the K-means_{cosine (1)} and BMM₍₁₎ algorithms are the weakest algorithms specialised in processing quantitative and qualitative data, respectively. The remainder of this subsection will therefore focus on the effectiveness of the collaborative method for improving these algorithms.

In Table 6, the evaluation of the performance of the K-means_{cosine (1)} algorithm after enhancing it with the results of algorithms specialised in qualitative data processing only is shown. This is also appraised after enhancing the outcomes of the two algorithms specialised in qualitative data and the algorithms specialised in data quantitation (Table 6). In both tables, the performance of the meta-classifier is evaluated after the reinforcement process.

The results in Table 6 indicate that there is a significant improvement after the reinforcement process, as the percentage of progress in the purity index is 17%. In addition, there is a rise in the Rand index and a minimal increase in the DB index. The Rand index is better when it is close to 1, whereas the DB index is not natural and is best when it is smaller. A significant improvement can also be seen in the classification performance metrics after the appointment process, as the percentage of progress in the F1 score reached 19%. Following training once the reinforcement process was complete, a good accuracy was achieved. The data in Table 7 indicate that there is a slight increase in performance, as the percentage purity index improvement is

18 %

and there is an increase in the Rand index although the DB index remained practically unchanged. The performance of the meta-classifier remained constant in both tables.

It can be concluded from both tables that enhancing the algorithm specialised in processing quantitative data with information from algorithms specialised in processing qualitative data, or by using inputs from algorithms specialised in different data altogether, has a significant impact on improving the result quality. This proves the strength of the proposed framework in the reinforcement scenario. The performance of the meta-classifier remains constant because it depends on the results of all the clustering algorithms shown in Table 5. Its performance is therefore unaffected by the improvement in the results of only a single algorithm.

Table 8 and Table 9 contain the data which describe the performance evaluation of the BMM₍₁₎ algorithm following its enhancement by algorithms specialised in processing quantitative data only and by both a combination of quantitative and qualitative algorithms, respectively. As in previous experiments, the performance of the meta-classifier is assessed after the reinforcement process.

The data in both tables demonstrate a significant improvement after the reinforcement process, with an improvement in the purity index of 9% and an increase in the Rand index in both cases. In contrast, the DB external validation index deteriorates, as its value has increased from the value in the local step in both tables. There was a significant improvement in the evaluation performance metrics after the reinforcement process, as the percentage of progress in the F1 score reached 15%. A slight decrease in the performance of the meta-classifier model is also evident, with a 1% fall in the F1 score compared to the previous experiment.

It can be concluded from both tables that the process of reinforcing an algorithm specialised in qualitative data with quantitative algorithms only or with quantitative and qualitative algorithms together has a similar effect in terms of improving its results. Although the reinforcement process is based on the use of weak-performing quantitative algorithms in order to improve a higher-performing algorithm, the stopping factor in the collaborative method controls stopping the reinforcement process at the onset of negative collaboration, which proves the strength of the proposed reinforcement framework in detecting negative collaboration.

4.3. Parallel Scenario Experiments

4.3.1. Experimental Setting

As mentioned earlier, in the parallel scenario, all collaborating algorithms must satisfy the proposed collaborative method hypothesis. This section offers experiments performed to assess the collaborative approach of the proposed model in the parallel scenario. To this end, the experimental setup is similar to that described in the previous section, with the exception that the number of collaborating algorithms is increased by generating multiple algorithms with different random initialisation in order to achieve diversity in the solutions.

In this scenario, the experiments are divided according to the algorithm specialisation for collaboration, with the first and second experiments representing the collaboration of algorithms specialised in only quantitative and qualitative data processing, respectively, with each other. Finally, the third experiment represents the collaboration of all the qualitative and quantitative algorithms. The method used to assess the proposed model is identical to that described in the preceding section.

4.3.2. Results Discussion

The performance data for the clustering algorithms after the local step are presented in Table 10, and include six models specialising in quantitative data and three models specialising in qualitative data. There is diversity in the performance of all the algorithms, which achieves diversity in the solutions.

The evaluation of the performance data for the quantitative clustering models after collaboration with each other is shown in Table 11. Two models for each algorithm, i.e., K-means, the K-means cosine GMM, k-mode and the BMM, were created in order to achieve diversity. The evaluation of the qualitative clustering models after their collaboration is presented in Table 12, and the assessment of the quantitative and qualitative clustering models after collaboration with each other is shown in Table 13.

Numerous differences can be observed in the DB internal validation index results. This could be explained by the fact that although the proposed collaborative framework aims to optimise all outcomes, weaker algorithms often impact the outcomes of the best collaborators negatively. However, on average, the collaboration scores of the DB index remain positive, proving the strength of the proposed collaborative framework. The second point that these experiments highlight is that the proposed collaborative framework does not solve the problem of achieving good results on external indicators, i.e., in this case, the Rand and purity indices, using completely unsupervised clustering algorithms. A lack of external knowledge could explain the poor performance achieved on these indicators. Thus, there is no reason for the collaborative process to converge on the ground truth.

In all experiments, the collaborative step of the proposed method takes an average of 8 to 10 iterations before a stable global entropy is reached. This number decreases slightly when there are only two or three collaborators with extremely close solutions at the end of the local step. Still, it remains primarily constant when the number of collaborators or diversity amongst initial solutions increases. The highest accuracy of the meta-classifier, 85%, was obtained when the quantitative algorithms cooperated. However, in the rest of the experiments, almost the same performance,

83 %,

was attained. The accuracy of the meta-classifier can be enhanced by increasing the number of collaborating members, provided that diversity is achieved in the solutions, because the meta-classifier training data set consists of the results of the collaborating members. Consequently, the more members that there are, the greater the training data set’s advantages and thus the better the meta-classifier learns.

5. Conclusions

A hybrid model comprising unsupervised and supervised learning techniques is proposed in this paper, which can be used for the general modelling and processing of mixed data and for heart disease data in particular. The model consists of two main components: (i) collaborative clustering and (ii) combining decisions (the ensemble approach). The mixed data clustering problem is considered as a multi-view clustering problem, with each view processed using specialised clustering algorithms. Since each algorithm operates on a different space of the data set’s features, a new collaborative framework has been suggested that enhances the clustering process through the information exchange between the different clustering algorithms, thereby producing expert models that model different spaces of the data set’s features. The strength of this method is that it does not require the prototypes or models used by the different algorithms to be shared during the collaboration step; only the solution vectors produced by all the algorithms need to be shared. The optimisation process is based on the EM algorithm and improves the entropy-equivalent collaborative term, thus ensuring good convergence properties. The approach also avoids negative collaboration by controlling the continuity of the collaborative step. Finally, the models’ clusters are mapped to the appropriate data set classes. There is therefore a shift from the unsupervised to the supervised learning approach, which makes the predictions of the clustering models meaningful. In order to produce a single decision from all these expert models, an ensemble approach similar to the stacking approach is applied. An LR model is used as a meta-classifier, trained on the results of the expert models’ predictions and then utilised to predict the final decision.

A cardiac data set containing different feature types, i.e., quantitative and qualitative data, which comprised discrete and continuous, and nominal and ordinal data types, respectively, was used to validate the proposed framework. The highest meta-classifier accuracy of 85% was obtained when the quantitative algorithms cooperated only with each other. The remainder of the experiments offered an equivalent performance level of 83%. This is because the meta-classifier’s performance is not affected by changing the results of just one algorithm, as in the reinforcement scenario experiments, in addition to the small number of participating cooperating members. However, the accuracy of the meta-classifier can be enhanced by increasing the number of collaborating members, provided that diversity in solutions is achieved owing to the meta-classifier training data set consisting of the results of the collaborating members. Consequently, the more members, the greater the advantages of the training data set and so the better the declarative classifier learns. It could also be argued that in these experiments, the collaborative process never achieved superior results to the best algorithms generated by the local phase. Moreover, the concept of the best result in unsupervised learning is highly dependent on the index considered; it is impossible to know in advance which algorithm will give the best results. In most cases, the presented approach therefore remains valid. Given the issue of strong and weak collaborators, future implementations of this work will focus on improving the collaborative process by balancing the influence of collaborators on each other based on quality and diversity criteria, as well as incorporating external knowledge into the collaborative approach. The objective will be to reduce instances of negative collaboration. The proposed algorithm can be used in other applications in several fields such as health and engineering, as well as in the method of ablation by only using two cluster algorithms.

Author Contributions

Formal analysis, A.A.-S.; Methodology, N.Z.; Software, A.A.-S.; Supervision, N.Z.; Writing—original draft, A.A.-S.; Writing—review and editing, M.M.K. and N.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the University of Jeddah, Jeddah, Saudi Arabia, under grant no. UJ-23-FR-41. Therefore, the authors thank the University of Jeddah for its technical and financial support.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data available in a publicly accessible repository. The data presented in this study are openly available in repository UCI Machine Learning Repository at https://www.kaggle.com/fedesoriano/heart-failure-prediction, reference number [4].

Conflicts of Interest

The authors declare no conflict of interest.

References

Marimuthu, M.; Abinaya, M.; Hariesh, K.; Madhankumar, K.; Pavithra, V. A review on heart disease prediction using machine learning and data analytics approach. Int. J. Comput. Appl. 2018, 181, 20–25. [Google Scholar] [CrossRef]
Ramalingam, V.; Dandapath, A.; Raja, M.K. Heart disease prediction using machine learning techniques: A survey. Int. J. Eng. Technol. 2018, 7, 684–687. [Google Scholar] [CrossRef]
Aggarwal, A. Cardiovascular Diseases. Available online: https://www.who.int/health-topics/cardiovascular-diseases (accessed on 1 September 2022).
Sublime, J.; Matei, B.; Cabanes, G.; Grozavu, N.; Bennani, Y.; Cornuéjols, A. Entropy based probabilistic collaborative clustering. Pattern Recognit. 2017, 72, 144–157. [Google Scholar] [CrossRef]
Sublime, J.; Grozavu, N.; Bennani, Y.; Cornuéjols, A. Collaborative clustering with heterogeneous algorithms. In Proceedings of the 2015 International Joint Conference on Neural Networks (IJCNN), Killarney, Ireland, 12–17 July 2015; pp. 1–8. [Google Scholar]
Cornuéjols, A.; Wemmert, C.; Gançarski, P.; Bennani, Y. Collaborative clustering: Why, when, what and how. Inf. Fusion 2018, 39, 81–95. [Google Scholar] [CrossRef]
Yekkala, I.; Dixit, S.; Jabbar, M. Prediction of heart disease using ensemble learning and particle swarm optimization. In Proceedings of the 2017 International Conference on Smart Technologies for Smart Nation (SmartTechCon), Bengaluru, India, 17–19 August 2017; pp. 691–698. [Google Scholar]
Mohan, S.; Thirumalai, C.; Srivastava, G. Effective heart disease prediction using hybrid machine learning techniques. IEEE Access 2019, 7, 81542–81554. [Google Scholar] [CrossRef]
Repaka, A.N.; Ravikanti, S.D.; Franklin, R.G. Design and implementing heart disease prediction using naives Bayesian. In Proceedings of the 2019 3rd International Conference on Trends in Electronics and Informatics (ICOEI), Tirunelveli, India, 23–25 April 2019; pp. 292–297. [Google Scholar]
Fitriyani, N.L.; Syafrudin, M.; Alfian, G.; Rhee, J. Hdpm: An effective heart disease prediction model for a clinical decision support system. IEEE Access 2020, 8, 133034–133050. [Google Scholar] [CrossRef]
Ayon, S.I.; Islam, M.M.; Hossain, M.R. Coronary artery heart disease prediction: A comparative study of computational intelligence techniques. IETE J. Res. 2022, 68, 2488–2507. [Google Scholar] [CrossRef]
Tarawneh, M.; Embarak, O. Hybrid approach for heart disease prediction using data mining techniques. In Advances in Internet, Data and Web Technologies: The 7th International Conference on Emerging Internet, Data and Web Technologies (EIDWT-2019); Springer: Berlin/Heidelberg, Germany, 2019; pp. 447–454. [Google Scholar]
Sonawane, R.; Patil, H. Automated heart disease prediction model by hybrid heuristic-based feature optimization and enhanced clustering. Biomed. Signal Process. Control. 2022, 72, 103260. [Google Scholar] [CrossRef]
Shinde, R.; Arjun, S.; Patil, P.; Waghmare, J. An intelligent heart disease prediction system using k-means clustering and naïve bayes algorithm. Int. J. Comput. Sci. Inf. Technol. 2015, 6, 637–639. [Google Scholar]
Ripan, R.C.; Sarker, I.H.; Hossain, S.M.M.; Anwar, M.M.; Nowrozy, R.; Hoque, M.M.; Furhad, M.H. A data-driven heart disease prediction model through k-means clustering-based anomaly detection. SN Comput. Sci. 2021, 2, 112. [Google Scholar] [CrossRef]
Pedrycz, W.; Rai, P. Collaborative clustering with the use of fuzzy c-means and its quantification. Fuzzy Sets Syst. 2008, 159, 2399–2427. [Google Scholar] [CrossRef]
Yu, F.; Tang, J.; Cai, R. Partially horizontal collaborative fuzzy c-means. Int. J. Fuzzy Syst. 2007, 9, 198–204. [Google Scholar]
Yu, F.; Yu, J.; Tang, J. The model of generalized partially horizontal collaborative fuzzy c-means. In Proceedings of the 2009 Chinese Control and Decision Conference, Guilin, China, 17–19 June 2009; pp. 6095–6099. [Google Scholar]
Yu, S.; Yu, F. Incorporating prototypes into horizontal collaborative fuzzy c-means. In Proceedings of the 2010 Chinese Control and Decision Conference, Xuzhou, China, 26–28 May 2010; pp. 3612–3616. [Google Scholar]
Jiang, Y.; Chung, F.-L.; Wang, S.; Deng, Z.; Wang, J.; Qian, P. Collaborative fuzzy clustering from multiple weighted views. IEEE Trans. Cybern. 2014, 45, 688–701. [Google Scholar] [CrossRef] [PubMed]
Yang, M.-S.; Sinaga, K.P. Collaborative feature-weighted multiview fuzzy c-means clustering. Pattern Recognit. 2021, 119, 108064. [Google Scholar] [CrossRef]
Gao, Y.; Wang, Z.; Li, H.; Pan, J. Gaussian collaborative fuzzy c-means clustering. Int. J. Fuzzy Syst. 2021, 23, 2218–2234. [Google Scholar] [CrossRef]
Grozavu, N.; Bennani, Y. Topological collaborative clustering. Aust. J. Intell. Inf. Process. Syst. 2010, 12, 14. [Google Scholar]
Grozavu, N.; Ghassany, M.; Bennani, Y. Learning confidence exchange in collaborative clustering. In Proceedings of the 2011 International Joint Conference on Neural Networks, San Jose, CA, USA, 31 July–5 August 2011; pp. 872–879. [Google Scholar]
Ghassany, M.; Grozavu, N.; Bennani, Y. Collaborative clustering using prototype-based techniques. Int. J. Comput. Intell. Appl. 2012, 11, 1250017. [Google Scholar] [CrossRef]
Sublime, J.; Grozavu, N.; Cabanes, G.; Bennani, Y.; Cornuéjols, A. From horizontal to vertical collaborative clustering using generative topographic maps. Int. J. Hybrid Intell. Syst. 2015, 12, 245–256. [Google Scholar] [CrossRef]
Sublime, J.; Grozavu, N.; Bennani, Y.; Cornuéjols, A. Vertical collaborative clustering using generative topographic maps. In Proceedings of the 2015 7th International Conference of Soft Computing and Pattern Recognition (SoC-PaR), Fukuoka, Japan, 13–15 November 2015; pp. 199–204. [Google Scholar]
Forestier, G.; Gançarski, P.; Wemmert, C. Collaborative clustering with background knowledge. Data Knowl. Eng. 2010, 69, 211–228. [Google Scholar] [CrossRef]
Koochemeshkian, P.; Zamzami, N.; Bouguila, N. Flexible distribution-based regression models for count data: Application to medical diagnosis. Cybern. Syst. 2020, 51, 442–466. [Google Scholar] [CrossRef]
Alalyan, F.; Zamzami, N.; Bouguila, N. A hybrid approach based on svm and bernoulli mixture model for binary vectors classification. In Proceedings of the 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Toronto, ON, Canada, 11–14 October 2020; pp. 1155–1160. [Google Scholar]
Gargiulo, F.; Silvestri, S.; Ciampi, M. A clustering based methodology to support the translation of medical specifications to software models. Appl. Soft Comput. 2018, 71, 199–212. [Google Scholar] [CrossRef]
Fedesoriano. Heart Failure Prediction Dataset. 2021. Available online: https://www.kaggle.com/fedesoriano/heart-failure-prediction (accessed on 13 September 2022).
Uci. Heart Failure Prediction Dataset. 1988. Available online: https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/ (accessed on 13 September 2022).
Ahsan, M.M.; Mahmud, M.P.; Saha, P.K.; Gupta, K.D.; Siddique, Z. Effect of data scaling methods on machine learning algorithms and model performance. Technologies 2021, 9, 52. [Google Scholar] [CrossRef]
Izonin, I.; Tkachenko, R.; Shakhovska, N.; Ilchyshyn, B.; Singh, K.K. A Two-Step Data Normalization Approach for Improving Classification Accuracy in the Medical Diagnosis Domain. Mathematics 2022, 10, 1942. [Google Scholar] [CrossRef]
Azimi, J.; Fern, X. Adaptive cluster ensemble selection. In Proceedings of the Twenty-First International Joint Conference on Artificial Intelligence, Pasadena, CA, USA, 11–17 July 2009. [Google Scholar]
Grozavu, N.; Cabanes, G.; Bennani, Y. Diversity analysis in collaborative clustering. In Proceedings of the 2014 International Joint Conference on Neural Networks (IJCNN), Beijing, China, 6–11 July 2014; pp. 1754–1761. [Google Scholar]
Zarinbal, M.; Zarandi, M.F.; Turksen, I. Relative entropy collaborative fuzzy clustering method. Pattern Recognit. 2015, 48, 933–940. [Google Scholar] [CrossRef]
Verbeek, J. Mixture Models for Clustering and Dimension Reduction. Ph.D. Thesis, Universiteit van Amsterdam, Amsterdam, The Netherlands, 2004. [Google Scholar]
Huang, Z. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov. 1998, 2, 283–304. [Google Scholar] [CrossRef]
Polikar, R. Ensemble Learning: Ensemble Machine Learning: Methods and Applications; Springer Science & Business Media: Berlin, Germany, 2012; pp. 1–34. [Google Scholar]
Halkidi, M.; Batistakis, Y.; Vazirgiannis, M. On clustering validation techniques. J. Intell. Inf. Syst. 2001, 17, 107–145. [Google Scholar] [CrossRef]
Tan, P.-N. Michael steinbach und vipin kumar. In Introduction to Data Mining; Addison-Wesley: Boston, MA, USA, 2006. [Google Scholar]

Figure 1. The general framework of the proposed model.

Figure 2. Collaborative clustering for parallel scenario.

Figure 3. Collaborative clustering for reinforcement scenario.

Figure 4. Ensemble approach.

Table 1. Description of data set.

Feature	Data Type	Description
Age	Quantitative	age of the patient (years)
Sex	Qualitative	sex of the patient (M: male, F: female)
Chest Pain Type	Qualitative	chest pain type (TA: typical angina, ATA: atypical angina, NAP: non-anginal pain, ASY: asymptomatic)
RestingBP	Quantitative	resting blood pressure (mm Hg)
Cholesterol	Quantitative	serum cholesterol (mm/dL)
FastingBS	Qualitative	fasting blood sugar (1: if FastingBS ? 120 mg/dL, 0: otherwise)
RestingECG	Qualitative	resting electrocardiogram results (normal: normal, ST: having ST-T wave abnormality (T wave inversions and ST elevation or depression of and 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes’ criteria)
MaxHR	Quantitative	maximum heart rate achieved (numeric value between 60 and 202)
ExerciseAngina	Qualitative	exercise-induced angina (Y: yes, N: no)
Oldpeak	Quantitative	oldpeak = ST (numeric value measured in depression)
ST_Slope	Qualitative	the slope of the peak exercise ST segment (Up: upsloping, Flat: flat, Down: downsloping)
HeartDisease	Qualitative	output class (1: heart disease, 0: normal)

Table 2. Notation.

Notation	Development	Comment
$X^{i}$	$X^{i} = \{x^{i},_{1}, x^{i}_{2}, \dots, x^{i}_{N}\}, x^{i}_{n} \in R^{d}$	$The subset of the data observed by algorithm A^{i}$ .
$X$	$X = \{X^{1}, X^{2}, \dots, X^{J}\}$	All data with all views.
$S^{i}$	$S^{i} = \{S_{1}^{i}, S_{2}^{i}, \dots, S_{N}^{i}\}, S_{n}^{i} \in R^{K^{i}}$	$The solution vector of the algorithm A^{i}$ .
$S$	$S = \{S^{1}, S^{2}, \dots, S^{J}\}$	Solution vectors for all clustering algorithms.
$A^{i}$	$A^{i} = \{X^{i}, K^{i}, θ^{i}, S^{i}\}$	The set of distribution parameters for all algorithms.
$Z$	$An algorithm looking for K_{i}$ $clusters of distribution parameters θ^{i}$ $in the subset X^{i}$ $and finding a solution vector S^{i}$
$Ψ^{i \to j}$	$Ψ^{i \to j} = \{Ψ^{1 \to j}, Ψ^{2 \to j}, \dots, Ψ^{2}, \dots, θ^{J}\}$	The normalised constant is independent of s.
$α_{k, l}^{i, j}$	$α_{k, l}^{i, j} \in Ψ^{i \to j}$	$The consensus matrices map the other algorithms ’ clusters to the ones of the current algorithm A^{j}$ .

Table 3. Clustering Algorithms.

Clustering Algorithm	Data Type	Description
K-means	Quantitative data	Cluster similarity is measured based on the mean value of the objects in the cluster`.`
Gaussian mixture model (GMM)	Quantitative data	EM algorithm using Gaussian distribution.
K-modes	Quantitative data	k-modes use a measure of difference based on the number of mismatches between the modes of the categorical variables.
Bernoulli mixture model (BMM)	Quantitative data	EM algorithm using Bernoulli distribution.

Table 4. Solutions matrix.

$X ∖ A$	$A^{1}$	$A^{2}$	$A^{3}$	…	$A^{J}$
$x_{1}$	0	1	0	…	0
$x_{2}$	1	1	1	…	1
$x_{3}$	0	1	0	…	0
…	…	…	…	…	…
$x_{N}$	0	0	1	0	1

Table 5. Evaluate of the performance of all algorithms after the local step.

Algorithm	DB Index	Rand Index	Purity	Sensitivity	Precision	F1 Score	Accuracy
Quantitative Clustering Algorithms
K-means₍₁₎	1.13915	0.51827	$59.695 %$	$31.496 %$	$87.912 %$	46.377%	$59.695 %$
K-means_{cosine (1)}	0.62323	0.50237	$55.338 %$	$54.528 %$	$58.936 %$	56.646%	$53.813 %$
Gmm₍₁₎	0.79985	0.63859	$76.362 %$	$83.858 %$	$75.936 %$	79.701%	$76.362 %$
Qualitative Clustering Algorithms
K-modes₍₁₎	0.9349	0.6103	$73.529 %$	$62.402 %$	$85.908 %$	72.292%	$73.529 %$
K-modes₍₂₎	1.22162	0.69652	$81.373 %$	$82.48 %$	$83.633 %$	83.052%	$81.373 %$
Bmm₍₁₎	0.66657	0.59736	$72.113 %$	$55.709 %$	$90.127 %$	68.857%	$72.113 %$
Bmm₍₂₎	1.10022	0.69652	$81.373 %$	$84.646 %$	$82.218 %$	83.414%	$81.373 %$

Table 6. Reinforcement of the K-means_{cosine (1)} algorithm by algorithms specialised in qualitative data processing.

Algorithm	DB Index	Rand Index	Purity	Sensitivity	Precision	F1 Score	Accuracy
K-means_{cosine (1)}	0.65664	0.60028	$72.44 %$	$75.197 %$	$75.049 %$	75.123%	$72.44 %$
Meta-classifier	-	-	-	$90.551 %$	$82.883 %$	86.547%	$84.423 %$

Table 7. Reinforcement of the K-means_{cosine (1)} algorithm by both qualitative and quantitative algorithms.

Algorithm	DB Index	Rand Index	Purity	Sensitivity	Precision	F1_Score	Accuracy
K-means_{cosine (1)}	0.62786	0.60724	$73.203 %$	$76.378 %$	$75.486 %$	75.929%	$73.203 %$
Meta-classifier	-	-	-	$90.551 %$	$82.883 %$	86.547%	$84.423 %$

Table 8. Reinforcement of the BMM₍₁₎ algorithm by algorithms specialised in quantitative data processing.

Algorithm	DB Index	Rand Index	Purity	Sensitivity	Precision	F1_Score	Accuracy
Bmm₍₁₎	1.10311	0.69652	$81.373 %$	$84.055 %$	$82.592 %$	83.317%	$81.373 %$
Meta-classifier	-	-	-	$87.598 %$	$83.022 %$	85.249%	$83.224 %$

Table 9. Reinforcement of BMM₍₁₎ algorithm by both quantitative algorithms and qualitative algorithms.

Algorithm	DB Index	Rand Index	Purity	Sensitivity	Precision	F1_Score	Accuracy
Bmm₍₁₎	1.0987	0.69652	$81.373 %$	$84.449 %$	$82.342 %$	83.382%	$81.373 %$
Meta-classifier	-	-	-	$87.598 %$	$83.022 %$	85.249%	$83.224 %$

Table 10. Evaluate the performance of all algorithms after the local step.

Algorithm	DB Index	Rand Index	Purity	Sensitivity	Precision	F1 Score	Accuracy
	Quantitative Clustering Algorithms
K-means₁	1.75751	0.5522	$66.231 %$	$51.181 %$	$80.745 %$	62.65%	$66.231 %$
K-means₂	1.14254	0.51785	$59.586 %$	$31.496 %$	$87.432 %$	46.31%	$59.586 %$
K-means_{cosine (1)}	0.55905	0.58887	$71.133 %$	$72.441 %$	$74.645 %$	73.526%	$71.133 %$
K-means_{cosine (2)}	0.57749	0.58254	$70.37 %$	$74.016 %$	$72.868 %$	73.438%	$70.37 %$
Gmm₍₁₎	0.79985	0.63859	$76.362 %$	$83.858 %$	$75.936 %$	79.701%	$76.362 %$
Gmm₍₂₎	0.98261	0.60623	$73.094 %$	$64.37 %$	$83.206 %$	72.586%	$73.094 %$
	Qualitative Clustering Algorithms
Bmm₍₁₎	1.10022	0.69652	$81.373 %$	$84.646 %$	$82.218 %$	83.414%	$81.373 %$
Bmm₍₂₎	0.63794	0.65883	$78.214 %$	$93.504 %$	$73.988 %$	82.609%	$78.214 %$
Bmm₍₃₎	2.03007	0.52717	$61.765 %$	$85.827 %$	$60.979 %$	71.3%	$61.765 %$

Table 11. Evaluation of quantitative algorithms after collaboration between them.

Algorithm	DB Index	Rand Index	Purity	Sensitivity	Precision	F1 Score	Accuracy
K-means₁	1.0891	0.51956	$60.022 %$	$32.283 %$	$87.701 %$	47.194%	$60.022 %$
K-means₂	1.08917	0.51956	$60.022 %$	$32.283 %$	$87.701 %$	47.194%	$60.022 %$
K-means_{cosine (1)}	0.72586	0.58077	$70.153 %$	$69.685 %$	$74.684 %$	72.098%	$70.153 %$
K-means_{cosine (2)}	0.72387	0.58077	$70.153 %$	$69.685 %$	$74.684 %$	72.098%	$70.153 %$
Gmm₍₁₎	1.50585	0.51703	$59.368 %$	$30.315 %$	$89.017 %$	45.228%	$59.368 %$
Gmm₍₂₎	1.50557	0.51827	$59.695 %$	$30.906 %$	$89.205 %$	45.907%	$59.695 %$
Meta-classifier	-	-	-	$88.386 %$	$85.199 %$	86.763%	$85.076 %$

Table 12. Evaluation of qualitative algorithms after collaboration between them.

Algorithm	DB Index	Rand Index	Purity	Sensitivity	Precision	F1 Score	Accuracy
Bmm₍₁₎	1.0987	0.69652	$81.373 %$	$84.449 %$	$82.342 %$	83.382%	$81.373 %$
Bmm₍₂₎	2.25264	0.69652	$81.373 %$	$84.449 %$	$82.342 %$	83.382%	$81.373 %$
Bmm₍₃₎	2.84156	0.5522	$66.231 %$	$86.22 %$	$64.602 %$	73.862%	$66.231 %$
Meta-classifier	-	-	-	$91.339 %$	$81.547 %$	86.166%	83.769%

Table 13. Evaluation of all quantitative and qualitative algorithms after collaboration with each other.

Algorithm	DB Index	Rand Index	Purity	Sensitivity	Precision	F1_score	Accuracy
	Quantitative Clustering Algorithms
K-means₍₁₎	1.12136	0.52321	$60.893 %$	$34.252 %$	$87.437 %$	49.222%	$60.893 %$
K-means₍₂₎	1.11817	0.52181	$60.566 %$	$33.465 %$	$87.629 %$	48.434%	$60.566 %$
K-means_{cosine (1)}	0.77607	0.59833	$72.222 %$	$73.031 %$	$75.869 %$	74.423%	$72.222 %$
K-means_{cosine (2)}	0.69197	0.60523	$72.985 %$	$74.409 %$	$76.21 %$	75.299%	$72.985 %$
Gmm₍₁₎	1.16055	0.65273	$77.669 %$	$81.102 %$	$79.079 %$	80.078%	$77.669 %$
Gmm₍₂₎	1.14702	0.56893	$68.627 %$	$50.394 %$	$87.671 %$	64.0%	$68.627 %$
	Qualitative Clustering Algorithms
Bmm₍₁₎	1.10022	0.69652	$81.373 %$	$84.646 %$	$82.218 %$	83.414%	$81.373 %$
Bmm₍₂₎	1.10373	0.70342	$81.917 %$	$86.22 %$	$82.022 %$	84,069%	$81.917 %$
Bmm₍₃₎	2.5694	0.56414	$67.974 %$	$96.26 %$	$64.005 %$	76.887%	$67.974 %$
Meta-classifier	-	-	-	$92.323 %$	$81.282 %$	86.451%	$83.987 %$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Al-Sayed, A.; Khayyat, M.M.; Zamzami, N. Predicting Heart Disease Using Collaborative Clustering and Ensemble Learning Techniques. Appl. Sci. 2023, 13, 13278. https://doi.org/10.3390/app132413278

AMA Style

Al-Sayed A, Khayyat MM, Zamzami N. Predicting Heart Disease Using Collaborative Clustering and Ensemble Learning Techniques. Applied Sciences. 2023; 13(24):13278. https://doi.org/10.3390/app132413278

Chicago/Turabian Style

Al-Sayed, Amna, Mashael M. Khayyat, and Nuha Zamzami. 2023. "Predicting Heart Disease Using Collaborative Clustering and Ensemble Learning Techniques" Applied Sciences 13, no. 24: 13278. https://doi.org/10.3390/app132413278

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Predicting Heart Disease Using Collaborative Clustering and Ensemble Learning Techniques

Abstract

1. Introduction

2. Literature Review

3. Methodology

3.1. Data Pre-Processing

3.2. Collaborative Clustering

3.2.1. Formalism

3.2.2. Problem Formulation

3.2.3. Local Step

3.2.4. Collaborative Step

3.2.5. Stopping Criterion

3.2.6. Collaborative Scenarios

3.2.7. Generate and Adapt Collaborative Members

3.3. Mapping Clusters to Class Labels

3.4. Combining Different Models (Ensemble Approach)

4. Experimental Results

4.1. Performance Measures

4.2. Reinforcement Scenario Experiments

4.2.1. Experimental Setting

4.2.2. Results Discussion

4.3. Parallel Scenario Experiments

4.3.1. Experimental Setting

4.3.2. Results Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI