Genetic Algorithm-Based Optimization of Clustering Algorithms for the Healthy Aging Dataset

Kouser, Kahkashan; Priyam, Amrita; Gupta, Mansi; Kumar, Sanjay; Bhattacharjee, Vandana

doi:10.3390/app14135530

Open AccessArticle

Genetic Algorithm-Based Optimization of Clustering Algorithms for the Healthy Aging Dataset

by

Kahkashan Kouser

,

Amrita Priyam

,

Mansi Gupta

,

Sanjay Kumar

and

Vandana Bhattacharjee

^*

Department of Computer Science and Engineering, Birla Institute of Technology, Mesra, Ranchi 835215, India

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(13), 5530; https://doi.org/10.3390/app14135530

Submission received: 18 May 2024 / Revised: 19 June 2024 / Accepted: 20 June 2024 / Published: 26 June 2024

(This article belongs to the Special Issue Predictive Analytics in Healthcare)

Download

Browse Figures

Versions Notes

Abstract

:

Clustering is a crucial and, at the same time, challenging task in several application domains. It is important to incorporate the optimum feature finding into our clustering algorithms for better exploration of features and to draw meaningful conclusions, but this is difficult when there is no or little information about the importance or relevance of features. To tackle this task in an efficient manner, we employ the natural evolution process inherent in genetic algorithms (GAs) to find the optimum features for clustering the healthy aging dataset. To empirically verify the findings, genetic algorithms were combined with a number of clustering algorithms, including partitional, density-based, and agglomerative clustering algorithms. A variant of the popular KMeans algorithm, named KMeans++, gave the best performance on all performance metrics when combined with GAs.

Keywords:

genetic algorithms; clustering; KMeans++; optimization

1. Introduction

A healthy, aging society is the dream for any sensible human being. This research aims at providing an insight into what could be the important factors in achieving this aim. Our study focused on the National Poll on Healthy Aging (NPHA) dataset, which was created to gather insights on the health, healthcare, and health policy issues affecting Americans aged 50 and older. By focusing on the perspectives of older adults and their caregivers, the University of Michigan aimed to inform the public, healthcare providers, policymakers, and advocates about the various aspects of aging. This includes topics like health insurance, household composition, sleep issues, dental care, prescription medications, and caregiving, thereby providing a comprehensive understanding of the health-related needs and concerns of the older population. The target variable in this study was the Number_of_Doctors_Visited. In this work, we aim to apply clustering algorithms and optimize them by applying genetic algorithms to find the most relevant set of features for deriving meaningful insights from the dataset. The primary objective of this research work is to list out some important factors that may affect the Number_of_Doctors_Visited by an individual. Furthermore, the clusters formed by the selected features serve to generate health-related recommendations for new patients. Several such recommendations have been included in this paper, which may serve as guidelines for the patient and/or the caregiver to improve their overall health.

To this end, we frame some questions, and by obtaining responses to the subsequent research questions, this paper aims to develop future clustering models that exhibit superior accuracy and efficiency. Moreover, these research questions offer corroborating evidence for the results of our empirical investigation.

RQ1: What is the efficacy of a genetic algorithm in the process of feature selection for enhancing clustering performance?
RQ2: Which clustering algorithm is most effective when applied to the selected NPHA dataset?
RQ3: Does the iterative process of selection, crossover, and mutation in a genetic algorithm have the potential to enhance clustering performance across numerous generations?

The major contributions in this study are outlined below:

By simulating the principles of natural evolution, the genetic algorithm utilized in this study optimizes feature selection for clustering.
The most relevant subset of features from the dataset are identified based on the “variance” score for feature selection.
Clustering algorithms, including KMeans++, DBSCAN, BIRCH, and agglomerative clustering, were applied to this set of features.
Finally, the outcomes of the clustering with the best performance metrics are reported.

The abstract view of our proposed methodology is presented in Figure 1.

Machine learning techniques have played an important role in healthcare applications, and several researchers have pursued this area [1,2,3,4,5], proposing the detection of COVID-19 through chest X-ray images or for image-based disease prediction and diabetes retinopathy detection. In [6], Jha et al. propose another machine learning approach as a step towards better health for our society. Machine learning classifiers have also been applied by other researchers in solving healthcare problems [7,8,9]. Similarly, we see several researchers experimenting with the clustering algorithms [10,11,12,13,14], applying them to distributed systems [15], or optimizing them [16]. Variants such as KMeans++ have also been active areas of research [17]. Clustering for mixed types of data [18], techniques for clustering algorithm selection [19], feature selection techniques [20,21], the application of the KMeans algorithm for customer segmentation [22], and improvisations using genetic algorithms or other techniques [23,24,25,26,27,28,29,30] are some other examples of the continued interest of the research community in this topic.

The organization of the rest of this paper is as follows: Section 2 presents the methods used in the study, while Section 3 and Section 4 present the experimentation part and the analysis of the experimental results, respectively. The Discussion and Conclusion are given in Section 5 and Section 6, respectively.

2. Materials and Methods

2.1. Selecting Features Using Genetic Algorithms

Feature selection is a procedure in machine learning and data mining that aims to select the most representative or relevant subset of features from the original dataset, achieving dimensionality reduction and improving model performance.

The objective of feature selection is to enhance the performance of machine learning models by decreasing the number of features, thereby mitigating overfitting, reducing computing complexity, and enhancing model interpretability. Feature selection encompasses several strategies, such as filter methods, wrapper methods, and embedding methods. Filter methods are used to choose features based on their statistical characteristics, such as their correlation with the target variable or their information gain. Wrapper approaches employ a dedicated machine learning algorithm to assess the significance of features by training and assessing the model using various subsets of features. Embedded approaches integrate feature selection into the model training process by utilizing regularization techniques such as Lasso and Ridge regression. Genetic algorithms are a form of optimization method that can be employed to pick features. Genetic algorithms involve the evolution of a population of potential feature subsets across numerous generations by genetic operations such as selection, crossover, and mutation. The fitness of each subset of features is assessed using a fitness function, which usually quantifies the performance of the model when using that subset of features. Genetic algorithms can systematically seek for an ideal subset of attributes that maximizes the performance of a model by iteratively evolving the population. The variance metric has been used to evaluate the quality of selected features.

2.2. Genetic Algorithm

A genetic algorithm (GA) is an optimization technique that draws inspiration from the natural selection process. It is utilized to determine the most ideal solutions for situations that need identifying the finest combination of factors or attributes. The algorithm employs the principle of natural selection to iteratively generate increasingly optimal approximations to a solution. Each generation of the genetic algorithm produces a fresh set of approximations. The technique involves selecting individuals based on their level of fitness in the problem domain and applying operators derived from natural genetics. This strategy leads to the development of a population of individuals that are more well-suited to the environment than the individual from which they originated. It bears resemblance to the process of natural adaptation.

The genetic algorithm is a naturally occurring heuristic algorithm. It is utilized to ascertain a precise and suitable resolution.

Initialization: The method commences by generating an initial population of potential solutions (individuals) to the problem. Each person is depicted as a sequence of values, which may be binary, integer, or real, depending on the specific issue.
Selection: The algorithm chooses people from the population based on their fitness, which is a metric of how effectively an individual solves the task. Individuals possessing greater levels of physical prowess are more inclined to be chosen for the subsequent generation.
Crossover: The chosen people are placed together, and a crossover operation is performed to generate new progeny. The crossover procedure involves selecting a random point in the string representation of the people and exchanging the values beyond that point between the parents to generate two new children.
Mutation: Following the crossover process, a mutation operation is implemented to induce minor random alterations in the offspring’s strings. This facilitates the introduction of novel genetic material into the population and hinders the algorithm from rapidly converging towards a suboptimal answer.
Replacement: The act of replacing individuals in the present population with offspring according to a predetermined replacement plan. This guarantees that the population size remains consistent across successive generations.
Termination: The algorithm persists in executing the selection, crossover, mutation, and replacement phases until a specified termination condition is satisfied. This condition may consist of a maximum number of generations, a suitable fitness level, or a predefined time restriction. The fundamental operational premise of a genetic algorithm encompasses the subsequent stages and is shown in Figure 2.

Genetic algorithms are commonly employed in optimization situations where conventional approaches may be unfeasible or inefficient, particularly in intricate optimization terrains with numerous local optima. Gupta et al. applied GAs for optimizing machine learning models for software fault prediction [31].

2.3. Clustering/Cluster Analysis

Cluster analysis, often known as clustering, is a machine learning technique that involves grouping data points with similar characteristics into clusters. Clustering aims to divide a dataset into distinct groups, where the data points within each group, or cluster, exhibit greater similarity to one another compared to those in different groups. Clustering is an unsupervised learning method where the algorithm identifies patterns and structures in the data without receiving explicit instructions on grouping the data.

This study thoroughly examines the K-means ++, DBSCAN, Birch, and agglomerative clustering methods and the effect of genetic optimization on them.

2.4. Clustering Algorithms

Clustering algorithms aim to partition a given dataset into distinct groups, or clusters, based on the degree of similarity between data elements within the same cluster and those in other clusters. The following is a concise overview of the clustering algorithms that were implemented in this study:

2.4.1. KMeans/KMeans++

KMeans++ is a centroid-based clustering algorithm designed to partition a dataset into clusters, ensuring that each data point is assigned to the cluster with the closest centroid. K-means++ is a modified version of the K-means method that enhances the initial selection of cluster centroids. In the normal K-means algorithm, the initial centroids are often selected randomly from the dataset. However, this random selection can result in clustering outcomes that are not ideal. K-means++ resolves this problem by employing a more sophisticated initialization technique. The process begins by randomly selecting a single initial centroid from the available data points. Successive centroids are thereafter selected based on a probability that is directly proportionate to the squared distance from the nearest centroid that already exists. This guarantees that the initial centroids are evenly distributed and enhances the likelihood of discovering improved final centroids.

After the centroids are initialized, the next steps of the K-means algorithm follow the normal procedure. Data points are allocated to the centroid that is closest to them, and the centroids are updated by calculating the mean of the points in each cluster. This process continues until the centroids reach stable positions and the cluster assignments remain mostly unchanged.

2.4.2. Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a clustering algorithm that groups data points together based on their density in the feature space. Density-based clustering algorithms, in contrast to centroid-based clustering algorithms such as K-means/Kmeans++, form clusters based on areas of high density that are separated by areas of low density. The DBSCAN algorithm initiates by randomly selecting a data point from the dataset. A point is defined as a core point if it has a minimum number of other points (MinPts) within a specified radius (ε). Points that fall within the ε-radius of a core point, including the core point itself, are said to be density-reachable from that core point.

The process proceeds by iteratively including all the points that can be reached from the core point and have sufficient density, thus expanding the cluster. This process continues until the cluster can no longer accommodate any additional points. Border points are classified as points that are not core points themselves but are located inside the ε-radius of a core point. These points are included in the cluster; however, they are not classified as core points. Points that do not fall into the category of core points or border points are categorized as noise points and do not belong to any cluster. DBSCAN is highly efficient at detecting clusters with diverse shapes and is resistant to interference caused by noise.

2.4.3. Balanced Iterative Reducing and Clustering Utilizing Hierarchies

Balanced Iterative Reducing and Clustering utilizing Hierarchies (BIRCH) is a hierarchical clustering algorithm that constructs a cluster tree. Each node in the tree corresponds to a cluster of data points. The tree can be represented as a dendrogram, with the root being the cluster that includes all the data points and the leaves being the individual data points.

This algorithm initially condenses the data into a hierarchical structure known as the Clustering Feature Tree (CF Tree). This tree is constructed using a collection of clustering characteristics that condense the data points. The CF Tree is utilized to effectively partition the data points into subclusters in a hierarchical fashion.

2.4.4. Agglomerative

Agglomerative clustering is also a hierarchical clustering method that repeatedly combines the nearest pairs of clusters to create a hierarchy of clusters. The process begins by considering each data point as an individual cluster and then calculates the distance or similarity between every pair of clusters. The algorithm subsequently combines the two clusters that are closest to each other, forming a unified cluster, and adjusts the distance matrix accordingly to represent the updated clustering structure. This procedure is iterated until there is only one cluster left, or until a predetermined condition for stopping is satisfied. The outcome is a dendrogram, in which the terminal nodes symbolize the individual data points, while the internal nodes symbolize groups at various levels of the hierarchy. Agglomerative clustering is a method that does not necessitate the pre-specification of the number of clusters. It can identify clusters of different forms and sizes within the data.

3. Workflow

The complete workflow presented in Figure 3 involves multiple essential steps that strive to discover the most effective subset of features for optimal clustering performance. The process commences with the importation of essential libraries and the dataset. Afterwards, the dataset is divided into training and testing sets to make it easier to evaluate the model.

Next, an initial population of individuals is created, with each individual representing a potential subset of traits. The evaluation of these subsets is conducted using a range of clustering performance criteria, such as the silhouette score, Calinski–Harabasz score, Davies–Bouldin score, and Within Cluster Sum of Squares (WCSS). The variance metric has been used to evaluate the quality of selected features. The performance measurements of each individual are recorded for the purpose of reference, facilitating the comparison and selection of the subset that performs the best.

The process of iterating over generations is used to improve the population of feature subsets. This iteration continues until the stated maximum iteration count is reached, which in this case is eight. During each iteration, parents are chosen depending on their fitness, which is determined by their clustering performance. Subsequently, the population undergoes a one-point crossover to produce offspring, which is subsequently followed by mutations to introduce diversity.

Following each iteration, the newly generated population is evaluated; the iterative technique enables us to systematically investigate various feature subsets and improve them over successive generations. Upon reaching the maximum number of repetitions, the characteristics of the most successful individual are extracted, and their corresponding measurements are displayed for evaluation.

4. Results

In this research work, various metrics are used to evaluate the performance of clustering algorithms. A brief explanation of each metric and its relevance is summarized below.

4.1. Performance Evaluation Parameters

The silhouette score is a metric that measures the degree of similarity between an object and its own cluster, relative to other clusters. The range of values is from −1 to 1. A high value implies that the object is well-suited to its own cluster and not well-suited to surrounding clusters. It is a valuable measure for evaluating the accuracy of clustering when the actual labels are unknown.

The Calinski–Harabasz score, sometimes referred to as the Variance Ratio Criterion, quantifies the ratio between the total dispersion among different clusters and the dispersion inside each individual cluster. A higher score signifies more distinct and well-defined clusters. This score is valuable for assessing the degree of compactness and distinctiveness of clusters. Greater scores suggest that the clusters exhibit high density and significant separation, which is advantageous for achieving effective clustering.

The Davies–Bouldin score quantifies the mean resemblance between each cluster and its most resembling cluster, with resemblance being determined by the ratio of distances within a cluster to distances between clusters. Lower scores are indicative of superior clustering. This score is used to evaluate the degree of compactness and separation of clusters. A lower Davies–Bouldin score indicates that the clusters are highly segregated and clearly distinguishable from one another.

The Within Cluster Sum of Squares (WCSS) is a measure that estimates the sum of squared distances between each data point and the centroid of the cluster it belongs to. It quantifies the level of density in clusters. Smaller WCSS values imply that the data points are near their centroids, indicating that the clusters are tightly packed and distinct from each other.

Essentially, these metrics offer varying viewpoints on the excellence of clustering. A clustering algorithm that achieves a high silhouette score, high Calinski–Harabasz score, low Davies–Bouldin score, and low WCSS is considered to have superior clustering results, showing that the clusters are well-defined, compact, and isolated.

The variance score quantifies the dispersion of the datapoints from the average. A high variance score is an indicator that the selected features exhibit a broad spectrum of values.

4.2. Dataset

The dataset utilized in this study is a subset of the National Poll on Healthy Aging (NPHA) dataset, which has been refined to construct and verify machine learning algorithms for forecasting the annual count of doctors visited by the respondents in a survey. The collection consists of records that correspond to elderly individuals who responded to the NPHA survey. The purpose of creating the National Poll on Healthy Aging dataset was to collect information and understanding about the health, healthcare, and health policy concerns that impact individuals in the United States who are 50 years old and above. Every row in the dataset corresponds to an individual who is in the survey. There are a total of 14 health and sleep-related features available for inclusion in the clustering task and a total of 714 instances, which were split into train and test datasets with 614 and 100 instances, respectively. Table 1 presents the dataset and feature description. In our study, the dataset from kaggle [32] has been utilized.

4.3. Implementation Details

The proposed genetic algorithm with cluster modeling harnesses the power of Python 3.11.9 within the Anaconda environment, leveraging its rich ecosystem of libraries for data manipulation, machine learning, and visualization. Experiments were conducted on a system equipped with a 64-bit operating system and 8 GB of RAM, ensuring the algorithm’s performance and scalability. At the core of the implementation lie essential Python libraries, starting with Pandas, which facilitates structured data handling, including tasks like CSV file reading, missing value management, and data frame manipulation. Complementing Pandas is NumPy, providing efficient numerical array operations essential for various machine learning computations.

The scikit-learn library, often referred to as sklearn, is pivotal for machine learning tasks, offering a diverse range of algorithms and utilities. Among its functionalities, K-means clustering for unsupervised learning and metrics such as the silhouette score, Calinski–Harabasz score, and Davies–Bouldin score for assessing clustering performance stand out. Lastly, Matplotlib is employed for visualization, enabling the generation of informative plots to track clustering performance metrics across multiple generations of feature selection. Together, these libraries form a robust ecosystem for comprehensive exploration and analysis of clustering algorithms and feature selection strategies.

The genetic algorithm operates on a population of eight instances, iterating through a maximum of eight generations. In each generation, 40% of the population, considered the elite, is preserved without mutation, while 20% of the genes in the remaining population are subject to mutation. The GA with the same setting is applied to all the mentioned clustering algorithms, out of which the GA with KMeans++ clustering outperformed.

4.4. Analysis

Table 2 displays the silhouette scores for several clustering methods and their corresponding versions augmented with a genetic algorithm (GA). Birch, DBSCAN, and agglomerative clustering demonstrate scores of 0.3816, 0.4653, and 0.2867, respectively, demonstrating different levels of cluster overlap and separation. Kmeans++ attains the maximum score of 0.7284, indicating the presence of distinct clusters. When the GA is used for feature selection, enhancements are observed in all algorithms. The genetic algorithm (GA) combined with the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm produces the greatest silhouette score of 0.8844, suggesting a substantial improvement in cluster separation. The genetic algorithm (GA) combined with Kmeans++ also demonstrates significant enhancement with a score of 0.9473, emphasizing the efficacy of feature selection in improving clustering performance.

Furthermore, the evaluation of each clustering technique and its equivalent GA-enhanced variant is conducted using the Davies–Bouldin score. The Birch, DBSCAN, and agglomerative clustering algorithms obtain scores of 0.8433, 1.544, and 1.0995, respectively. These values indicate different levels of similarity among clusters, with larger scores suggesting worse quality clustering.

Kmeans++ achieves the lowest Davies–Bouldin score of 0.474, indicating superior clustering performance compared to the other algorithms. When the GA is used for feature selection, enhancements are observed in all algorithms. The genetic algorithm (GA) combined with the Kmeans++ algorithm obtains the lowest Davies–Bouldin score of 0.1061, which indicates a substantial improvement in cluster similarity and the presence of well-defined clusters.

Similarly, the Calinski–Harabasz (CH) score evaluates the quality of clustering by quantifying the ratio of dispersion between clusters to dispersion within clusters. This ratio indicates the extent to which clusters are distinct from each other in a dataset. Greater scores indicate clusters that are more distinct and tightly packed. Birch achieved a score of 68.67, indicating a moderate level of cluster separation but lower clarity compared to algorithms with higher scores. The DBSCAN algorithm achieved a score of 14.78, which suggests inadequate separation between clusters and the possibility of clusters overlapping. The agglomerative clustering algorithm achieved a score of 90.7, indicating enhanced but not yet perfect separation of clusters. Kmeans++ had superior performance compared to other methods, achieving a score of 397.46, which suggests the presence of distinct and closely packed clusters.

When utilized in conjunction with a genetic algorithm (GA) for the purpose of feature selection, all methods exhibited better CH scores, which signifies improved cluster separation and density. The GA with the Kmeans++ algorithm achieved the best score of 3709.51, indicating that the GA greatly enhanced clustering performance, leading to denser and more distinct clusters compared to other models.

Based on a range of evaluation indicators, Table 2 analyses the effectiveness of several algorithms on the NPHA dataset.

Figure 4, Figure 5 and Figure 6 present the graphs corresponding to the various performance parameters.

Also, the WCSS measure was computed for both the KMeans++ algorithm and the genetic algorithm with KMeans++. This measure quantifies the degree of compactness shown by clusters. The metric calculates the total of the squared distances between every data point and the centroid of its corresponding cluster. A decrease in WCSS signifies that the data points are in closer proximity to their respective centroids, indicating the presence of more tightly knit clusters.

Based on the data presented in Figure 6, Kmeans++ obtained a within cluster sum of squares (WCSS) value of 132.85. However, when the genetic algorithm (GA) was combined with Kmeans++, the clustering performance was greatly enhanced, resulting in a much lower WCSS value of 18.75. This reduction indicates that the genetic algorithm (GA) for feature selection improved Kmeans++ by producing more tightly packed and compact clusters. The reduced within cluster sum of squares (WCSS) indicates that the clusters generated by the genetic algorithm (GA) with Kmeans++ have data points that are in closer proximity to their respective centroids. This signifies an enhanced quality of clustering and potentially a more distinct separation between the clusters.

Based on these metrics, K-Means++ demonstrates superior performance compared to other clustering algorithms in terms of its clustering quality, resulting in more distinct, concise, and distinct clusters.

5. Discussion

After the GA experiments were run for eight generations, we sorted the results of feature selection based on their “variance” score. In each generation, the features selected in the top four ranked individuals were listed. When this exercise was completed for all the generations, the features having the highest frequency of occurring were isolated, and for KMeans++, these were as follows: Age, Gender, Trouble sleeping, Medication Keeps Patient from Sleeping, and Mental health. For this set of five features, clustering was performed, and the best performance metric was obtained with the feature set {Age, Medication Keeps Patient from Sleeping, Gender}.

5.1. GA-KMeans ++ vs. Other GA-Based Clustering Algorithms

The GA with KMeans++ employs KMeans++ initialization to facilitate the identification of superior solutions by selecting centroids that are balanced and uniformly dispersed. This results in accelerated convergence. Other genetic algorithm-based clustering and suboptimal initialization strategies are implemented, leading to sluggish convergence and challenges in locating optimal solutions.

KMeans++ and GA genetic operations such as selection, crossover, and mutation are employed to improve the centroids obtained via KMeans++. This results in enhanced cluster assignments and an overall improvement in the quality of clustering. Alternative clustering methods that rely on genetic algorithms may encounter difficulties in effectively traversing the search space or optimizing cluster assignments, which could result in less-than-ideal clustering outcomes.

The GA utilizing KMeans++ attains exceptional clustering performance metrics, including the silhouette score, Davies–Bouldin score, and Calinski–Harabasz score, which signify improved inter-cluster separation and optimized separation and compactness of clusters, respectively. Alternatives to genetic algorithm-based clustering: potentially lower clustering performance metric attainment indicates less effective or less well-separated clustering outcomes.

In comparison to alternative clustering algorithms that rely on genetic algorithms, the GA with KMeans++ method sets itself apart through the provision of a more resilient initialization strategy, a more efficient exploration of the search space, and superior metrics for clustering performance. Due to these benefits, it is a more desirable alternative for clustering tasks in comparison to alternative clustering approaches based on genetic algorithms.

5.2. KMeans++ vs. GA-KMeans++

The main differences between KMeans++ and GA-KMeans++ are their starting procedures. KMeans++ uses smart initialization to choose centroids far apart, improving convergence but perhaps resulting in poor solutions. The GA with KMeans++ uses the initialization approach to balance and evenly distribute centroids, which speeds convergence and improves clustering. The GA with KMeans++ relies on genetic algorithms to improve centroids through selection, crossover, and mutation. Better cluster assignments and clustering quality result from this method. In contrast, KMeans++ does not use genetic processes, which may decrease convergence and reduce clustering performance.

The GA using KMeans++’s evaluation metrics demonstrate its benefits. It has an excellent silhouette score, Davies–Bouldin score, and Calinski–Harabasz score, showing enhanced inter-cluster separation and cluster compactness. KMeans++ performs worse in these metrics, implying poorer clustering. The GA with KMeans++ has better clustering performance, resilient initialization, and efficient search space exploration. It outperforms KMeans++ and other genetic algorithm-based clustering methods due to these advantages.

5.3. Insight into the Clustering Based on Features Selected by the Best-Performing Algorithms

From Table 2, it is seen that the GA with KMeans++ is the best-performing algorithm. The best performance metric was obtained for KMeans++ with the feature set {Age, Medication Keeps Patient from Sleeping, Gender}. Using training data, three clusters were formed with 323, 271, and 20 data points, respectively. The mean data points of the three clusters and the point nearest to the centroids, called the representative points for each cluster, are presented in Table 3.

Table 3 provides some general observations for patients across all clusters, however, we would like to have a deeper insight into their behavior which can be obtained cluster-wise. The particular observations for the remaining features cluster-wise are presented in Table 4.

For cluster 1, it was observed that physical health varied from very good to good, while mental and dental health were very good for most of the cases. This cluster had a balanced count of male and female patients aged between 50 and 64 who had some trouble sleeping. For cluster 2, it is observed that there are more females aged 65–80 and having good to fair physical health, with very good mental health and good dental health. They also have some trouble while sleeping. Finally, cluster 3 is a balanced dataset with patients’ age varying from 65 to 80, having average physical and dental health, good mental health, and some trouble sleeping.

5.4. Health-Related Recommendations for All Clusters

From Section 5.3, it can be observed that across all clusters, most of the patients are retired, and patients’ sleep is generally unaffected by pain, medication, or unknown factors. Further, approximately 50% of the patients have disturbed sleep due to bathroom needs, and only a few patients use sleep medication prescriptions. Based on these observations, we feel that general recommendations, such as regular health check-ups, moderate physical activity, regulating fluid intake after sunset, and good sleep hygiene practices, would help in achieving healthy aging. Additional recommendations, cluster-wise, are as follows:

For patients falling in cluster 1, to maintain dental health, regular dental check-ups are advised. Activities promoting mental health such as social engagements and hobbies should be included in regular routines. Even though sleep issues are minor, the patients can be advised to consider lifestyle adjustments like sleep environment improvements and relaxation techniques before bed.
Since there are more female patients in cluster 2, gentle exercise and physical therapy programs can be introduced to move patients from fair to good physical health. Gender-specific programs may be developed, tailored to older females, addressing specific health issues like osteoporosis and cardiovascular health. Also, for this cluster of patients, activities promoting mental health, such as social engagements and hobbies, may be included in the regular routine. Advice regarding improving dental health by regular dental check-ups and lifestyle changes to improve sleep quality may be given.
Cluster 3 comprises a balance of males and females. Activities such as tailored fitness classes and nutritional guidance to improve physical health should be promoted. To improve dental health, regular dental check-ups and education on oral hygiene is advised. Social interactions, mental health workshops, and mental stimulation activities can help patients in this cluster maintain good mental health.

We believe that by addressing both the common and specific health needs of these clusters, we can enhance the quality of life for retired patients, ensuring they remain healthy and active as they age.

6. Conclusions

In this research work, the impact of the feature selection technique on the NPHA dataset is comprehensively evaluated. As a result of our experiments, it is found that the GA utilizing KMeans++ attains exceptional clustering performance metrics, including the silhouette score, Davies–Bouldin score, and Calinski–Harabasz score. This is an indicator that improved inter-cluster separation, optimized separation, and compactness of clusters have been attained. This answers RQ1 and proves the efficacy of the GA in enhancing clustering performance. The answer to RQ2 is that the GA with KMeans++ is most effective when applied to the NPHA dataset. Finally, the graphs in Figure 7 are the answer to RQ3.

It can be seen from Figure 7 that the iterative processes of the GA indeed have the potential to enhance clustering performance across numerous generations. Finally, we would like to end our discussion with two open questions:

Q1: When genetic algorithms are used to select features, which clustering performance metric is most indicative of successful clustering outcomes, thus indicating the best parameters for healthy aging?
Q2: How does the feature selection technique contribute to enhancing performance parameters?

We would like to take up, as part of our ongoing work, more optimization techniques and further evaluation of clusters for successful clustering outcomes. We believe that studies of the kind presented in this research can play a vital role in predictive healthcare analytics and intend to continue our pursuit in this direction. From the clustering obtained with the best performance metric, it is also seen that in clusters 1 and 2, patients had been visited by one to three doctors, while those in cluster 3 had been visited by four or more doctors in a year. As an outcome of our study, when a new patient’s data is received, it can be assigned a cluster based on the similarity of its features with existing cluster means. Based on this, general and cluster-specific recommendations can be given to the patient and/or the caregiver to improve their overall health.

Subsequently, conclusions can be drawn regarding the unknown parameters. Insurance or health agencies can draw inferences about the expenditures involved in imparting healthcare. Personalized care plans can be developed that take into account individual health statuses and preferences. To improve overall well-being and health, community support groups or clubs that promote social interaction may be established, reducing feelings of isolation.

Author Contributions

Conceptualization, V.B. and A.P.; methodology, M.G.; software, M.G. and S.K.; validation, V.B., A.P. and S.K.; formal analysis, V.B.; investigation, S.K.; resources, K.K. and M.G.; data curation, M.G.; writing—original draft preparation, K.K.; writing—review and editing, V.B., A.P. and M.G.; visualization, S.K.; supervision, V.B.; project administration, M.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Available at Kaggle.com (accessed on 11 January 2024).

Acknowledgments

The authors would like thank the anonymous reviewers for their valuable suggestions, which greatly helped in preparing the manuscript in its present form.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bhattacharjee, V.; Priya, A.; Kumari, N.; Anwar, S. DeepCOVNet Model for COVID-19 Detection Using Chest X-ray Images. Wirel. Pers. Commun. 2023, 130, 1399–1416. [Google Scholar] [CrossRef] [PubMed]
Foo, A.; Hsu, W.; Lee, M.L.; Tan, G.S. DP-GAT: A Framework for Image-based Disease Progression Prediction. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022. [Google Scholar]
Nandy, J.; Hsu, W.; Lee, M.L. An Incremental Feature Extraction Framework for Referable Diabetic Retinopathy Detection. In Proceedings of the IEEE 28th International Conference on Tools with Artificial Inteligience (ICTAI), San Jose, CA, USA, 6–8 November 2016. [Google Scholar]
Mishra, A.; Jha, R.; Bhattacharjee, V. SSCLNet: A Self-Supervised Contrastive Loss-Based Pre-Trained Network for Brain MRI Classification. IEEE Access 2023, 11, 6673–6681. [Google Scholar] [CrossRef]
Kumari, N.; Anwar, S.; Bhattacharjee, V.; Sahana, S.K. Visually evoked brain signals guided image regeneration using GAN variants. Multimed. Tools Appl. 2023, 82, 32259–32279. [Google Scholar] [CrossRef]
Jha, R.; Bhattacharjee, V.; Mustafi, A. Increasing the Prediction Accuracy for Thyroid Disease: A Step Towards Better Health for Society. Wirel. Pers. Commun. 2021, 122, 1921–1938. [Google Scholar] [CrossRef]
Bhattacharjee, V.; Priya, A.; Prasad, U. Evaluating the Performance of Machine Learning Models for Diabetes Prediction with Feature Selection and Missing Values Handling. Int. J. Microsyst. IoT 2023, 1. Available online: https://www.ijmit.org/Photo/IJMIT20230028R1.pdf (accessed on 11 January 2024).
Singh, S.; Aditi Sneh, A.; Bhattacharjee, V. A Detailed Analysis of Applying the K Nearest Neighbour Algorithm for Detection of Breast Cancer. Int. J. Theor. Appl. Sci. 2021, 13, 73–78. [Google Scholar]
Ahmed, M.; Seraj, R.; Islam, S.M.S. The k-means algorithm: A comprehensive survey and performance evaluation. Electronics 2020, 9, 1295. [Google Scholar] [CrossRef]
Jahwar, A.F.; Abdulazeez, A.M. Meta-heuristic algorithms for K-means clustering: A review. PalArch’s J. Archaeol. Egypt/Egyptol. 2020, 17, 12002–12020. [Google Scholar]
Huang, J. Design of Tourism Data Clustering Analysis Model Based on K-Means Clustering Algorithm. In International Conference on Multi-Modal Information Analytics; Springer: Cham, Switzerland, 2022; pp. 373–380. [Google Scholar]
Yuan, C.; Yang, H. Research on K-value selection method of K-means clustering algorithm. J 2019, 2, 226–235. [Google Scholar] [CrossRef]
Ikotun, A.M.; Ezugwu, A.E.; Abualigah, L.; Abuhaija, B.; Heming, J. K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Inf. Sci. 2023, 622, 178–210. [Google Scholar] [CrossRef]
Yang, Z.; Jiang, F.; Yu, X.; Du, J. Initial Seeds Selection for K-means Clustering Based on Outlier Detection. In Proceedings of the 2022 5th International Conference on Software Engineering and Information Management (ICSIM), Yokohama, Japan, 21–23 January 2022; pp. 138–143. [Google Scholar]
Han, M. Research on optimization of K-means Algorithm Based on Spark. In Proceedings of the 2023 IEEE 6th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Chongqing, China, 24–26 February 2023; pp. 1829–1836. [Google Scholar]
Suryanarayana, G.; Swapna, N.; Bhaskar, T.; Kiran, A. Optimizing K-Means Clustering using the Artificial Firefly Algorithm. Int. J. Intell. Syst. Appl. Eng. 2023, 11, 461–468. [Google Scholar]
Bahmani, B.; Moseley, B.; Vattani, A.; Kumar, R.; Vassilvitskii, S. Scalable k-means++. arXiv 2012, arXiv:1203.6402. [Google Scholar] [CrossRef]
Dinh, D.; Huynh, V.; Sriboonchitta, S. Clustering mixed numerical and categorical data with missing values. Inf. Sci. 2021, 571, 418–442. [Google Scholar] [CrossRef]
Crase, S.; Thennadil, S.N. An analysis framework for clustering algorithm selection with applicationstospectroscopy. PLoS ONE 2022, 17, e0266369. [Google Scholar] [CrossRef]
Zheng, W.; Jin, M. Improving the Performance of Feature Selection Methods with Low-Sample-Size Data. Comput. J. 2023, 66, 1664–1686. [Google Scholar] [CrossRef]
Pullissery, Y.H.; Starkey, A. Application of Feature Selection Methods for Improving Classifcation Accuracy and Run-Time: A Comparison of Performance on Real-World Datasets. In Proceedings of the 2nd International Conference on Applied Artificial Intelligence and Computing (ICAAIC), Salem, India, 4–6 May 2023; pp. 687–694. [Google Scholar] [CrossRef]
Tabianan, K.; Velu, S.; Ravi, V. K-means clustering approach for intelligent customer segmentation using customer purchase behavior data. Sustainability 2022, 14, 7243. [Google Scholar] [CrossRef]
Ghezelbash, R.; Maghsoudi, A.; Shamekhi, M.; Pradhan, B.; Daviran, M. Genetic algorithm to optimize the SVM and K-means algorithms for mapping of mineral prospectivity. Neural Comput. Appl. 2023, 35, 719–733. [Google Scholar] [CrossRef]
El-Shorbagy, M.A.; Ayoub, A.Y.; Mousa, A.A.; Eldesoky, I. An enhanced genetic algorithm with new mutation for cluster analysis. Comput. Stat. 2019, 34, 1355–1392. [Google Scholar] [CrossRef]
Albadr, M.A.; Tiun, S.; Ayob, M.; AL-Dhief, F. Genetic Algorithm Based on Natural Selection Theory for Optimization Problems. Symmetry 2020, 12, 1758. [Google Scholar] [CrossRef]
Zubair, M.; Iqbal, M.A.; Shil, A.; Chowdhury, M.; Moni, M.A.; Sarker, I.H. An improved K-means clustering algorithm towards an efficient data-driven modeling. Ann. Data Sci. 2022, 9, 1–20. [Google Scholar] [CrossRef]
Al Shaqsi, J.; Wang, W. Robust Clustering Ensemble Algorithm. SSRN Electron. J. 2022. Available online: https://www.researchgate.net/publication/365606528_Robust_Clustering_Ensemble_Algorithm (accessed on 11 January 2024). [CrossRef]
Yu, H.; Wen, G.; Gan, J.; Zheng, W.; Lei, C. Self-paced learning for k-means clustering algorithm. Pattern Recognit. Lett. 2018, 132, 69–75. [Google Scholar] [CrossRef]
Sajidha, S.; Desikan, K.; Chodnekar, S.P. Initial seed selection for mixed data using modified k-means clustering algorithm. Arab. J. Sci. Eng. 2020, 45, 2685–2703. [Google Scholar] [CrossRef]
Hua, C.; Li, F.; Zhang, C.; Yang, J.; Wu, W. A Genetic XK-Means Algorithm with Empty Cluster Reassignment. Symmetry 2019, 11, 744. [Google Scholar] [CrossRef]
Gupta, M.; Rajnish, K.; Bhattacharjee, V. Software fault prediction with imbalanced datasets using SMOTE-Tomek sampling technique and Genetic Algorithm models. Multimed. Tools Appl. 2023, 83, 47627–47648. [Google Scholar] [CrossRef]
National Poll on Healthy Aging (NPHA) Dataset. Available online: https://www.kaggle.com/ (accessed on 11 January 2024).

Figure 1. Abstract view of the proposed methodology.

Figure 2. A fundamental operational premise of a genetic algorithm.

Figure 3. Workflow of the proposed model.

Figure 4. Graph displaying the silhouette scores and Davies–Bouldin scores for all clustering algorithms.

Figure 5. Graph displaying the Calinski–Harabasz scores for all clustering algorithms.

Figure 6. Graph showing the WCSS metrics for Kmeans ++ and GA-KMeans++.

Figure 7. Graph representing the score of GA-KMeans ++ clustering for 8 generations.

Table 1. NPHA Dataset Description.

Features	Type	Description
Age	Categorical	The patient’s age group = {1: 50–64; 2: 65–80}
Physical Health	Categorical	A self-assessment of the patient’s physical well-being = {−1: Refused; 1: Excellent; 2: Very Good; 3: Good; 4: Fair; 5: Poor}
Mental Health	Categorical	A self-evaluation of the patient’s mental or psychological health = {−1: Refused; 1: Excellent; 2: Very Good; 3: Good; 4: Fair; 5: Poor}
Dental Health	Categorical	A self-assessment of the patient’s oral or dental health= {−1: Refused; 1: Excellent; 2: Very Good; 3: Good; 4: Fair; 5: Poor}
Employment	Categorical	The patient’s employment status or work-related information = {−1: Refused; 1: Working full-time; 2: Working part-time; 3: Retired; 4: Not working at this time}
Stress Keeps Patient from Sleeping	Categorical	Whether stress affects the patient’s ability to sleep = {0: No; 1: Yes}
Medication Keeps Patient from Sleeping	Categorical	Whether medication impacts the patient’s sleep = {0: No; 1: Yes}
Pain Keeps Patient from Sleeping	Categorical	Whether physical pain disturbs the patient’s sleep = {0: No; 1: Yes}
Bathroom Needs Keeps Patient from Sleeping	Categorical	Whether the need to use the bathroom affects the patient’s sleep = {0: No; 1: Yes}
Unknown Keeps Patient from Sleeping	Categorical	Unidentified factors affecting the patient’s sleep = {0: No; 1: Yes}
Trouble sleeping	Categorical	General issues or difficulties the patient faces with sleeping = {−1: REFUSED; 1: No; 2: Mild; 3: Yes}
Prescription Sleep Medication	Categorical	Information about any sleep medication prescribed to the patient = {−1: Refused; 1: Use regularly; 2: Use occasionally; 3: Do not use}
Race	Categorical	The patient’s racial or ethnic background = {−2: Not asked; −1: REFUSED; 1: White, Non-Hispanic; 2: Black, Non-Hispanic; 3: Other, Non-Hispanic; 4: Hispanic; 5: 2+ Races, Non-Hispanic}
Gender	Categorical	The gender identity of the patient = { −2: Not asked; −1: REFUSED; 1: Male; 2: Female}
Number of Doctors Visited (target variable)	Categorical	The total count of different doctors the patient has seen = {1: 0–1 doctors; 2: 2–3 doctors; 3: 4 or more doctors}

Table 2. Performance evaluation metrics for the NPHA dataset.

NPHA Dataset
Model	Silhoutte Score	Davies–Bouldin Score	Calinski–Harabasz Score
Birch	0.3816	0.8433	68.67
DBSCAN	0.4653	1.544	14.78
Agglomerative(Agg)	0.2867	1.0995	90.7
Kmeans ++	0.7284	0.474	397.46
GA with Birch(GA-B)	0.6497	0.6024	229.007
GA with DBSCAN(GA-DB)	0.8844	1.2082	140.69
GA & agglomerative(GA-Agg)	0.7044	0.546	283.24
GA & Kmeans ++(GA-KM++)	0.9473	0.1062	3709.51

Table 3. Cluster means and representative datapoints for all of the features.

Features	Cluster 1		Cluster 2		Cluster 3
Features	Mean	Representative Datapoint	Mean	Representative Datapoint	Mean	Representative Datapoint
Age	1	1	2	2	2	2
Physical health	2	2.37	3	3.01	3	3.37
Mental health	2	1.61	2	2.17	2	2.51
Dental health	2	2.16	3	3.28	4	4.25
Employment	3	2.76	3	2.82	3	2.86
Stress keep patient from sleeping	0	0.23	0	0.34	0	0.24
Medication keeps patients from sleeping	0	0.04	0	0.06	0	0.07
Pain keeps patient from sleeping	0	0.17	0	0.21	0	0.28
Bathroom needs keeps patient from sleeping	1	0.5	1	0.52	1	0.5
Unknown keeps patient from sleeping	0	0.41	0	0.37	0	0.42
Trouble sleeping	3	2.46	3	2.41	2	2.3
Prescription sleep medication	3	2.85	3	2.88	3	2.76
Race	1	1.1	4	4.1	1	1.12
Gender	2	1.55	2	1.61	2	1.52

Table 4. Cluster-wise description of features.

Features	Cluster 1	Cluster 2	Cluster 3
Age	50–64	65–80	65–80
Physical health	Between very good and good	Between good and fair	Good but more towards fair
Mental health	Mostly very good	Mostly very good	Good
Dental health	Very good	Good	Fair
Trouble sleeping	Mild to yes	Mild to yes	Mild to yes
Gender	Balanced male and female	More female	Balanced male and female

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kouser, K.; Priyam, A.; Gupta, M.; Kumar, S.; Bhattacharjee, V. Genetic Algorithm-Based Optimization of Clustering Algorithms for the Healthy Aging Dataset. Appl. Sci. 2024, 14, 5530. https://doi.org/10.3390/app14135530

AMA Style

Kouser K, Priyam A, Gupta M, Kumar S, Bhattacharjee V. Genetic Algorithm-Based Optimization of Clustering Algorithms for the Healthy Aging Dataset. Applied Sciences. 2024; 14(13):5530. https://doi.org/10.3390/app14135530

Chicago/Turabian Style

Kouser, Kahkashan, Amrita Priyam, Mansi Gupta, Sanjay Kumar, and Vandana Bhattacharjee. 2024. "Genetic Algorithm-Based Optimization of Clustering Algorithms for the Healthy Aging Dataset" Applied Sciences 14, no. 13: 5530. https://doi.org/10.3390/app14135530

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Genetic Algorithm-Based Optimization of Clustering Algorithms for the Healthy Aging Dataset

Abstract

1. Introduction

2. Materials and Methods

2.1. Selecting Features Using Genetic Algorithms

2.2. Genetic Algorithm

2.3. Clustering/Cluster Analysis

2.4. Clustering Algorithms

2.4.1. KMeans/KMeans++

2.4.2. Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

2.4.3. Balanced Iterative Reducing and Clustering Utilizing Hierarchies

2.4.4. Agglomerative

3. Workflow

4. Results

4.1. Performance Evaluation Parameters

4.2. Dataset

4.3. Implementation Details

4.4. Analysis

5. Discussion

5.1. GA-KMeans ++ vs. Other GA-Based Clustering Algorithms

5.2. KMeans++ vs. GA-KMeans++

5.3. Insight into the Clustering Based on Features Selected by the Best-Performing Algorithms

5.4. Health-Related Recommendations for All Clusters

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI