1. Introduction
A galaxy represents a vast and intricate system composed of stars and interstellar matter within the expanse of our universe. It requires great effort to effectively engage with these complex and dynamic databases. The repository of galaxy data includes an extensive array of information encompassing diverse aspects of galaxies, including their morphological characteristics, photometric properties, spectral attributes, and more. While substantial research has been conducted in these specific domains, the comprehensive exploration of their “physical properties” remains a relatively uncharted territory.
Esteemed statisticians and physicists concur that multivariate techniques represent the most suitable approach for deriving meaningful insights from these astronomical databases. Among the array of partitioning techniques widely embraced in multivariate statistics, the k-means and k-medoid methods emerge as notable contenders. As we navigate through our analysis, it becomes increasingly apparent that a heuristic comparison between these robust partitioning techniques can illuminate their relative strengths, particularly concerning the percentage of misclassification, all within the context of an assumed optimal number of clusters tailored to this specific category of astronomical data. This dataset was meticulously assembled by Ogando et al. in 2008 [
1,
2] and comprises a set of parameters that hold paramount significance for our study. Furthermore, we have enriched our dataset by incorporating supplementary parameters sourced from the Hyperleda database, enhancing the depth and breadth of our analytical endeavors.
2. Materials and Methods
2.1. Missing Value Imputations
To address the absence of data in the Galaxy dataset, we have employed the multiple imputation technique known as Predictive Mean Matching (PMM). In essence, PMM computes the anticipated value of the target variable Y based on the specified imputation model. Predictive mean matching is used in statistics and data analysis to impute missing values by matching them with the predicted means of similar observations, preserving the original data distribution and relationships.
2.2. Choice of Optimal Clusters
2.2.1. Elbow Plot
To ascertain the ideal number of partitions into which the data can be divided, the Distortion Plot Method stands as a widely embraced technique for determining this optimal value, often denoted as ‘k’. This method computes the average sum of squared distances from the partition centers within the generated partitions. Essentially, the optimal number of clusters becomes evident when examining the graph for a distinct ‘elbow-like’ point [
3].
2.2.2. Dunn Index
The Dunn Index is a metric used to evaluate the quality of clustering results in unsupervised machine learning [
4]. It helps assess the separation between clusters and the compactness of data points within each cluster.
3. Formula
The Dunn Index is calculated using the following formula:
where:
A higher Dunn Index indicates better clustering, as it signifies greater inter-cluster separation and smaller intra-cluster distances.
When the Dunn Index is high, it suggests that the clusters are well-separated and compact, indicating a good clustering solution.
Conversely, a low Dunn Index implies that clusters are either too close to each other (poor separation) or data points within clusters are too spread out (low compactness).
3.1. Clustering (Partitioning) Algorithms and Discriminant Analysis
Clustering is a method that involves categorizing individuals with diverse characteristics based on their similarities or dissimilarities. In this study, several renowned algorithms have been employed, including the following:
3.1.1. K-Means
K-means clustering is a popular unsupervised machine learning technique used for data clustering and segmentation. It is a simple yet effective algorithm for partitioning a dataset into K distinct, non-overlapping clusters. The goal is to group similar data points together based on their feature similarity.
3.1.2. Algorithm
The k-means algorithm works as follows (Algorithm 1) [
3]:
Algorithm 1 k-means Clustering |
Initialize K cluster centroids randomly. Assign each data point to the nearest centroid. Recalculate the centroids as the mean of the data points in each cluster. Repeat steps 2 and 3 until convergence (centroids no longer change significantly).
|
k-means clustering is a versatile and straightforward technique for clustering data. It is easy to implement and can be applied to various domains like we used here in the classification and clustering of galaxy diversification, discovering hidden patterns and grouping similar data points together.
3.1.3. K-Medoids
We use this as a second algorithm to compare between them. The method is given below.
Initialize k medoids randomly.
Assign each data point to the nearest medoid.
For each cluster, select the data point that minimizes the total distance to other points in the same cluster as the new medoid.
Repeat steps 2 and 3 until convergence.
k-medoid clustering is a valuable technique for partitioning data into meaningful clusters. It is particularly useful when dealing with noisy or non-linear data.
3.1.4. The Linear Discriminant Analysis (LDA)
The primary objective of LDA is to find a linear combination of features that best separates two or more classes in a dataset. It aims to maximize the between-class variance while minimizing the within-class variance [
5]. In LDA, key concepts include:
Scatter matrices: Within-class and between-class scatter matrices.
Eigenvectors and eigenvalues: Used to find the optimal linear transformation.
Decision boundaries: Separating classes based on discriminant functions.
4. Results
Astronomy generates complex datasets, especially for galaxies. k-means and k-medoids are vital for:
** Classification: ** Grouping galaxies by attributes.
** Structure Detection: ** Identifying cosmic structures.
** Outlier Detection: ** Finding rare celestial objects.
** Dimensionality Reduction: ** Simplifying data for analysis.
These techniques help astronomers unveil patterns, understand celestial structures, and explore the universe’s mysteries.
From the techniques used to find the optimal number of clusters,
Elbow plot and
Dunn Index are 4 and 3 for k-means and k-medoids, respectively. The Elbow plots and Value of the Dunn Index are given in
Table 1,
Figure 1 and
Figure 2.
The clusters thus formed by k-means and k-medoids considering the optimal number of clusters to be 3 and 4 are shown in the
Figure 3,
Figure 4,
Figure 5 and
Figure 6.
5. Conclusions
From the results and findings of the work, we can observe there are four distinct clusters of galaxies in the local universe of Orlando (2008) based on their collective physical characteristics. The approximate mean values of the parameters in those robust clusters are also included in the study, which would give us a heuristic idea about the physical characteristics of a newly observed galaxy, provided it falls into one of the three robust clusters. Additionally, there is about misclassification in the data which indicates the high accuracy of the clustering. The misclassification that occurred while clustering for a given optimal number of clusters (k = 3 and k = 4) can be unanimously inferred that k-means performs better than k-medoids under this category of galaxy database. Also, the misclassification with the optimal number of clusters for k-means (k = 4) and k-medoids (k = 3) also serves as a reasonable indication of the superiority of the k-means algorithm over k-medoids considering galaxy data.
Author Contributions
Conceptualization, P.G. and S.C.; methodology, P.G. and S.C.; software, P.G.; formal analysis, S.C. and P.G.; investigation, P.G.; resources, S.C.; data curation, S.C. and P.G.; writing—original draft preparation, P.G. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
We selected the sample of 509 early-type galaxies in the local Universe of Ogando et al. (2008). To describe the galaxies, we took from Ogando et al. (2008) [
1,
6].
Acknowledgments
We are thankful to our Department for their constant support and motivation, resulting in the successful completion of this work.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Davoust, D.; Fraix-Burnet, T.; Chattopadhyay, A.K.; Chattopadhyay, E.; Thuil-lard, M. A six-parameter space to describe galaxy diversification. Astron-Omy Astrophys. 2012, 545, A80. [Google Scholar] [CrossRef]
- Nigoche-Netro, A.; Aguerri, J.A.L.; Lagos, P.; Ruelas-Mayorga, A.; Sánchez, L.J.; Muñoz-Tuñón, C.; Machado, A. The intrinsic dispersion in the Faber-Jackson relation for early-type galaxies as function of the mass and redshift. Astron. Astrophys. 2011, 534, A61. [Google Scholar] [CrossRef]
- Ghosh, P.; Chakraborty, S. Classification and Distributional properties of Gamma Ray Bursts. In Proceedings of the 16th International Conference MSAST, Online, 21–23 December 2022; Volume 11, p. 148. [Google Scholar]
- Dunn, J.C. Well-Separated Clusters and Optimal Fuzzy Partitions. J. Cybern. 1974, 4, 95–104. [Google Scholar] [CrossRef]
- Ghosh, P.; Chakraborty, S. Spectral Classification of Quasar Subject to Redshift: A Statistical Study. Comput. Sci. Math. Forum 2023, 7, 43. [Google Scholar] [CrossRef]
- Guy, W.; Ottaviani, D.L. Hγ and Hδ absorption features in stars and stellar populations. Astrophys. J. Suppl. Ser. 1997, 111, 377. [Google Scholar]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).