Model Selection Using K-Means Clustering Algorithm for the Symmetrical Segmentation of Remote Sensing Datasets
Abstract
:1. Introduction
- Distribute the data into disjoint/asymmetrical groups, ;
- At least one instance should be present in each cluster, ;
- Items within each cluster are homogeneous or symmetrical while, externally, they are heterogeneous or asymmetrical.
2. Materials and Methods
2.1. The k-Means Algorithm
- 1.
- Select k initial cluster center points randomly, such as .
- 2.
- At kth iteration, assign the data points {x} to k clusters using the given relation < ∀ (for all) ; , where denotes group of data points whose centers are .
- 3.
- Recalculate the location of new cluster centers ; so that the sum of squared distance to the new cluster center is minimized from all points in . The degree of measurement that serves to minimize the distance is the sample mean of . Hence, the new cluster is measured by Equation (3).
- 4.
- If , for , the algorithm stands converged and the process is deemed terminated; otherwise, repeat steps 2 and 3.
2.2. Calinski and Harabasz (CH) Index
2.3. Silhouette Index
2.4. Krzanowski and Lai (KL) Index
2.5. Gap Statistic
2.6. Dunn Index
- denotes the dissimilarity between two groups/clusters, i.e., and are defined as ;
- is the diameter of a group/cluster G, which measures cluster dispersion, which is defined as .
2.7. Duda Index
2.8. Pseudot2 Index
2.9. Davies–Bouldin (DB) Index
2.10. Rubin Index
2.11. C-Index
- is the sum of the Euclidean distance between objects and ;
- ;
- .
3. The Proposed Method
Algorithm 1 | ||
1. | Consider a dataset {x} to be partitioned into k clusters. | |
2. | Define the range of clusters k where its true value is presumed to lie, i.e., 1 ≤ k ≤ 20. | |
3. | For each value of k, apply the k-means clustering algorithm multiple times (J times) on the given dataset {x}. | |
4. | The centroid values of clusters Cjk are calculated j times by using the following expression: | |
(15) | ||
5. | The uniqueness of the centroid values obtained in step 4 is checked. | |
6. | If uniqueness is established, select the next value of k. | |
7. | Repeat the process from step 3 to 6 until a global unique value of k is achieved. | |
8. | The optimum number of clusters k will be the value for which the uniqueness of cluster centroids is observed the maximum number of times in the iterative procedure. |
4. Experiments and Results
4.1. Datasets and Experimental Procedure
4.2. Results
4.3. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Caraka, R.E.; Chen, R.C.; Huang, S.W.; Chiou, S.Y.; Gio, P.U.; Pardamean, B. Big data ordination towards intensive care event count cases using fast computing GLLVMS. BMC Med. Res. Methodol. 2022, 22, 77. [Google Scholar]
- Bhadani, A.K.; Jothimani, D. Big data: Challenges, opportunities, and realities. In Effective Big Data Management and Opportunities for Implementation; IGI Global: Hershey, PA, USA, 2016; pp. 1–24. [Google Scholar]
- Fahad, A.; Alshatri, N.; Tari, Z.; Alamri, A.; Khalil, I.; Zomaya, A.Y.; Foufou, S.; Bouras, A. A survey of clustering algorithms for big data: Taxonomy and empirical analysis. IEEE Trans. Emerg. Top. Comput. 2014, 2, 267–279. [Google Scholar] [CrossRef]
- Silipo, R.; Adae, I.; Hart, A.; Berthold, M. Seven Techniques for Dimensionality Reduction; KNIME: Zurich, Switzerland, 2014; pp. 1–21. [Google Scholar]
- Martín-Fernández, J.D.; Luna-Romera, J.M.; Pontes, B.; Riquelme-Santos, J.C. Indexes to Find the Optimal Number of Clusters in a Hierarchical Clustering. In Proceedings of the International Workshop on Soft Computing Models in Industrial and Environmental Applications, Seville, Spain, 13–15 May 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 3–13. [Google Scholar]
- Tang, Y.; Ren, F.; Pedrycz, W. Fuzzy C-means clustering through SSIM and patch for image segmentation. Appl. Soft Comput. 2020, 87, 105928. [Google Scholar] [CrossRef]
- Zhang, Y.; Bai, X.; Fan, R.; Wang, Z. Deviation-Sparse Fuzzy C-Means With Neighbor Information Constraint. IEEE Trans. Fuzzy Syst. 2019, 27, 185–199. [Google Scholar] [CrossRef]
- Zhou, S.; Xu, Z. A novel internal validity index based on the cluster centre and the nearest neighbour cluster. Appl. Soft Comput. 2018, 71, 78–88. [Google Scholar] [CrossRef]
- Ye, F.; Chen, Z.; Qian, H.; Li, R.; Chen, C.; Zheng, Z. New approaches in multi-view clustering. In Recent Applications in Data Clustering; IntechOpen: London, UK, 2018; p. 195. [Google Scholar]
- MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Oakland, CA, USA, 27 December 1965–7 January 1966; Volume 1, pp. 281–297. [Google Scholar]
- Jain, A.K. Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 2010, 31, 651–666. [Google Scholar] [CrossRef]
- Maldonado, S.; Carrizosa, E.; Weber, R. Kernel penalized k-means: A feature selection method based on kernel k-means. Inf. Sci. 2015, 322, 150–160. [Google Scholar] [CrossRef]
- Du, L.; Zhou, P.; Shi, L.; Wang, H.; Fan, M.; Wang, W.; Shen, Y.D. Robust multiple kernel k-means using l21-norm. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, Buenos Aires, Argentina, 25–31 July 2015. [Google Scholar]
- Wang, S.; Gittens, A.; Mahoney, M.W. Scalable kernel k-means clustering with nystrom approximation: Relative-error bounds. arXiv 2017, arXiv:1706.02803. [Google Scholar]
- Liu, X.; Zhu, X.; Li, M.; Wang, L.; Zhu, E.; Liu, T.; Kloft, M.; Shen, D.; Yin, J.; Gao, W. Multiple kernel k-means with incomplete kernels. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 1191–1204. [Google Scholar] [CrossRef] [Green Version]
- Di, J.; Gou, X. Bisecting K-means Algorithm Based on K-valued Selfdetermining and Clustering Center Optimization. J. Comput. 2018, 13, 588–595. [Google Scholar] [CrossRef]
- Kingrani, S.; Levene, M.; Zhang, D. Estimating the number of clusters using diversity. Artif. Intell. Res. 2017, 7, 15. [Google Scholar] [CrossRef] [Green Version]
- Zhou, S.; Xu, Z.; Liu, F. Method for Determining the Optimal Number of Clusters Based on Agglomerative Hierarchical Clustering. IEEE Trans. Neural Netw. Learn. Syst. 2017, 28, 3007–3017. [Google Scholar] [CrossRef]
- Milligan, G.W.; Cooper, M.C. An examination of procedures for determining the number of clusters in a data set. Psychometrika 1985, 50, 159–179. [Google Scholar] [CrossRef]
- Shafeeq, A.; Hareesha, K. Dynamic clustering of data with modified k-means algorithm. In Proceedings of the 2012 Conference on Information and Computer Networks, Singapore, 26–28 February 2012; pp. 221–225. [Google Scholar]
- Hamerly, G.; Elkan, C. Learning the k in k-means. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2004; pp. 281–288. [Google Scholar]
- Tibshirani, R.; Walther, G.; Hastie, T. Estimating the number of clusters in a data set via the gap statistic. J. R. Stat. Soc. Ser. B 2001, 63, 411–423. [Google Scholar] [CrossRef]
- Feng, Y.; Hamerly, G. PG-means: Learning the number of clusters in data. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2007; pp. 393–400. [Google Scholar]
- Ray, S.; Turi, R.H. Determination of number of clusters in k-means clustering and application in colour image segmentation. In Proceedings of the 4th International Conference on Advances in Pattern Recognition and Digital Techniques, Calcutta, India, 27–29 December 1999; pp. 137–143. [Google Scholar]
- Gupta, N.; Ujjwal, R. An efficient incremental clustering algorithm. World Comput. Sci. Inf. Technol. J 2013, 3, 97–99. [Google Scholar]
- Zhang, Y.; Mańdziuk, J.; Quek, C.H.; Goh, B.W. Curvature-based method for determining the number of clusters. Inf. Sci. 2017, 415, 414–428. [Google Scholar] [CrossRef]
- Kodinariya, T.M.; Makwana, P.R. Review on determining number of Cluster in K-Means Clustering. Int. J. 2013, 1, 90–95. [Google Scholar]
- Li, X.; Liang, W.; Zhang, X.; Qing, S.; Chang, P.C. A cluster validity evaluation method for dynamically determining the near-optimal number of clusters. Soft Comput. 2020, 24, 9227–9241. [Google Scholar] [CrossRef]
- Shao, X.; Lee, H.; Liu, Y.; Shen, B. Automatic K selection method for the K—Means algorithm. In Proceedings of the 2017 4th International Conference on Systems and Informatics (ICSAI), Hangzhou, China, 11–13 November 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1573–1578. [Google Scholar]
- Duda, R.O.; Hart, P.E. Pattern Classification and Scene Analysis; Wiley: New York, NY, USA, 1973; Volume 3. [Google Scholar]
- Caliński, T.; Harabasz, J. A dendrite method for cluster analysis. Commun. Stat. Theory Methods 1974, 3, 1–27. [Google Scholar] [CrossRef]
- Dunn, J.C. Well-separated clusters and optimal fuzzy partitions. J. Cybern. 1974, 4, 95–104. [Google Scholar] [CrossRef]
- Hartigan, J.A. Clustering Algorithms; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 1975. [Google Scholar]
- Davies, D.L.; Bouldin, D.W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1979, PAMI-1, 224–227. [Google Scholar] [CrossRef]
- Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef] [Green Version]
- Krzanowski, W.J.; Lai, Y. A criterion for determining the number of groups in a data set using sum-of-squares clustering. Biometrics 1988, 44, 23–34. [Google Scholar] [CrossRef]
- Tou, J.T.; Gonzalez, R.C. Pattern Recognition Principles; Addison-Wesley Publishing Company: Boston, MA, USA, 1974. [Google Scholar]
- Gordon, A. Classification; Chapman and Hall: New York, NY, USA, 1999. [Google Scholar]
- Friedman, H.P.; Rubin, J. On some invariant criteria for grouping data. J. Am. Stat. Assoc. 1967, 62, 1159–1178. [Google Scholar] [CrossRef]
- Hubert, L.J.; Levin, J.R. A general statistical framework for assessing categorical clustering in free recall. Psychol. Bull. 1976, 83, 1072. [Google Scholar] [CrossRef]
- Dua, D.; Graff, C. UCI Machine Learning Repository; University of California Irvine: Irvine, CA, USA, 2017. [Google Scholar]
- Guyon, I.; Von Luxburg, U.; Williamson, R.C. Clustering: Science or art. In NIPS 2009 Workshop on Clustering Theory; NIPS: Vancouver, BC, Canada, 2009; pp. 1–11. [Google Scholar]
- Hijmans, R.J. Raster: Geographic Data Analysis and Modeling. R Package. 2021. Available online: https://CRAN.R-project.org/package=raster (accessed on 3 April 2012).
- Ullah, I.; Mengersen, K. Bayesian mixture models and their Big Data implementations with application to invasive species presence-only data. J. Big Data 2019, 6, 29. [Google Scholar] [CrossRef]
Dataset | Number of Instances | Number of Attributes | Number of Clusters |
---|---|---|---|
Iris | 150 | 5 | 3 |
Wine | 178 | 13 | 3 |
Tripadvisor review | 980 | 11 | 2 |
Aggregation | 788 | 3 | 7 |
Flame | 240 | 3 | 2 |
Control | 600 | - | 6 |
Glass | 214 | 9 | 7 |
Pathbased | 300 | 3 | 3 |
Breast | 699 | 10 | 2 |
Vechile | 946 | 18 | 4 |
Band Number | Central Wavelength (nm) | Bandwidth (nm) | Spatial Resolution |
---|---|---|---|
2 | 490 | 65 | 10 |
3 | 560 | 35 | 10 |
4 | 665 | 30 | 10 |
5 | 705 | 15 | 20 |
6 | 740 | 15 | 20 |
7 | 783 | 20 | 20 |
8 | 842 | 115 | 10 |
8A | 865 | 20 | 20 |
N | D | k* | k+ | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Proposed | CH | KL | Silhouette | Gap | C-Index | Pseudot2 | DB | Duda | Dunn | Rubin | |||
3 | 62 | 39 | 21 | 38 | 36 | 11 | 34 | 35 | 34 | 20 | 11 | ||
2 | 5 | 48 | 19 | 6 | 18 | 1 | 7 | 5 | 10 | 5 | 5 | 16 | |
7 | 38 | 8 | 5 | 6 | 0 | 7 | 0 | 6 | 0 | 3 | 11 | ||
3 | 80 | 51 | 28 | 47 | 48 | 22 | 31 | 42 | 31 | 15 | 35 | ||
1000 | 4 | 5 | 51 | 28 | 15 | 24 | 0 | 12 | 5 | 22 | 5 | 11 | 20 |
7 | 35 | 21 | 11 | 14 | 0 | 6 | 1 | 12 | 1 | 4 | 13 | ||
3 | 74 | 48 | 29 | 48 | 46 | 26 | 31 | 49 | 31 | 19 | 38 | ||
6 | 5 | 59 | 27 | 15 | 27 | 0 | 9 | 7 | 21 | 8 | 10 | 19 | |
7 | 34 | 15 | 5 | 19 | 0 | 5 | 1 | 9 | 1 | 5 | 18 | ||
3 | 65 | 30 | 23 | 27 | 31 | 12 | 43 | 27 | 43 | 14 | 11 | ||
2 | 5 | 46 | 25 | 18 | 28 | 2 | 16 | 3 | 25 | 3 | 3 | 18 | |
7 | 33 | 8 | 6 | 3 | 0 | 7 | 0 | 6 | 0 | 2 | 12 | ||
3 | 70 | 50 | 22 | 44 | 37 | 22 | 31 | 40 | 31 | 21 | 18 | ||
5000 | 4 | 5 | 49 | 28 | 16 | 20 | 1 | 10 | 6 | 13 | 6 | 4 | 19 |
7 | 39 | 15 | 17 | 15 | 0 | 3 | 2 | 12 | 2 | 3 | 18 | ||
3 | 68 | 49 | 20 | 55 | 49 | 21 | 26 | 50 | 26 | 28 | 28 | ||
6 | 5 | 55 | 31 | 15 | 23 | 0 | 13 | 13 | 16 | 13 | 4 | 22 | |
7 | 33 | 18 | 14 | 20 | 0 | 4 | 3 | 14 | 3 | 3 | 17 | ||
3 | 66 | 30 | 21 | 26 | 30 | 14 | 28 | 24 | 28 | 15 | 14 | ||
2 | 5 | 48 | 16 | 14 | 9 | 1 | 10 | 6 | 9 | 6 | 4 | 7 | |
7 | 36 | 7 | 7 | 5 | 0 | 7 | 0 | 5 | 0 | 4 | 12 | ||
3 | 69 | 27 | 15 | 22 | 23 | 15 | 14 | 21 | 14 | 6 | 14 | ||
1000 | 4 | 5 | 49 | 17 | 14 | 12 | 0 | 6 | 7 | 6 | 7 | 3 | 11 |
7 | 38 | 15 | 7 | 10 | 0 | 3 | 2 | 4 | 2 | 4 | 11 | ||
3 | 64 | 28 | 14 | 22 | 25 | 13 | 12 | 23 | 12 | 18 | 19 | ||
6 | 5 | 45 | 18 | 12 | 16 | 0 | 5 | 12 | 10 | 12 | 6 | 14 | |
7 | 36 | 14 | 6 | 12 | 0 | 3 | 4 | 8 | 4 | 6 | 12 |
Datasets | k* | k+ | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Proposed | CH | KL | Silhouette | Gap | C-Index | Pseudot2 | DB | Duda | Dunn | Rubin | ||
Iris | 3 | 3 | 3 | 12 | 2 | 2 | 3 | 2 | 2 | 2 | 12 | 3 |
Wine | 3 | 3 | 3 | 3 | 3 | 2 | 2 | 3 | 3 | 3 | 19 | 3 |
Tripadvisorreview | 2 | 2 | 2 | 2 | 2 | 2 | 6 | 2 | 2 | 2 | 15 | 2 |
Aggregation | 7 | 7 | 17 | 19 | 4 | 2 | 16 | 2 | 4 | 2 | 17 | 17 |
Flame | 2 | 3 | 4 | 15 | 4 | 2 | 5 | 2 | 4 | 2 | 20 | 9 |
Control | 6 | 4 | 3 | 12 | 2 | 2 | 20 | 2 | 2 | 2 | 13 | 3 |
Glass | 7 | 6 | 6 | 18 | 3 | 2 | 7 | 2 | 7 | 2 | 2 | 8 |
Pathbased | 3 | 3 | 19 | 9 | 3 | 2 | 11 | 2 | 2 | 2 | 20 | 12 |
Breast | 2 | 2 | 2 | 18 | 2 | 2 | 14 | 3 | 2 | 3 | 2 | 2 |
Vechile | 4 | 4 | 3 | 10 | 2 | 2 | 2 | 3 | 2 | 3 | 19 | 8 |
Success | 7 | 4 | 2 | 4 | 3 | 2 | 3 | 4 | 3 | 1 | 4 | |
RMSE | 0.77 | 6.10 | 9.38 | 2.27 | 2.70 | 7.17 | 2.64 | 1.87 | 2.64 | 12.32 | 5.06 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ali, I.; Rehman, A.U.; Khan, D.M.; Khan, Z.; Shafiq, M.; Choi, J.-G. Model Selection Using K-Means Clustering Algorithm for the Symmetrical Segmentation of Remote Sensing Datasets. Symmetry 2022, 14, 1149. https://doi.org/10.3390/sym14061149
Ali I, Rehman AU, Khan DM, Khan Z, Shafiq M, Choi J-G. Model Selection Using K-Means Clustering Algorithm for the Symmetrical Segmentation of Remote Sensing Datasets. Symmetry. 2022; 14(6):1149. https://doi.org/10.3390/sym14061149
Chicago/Turabian StyleAli, Ishfaq, Atiq Ur Rehman, Dost Muhammad Khan, Zardad Khan, Muhammad Shafiq, and Jin-Ghoo Choi. 2022. "Model Selection Using K-Means Clustering Algorithm for the Symmetrical Segmentation of Remote Sensing Datasets" Symmetry 14, no. 6: 1149. https://doi.org/10.3390/sym14061149
APA StyleAli, I., Rehman, A. U., Khan, D. M., Khan, Z., Shafiq, M., & Choi, J.-G. (2022). Model Selection Using K-Means Clustering Algorithm for the Symmetrical Segmentation of Remote Sensing Datasets. Symmetry, 14(6), 1149. https://doi.org/10.3390/sym14061149