Using K-Means Clustering in Python with Periodic Boundary Conditions
Abstract
:1. Introduction
2. K-Means Clustering and Periodicity
Algorithm 1 K-means algorithm |
Task: divide the set of data points into k clusters . Inputs: data points , initial centers . Output: final set of centers . Parameters: number of clusters k, stop requirements. Begin repeat for all do Find index i of the nearest center, i.e., . Assign x to cluster . end for Calculate the new center of all clusters . until Centers are stable within the given parameters. End |
- The distance measure
- The position of the centers, i.e., the mean position within the clusters.
Algorithm 2 Periodic mean |
Task: find the mean position of points according to the periodic boundary condition. Inputs: data points . Output: mean position in the set . Parameters: period T. Begin Create empty sets for alldo if then Add x to set . else Add x to set . end if end for Count – mean position of points in . Count – number of elements in . Count – mean position of points in . Count – number of elements in . ifthen else end if End |
3. Modifying Out-of-the-Box Python Solution
Using Periodic Measure in the PyClustering Library
4. Comparing Different Clusterizations
5. Artificial Dataset
6. Clustering a Real Angular Dataset
7. Clustering Seasonal Data
8. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Hartigan, J.A.; Wong, M.A. Algorithm AS 136: A K-Means Clustering Algorithm. Appl. Stat. 1979, 28, 100. [Google Scholar] [CrossRef]
- Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, OR, USA, 2–4 August 1996; Volume 96, pp. 226–231. [Google Scholar]
- Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum Likelihood from Incomplete Data Via the EM Algorithm. J. R. Stat. Soc. Ser. B 1977, 39, 1–22. [Google Scholar]
- Agrawal, R.; Srikant, R. Mining sequential patterns. In Proceedings of the Proceedings of the Eleventh International Conference on Data Engineering, Taipei, Taiwan, 6–10 March 1995. [Google Scholar] [CrossRef]
- Cao, H.; Mamoulis, N.; Cheung, D.W. Discovery of Periodic Patterns in Spatiotemporal Sequences. IEEE Trans. Knowl. Data Eng. 2007, 19, 453–467. [Google Scholar] [CrossRef] [Green Version]
- Chan, S.; Leong, K. An application of Cyclic Signature (CS) clustering for spatial-temporal pattern analysis to support public safety work. In Proceedings of the 2010 IEEE International Conference on Systems, Man and Cybernetics, Istanbul, Turkey, 10–13 October 2010. [Google Scholar] [CrossRef]
- Zhang, D.; Lee, K.; Lee, I. Hierarchical trajectory clustering for spatio-temporal periodic pattern mining. Expert Syst. Appl. 2018, 92, 1–11. [Google Scholar] [CrossRef]
- Rosati, S.; Agostini, V.; Knaflitz, M.; Balestra, G. Muscle activation patterns during gait: A hierarchical clustering analysis. Biomed. Signal Process. Control. 2017, 31, 463–469. [Google Scholar] [CrossRef] [Green Version]
- Agostini, V.; Rosati, S.; Castagneri, C.; Balestra, G.; Knaflitz, M. Clustering analysis of EMG cyclic patterns: A validation study across multiple locomotion pathologies. In Proceedings of the 2017 IEEE International Instrumentation and Measurement Technology Conference (I2MTC), Torino, Italy, 22–25 May 2017. [Google Scholar] [CrossRef] [Green Version]
- Giordano, F.; Rocca, M.L.; Parrella, M.L. Clustering complex time-series databases by using periodic components. Stat. Anal. Data Min. ASA Data Sci. 2017, 10, 89–106. [Google Scholar] [CrossRef]
- Haskey, S.; Blackwell, B.; Pretty, D. Clustering of periodic multichannel timeseries data with application to plasma fluctuations. Comput. Phys. Commun. 2014, 185, 1669–1680. [Google Scholar] [CrossRef]
- Grabovoy, A.V.; Strijov, V.V. Quasi-Periodic Time Series Clustering for Human Activity Recognition. Lobachevskii J. Math. 2020, 41, 333–339. [Google Scholar] [CrossRef]
- Nunes, N.; Araújo, T.; Gamboa, H. Time Series Clustering Algorithm for Two-Modes Cyclic Biosignals. In Biomedical Engineering Systems and Technologies; Springer: Berlin/Heidelberg, Germany, 2013; pp. 233–245. [Google Scholar] [CrossRef]
- Abraham, C.; Molinari, N.; Servien, R. Unsupervised clustering of multivariate circular data. Stat. Med. 2012, 32, 1376–1382. [Google Scholar] [CrossRef] [Green Version]
- Tóth, B.; Vad, J. A fuzzy clustering method for periodic data, applied for processing turbomachinery beamforming maps. J. Sound Vib. 2018, 434, 298–313. [Google Scholar] [CrossRef]
- Kume, A.; Walker, S.G. The utility of clusters and a Hungarian clustering algorithm. PLoS ONE 2021, 16, e0255174. [Google Scholar] [CrossRef]
- Lu, H.; He, T.; Wang, S.; Liu, C.; Mahdavi, M.; Narayanan, V.; Chan, K.S.; Pasteris, S. Communication-efficient k-Means for Edge-based Machine Learning. IEEE Trans. Parallel Distrib. Syst. 2022, 33, 2509–2523. [Google Scholar] [CrossRef]
- Fang, C.; Liu, H. Research and Application of Improved Clustering Algorithm in Retail Customer Classification. Symmetry 2021, 13, 1789. [Google Scholar] [CrossRef]
- Fränti, P.; Sieranoja, S. How much can k-means be improved by using better initialization and repeats? Pattern Recognit. 2019, 93, 95–112. [Google Scholar] [CrossRef]
- Kaufman, L.; Rousseeuw, P.J. Partitioning Around Medoids (Program PAM). In Finding Groups in Data: An Introduction to Cluster Analysis; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 1990; pp. 68–125. [Google Scholar]
- Dunn, J.C. A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters. J. Cybern. 1973, 3, 32–57. [Google Scholar] [CrossRef]
- Hany, O.; Abu-Elkheir, M. Detecting Vulnerabilities in Source Code Using Machine Learning. In Lecture Notes in Networks and Systems; Springer: Berlin/Heidelberg, Germany, 2022; pp. 35–41. [Google Scholar] [CrossRef]
- Inan, M.S.K.; Alam, F.I.; Hasan, R. Deep integrated pipeline of segmentation guided classification of breast cancer from ultrasound images. Biomed. Signal Process. Control. 2022, 75, 103553. [Google Scholar] [CrossRef]
- Chen, M.; Zhang, Z.; Wu, H.; Xie, S.; Wang, H. Otsu-Kmeans gravity-based multi-spots center extraction method for microlens array imaging system. Opt. Lasers Eng. 2022, 152, 106968. [Google Scholar] [CrossRef]
- Balsor, J.L.; Arbabi, K.; Singh, D.; Kwan, R.; Zaslavsky, J.; Jeyanesan, E.; Murphy, K.M. Corrigendum: A Practical Guide to Sparse k-Means Clustering for Studying Molecular Development of the Human Brain. Front. Neurosci. 2022, 16. [Google Scholar] [CrossRef]
- Zhao, M.; Wang, Y.; Wang, X.; Chang, J.; Zhou, Y.; Liu, T. Modeling and Simulation of Large-Scale Wind Power Base Output Considering the Clustering Characteristics and Correlation of Wind Farms. Front. Energy Res. 2022, 10. [Google Scholar] [CrossRef]
- Wu, X.; Zhang, J.; Lau, A.P.T.; Lu, C. Low-complexity absolute-term based nonlinear equalizer with weight sharing for C-band 85-GBaud OOK transmission over a 100-km SSMF. Opt. Lett. 2022, 47, 1565. [Google Scholar] [CrossRef]
- Bora, M.D.J.; Gupta, D.A.K. Effect of Different Distance Measures on the Performance of K-Means Algorithm: An Experimental Study in Matlab. arXiv 2014, arXiv:1405.7471. [Google Scholar] [CrossRef]
- Jain, A.K. Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 2010, 31, 651–666. [Google Scholar] [CrossRef]
- Charalampidis, D. A modified k-means algorithm for circular invariant clustering. IEEE Trans. Pattern Anal. Mach. 2005, 27, 1856–1865. [Google Scholar] [CrossRef]
- Vejmelka, M.; Muslek, P.; Paluš, M.; Pelikán, E. K-means Clustering for Problems with Periodic Attributes. Int. J. Pattern Recognit. Artif. 2009, 23, 721–743. [Google Scholar] [CrossRef]
- Harb, H.; Makhoul, A.; Laiymani, D.; Jaber, A.; Tawil, R. K-means based clustering approach for data aggregation in periodic sensor networks. In Proceedings of the 2014 IEEE 10th International Conference on Wireless and Mobile Computing, Networking and Communications (WiMob), Larnaca, Cyprus, 8–10 October 2014. [Google Scholar] [CrossRef]
- You, X.; Sun, T.; Sun, D.; Liu, X.; Lv, X.; Buyya, R. K-ear: Extracting data access periodic characteristics for energy-aware data clustering and storing in cloud storage systems. Concurr. Comput. Pract. Exp. 2021, 33, e6096. [Google Scholar] [CrossRef]
- Doğan, E. Short-term Traffic Flow Prediction Using Artificial Intelligence with Periodic Clustering and Elected Set. Promet-Traffic Transp. 2020, 32, 65–78. [Google Scholar] [CrossRef]
- Wang, G.; Qin, W.; Wang, Y. Cyclic Weighted k-means Method with Application to Time-of-Day Interval Partition. Sustainability 2021, 13, 4796. [Google Scholar] [CrossRef]
- Novikov, A. PyClustering: Data Mining Library. J. Open Source Softw. 2019, 4, 1230. [Google Scholar] [CrossRef]
- Arthur, D.; Vassilvitskii, S. k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, New Orleans, LA, USA, 7–9 January 2007; pp. 1027–1035. [Google Scholar]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Podlaski, K. Periodic K-Means Exemplary Implementation. Available online: https://github.com/kpodlaski/periodic-kmeans (accessed on 25 May 2022).
- Rand, W. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 1971, 66, 846–850. [Google Scholar] [CrossRef]
- Fowlkes, E.; Mallows, C. A method for comparing two hierarchical clusterings. J. Am. Stat. Assoc. 1983, 78, 553–556. [Google Scholar] [CrossRef]
- Hubert, L.; Arabie, P. Comparing partitions. J. Classif. 1985, 2, 193–218. [Google Scholar] [CrossRef]
- Warrens, M. On the equivalence of Cohen’s kappa and the Hubert-Arabie adjusted Rand index. J. Classif. 2008, 25, 177–183. [Google Scholar] [CrossRef] [Green Version]
- Fortuniak, K.; Pawlak, W.; Bednorz, L.; Grygoruk, M.; Siedlecki, M.; Zielinski, M. Methane and carbon dioxide fluxes of a temperate mire in Central Europe. Agric. For. Meteorol. 2017, 232, 306–318. [Google Scholar] [CrossRef]
- Podlaski, K.; Durka, M.; Gwizdałła, T.; Miniak-Górecka, A.; Fortuniak, K.; Pawlak, W. LSTM Processing of Experimental Time Series with Varied Quality. In Computational Science—ICCS 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 581–593. [Google Scholar]
- NYC Taxi and Limousine Commission (TLC). Available online: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml (accessed on 16 May 2022).
Clustering 1 | Clustering 2 | Rand | Adjusted | Arabie | Hubert | Fowles | Jaccard | ||
---|---|---|---|---|---|---|---|---|---|
Dataset | Method | Dataset | Method | ||||||
original | periodic | 1.000 | 1.000 | 0.000 | 1.000 | 1.000 | 1.000 | ||
original | original | 0.981 | 0.961 | 0.019 | 0.962 | 0.977 | 0.955 | ||
original | periodic | 1.000 | 1.000 | 0.000 | 1.000 | 1.000 | 1.000 | ||
original | original | 0.804 | 0.575 | 0.196 | 0.607 | 0.731 | 0.559 | ||
original | periodic | 1.000 | 1.000 | 0.000 | 1.000 | 1.000 | 1.000 | ||
periodic | original | 0.981 | 0.961 | 0.019 | 0.962 | 0.977 | 0.955 | ||
periodic | periodic | 1.000 | 1.000 | 0.000 | 1.000 | 1.000 | 1.000 | ||
periodic | original | 0.804 | 0.575 | 0.196 | 0.607 | 0.731 | 0.559 | ||
periodic | periodic | 1.000 | 1.000 | 0.000 | 1.000 | 1.000 | 1.000 | ||
original | periodic | 0.981 | 0.961 | 0.019 | 0.962 | 0.977 | 0.955 | ||
original | original | 0.789 | 0.540 | 0.211 | 0.578 | 0.705 | 0.530 | ||
original | periodic | 0.981 | 0.961 | 0.019 | 0.962 | 0.977 | 0.955 | ||
periodic | original | 0.804 | 0.575 | 0.196 | 0.607 | 0.731 | 0.559 | ||
periodic | periodic | 1.000 | 1.000 | 0.000 | 1.000 | 1.000 | 1.000 | ||
original | periodic | 0.804 | 0.575 | 0.196 | 0.607 | 0.731 | 0.559 |
Dataset | Mehod | WCCS | Ratio |
---|---|---|---|
original | 3,385,241 | 1.00 | |
periodic | 3,385,241 | ||
original | 3,385,241 | 0.92 | |
periodic | 3,653,802 | ||
original | 3,385,241 | 0.91 | |
periodic | 3,695,809 |
Algorithm | Centers | WCCS | Centers | WCCS |
---|---|---|---|---|
periodic | 2 | 1,851,331 | 3 | 662,869 |
original | 2 | 1,941,053 | 3 | 706,689 |
Day Data | ||||||
---|---|---|---|---|---|---|
Algorithm | Centers | WCCS | Centers | WCCS | Centers | WCCS |
periodic | 4 | 3,619,456 | 7 | 1,306,400 | 12 | 452,409 |
original | 4 | 4,137,528 | 7 | 1,314,643 | 12 | 475,518 |
Week Data | ||||||
Algorithm | Centers | WCCS | Centers | WCCS | Centers | WCCS |
periodic | 3 | 641,157 | 5 | 232,192 | 7 | 82,427 |
original | 3 | 644,370 | 5 | 232,805 | 7 | 84,376 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Miniak-Górecka, A.; Podlaski, K.; Gwizdałła, T. Using K-Means Clustering in Python with Periodic Boundary Conditions. Symmetry 2022, 14, 1237. https://doi.org/10.3390/sym14061237
Miniak-Górecka A, Podlaski K, Gwizdałła T. Using K-Means Clustering in Python with Periodic Boundary Conditions. Symmetry. 2022; 14(6):1237. https://doi.org/10.3390/sym14061237
Chicago/Turabian StyleMiniak-Górecka, Alicja, Krzysztof Podlaski, and Tomasz Gwizdałła. 2022. "Using K-Means Clustering in Python with Periodic Boundary Conditions" Symmetry 14, no. 6: 1237. https://doi.org/10.3390/sym14061237
APA StyleMiniak-Górecka, A., Podlaski, K., & Gwizdałła, T. (2022). Using K-Means Clustering in Python with Periodic Boundary Conditions. Symmetry, 14(6), 1237. https://doi.org/10.3390/sym14061237