Fast Component Density Clustering in Spatial Databases: A Novel Algorithm
Abstract
:1. Introduction
State-of-the-Art Methods
2. Materials and Methods
- Component size (CS) is the radius size of the component. The component is a circular region formed by a radius and a core point as a center. The radius must fit the nature of distances between data points. Large distances require a large radius value. However, a minimum value must be determined, and it should be sufficient to create a component for covering several points in each process.
- Minimum density (MD) is the threshold density value used to classify each created component as a cluster or noise. It refers to the minimum number of points that must be in the component. This parameter must be suitable for the size of the dataset, and large data require a large value.
- Core point (P_core) is the selected center point for creating the current component. This point must be an “unlabeled point.”
- Unlabeled points (PU) are points that do not yet have label values.
- Labeled points (PL) are points within the component, and it is given a label value by a previous process.
- Component points (PC) are the data points within the component.
- Component density (CD) represents the number of data points in the component.
- Temporary label (TL), represented by values, is assigned to the points. The initial value is 1, and it increases by 1 after each use.
- Label list (LL) stores the adjacent TLs. The stored values are used to calculate the final clusters.
2.1. Temporary Labeling Stage
Algorithm 1: Temporary labeling stage. Input: Component size (CS), minimum density (MD), TL = 1 Output: temerity labeled points, labels list (LL) |
1: IF P is PU: 2: P_Core ← P # Create new component is over it 3: IF CD > MD: 4: IF PL in PC # if any labeled point in the component 5: PC ← min(PL) # label all points by the minimum label 6: LL(TL) ← ALL (PL) # store all labels in the list 7: ELSE: 8: ALL PC ← TL 9: LL(TL) ← TL # store the label in the list 10: TL = TL + 1 11: ELSE: 12: P_Core ← TL 13: LL(TL) ← TL # store the label in the list 14: TL ← TL + 1 |
2.2. Clustering Stage
Algorithm 2: Clustering stage. Input: Labels list (LL), temerity labeled data points Outputs: List of representative values, clustered data |
1: Sort all sets of adjacent label values in if LL incrementally 2: For I in 1 to TL: # Union 3: For v in LL[i]: 4: IF v ! = i and length (LL[i]) > 1: 5: LL[ i ] ← LL[ i ] + LL[ v ] 6: LL[ v ] ← v 7: FOR i in 1 to TL: #Find 8: Representative [ I ] ← min(LL[ I ]) 9: FOR each TL in Data: # Replace TL by representative value 10: P_Cluster ← Representative [TL] # The final cluster value for each point 11: FOR each Noise: 12: IF distance (Noise, nearest cluster) < radius: 13: Noise ← nearest cluster |
3. Results
3.1. Dataset Selection
3.2. Estimation of Parameters
3.3. Evaluation Measurement Methods
- a, the number of pairs of elements in S that are in the same set in C and in the same set in K.
- b, the number of pairs of elements in S that are in different sets in C and in different sets in K.
- c, the number of pairs of elements in S that are in the same set in C and in different sets in K.
- d, the number of pairs of elements in S that are in different sets in C and in the same set in K.
3.4. Clustering Performance
4. Discussion
5. Conclusions
Funding
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Zhao, J.; Ding, Y.; Zhai, Y.; Jiang, Y.; Zhai, Y.; Hu, M. Explore unlabeled big data learning to online failure prediction in safety-aware cloud environment. J. Parallel Distrib. Comput. 2021, 153, 53–63. [Google Scholar] [CrossRef]
- Xu, X.; Ding, S.; Wang, Y.; Wang, L.; Jia, W. A fast density peaks clustering algorithm with sparse search. Inf. Sci. 2021, 554, 61–83. [Google Scholar] [CrossRef]
- Rehman, A.U.; Belhaouari, S.B. Divide well to merge better: A novel clustering algorithm. Pattern Recognit. 2022, 122, 108305. [Google Scholar] [CrossRef]
- Najim Adeen, I.M.; Abdulazeez, A.M.; Zeebaree, D.Q. Systematic review of unsupervised genomic clustering algorithms techniques for high dimensional datasets. Technol. Rep. Kansai Univ. 2020, 62, 355–374. [Google Scholar]
- Wang, H.; Yang, Y.; Liu, B.; Fujita, H. A study of graph-based system for multi-view clustering. Knowledge-Based Syst. 2019, 163, 1009–1019. [Google Scholar] [CrossRef]
- Zhu, X.; Zhang, S.; He, W.; Hu, R.; Lei, C.; Zhu, P. One-Step Multi-View Spectral Clustering. IEEE Trans. Knowl. Data Eng. 2019, 31, 2022–2034. [Google Scholar] [CrossRef]
- Naik, A.; Reddy, D.; Jana, P.K. A novel clustering algorithm for biological data. In Proceedings of the 2011 Second International Conference on Emerging Applications of Information Technology, Kolkata, India, 19–20 February 2011; pp. 249–252. [Google Scholar] [CrossRef]
- Lytvynenko, V.; Lurie, I.; Krejci, J.; Voronenko, M.; Savina, N.; Taif, M.A. Two step density-based object-inductive clustering algorithm. CEUR Workshop Proc. 2019, 2386, 117–135. [Google Scholar]
- Haoxiang, W.; Smys, S. Big data analysis and perturbation using data mining algorithm. J. Soft Comput. Paradig. (JSCP) 2021, 3, 19–28. [Google Scholar] [CrossRef]
- Okagbue, H.I.; Oguntunde, P.E.; Adamu, P.I.; Adejumo, A.O. Unique clusters of patterns of breast cancer survivorship. Health Technol. 2022, 12, 365–384. [Google Scholar] [CrossRef]
- Bateja, R.; Dubey, S.K.; Bhatt, A. Evaluation and Application of Clustering Algorithms in Healthcare Domain Using Cloud Services. In Second International Conference on Sustainable Technologies for Computational Intelligence; Springer: Singapore, 2021; pp. 249–261. [Google Scholar]
- Hao, Y.; Dong, L.; Liao, X.; Liang, J.; Wang, L.; Wang, B. A novel clustering algorithm based on mathematical morphology for wind power generation prediction. Renew. Energy 2019, 136, 572–585. [Google Scholar] [CrossRef]
- Cai, N.; Diao, C.; Khan, M.J. A Novel Clustering Method Based on Quasi-Consensus Motions of Dynamical Multiagent Systems. Complexity 2017, 2017, 4978613. [Google Scholar] [CrossRef]
- Bataineh, B. A fast and memory-efficient two-pass connected-component labeling algorithm for binary images. Turk. J. Electr. Eng. Comput. Sci. 2019, 27, 1243–1259. [Google Scholar] [CrossRef]
- Bataineh, B.; Abdullah, S.N.H.S.; Omar, K. An adaptive local binarization method for document images based on a novel thresholding method and dynamic windows. Pattern Recognit. Lett. 2011, 32, 1805–1813. [Google Scholar] [CrossRef]
- Bataineh, B.; Abdullah, S.N.H.S.; Omar, K. Adaptive binarization method for degraded document images based on surface contrast variation. Pattern Anal. Appl. 2017, 20, 639–652. [Google Scholar] [CrossRef]
- Pandey, M.; Avhad, O.; Khedekar, A.; Lamkhade, A.; Vharkate, M. Social Media Community Using Optimized Clustering Algorithm. In ICT Analysis and Applications; Springer: Berlin/Heidelberg, Germany, 2022; pp. 669–675. [Google Scholar]
- Nasrazadani, M.; Fatemi, A.; Nematbakhsh, M. Sign prediction in sparse social networks using clustering and collaborative filtering. J. Supercomput. 2022, 78, 596–615. [Google Scholar] [CrossRef]
- Appiah, S.K.; Wirekoh, K.; Aidoo, E.N.; Oduro, S.D.; Arthur, Y.D. A model-based clustering of expectation–maximization and K-means algorithms in crime hotspot analysis. Res. Math. 2022, 9, 2073662. [Google Scholar] [CrossRef]
- Kumar, J.; Sravani, M.; Akhil, M.; Sureshkumar, P.; Yasaswi, V. Crime Rate Prediction Based on K-means Clustering and Decision Tree Algorithm. In Computer Networks and Inventive Communication Technologies; Springer: Singapore, 2022; pp. 451–462. [Google Scholar]
- Castañeda, G.; Castro Peñarrieta, L. A Customized Machine Learning Algorithm for Discovering the Shapes of Recovery: Was the Global Financial Crisis Different? J. Bus. Cycle Res. 2022, 18, 69–99. [Google Scholar] [CrossRef]
- Dai, T. Computer Management Method of Foreign Trade Business Expenses Based on Data Analysis Technology. In 2021 International Conference on Big Data Analytics for Cyber-Physical System in Smart City; Springer: Singapore, 2021; pp. 1037–1043. [Google Scholar]
- Alalyan, F.; Zamzami, N.; Bouguila, N. Model-based hierarchical clustering for categorical data. In Proceedings of the 2019 IEEE 28th International Symposium on Industrial Electronics (ISIE), Vancouver, BC, Canada, 12–14 June 2019; pp. 1424–1429. [Google Scholar]
- Aljibawi, M.; Nazri, M.Z.A.; Sani, N.S. An Enhanced Mudi-Stream Algorithm for Clustering Data Stream. J. Theor. Appl. Inf. Technol. 2022, 100, 3012–3021. [Google Scholar]
- Wang, L.; Leckie, C.; Ramamohanarao, K.; Bezdek, J. Automatically determining the number of clusters in unlabeled data sets. IEEE Trans. Knowl. Data Eng. 2009, 21, 335–350. [Google Scholar] [CrossRef]
- Ahmed, M.; Seraj, R.; Islam, S.M.S. The k-means algorithm: A comprehensive survey and performance evaluation. Electronics 2020, 9, 1295. [Google Scholar] [CrossRef]
- Huang, J.; Zhu, Q.; Yang, L.; Cheng, D.; Wu, Q. QCC: A novel clustering algorithm based on Quasi-Cluster Centers. Mach. Learn. 2017, 106, 337–357. [Google Scholar] [CrossRef] [Green Version]
- Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. KDD 1996, 96, 226–231. Available online: https://www.semanticscholar.org/paper/A-Density-Based-Algorithm-for-Discovering-Clusters-Ester-Kriegel/5c8fe9a0412a078e30eb7e5eeb0068655b673e86 (accessed on 22 September 2022).
- Zelig, A.; Kaplan, N. KMD clustering: Robust generic clustering of biological data. bioRxiv 2020. [Google Scholar] [CrossRef]
- Ankerst, M.; Breunig, M.M.; Kriegel, H.P.; Sander, J. OPTICS: Ordering Points to Identify the Clustering Structure. SIGMOD Rec. (ACM Spec. Interes. Gr. Manag. Data) 1999, 28, 49–60. [Google Scholar] [CrossRef]
- Bhattacharjee, P.; Mitra, P. A survey of density based clustering algorithms. Front. Comput. Sci. 2021, 15, 5–7. [Google Scholar] [CrossRef]
- Hahsler, M.; Piekenbrock, M.; Doran, D. dbscan: Fast density-based clustering with R. J. Stat. Softw. 2019, 91, 1–30. [Google Scholar] [CrossRef] [Green Version]
- Mittal, M.; Goyal, L.M.; Hemanth, D.J.; Sethi, J.K. Clustering approaches for high-dimensional databases: A review. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2019, 9, e1300. [Google Scholar] [CrossRef]
- Renjith, S.; Sreekumar, A.; Jathavedan, M. Performance evaluation of clustering algorithms for varying cardinality and dimensionality of data sets. Mater. Today Proc. 2020, 27, 627–633. [Google Scholar] [CrossRef]
- Likas, A.; Vlassis, N.; Verbeek, J.J. The global k-means clustering algorithm. Pattern Recognit. 2003, 36, 451–461. [Google Scholar] [CrossRef] [Green Version]
- Derpanis, K.G. Mean shift clustering. Lect. Notes 2005, 32, 1–4. [Google Scholar]
- Kong, D.; Xie, X.; Zhang, Z. Clustering-based Partitioning for Large Web Graphs. arXiv 2022, arXiv:2201.00472. [Google Scholar] [CrossRef]
- Mustafi, D.; Mustafi, A.; Sahoo, G. A novel approach to text clustering using genetic algorithm based on the nearest neighbour heuristic. Int. J. Comput. Appl. 2022, 44, 291–303. [Google Scholar] [CrossRef]
- Kashyap, M.; Gogoi, S.; Prasad, R.K. A Comparative Study on Partition-based Clustering Methods. Int. J. Create. Res. Thoughts (IJCRT) 2018, 6, 1457–1463. Available online: https://books.google.co.uk/books?hl=en&lr=&id=DEZ1EAAAQBAJ&oi=fnd&pg=PT7&dq=39.%09Kashyap,+M.%3B+Gogoi,+S.%3B+Prasad,+R.K.%3B+Science,+C.+A+Comparative+Study+on+Partition-based+Clustering+Methods&ots=8ud8p982IK&sig=HpmAfa5p3FrBBQf3KDYsy-hNJ7g&redir_esc=y#v=onepage&q&f=false (accessed on 22 September 2022).
- Fraley, C.; Raftery, A.E. Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 2002, 97, 611–631. [Google Scholar] [CrossRef]
- McNicholas, P.D. Model-based clustering. J. Classif. 2016, 33, 331–373. [Google Scholar] [CrossRef]
- Johnson, S.C. Hierarchical clustering schemes. Psychometrika 1967, 32, 241–254. [Google Scholar] [CrossRef] [PubMed]
- Nielsen, F. Hierarchical clustering. In Introduction to HPC with MPI for Data Science; Springer: Berlin/Heidelberg, Germany, 2016; pp. 195–211. [Google Scholar]
- Murtagh, F.; Contreras, P. Algorithms for hierarchical clustering: An overview. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2012, 2, 86–97. [Google Scholar] [CrossRef]
- Day, W.H.E.; Edelsbrunner, H. Efficient algorithms for agglomerative hierarchical clustering methods. J. Classif. 1984, 1, 7–24. [Google Scholar] [CrossRef]
- Askari, S. Fuzzy C-Means clustering algorithm for data with unequal cluster sizes and contaminated with noise and outliers: Review and development. Expert Syst. Appl. 2021, 165, 113856. [Google Scholar] [CrossRef]
- Leski, J.M. Fuzzy c-ordered-means clustering. Fuzzy Sets Syst. 2016, 286, 114–133. [Google Scholar] [CrossRef]
- Zhang, H.; Li, H.; Chen, N.; Chen, S.; Liu, J. Novel fuzzy clustering algorithm with variable multi-pixel fitting spatial information for image segmentation. Pattern Recognit. 2022, 121, 108201. [Google Scholar] [CrossRef]
- Baraldi, A.; Blonda, P. A survey of fuzzy clustering algorithms for pattern recognition—Part II. IEEE Trans. Syst. Man Cybern. Part B Cybern. 1999, 29, 786–801. [Google Scholar] [CrossRef]
- Chen, J.Y.; He, H.H. A fast density-based data stream clustering algorithm with cluster centers self-determined for mixed data. Inf. Sci. 2016, 345, 271–293. [Google Scholar] [CrossRef]
- Wang, M.; Min, F.; Zhang, Z.H.; Wu, Y.X. Active learning through density clustering. Expert Syst. Appl. 2017, 85, 305–317. [Google Scholar] [CrossRef]
- Cai, J.; Wei, H.; Yang, H.; Zhao, X. A Novel Clustering Algorithm Based on DPC and PSO. IEEE Access 2020, 8, 88200–88214. [Google Scholar] [CrossRef]
Method | Complexity Time |
---|---|
Affinity Propagation | O(N2I) |
Agglomerative Hierarchical Clustering | O(N3) |
K-means clustering | O(IN) |
Mean-shift | O(IN2) |
Spectral clustering | O(N3) |
Ward hierarchical clustering | O(N3) |
DBSCAN | O(N2), could be O(N log N) |
OPTICS | O(N log N) |
Gaussian mixtures | O(N3) |
Proposed | O(N) |
Noisy Circles | Noisy Moons | Blobs | Aniso | Varied | Overlapped | |
---|---|---|---|---|---|---|
Affinity Propagation | 0.53 | 0.55 | 0.76 | 0.72 | 0.73 | 0.52 |
Agglo. Hierarchical | 0.50 | 0.73 | 1.00 | 0.80 | 0.99 | 0.72 |
K-means | 0.50 | 0.62 | 1.00 | 0.82 | 0.92 | 0.73 |
Mean-shift | 0.50 | 0.71 | 1.00 | 0.76 | 0.93 | 0.74 |
Spectral | 0.50 | 0.65 | 1.00 | 0.89 | 0.97 | 0.69 |
DBSCAN | 1.00 | 1.00 | 0.98 | 0.99 | 0.96 | 0.62 |
OPTICS | 0.51 | 0.50 | 0.58 | 0.58 | 0.56 | 0.50 |
Gaussian mixtures | 0.50 | 0.75 | 1.00 | 1.00 | 0.99 | 0.82 |
Proposed | 1.00 | 1.00 | 1.00 | 1.00 | 0.97 | 0.69 |
Noisy Circles | Noisy Moons | Blobs | Aniso | Varied | Overlapped | |
---|---|---|---|---|---|---|
Affinity Propagation | 0.07 | 0.10 | 0.33 | 0.21 | 0.24 | 0.05 |
Agglo. Hierarchical | 0.01 | 0.47 | 1.00 | 0.57 | 0.97 | 0.46 |
K-means | 0.00 | 0.24 | 1.00 | 0.61 | 0.83 | 0.46 |
Mean-shift | 0.00 | 0.42 | 1.00 | 0.54 | 0.85 | 0.49 |
Spectral | 0.00 | 0.29 | 1.00 | 0.74 | 0.93 | 0.39 |
DBSCAN | 1.00 | 1.00 | 0.96 | 0.98 | 0.91 | 0.24 |
OPTICS | 0.01 | 0.01 | 0.02 | 0.01 | 0.01 | 0.00 |
Gaussian mixtures | 0.00 | 0.50 | 1.00 | 1.00 | 0.97 | 0.64 |
Proposed | 1.00 | 1.00 | 1.00 | 1.00 | 0.94 | 0.39 |
Noisy Circles | Noisy Moons | Blobs | Aniso | Varied | Overlapped | |
---|---|---|---|---|---|---|
Affinity Propagation | 1.00 | 1.00 | 1.00 | 0.99 | 0.95 | 0.69 |
Agglo. Hierarchical | 0.00 | 0.48 | 1.00 | 0.63 | 0.95 | 0.43 |
K-means | 0.00 | 0.18 | 1.00 | 0.63 | 0.81 | 0.43 |
Mean-shift | 0.01 | 0.36 | 1.00 | 0.52 | 0.83 | 0.45 |
Spectral | 0.00 | 0.22 | 1.00 | 0.74 | 0.91 | 0.41 |
DBSCAN | 1.00 | 1.00 | 0.98 | 0.99 | 0.94 | 0.46 |
OPTICS | 0.58 | 0.47 | 0.46 | 0.48 | 0.43 | 0.35 |
Gaussian mixtures | 0.00 | 0.40 | 1.00 | 1.00 | 0.94 | 0.55 |
Proposed | 1.00 | 1.00 | 1.00 | 1.00 | 0.91 | 0.41 |
Noisy Circles | Noisy Moons | Blobs | Aniso | Varied | Overlapped | |
---|---|---|---|---|---|---|
Affinity Propagation | 0.20 | 0.23 | 0.45 | 0.36 | 0.35 | 0.14 |
Agglo. Hierarchical | 0.00 | 0.51 | 1.00 | 0.68 | 0.95 | 0.45 |
K-means | 0.00 | 0.18 | 1.00 | 0.63 | 0.82 | 0.45 |
Mean-shift | 0.00 | 0.37 | 1.00 | 0.89 | 0.84 | 0.48 |
Spectral | 0.00 | 0.22 | 1.00 | 0.74 | 0.91 | 0.46 |
DBSCAN | 1.00 | 1.00 | 0.91 | 0.93 | 0.85 | 0.17 |
OPTICS | 0.12 | 0.12 | 0.19 | 0.19 | 0.18 | 0.09 |
Gaussian mixtures | 0.00 | 0.40 | 1.00 | 1.00 | 0.94 | 0.55 |
Proposed | 1.00 | 1.00 | 1.00 | 1.00 | 0.91 | 0.46 |
Noisy Circles | Noisy Moons | Blobs | Aniso | Varied | Overlapped | |
---|---|---|---|---|---|---|
Affinity Propagation | 0.34 | 0.37 | 0.62 | 0.53 | 0.52 | 0.23 |
Agglo. Hierarchical | 0.00 | 0.49 | 1.00 | 0.65 | 0.95 | 0.44 |
K-means | 0.00 | 0.18 | 1.00 | 0.63 | 0.81 | 0.44 |
Mean-shift | 0.00 | 0.36 | 1.00 | 0.65 | 0.83 | 0.47 |
Spectral | 0.00 | 0.22 | 1.00 | 0.74 | 0.91 | 0.44 |
DBSCAN | 1.00 | 1.00 | 0.94 | 0.96 | 0.89 | 0.25 |
OPTICS | 0.20 | 0.19 | 0.27 | 0.27 | 0.26 | 0.14 |
Gaussian mixtures | 0.00 | 0.40 | 1.00 | 1.00 | 0.94 | 0.55 |
Proposed | 1.00 | 1.00 | 1.00 | 1.00 | 0.91 | 0.43 |
Noisy Circles | Noisy Moons | Blobs | Aniso | Varied | Overlapped | |
---|---|---|---|---|---|---|
Affinity Propagation | 0.26 | 0.31 | 0.52 | 0.40 | 0.43 | 0.22 |
Agglo. Hierarchical | 0.51 | 0.75 | 1.00 | 0.73 | 0.98 | 0.74 |
K-means | 0.50 | 0.62 | 1.00 | 0.74 | 0.88 | 0.74 |
Mean-shift | 0.36 | 0.72 | 1.00 | 0.76 | 0.90 | 0.75 |
Spectral | 0.50 | 0.65 | 1.00 | 0.83 | 0.96 | 0.72 |
DBSCAN | 1.00 | 1.00 | 0.98 | 0.98 | 0.94 | 0.55 |
OPTICS | 0.31 | 0.38 | 0.32 | 0.31 | 0.34 | 0.39 |
Gaussian mixtures | 0.50 | 0.75 | 1.00 | 1.00 | 0.98 | 0.82 |
Proposed | 1.00 | 1.00 | 1.00 | 1.00 | 0.96 | 0.71 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Bataineh, B. Fast Component Density Clustering in Spatial Databases: A Novel Algorithm. Information 2022, 13, 477. https://doi.org/10.3390/info13100477
Bataineh B. Fast Component Density Clustering in Spatial Databases: A Novel Algorithm. Information. 2022; 13(10):477. https://doi.org/10.3390/info13100477
Chicago/Turabian StyleBataineh, Bilal. 2022. "Fast Component Density Clustering in Spatial Databases: A Novel Algorithm" Information 13, no. 10: 477. https://doi.org/10.3390/info13100477
APA StyleBataineh, B. (2022). Fast Component Density Clustering in Spatial Databases: A Novel Algorithm. Information, 13(10), 477. https://doi.org/10.3390/info13100477