High-Dimensional Data Analysis Using Parameter Free Algorithm Data Point Positioning Analysis
Abstract
:1. Introduction
- 1.
- In this study, we propose a new non-parametric clustering algorithm (DPPA), which can calculate 1-NN and Max-NN by analyzing the positions of data in the dataset without any initial manual parameter assignment.
- 2.
- We use two well-known publicly available datasets to evaluate the proposed method to find clusters. In addition, the proposed method is not time-consuming because it reduces the dependence of analysis on the selection of artificial parameters. Finally, four popular algorithms (DBSCAN algorithm, K-means algorithm, affinity propagation algorithm, and Mean Shift algorithm) are implemented to compare the performance of the proposed model.
2. Related Work
3. Overview of Clustering Algorithms
3.1. DBSCAN
Algorithm 1: DBSCAN clustering algorithm. |
3.2. Affinity Propagation
- 1.
- Initialization: The algorithm begins by initializing two matrices: the “similarity matrix” and the “responsibility matrix”. The similarity matrix contains the pairwise similarity scores between all data points. The responsibility matrix is a matrix of the same size as the similarity matrix and is initialized to 0.
- 2.
- Calculate responsibility matrix: In this step, the algorithm iteratively updates the responsibility matrix. The responsibility matrix represents how well-suited each data point is to be an “exemplar” or representative of a cluster. The update rule is as follows:
- 3.
- Calculate availability matrix: In this step, the algorithm updates the availability matrix. The availability matrix represents how much “support” a data point receives from other data points for being an exemplar. The update rule is as follows:
- 4.
- Calculate cluster exemplars: The algorithm uses the responsibility and availability matrices to calculate which data points are the best exemplars for each cluster. The exemplars are chosen as the data points with the highest sum of their responsibility and availability scores:
- 5.
- Assign data points to clusters: Finally, the Algorithm 2 assigns each data point to its nearest exemplar, which forms the final clusters.
Algorithm 2: Affinity propagation algorithm. |
3.3. Mean Shift Algorithm
3.3.1. Mean Shift Procedures
3.3.2. Some Special Kernels and Their Shadows
3.4. K-Means
- 1.
- Choose the number of clusters K that you want to identify and randomly initialize K cluster centers.
- 2.
- Assign each data point to its nearest cluster center. This can be done using Euclidean distance or other distance measures.
- 3.
- Calculate the mean of the data points in each cluster to obtain the new cluster centers.
- 4.
- Repeat steps 2 and 3 until the cluster centers converge, i.e., when the assignments of the data points to clusters no longer change or change minimally.
- Step 2:
- Assign each data point to its nearest cluster center :
- Step 3:
- Calculate the mean of the data points in each cluster to obtain the new cluster centers:
- Step 4:
- Repeat steps 2 and 3 until convergence.
4. Data Point Positioning Analysis Algorithm
- 1.
- 2.
- For determine
- 3.
- Calculate the range of radius, denoted by , between the minimum and maximum values of a scalar value .
- 4.
- For each point ,
- (a)
- Compute , which is the count of neighboring data points within the specified radius range , for a given data point . The calculation involves summing up a series of ones or zeros, depending on whether the distance between and a particular neighboring data point is within or outside the specified range.
- (b)
- Construct a table (Table 1) that shows the nearest neighboring data points for each data point , including the nearest neighbor () and maximum neighbor ():
- 5.
- Arrange the data points in the neighbor-link table in a manner such that they are sorted in ascending order based on the value of their corresponding .
- (a)
- To form a cluster , first place the data point , and then add all the data points that are in the Max-NN linkage of .
- (b)
- Add all the data points that are 1-NN (nearest neighbors) of to the cluster .
- (c)
- If the next data point belongs to cluster , then assign to and set to , and repeat the process, starting from step a.
- (d)
- Continue the process until there are no more data points left.
Algorithm 3: Data point positioning analysis algorithm. |
5. Dataset
5.1. Dataset 1
5.2. Dataset 2
6. Results and Discussion
6.1. Proposed Data Point Positioning Analysis
6.2. Comparison with Four Popular Methods
6.2.1. DBSCAN
6.2.2. Affinity Propagation
6.2.3. Mean Shift
6.2.4. K-Means
6.2.5. Performed on Other Methods
7. Conclusions
- This study proposed a novel approach for clustering data, called data point positioning analysis (DPPA), to enhance the efficiency of high-dimensional dataset clustering.
- In this method, there is no need to pre-specify the number of clusters; whereas traditional clustering methods often require the number of clusters to be determined beforehand. This makes parameter-free methods more flexible and adaptable to different datasets.
- This proposed parameter-free clustering algorithm is better able to handle noisy or outlier data points since it uses density-based clustering techniques that do not depend on distance measures alone.
- The study compared the proposed method to four popular clustering algorithms and demonstrated that the proposed method achieves superior performance in identifying clusters.
Funding
Conflicts of Interest
References
- Mirkin, B. Clustering for Data Mining: A Data Recovery Approach; Chapman and Hall/CRC: Boca Raton, FL, USA, 2005. [Google Scholar]
- Huang, F.; Zhu, Q.; Zhou, J.; Tao, J.; Zhou, X.; Jin, D.; Tan, X.; Wang, L. Research on the parallelization of the DBSCAN clustering algorithm for spatial data mining based on the spark platform. Remote Sens. 2017, 9, 1301. [Google Scholar] [CrossRef]
- Sabor, K.; Jougnot, D.; Guerin, R.; Steck, B.; Henault, J.M.; Apffel, L.; Vautrin, D. A data mining approach for improved interpretation of ERT inverted sections using the DBSCAN clustering algorithm. Geophys. J. Int. 2021, 225, 1304–1318. [Google Scholar] [CrossRef]
- Parsons, L.; Haque, E.; Liu, H. Subspace clustering for high dimensional data: A review. ACM SIGKDD Explor. Newsl. 2004, 6, 90–105. [Google Scholar] [CrossRef]
- Hawkins, D.M. The problem of overfitting. J. Chem. Inf. Comput. Sci. 2004, 44, 1–12. [Google Scholar] [CrossRef]
- Stojanovic, V.; Nedic, N.; Prsic, D.; Dubonjic, L. Optimal experiment design for identification of ARX models with constrained output in non-Gaussian noise. Appl. Math. Model. 2016, 40, 6676–6689. [Google Scholar] [CrossRef]
- Stojanovic, V.; Nedic, N. Identification of time-varying OE models in presence of non-Gaussian noise: Application to pneumatic servo drives. Int. J. Robust Nonlinear Control 2016, 26, 3974–3995. [Google Scholar] [CrossRef]
- Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef]
- Yan, S.; Xu, D.; Zhang, B.; Zhang, H.J.; Yang, Q.; Lin, S. Graph embedding and extensions: A general framework for dimensionality reduction. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 29, 40–51. [Google Scholar] [CrossRef]
- Baudat, G.; Anouar, F. Generalized discriminant analysis using a kernel approach. Neural Comput. 2000, 12, 2385–2404. [Google Scholar] [CrossRef]
- Jolliffe, I.T.; Cadima, J. Principal component analysis: A review and recent developments. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 2016, 374, 20150202. [Google Scholar] [CrossRef]
- Yan, C.; Gong, B.; Wei, Y.; Gao, Y. Deep multi-view enhancement hashing for image retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 1445–1451. [Google Scholar] [CrossRef]
- Yan, C.; Li, Z.; Zhang, Y.; Liu, Y.; Ji, X.; Zhang, Y. Depth image denoising using nuclear norm and learning graph model. ACM Trans. Multimed. Comput. Commun. Appl. TOMM 2020, 16, 122. [Google Scholar] [CrossRef]
- Yan, C.; Hao, Y.; Li, L.; Yin, J.; Liu, A.; Mao, Z.; Chen, Z.; Gao, X. Task-adaptive attention for image captioning. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 43–51. [Google Scholar] [CrossRef]
- Yan, C.; Teng, T.; Liu, Y.; Zhang, Y.; Wang, H.; Ji, X. Precise no-reference image quality evaluation based on distortion identification. ACM Trans. Multimed. Comput. Commun. Appl. TOMM 2021, 17, 110. [Google Scholar] [CrossRef]
- Demiriz, A.; Bennett, K.P.; Embrechts, M.J. Semi-supervised clustering using genetic algorithms. In Proceedings of the Artificial Neural Networks in Engineering (ANNIE-99), St. Louis, MO, USA, 7–10 November 1999; pp. 809–814. [Google Scholar]
- Chen, X.; Liu, W.; Qiu, H.; Lai, J. APSCAN: A parameter free algorithm for clustering. Pattern Recognit. Lett. 2011, 32, 973–986. [Google Scholar] [CrossRef]
- Ding, Z.; Xie, H.; Li, P. Evolutionary Parameter-Free Clustering Algorithm. In Proceedings of the 2021 IEEE 2nd International Conference on Pattern Recognition and Machine Learning (PRML), Chengdu, China, 16–18 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 33–38. [Google Scholar]
- Mustapha, S.S.; Theruvil, B.; Madanan, M. Visual comparison of clustering using link-based clustering method (Lbcm) without predetermining initial centroid information. ICIC Express Lett. Part B Appl. Int. J. Res. Surv. 2021, 12, 317–323. [Google Scholar]
- Chang, C.H.; Ding, Z.K. Categorical data visualization and clustering using subjective factors. Data Knowl. Eng. 2005, 53, 243–262. [Google Scholar] [CrossRef]
- He, Z.; Xu, X.; Deng, S. A cluster ensemble method for clustering categorical data. Inf. Fusion 2005, 6, 143–151. [Google Scholar] [CrossRef]
- Han, E.; Karypis, G.; Kumar, V.; Mobasher, B. Clustering Based on Association Rule Hypergraphs; University of Minnesota: Minneapolis, MN, USA, 1997. [Google Scholar]
- Gibson, D.; Kleinberg, J.; Raghavan, P. Clustering categorical data: An approach based on dynamical systems. VLDB J. 2000, 8, 222–236. [Google Scholar] [CrossRef]
- Rokach, L.; Maimon, O. Clustering methods. In Data Mining and Knowledge Discovery Handbook; Springer: Berlin, Germany, 2005; pp. 321–352. [Google Scholar]
- San, O.M.; Huynh, V.N.; Nakamori, Y. An alternative extension of the k-means algorithm for clustering categorical data. Int. J. Appl. Math. Comput. Sci. 2004, 14, 241–247. [Google Scholar]
- Wu, F.X. Genetic weighted k-means algorithm for clustering large-scale gene expression data. BMC Bioinform. 2008, 9, S12. [Google Scholar] [CrossRef] [PubMed]
- Likas, A.; Vlassis, N.; Verbeek, J.J. The global k-means clustering algorithm. Pattern Recognit. 2003, 36, 451–461. [Google Scholar] [CrossRef]
- Frey, B.J.; Dueck, D. Clustering by passing messages between data points. Science 2007, 315, 972–976. [Google Scholar] [CrossRef] [PubMed]
- Dunn, J.C. A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J. Cybern. 1973, 3, 32–57. [Google Scholar] [CrossRef]
- Pal, N.R.; Biswas, J. Cluster validation using graph theoretic concepts. Pattern Recognit. 1997, 30, 847–857. [Google Scholar] [CrossRef]
- Ilc, N. Modified Dunn’s cluster validity index based on graph theory. Prz. Elektrotech. 2012, 88, 126–131. [Google Scholar]
- Davies, D.L.; Bouldin, D.W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1979, PAMI-1, 224–227. [Google Scholar] [CrossRef]
- Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
- Tibshirani, R.; Walther, G.; Hastie, T. Estimating the number of clusters in a dataset via the gap statistic. J. R. Stat. Soc. Ser. B Stat. Methodol. 2001, 63, 411–423. [Google Scholar] [CrossRef]
- Costa, I.G.; de Carvalho, F.d.A.; de Souto, M.C. Comparative analysis of clustering methods for gene expression time course data. Genet. Mol. Biol. 2004, 27, 623–631. [Google Scholar] [CrossRef]
- Moulick, S.; Mal, B.; Bandyopadhyay, S. Prediction of aeration performance of paddle wheel aerators. Aquac. Eng. 2002, 25, 217–237. [Google Scholar] [CrossRef]
- Mustapha, S.S. An Alternative Parameter Free Clustering Algorithm Using Data Point Positioning Analysis (DPPA): Comparison with DBSCAN. Int. J. Innov. Comput. Inf. Control, 2024; in press. [Google Scholar]
- Sander, J.; Ester, M.; Kriegel, H.P.; Xu, X. Density-based clustering in spatial databases: The algorithm gdbscan and its applications. Data Min. Knowl. Discov. 1998, 2, 169–194. [Google Scholar] [CrossRef]
- Frey, B.J.; Dueck, D. Mixture modeling by affinity propagation. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 5–8 December 2005; Volume 18. [Google Scholar]
- Dueck, D.; Frey, B.J. Non-metric affinity propagation for unsupervised image categorization. In Proceedings of the 2007 IEEE 11th International Conference on Computer Vision, Rio De Janeiro, Brazil, 14–21 October 2007; IEEE: Piscataway, NJ, USA, 2007; pp. 1–8. [Google Scholar]
- Liu, Y.; Wu, F. Multi-modality video shot clustering with tensor representation. Multimed. Tools Appl. 2009, 41, 93–109. [Google Scholar] [CrossRef]
- Zhang, X.; Gao, J.; Lu, P.; Yan, Y. A novel speaker clustering algorithm via supervised affinity propagation. In Proceedings of the 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, Las Vegas, NV, USA, 31 March–4 April 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 4369–4372. [Google Scholar]
- Camastra, F.; Verri, A. A novel kernel method for clustering. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 801–805. [Google Scholar] [CrossRef]
- Girolami, M. Mercer kernel-based clustering in feature space. IEEE Trans. Neural Netw. 2002, 13, 780–784. [Google Scholar] [CrossRef]
- Silverman, B.W. Density Estimation for Statistics and Data Analysis; Routledge: Abingdon, UK, 2018. [Google Scholar]
- Wu, K.L.; Yang, M.S. Mean shift-based clustering. Pattern Recognit. 2007, 40, 3035–3052. [Google Scholar] [CrossRef]
- Fukunaga, K.; Hostetler, L. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans. Inf. Theory 1975, 21, 32–40. [Google Scholar] [CrossRef]
- Cheng, Y. Mean shift, mode seeking, and clustering. IEEE Trans. Pattern Anal. Mach. Intell. 1995, 17, 790–799. [Google Scholar] [CrossRef]
- Comaniciu, D.; Meer, P. Mean shift: A robust approach toward feature space analysis. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 603–619. [Google Scholar] [CrossRef]
- MacQueen, J. Classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Los Angeles, CA, USA, 21 June–18 July 1965; 27 December 1965–7 January 1966; University of California: Los Angeles, CA, USA, 1967; pp. 281–297. [Google Scholar]
- Wang, Q.; Wang, C.; Feng, Z.; Ye, J.f. Review of K-means clustering algorithm. Electron. Des. Eng. 2012, 20, 21–24. [Google Scholar]
- Moro, S.; Laureano, R.; Cortez, P. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology; EUROSIS-ETI: Ostend, Belgium, 2011. [Google Scholar]
- PSVishnu. Bank Direct Marketing Dataset. Available online: https://www.kaggle.com/datasets/psvishnu/bank-direct-marketing (accessed on 20 July 2023).
- Ayetiran, E.F.; Adeyemo, A.B. A data mining-based response model for target selection in direct marketing. IJ Inf. Technol. Comput. Sci. 2012, 1, 9–18. [Google Scholar]
- Pisharath, J.; Liu, Y.; Liao, W.; Choudhary, A.; Memik, G.; Parhi, J. NU-MineBench 3.0; Technical Report CUCIS-2005-08-01; Northwestern University: Evanston, IL, USA, 2010. [Google Scholar]
- Agrawal, R.; Srikant, R. Quest Synthetic Data Generator; Technical Report; IBM Almaden Research Center: San Jose, CA, USA, 1994. [Google Scholar]
- CUCIS—Northwestern University. Clustering Benchmark Datasets. Available online: http://cucis.ece.northwestern.edu/projects/Clustering/download_data.html (accessed on 20 July 2023).
- Hassan, B.A.; Rashid, T.A.; Mirjalili, S. Performance evaluation results of evolutionary clustering algorithm star for clustering heterogeneous datasets. Data Brief 2021, 36, 107044. [Google Scholar] [CrossRef] [PubMed]
- Hassan, B.A.; Rashid, T.A. A multidisciplinary ensemble algorithm for clustering heterogeneous datasets. Neural Comput. Appl. 2021, 33, 10987–11010. [Google Scholar] [CrossRef]
- Kumar, C.A.; Srinivas, S. Concept lattice reduction using fuzzy K-means clustering. Expert Syst. Appl. 2010, 37, 2696–2704. [Google Scholar] [CrossRef]
- Dias, S.M.; Vieira, N. Reducing the Size of Concept Lattices: The JBOS Approach. In Proceedings of the CLA, Sevilla, Spain, 19–21 October 2010; Volume 672, pp. 80–91. [Google Scholar]
- Arthur, D.; Vassilvitskii, S. K-means++ the advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, USA, 7–9 January 2007; pp. 1027–1035. [Google Scholar]
Data Point | 1-NN | Max-NN |
---|---|---|
where a is some index | ||
where b is some index | ||
where c is some index | ||
Attributes | Type | Kind | Attributes illustration |
---|---|---|---|
Age | Range | Numeric | |
Job | Set | Categorical | (“admin”, “unknown”, “unemployed”, “management”, “housemaid”, “entrepreneur”, “student”, “blue-collar”, “self employed”, “retired”, “technician”, “services”) |
Marital | Set | Categorical | marital status (“married”, “divorced”, “single”; note: “divorced” means divorced or widowed) |
Education | Set | Categorical | (“unknown”, “secondary”, “primary”, “tertiary”) |
Default | Flag | Binary (Categorical) | Has defaulted on credit? (Yes/No in binary) |
Balance | Range | Numeric | Typical annual balance, in euros |
Housing | Flag | Binary (Categorical) | Has a mortgage loan? (Yes/No in binary) |
Loan | Flag | Binary (Categorical) | Individual loan? (Yes/No in binary) # pertaining to the most recent campaign interaction. |
Contact | Set | Categorical | Kind of contact communication (categorical: “unknown”, “telephone”, “cellular”) |
Day | Range | Numeric | Last contact day of the month |
Month | Set | Categorical | Year’s last month of contact (categorical: “jan”, “feb”, “mar”, ⋯, “nov”, “dec”) |
Duration | Range | Numeric | Length of the last contact, in seconds |
Campaign | Range | Numeric | Number of contacts made for this customer during this campaign (includes the last contact) |
Pdays | Range | Numeric | Number of days since the last contact with the client from a prior campaign (−1 means the client was not previously contacted) |
Previous | Range | Numeric | Number of contacts performed for this customer prior to this campaign. |
Poutcome | Set | Categorical | Result of the preceding marketing effort (categorical: “unknown”, “other”, “failure”, “success”) |
Output | Flag | Binary (Categorical) | Output variable (desired outcome): y—has the customer made a term deposit? (Yes/No in binary) |
Dataset 1 | Dataset 2 | |||
---|---|---|---|---|
Methods | Cluster Number | Silhouette Coefficient | Cluster Number | Silhouette Coefficient |
Mean Shift | 1 | N/A | 1 | N/A |
K-means | 8 | 0.226 | 50 | 0.832 |
Affinity Propagation | 976 | 0.195 | 1525 | 0.0745 |
DBSCAN | 8 | 0.288 | 50 | 0.832 |
Fuzzy K-means | 8 | 0.158 | 50 | 0.832 |
JBOS | 8 | 0.259 | 50 | 0.584 |
K-Mean++ | 8 | 0.262 | 50 | 0.60 |
ECA* | 8 | 0.001 | 50 | −0.12 |
DPPA | 8 | 0.288 | 50 | 0.832 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Mustapha, S.M.F.D.S. High-Dimensional Data Analysis Using Parameter Free Algorithm Data Point Positioning Analysis. Appl. Sci. 2024, 14, 4231. https://doi.org/10.3390/app14104231
Mustapha SMFDS. High-Dimensional Data Analysis Using Parameter Free Algorithm Data Point Positioning Analysis. Applied Sciences. 2024; 14(10):4231. https://doi.org/10.3390/app14104231
Chicago/Turabian StyleMustapha, S. M. F. D. Syed. 2024. "High-Dimensional Data Analysis Using Parameter Free Algorithm Data Point Positioning Analysis" Applied Sciences 14, no. 10: 4231. https://doi.org/10.3390/app14104231
APA StyleMustapha, S. M. F. D. S. (2024). High-Dimensional Data Analysis Using Parameter Free Algorithm Data Point Positioning Analysis. Applied Sciences, 14(10), 4231. https://doi.org/10.3390/app14104231