An Effective Fuzzy Clustering of Crime Reports Embedded by a Universal Sentence Encoder Model
Abstract
:1. Introduction
1.1. Literature Survey
1.2. Motivation and Objective
1.3. Contribution
- After collecting the dataset, named entities are recognized to extract the noun phrases of the reports, which are subsequently preprocessed by following stopword removal and lemmatization operations. Then each report has been converted to a vector by applying a transformer architecture-based Universal Sentence Encoder model on the collection of extracted processed noun phrases of the report.
- An undirected graph is constructed where each report vector is considered as a vertex, and an edge exists between a pair of vertices if the cosine similarity score between them crosses a predefined threshold.
- A novel graph-based overlapping clustering algorithm has been deduced based on splitting and merging operations. In the splitting operation, a graph is split into subgraphs using the clustering coefficient and degree of the vertices, and in the merging operation, a graph is reformed by fusing two subgraphs based on edge density.
- Fuzzy theorem is applied on overlapping clusters, where fuzzification is done to provide membership values to the reports lying in the overlapping regions, and defuzzification is done to label the reports by multiple crime types. Thus, reports outside overlapping regions of the clusters are of a single crime type and those in overlapping regions are of multiple crime types.
1.4. Summary of the Paper
2. Preprocessing and Report Embedding
2.1. Preprocessing of Reports
2.2. Report Embedding
3. Graph Based Fuzzy Clustering
3.1. Splitting
3.2. Merging
Algorithm 1: Split a Graph into subgraphs - |
Algorithm 2: Merge subgraphs into graphs- |
3.3. Fuzzy Theory and Report Labelling
Algorithm 3: Fuzzy Theory based Crime Report Labelling-FTCRL() |
Algorithm 3: Cont. |
4. Experimental Results
4.1. Cluster Analysis
4.2. Performance Evaluation
4.2.1. Comparison Using Internal Cluster Indices
4.2.2. Comparison Using Overlapping Cluster Indices
5. Conclusions
Author Contributions
Funding
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Saeed, M.Y.; Awais, M.; Talib, R.; Younas, M. Unstructured Text Documents Summarization With Multi-Stage Clustering. IEEE Access 2020, 8, 212838–212854. [Google Scholar] [CrossRef]
- Li, L.; Yang, B.; Zhang, F. Clustering for Complex Structured Data Based on Higher-Order Logic. In Proceedings of the 2008 International Conference on Computer Science and Software Engineering, Wuhan, China, 12–14 December 2008; Volume 4, pp. 390–393. [Google Scholar] [CrossRef]
- Misra, R. News category dataset. ResearchGate 2018, 3, 11429. [Google Scholar] [CrossRef]
- Das, P.; Das, A.K. Graph-based clustering of extracted paraphrases for labelling crime reports. Knowl.-Based Syst. 2019, 179, 55–76. [Google Scholar] [CrossRef]
- Khyani, D.; B S, S. An Interpretation of Lemmatization and Stemming in Natural Language Processing. Shanghai Ligong Daxue Xuebao/J. Univ. Shanghai Sci. Technol. 2021, 22, 350–357. [Google Scholar]
- Cer, D.; Yang, Y.; Kong, S.y.; Hua, N.; Limtiaco, N.; John, R.S.; Constant, N.; Guajardo-Cespedes, M.; Yuan, S.; Tar, C.; et al. Universal sentence encoder. arXiv 2018, arXiv:1803.11175. [Google Scholar]
- Girvan, M.; Newman, M.E. Community structure in social and biological networks. Proc. Natl. Acad. Sci. USA 2002, 99, 7821–7826. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Baadel, S.; Thabtah, F.; Lu, J. Overlapping clustering: A review. In Proceedings of the 2016 SAI Computing Conference (SAI), London, UK, 13–15 July 2016; pp. 233–237. [Google Scholar]
- Hauff, B.M.; Deogun, J.S. Parameter tuning for disjoint clusters based on concept lattices with application to location learning. In Proceedings of the International Workshop on Rough Sets, Fuzzy Sets, Data Mining, and Granular-Soft Computing; Springer: New York, NY, USA, 2007; pp. 232–239. [Google Scholar]
- Yang, L.; Cao, X.; Jin, D.; Wang, X.; Meng, D. A unified semi-supervised community detection framework using latent space graph regularization. IEEE Trans. Cybern. 2014, 45, 2585–2598. [Google Scholar] [CrossRef] [PubMed]
- Bianchi, F.M.; Grattarola, D.; Alippi, C. Spectral clustering with graph neural networks for graph pooling. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 874–883. [Google Scholar]
- Taha, K. Disjoint community detection in networks based on the relative association of members. IEEE Trans. Comput. Soc. Syst. 2018, 5, 493–507. [Google Scholar] [CrossRef]
- Ghoshal, A.K.; Das, N.; Das, S. Disjoint and overlapping community detection in small-world networks leveraging mean path length. IEEE Trans. Comput. Soc. Syst. 2021, 9, 406–418. [Google Scholar] [CrossRef]
- Li, M.; Lu, S.; Zhang, L.; Zhang, Y.; Zhang, B. A community detection method for social network based on community embedding. IEEE Trans. Comput. Soc. Syst. 2021, 8, 308–318. [Google Scholar] [CrossRef]
- Whang, J.J.; Gleich, D.F.; Dhillon, I.S. Overlapping community detection using neighborhood-inflated seed expansion. IEEE Trans. Knowl. Data Eng. 2016, 28, 1272–1284. [Google Scholar] [CrossRef]
- Lu, M.; Zhang, Z.; Qu, Z.; Kang, Y. LPANNI: Overlapping community detection using label propagation in large-scale complex networks. IEEE Trans. Knowl. Data Eng. 2018, 31, 1736–1749. [Google Scholar] [CrossRef]
- Rezvani, M.; Liang, W.; Liu, C.; Yu, J.X. Efficient detection of overlapping communities using asymmetric triangle cuts. IEEE Trans. Knowl. Data Eng. 2018, 30, 2093–2105. [Google Scholar] [CrossRef]
- Chakraborty, T.; Kumar, S.; Ganguly, N.; Mukherjee, A.; Bhowmick, S. GenPerm: A unified method for detecting non-overlapping and overlapping communities. IEEE Trans. Knowl. Data Eng. 2016, 28, 2101–2114. [Google Scholar] [CrossRef] [Green Version]
- Van Lierde, H.; Chow, T.W.; Chen, G. Scalable spectral clustering for overlapping community detection in large-scale networks. IEEE Trans. Knowl. Data Eng. 2019, 32, 754–767. [Google Scholar] [CrossRef]
- Su, J.; Havens, T.C. Quadratic program-based modularity maximization for fuzzy community detection in social networks. IEEE Trans. Fuzzy Syst. 2014, 23, 1356–1371. [Google Scholar] [CrossRef]
- Yazdanparast, S.; Havens, T.C.; Jamalabdollahi, M. Soft overlapping community detection in large-scale networks via fast fuzzy modularity maximization. IEEE Trans. Fuzzy Syst. 2020, 29, 1533–1543. [Google Scholar] [CrossRef]
- Biswas, A.; Biswas, B. FuzAg: Fuzzy agglomerative community detection by exploring the notion of self-membership. IEEE Trans. Fuzzy Syst. 2018, 26, 2568–2577. [Google Scholar] [CrossRef]
- Gupta, A.; Datta, S.; Das, S. Fuzzy clustering to identify clusters at different levels of fuzziness: An evolutionary multiobjective optimization approach. IEEE Trans. Cybern. 2019, 51, 2601–2611. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Madhuri, V.; Bazighifan, O.; Ali, A.H.; El-Mesady, A. On Fuzzy-Simply Connected Spaces in Fuzzy-Homotopy. J. Funct. Spaces 2022, 2022, 9926963. [Google Scholar] [CrossRef]
- PM, D.; PB, R.; Cletus, N.; Joy, P. Fuzzy Hypergraph Modeling, Analysis and Prediction of Crimes. Int. J. Comput. Digit. Syst. 2022, 11, 649–661. [Google Scholar]
- Lee, S.J.; Jiang, J.Y. Multilabel text categorization based on fuzzy relevance clustering. IEEE Trans. Fuzzy Syst. 2013, 22, 1457–1471. [Google Scholar] [CrossRef]
- Meng, T.; Cai, L.; He, T.; Chen, L.; Deng, Z. Local higher-order community detection based on fuzzy membership functions. IEEE Access 2019, 7, 128510–128525. [Google Scholar] [CrossRef]
- Liu, Z.; Barahona, M. Graph-based data clustering via multiscale community detection. Appl. Netw. Sci. 2020, 5, 3. [Google Scholar] [CrossRef] [Green Version]
- Loper, E.; Bird, S. Nltk: The natural language toolkit. arXiv 2002, arXiv:0205028. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
- Das, A.K.; Das, P. Graph based ensemble classification for crime report prediction. Appl. Soft Comput. 2022, 125, 109–215. [Google Scholar] [CrossRef]
- Blondel, V.D.; Guillaume, J.L.; Lambiotte, R.; Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp. 2008, 2008, P10008. [Google Scholar] [CrossRef] [Green Version]
- Xing, Y.; Meng, F.; Zhou, Y.; Zhu, M.; Shi, M.; Sun, G. A node influence based label propagation algorithm for community detection in networks. Sci. World J. 2014, 2014, 627581. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Waltman, L.; Van Eck, N.J. A smart local moving algorithm for large-scale modularity-based community detection. Eur. Phys. J. B 2013, 86, 471. [Google Scholar] [CrossRef]
- Goswami, S.; Murthy, C.; Das, A.K. Sparsity measure of a network graph: Gini index. Inf. Sci. 2018, 462, 16–39. [Google Scholar] [CrossRef] [Green Version]
- Das, A.; Nayak, J.; Naik, B.; Ghosh, U. Generation of overlapping clusters constructing suitable graph for crime report analysis. Future Gener. Comput. Syst. 2021, 118, 339–357. [Google Scholar] [CrossRef]
- Liu, Y.; Li, Z.; Xiong, H.; Gao, X.; Wu, J. Understanding of internal clustering validation measures. In Proceedings of the 2010 IEEE International Conference on Data Mining, IEEE, Sydney, NSW, Australia, 13–17 December 2010; pp. 911–916. [Google Scholar]
- Dong, S. Improved label propagation algorithm for overlapping community detection. Computing 2020, 102, 2185–2198. [Google Scholar] [CrossRef]
- McDaid, A.; Hurley, N. Detecting highly overlapping communities with model-based overlapping seed expansion. In Proceedings of the 2010 International Conference on Advances in Social Networks Analysis and Mining, IEEE, Odense, Denmark, 9–11 August 2010; pp. 112–119. [Google Scholar]
- Dave, R.N. Validating fuzzy partitions obtained through c-shells clustering. Pattern Recognit. Lett. 1996, 17, 613–623. [Google Scholar] [CrossRef]
- Joopudi, S.; Rathi, S.S.; Narasimhan, S.; Rengaswamy, R. A new cluster validity index for fuzzy clustering. IFAC Proc. Vol. 2013, 46, 325–330. [Google Scholar] [CrossRef]
Dataset | Number of | Number of | (Cluster Number, No. of Reports) |
---|---|---|---|
Name | Reports | Clusters | |
3405 | 24 | (C1,389), (C2,178), (C3,190), (C4,214), (C5,50), (C6,49), (C7,81), (C8,230), (C9,85), (C10,76), (C11,54), (C12,171), (C13,146), (C14,439), (C15,64), (C16,79), (C17,529), (C18,42), (C19,50), (C20,290), (C21,188), (C22,48), (C23,80), (C24,68) | |
3490 | 26 | (C1,392), (C2,196), (C3,68), (C4,59), (C5,143), (C6,214), (C7,77), (C8,138), (C9,158), (C10,168), (C11,111), (C12,97), (C13,263), (C14,78),(C15,121), (C16,204), (C17,96), (C18,170), (C19,95), (C20,145), (C21,212), (C22,144), (C23,297), (C24,146), (C25,110), (C26,163) | |
6226 | 32 | (C1,442), (C2,158), (C3,269), (C4,249), (C5,543), (C6,234), (C7,377), (C8,638), (C9,245), (C10,185), (C11,371), (C12,503), (C13,63), (C14,358), (C15,110), (C16,170), (C17,240), (C18,350), (C19,295), (C20,145), (C21,232), (C22,344), (C23,297), (C24,146), (C25,118), (C26,87), (C27,206 ), (C28, 49), (C29, 126), (C30,74) (C31, 78), (C32,88) | |
1323 | 16 | (C1,124), (C2,96), (C3,65), (C4,54), (C5,113), (C6,96), (C7,77), (C8,138), (C9,95), (C10,276), (C11,49), (C12,103), (C13,53), (C14,78), (C15,110), (C16,98) | |
1144 | 15 | (C1,194), (C2,226),(C3,60), (C4,42), (C5,58), (C6,76), (C7,168), (C8,71), (C9,45), (C10,89), (C11,58), (C12,178), (C13,68), (C14,72), (C15,83) | |
31,515 | 33 | (C1,516), (C2,396), (C3,1612), (C4,871), (C5,768), (C6,1482), (C7,416), (C8,3480), (C9,2945), (C10,1752), (C11,2551), (C12,790), (C13,3379), (C14,2591), (C15,3374), (C16,2861), (C17,2897), (C18,390), (C19,1682), (C20,1889), (C21,2975), (C22,814), (C23,2552), (C24,3701), (C25,4021), (C26,2896), (C27,3215), (C28,4498), (C29,3002), (C30,3169), (C31,4296), (C32,1289), (C33,4158) |
Dataset | Algorithm | SL | DN | DB | XB | CH | IN |
---|---|---|---|---|---|---|---|
MOCD | 0.72 | 1.20 | 0.51 | 0.42 | 419 | 528 | |
LPNI | 0.75 | 1.37 | 0.49 | 0.69 | 406 | 523 | |
CSLMA | 0.76 | 1.54 | 0.52 | 0.64 | 474 | 584 | |
GICDA | 0.70 | 1.01 | 0.53 | 0.48 | 458 | 590 | |
CRCA | 0.80 | 0.98 | 0.51 | 0.39 | 466 | 540 | |
Proposed | 0.81 | 1.94 | 0.42 | 0.34 | 474 | 591 | |
MOCD | 0.73 | 0.92 | 0.52 | 0.59 | 402 | 410 | |
LPNI | 0.69 | 1.17 | 0.50 | 0.54 | 407 | 397 | |
CSLMA | 0.68 | 1.06 | 0.49 | 0.56 | 399 | 389 | |
GICDA | 0.63 | 0.98 | 0.58 | 0.52 | 372 | 377 | |
CRCA | 0.76 | 0.93 | 0.56 | 0.33 | 396 | 467 | |
Proposed | 0.77 | 1.98 | 0.44 | 0.31 | 409 | 473 | |
MOCD | 0.71 | 0.97 | 0.49 | 0.45 | 411 | 496 | |
LPNI | 0.68 | 0.92 | 0.51 | 0.47 | 407 | 368 | |
CSLMA | 0.69 | 0.88 | 0.48 | 0.41 | 398 | 407 | |
GICDA | 0.68 | 0.81 | 0.63 | 0.48 | 396 | 412 | |
CRCA | 0.72 | 0.98 | 0.70 | 0.37 | 436 | 491 | |
Proposed | 0.72 | 1.16 | 0.42 | 0.36 | 443 | 507 | |
MOCD | 0.69 | 0.91 | 0.59 | 0.41 | 392 | 589 | |
LPNI | 0.66 | 0.84 | 0.60 | 0.42 | 387 | 596 | |
CSLMA | 0.68 | 0.78 | 0.58 | 0.39 | 396 | 593 | |
GICDA | 0.64 | 0.76 | 0.62 | 0.41 | 404 | 508 | |
CRCA | 0.72 | 1.07 | 0.65 | 0.33 | 431 | 579 | |
Proposed | 0.74 | 1.12 | 0.50 | 0.31 | 457 | 612 | |
MOCD | 0.61 | 0.94 | 0.71 | 0.49 | 205 | 310 | |
LPNI | 0.64 | 0.91 | 0.68 | 0.41 | 192 | 302 | |
CSLMA | 0.62 | 0.92 | 0.71 | 0.48 | 184 | 279 | |
GICDA | 0.55 | 0.82 | 0.65 | 0.54 | 146 | 304 | |
CRCA | 0.68 | 1.10 | 0.70 | 0.37 | 263 | 593 | |
Proposed | 0.69 | 1.06 | 0.63 | 0.38 | 315 | 586 | |
MOCD | 0.77 | 1.13 | 0.49 | 0.45 | 372 | 553 | |
LPNI | 0.72 | 0.97 | 0.50 | 0.47 | 363 | 594 | |
CSLMA | 0.78 | 0.95 | 0.48 | 0.41 | 405 | 579 | |
GICDA | 0.71 | 0.83 | 0.57 | 0.53 | 368 | 571 | |
CRCA | 0.81 | 1.19 | 0.40 | 0.37 | 436 | 687 | |
Proposed | 0.81 | 2.91 | 0.40 | 0.33 | 441 | 589 |
Methods | Internal Cluster Validation Indices | |||||
---|---|---|---|---|---|---|
SL | DN | DB | XB | CH | IN | |
MOCD | 0.70 | 1.01 | 0.55 | 0.46 | 3.66 | 4.81 |
LPNI | 0.69 | 1.03 | 0.54 | 0.50 | 3.6 | 4.63 |
CSLMA | 0.70 | 1.02 | 0.54 | 0.48 | 3.76 | 5.21 |
GICDA | 0.65 | 0.86 | 0.59 | 0.49 | 3.57 | 4.61 |
CRCA | 0.74 | 1.04 | 0.58 | 0.36 | 4.03 | 5.59 |
Proposed | 0.61 | 1.69 | 0.47 | 0.36 | 4.05 | 5.03 |
Dataset | Algorithm | PC | PE | DI | GD | KI |
---|---|---|---|---|---|---|
OCLP | 0.73 | 0.31 | 0.71 | 0.52 | 8.98 | |
SEOC | 0.70 | 0.28 | 0.73 | 0.50 | 8.86 | |
FCMO | 0.71 | 0.32 | 0.70 | 0.48 | 8.49 | |
GICDA | 0.63 | 0.33 | 0.52 | 0.45 | 9.14 | |
CRCA | 0.79 | 0.29 | 0.78 | 0.51 | 8.94 | |
Proposed | 0.79 | 0.25 | 0.79 | 0.56 | 8.31 | |
OCLP | 0.80 | 0.35 | 0.73 | 0.68 | 9.38 | |
SEOC | 0.78 | 0.37 | 0.74 | 0.67 | 9.15 | |
FCMO | 0.76 | 0.37 | 0.78 | 0.69 | 10.08 | |
GICDA | 0.71 | 0.41 | 0.73 | 0.56 | 10.04 | |
CRCA | 0.81 | 0.33 | 0.80 | 0.62 | 8.71 | |
Proposed | 0.84 | 0.27 | 0.79 | 0.65 | 9.26 | |
OCLP | 0.77 | 0.34 | 0.71 | 0.55 | 9.14 | |
SEOC | 0.73 | 0.31 | 0.74 | 0.58 | 9.02 | |
FCMO | 0.77 | 0.35 | 0.72 | 0.58 | 9.10 | |
GICDA | 0.71 | 0.37 | 0.68 | 0.52 | 9.15 | |
CRCA | 0.80 | 0.28 | 0.81 | 0.57 | 8.72 | |
Proposed | 0.82 | 0.26 | 0.81 | 0.61 | 8.58 | |
OCLP | 0.74 | 0.23 | 0.51 | 0.60 | 9.12 | |
SEOC | 0.68 | 0.37 | 0.68 | 0.61 | 9.16 | |
FCMO | 0.54 | 0.42 | 0.67 | 0.58 | 9.38 | |
GICDA | 0.80 | 0.26 | 0.51 | 0.50 | 9.44 | |
CRCA | 0.82 | 0.26 | 0.80 | 0.61 | 8.41 | |
Proposed | 0.80 | 0.26 | 0.81 | 0.64 | 8.38 | |
OCLP | 0.76 | 0.25 | 0.81 | 0.68 | 8.25 | |
SEOC | 0.70 | 0.29 | 0.84 | 0.65 | 8.28 | |
FCMO | 0.71 | 0.31 | 0.73 | 0.65 | 8.21 | |
GICDA | 0.73 | 0.38 | 0.78 | 0.69 | 9.52 | |
CRCA | 0.81 | 0.17 | 0.86 | 0.70 | 7.41 | |
Proposed | 0.82 | 0.22 | 0.88 | 0.72 | 8.46 | |
OCLP | 0.81 | 0.25 | 0.78 | 0.79 | 7.78 | |
SEOC | 0.81 | 0.28 | 0.75 | 0.83 | 8.04 | |
FCMO | 0.83 | 0.34 | 0.79 | 0.77 | 8.14 | |
GICDA | 0.79 | 0.36 | 0.83 | 0.62 | 8.39 | |
CRCA | 0.85 | 0.13 | 0.91 | 0.84 | 7.24 | |
Proposed | 0.86 | 0.13 | 0.85 | 0.89 | 7.19 |
Methods | Overlapping Cluster Validation Indices | ||||
---|---|---|---|---|---|
PC | PE | DI | GD | KI | |
OCLP | 0.76 | 0.28 | 0.70 | 0.63 | 8.77 |
SEOC | 0.73 | 0.31 | 0.74 | 0.64 | 8.75 |
FCMO | 0.73 | 0.35 | 0.73 | 0.62 | 8.91 |
GICDA | 0.72 | 0.35 | 0.70 | 0.55 | 9.28 |
CRCA | 0.81 | 0.24 | 0.82 | 0.64 | 8.23 |
Proposed | 0.82 | 0.23 | 0.82 | 0.67 | 8.36 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Pramanik, A.; Das, A.K.; Pelusi, D.; Nayak, J. An Effective Fuzzy Clustering of Crime Reports Embedded by a Universal Sentence Encoder Model. Mathematics 2023, 11, 611. https://doi.org/10.3390/math11030611
Pramanik A, Das AK, Pelusi D, Nayak J. An Effective Fuzzy Clustering of Crime Reports Embedded by a Universal Sentence Encoder Model. Mathematics. 2023; 11(3):611. https://doi.org/10.3390/math11030611
Chicago/Turabian StylePramanik, Aparna, Asit Kumar Das, Danilo Pelusi, and Janmenjoy Nayak. 2023. "An Effective Fuzzy Clustering of Crime Reports Embedded by a Universal Sentence Encoder Model" Mathematics 11, no. 3: 611. https://doi.org/10.3390/math11030611