Next Article in Journal
Application of Artificial Intelligence for Better Investment in Human Capital
Next Article in Special Issue
Generalized Quasi Trees with Respect to Degree Based Topological Indices and Their Applications to COVID-19 Drugs
Previous Article in Journal
Optimality of a Network Monitoring Agent and Validation in a Real Probe
Previous Article in Special Issue
On Bond Incident Degree Indices of Chemical Graphs
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Effective Fuzzy Clustering of Crime Reports Embedded by a Universal Sentence Encoder Model

1
Department of Computer Science and Technology, Indian Institute of Engineering Science and Technology, Shibpur, Howrah 711103, West Bengal, India
2
Department of Communication Sciences, University of Teramo, 64100 Teramo, Italy
3
Post Graduate Department of Computer Science, Maharaja Sriram Chandra Bhanja Deo (MSCB) University, Baripada 757003, Odisha, India
*
Authors to whom correspondence should be addressed.
Mathematics 2023, 11(3), 611; https://doi.org/10.3390/math11030611
Submission received: 15 December 2022 / Revised: 12 January 2023 / Accepted: 18 January 2023 / Published: 26 January 2023
(This article belongs to the Special Issue Graph Theory and Applications, 2nd Edition)

Abstract

:
Crime reports clustering is crucial for identifying and preventing criminal activities that frequently happened in society. In the proposed work, named entities in a report are recognized to extract the crime-related phrases and subsequently, the phrases are preprocessed by applying stopword removal and lemmatization operations. Next, the module of the universal encoder model, called the transformer, is applied to extract phrases of the report to get a sentence embedding for each associated sentence, aggregation of which finally provides the vector representation of that report. An innovative and efficient graph-based clustering algorithm consisting of splitting and merging operations has been proposed to get the cluster of crime reports. The proposed clustering algorithm generates overlapping clusters, which indicates the existence of reports of multiple crime types. The fuzzy theory has been used to provide a score to the report for expressing its membership into different clusters, and accordingly, the reports are labelled by multiple categories. The efficiency of the proposed method has been assessed by taking into account different datasets and comparing them with other state-of-the-art approaches with the help of various performance measure metrics.

1. Introduction

The rate of crimes is occurring and frequently increasing in various places across the world. Technology advancements have made this information easily accessible on social media. This huge amount of information can be divided into different groups based on their crime categories to make it convenient for police personnel and investigators to take appropriate actions for reducing criminal activities in society. The clustering algorithms take an important role in this purpose, which group together the crime reports of similar crime types. There are many clustering methods [1,2] that can be used to cluster both structured and unstructured datasets. But research on report clustering over a long period of time has shown that it is neither an easy task nor a perfect solution yet. Here, we have proposed a novel overlapping clustering algorithm using the graph theory for partitioning the crime reports of different categories. The clustering has been carried out considering the concepts of graph theory, such as the clustering coefficient, degree of the nodes, and edge density of the graph. After generating the clusters of crime reports, a fuzzy technique has been introduced to set scores of each report, which give the degree of memberships of the report to reside in different clusters; and finally, the reports are labelled by multiple classes (i.e., crime types) based on their membership values. In the proposed work, data preprocessing techniques take important prior steps to represent the reports in a structured form, which not only helps for efficient clustering of the reports but also extracts the effective information from the reports to facilitate the clustering process.
Initially, the crime reports are collected [3] and the named entities are recognized [4] to select only the crime-related phrases. Next, the stopwords are removed, and lemmatization [5] is done on the extracted phrases to select only the meaningful root words of the crime-related phrases, which are finally used for report embedding. To achieve this, Universal Sentence Encoder (USE) [6], one of the most well-performing sentence embedding techniques, is applied. The key feature that inspires us to use it is its wide application in multi-task learning tasks like sentiment analysis, sentence similarity, clustering, and classification. The USE model is developed based on two encoders, namely Transformer and Deep Averaging Network(DAN). Both of these models are capable of taking a word or a sentence as input and generating embeddings for the same. The models take the sentences as input, tokenize them, and convert each sentence to a 512-dimensional vector, the average of which provides a 512-dimensional vector of the report. The function of the transformer is similar to the encoder module of the transformer architecture, and it uses the self-attention mechanism. The DAN computes unigram or bigram embeddings first and then average them to get a single embedding, which is subsequently passed to a deep neural network to obtain a final sentence embedding of 512 dimensions. We have used the transformer model in sentence embedding for its simplicity and efficiency.

1.1. Literature Survey

Community detection or partition of a graph into subgraphs is crucial for identifying the coherent groups or clusters where the elements inside a cluster are tightly connected. In literature, various partitioning algorithms are presented to detect the communities or partitions for different problems and the structures of these partitions are mostly hierarchical clusters [7], overlapping clusters [8] and disjoint clusters [9]. A semi-supervised graph partitioning algorithm has been introduced in [10], and it employs graph regularisation to blend past information with the network topology. Girvan et al. [7] proposed a graph-based method to make the clusters in a hierarchical way. In this approach, they removed the edge with the highest betweenness to make the clusters and at the last stage, every report has been placed separately. A graph clustering algorithm has been proposed by Bianchi et al. in their paper [11] with addressing the constraints of spectral clustering. They have applied the graph neural network model and embedded min-cut pooling operation to make the clusters. In [12], K. Taha utilized the concept of edge betweenness, relative importance score, and degree of association scores to find the disjoint cluster within a graph. But in real life, there exist many problems in which it has been seen that the clusters are overlapping in nature; therefore, many researchers have proposed different algorithms for generating overlapped clusters. Ghoshal et al. [13] have introduced an algorithm to detect disjoint and overlapping communities based on mean path length accompanying the modularity index in the Genetic Algorithm. In [14,15,16], different approaches have been highlighted to detect overlapping communities from a network. The node influence has been identified in [14] by measuring the degree centrality of a node, and another factor called agglomeration coefficient has also been considered for the task. In [16], the label propagation technique has been used for finding overlapping communities. Rezvani et al. [17] have detected overlapping clusters by proposing a novel community fitness metric, named as triangle-based fitness metric. Whang et al. [15] have proposed the neighborhood inflation technique to detect overlapping communities. Initially, they determined the good seed nodes in a graph. Later, the PageRank clustering scheme has been applied to optimize the conductance community score. The important step of their method is for neighborhood inflation, where seeds are modified to represent their entire vertex neighborhood, and the drawback of their method is that it produced much larger communities to cover the entire graph. The overlapping and non-overlapping communities have been detected in [18] by introducing vertex-based metrics called GenPerm. In [19], an overlapping community detection algorithm, named Scalable Spectral Clustering algorithm, is proposed, which is an extension of the notion of normalized cut and is able to find overlapping communities in a large network. In addition to these methods, there exist several fuzzy techniques to detect the communities that estimate the likelihood of each node belonging to each community. But, the majority of these algorithms require prior knowledge, such as community size, and community number. Su et al. [20] and Yazdanparast et al. [21] have applied the fuzzy method for community detection by modularity maximization. The concept of self-membership has been introduced in paper [22]. Here, the method allows all the nodes to grow their own community and the anchor nodes are those with a higher degree of self-membership which have the opportunity to grow the linked community. While incorrect or unnecessary anchors are eliminated, some new anchors may appear in subsequent iterations. In [23], the authors have proposed a multiobjective fuzzy clustering method where they have optimized the cluster compactness and level of fuzziness. The concepts of fuzzy F * -simply connected spaces and fuzzy F * -contractible spaces are presented by Madhuri et al. [24]. Later, they analysed some significant characteristics of fuzzy F * -homotopy and also proved that each fuzzy F * -loop based at any fuzzy point in fuzzy F * -contractible space is equivalent to the constant fuzzy F * -loop. Dhanya et al. [25] proposed a fuzzy hypergraph-based model to predict crimes in various locations. The crime fuzzy hypergraph contains two layers: an outer level and an interior level. Both levels have been subjected to morphological procedures like dilation and erosion. The authors in paper [26] have also used the fuzzy clustering method for text categorization. It follows some steps such as fuzzy transformation for dimensionality reduction, cluster membership assignment, cluster-to-category mapping, and finally, getting the assigned category by applying a threshold. Meng et al. [27] have introduced a new measure called the network motif, which is a small connected subgraph that contains multiple nodes and edges and represents the information interactions among the nodes. In our paper, we have proposed an innovative euclidean distance-based fuzzy clustering algorithm using graph splitting and subgraph merging operations for the clustering of crime reports.

1.2. Motivation and Objective

One of the major issues facing humanity is crimes, which pose the greatest danger to every human on the globe. As criminal activities are increasing day by day, crime report analysis is very important to prevent it in society. The main purpose of this work is to select crime-related information from a wide variety of crime reports and share it with police officers so that they can take preventive measures against criminal activities. Analyzing the crime reports manually is a very difficult task and quite impossible for the huge volume of a complex dataset. Therefore, different types of crime report analysis techniques have been presented, such as classification of crimes, clustering of crimes, location detection, and many more. In practice, most of the generated crime reports are unlabeled, so unsupervised learning, such as the clustering approach, is more effective for crime report analysis. Clustering of crime reports aids in identifying connections and linkages between illegal activity. In crime report clustering, crime reports are placed in different groups based on their context, so when the investigators want to investigate a particular crime type, they can focus on a particular cluster of reports, which reduces the time complexity of the investigation as well as helps to provide more effective information. Therefore, it is required to group the crime reports according to the crime types. However, it’s possible for one piece of information about a crime to contain information about another type of crime. This creates an overlapping cluster dilemma where one crime incidence can fit into many crime types. For example, suppose in a crime incident, it has been found that someone kidnapped a person and then killed him. So, this crime incident falls into two categories. While many efforts have been made to locate overlapping groups of reports, relatively few have been successful in locating crime reports that contain information on several crime types. Additionally, there are numerous graph partitioning methods that yield an excessively large number of overlapping clusters and have significant computational costs. There are numerous edges connecting a node for a generic document to other nodes for comparable documents since some documents are very general and consequently similar to many other documents. To improve the partitioning quality, these kinds of edges must be eliminated. Therefore, a novel graph-based fuzzy clustering technique has been proposed to address the overlapping clustering problem effectively.

1.3. Contribution

As a contribution, we have applied an innovative data preprocessing method where only the noun phrases for each report have been bunched together. Then these newly formed reports have been processed and clustered by our proposed graph-based clustering algorithm. The main contribution of our work is to produce overlapping clusters with fuzzy membership values of each report in overlapping regions for the purpose of crime analysis. Initially, named entities of each report have been detected, and noun phrases are bunched together. Next, the extracted phrases of each report have been preprocessed by removing the stopwords from the sentences and selecting the root words using lemmatization. Then the preprocessed phrases of a report are embedded by applying transformer architecture based Universal Sentence Encoder (USE) [6] and obtain the report embedding by averaging all the phrase embeddings of the report. Subsequently, a graph has been constructed based on the ξ -ball graph construction method [28], where the vector representation of each report has been considered as a node and cosine similarity between a pair of nodes is represented by an edge in the graph. The cosine similarity between each pair of nodes is measured, and if the similarity crosses a threshold, then an edge is placed between them. Later, the overlapping clusters have been discovered by following two steps, namely splitting a graph into subgraphs, and merging subgraphs into a graph. The splitting operation partitions the graph by considering the clustering coefficient and degree centrality measures, whereas the merging operation fuses subgraphs based on the edge density measure of the graph to obtain the optimal set of clusters. Finally, an innovative fuzzy technique has been introduced for the reports those lie in multiple clusters to assign the degree of memberships, which helps to label the reports by multiple crime types.
The workflow diagram of the proposed work is shown in Figure 1, and the main contributions of the paper are concluded by the following few steps.
  • After collecting the dataset, named entities are recognized to extract the noun phrases of the reports, which are subsequently preprocessed by following stopword removal and lemmatization operations. Then each report has been converted to a vector by applying a transformer architecture-based Universal Sentence Encoder model on the collection of extracted processed noun phrases of the report.
  • An undirected graph is constructed where each report vector is considered as a vertex, and an edge exists between a pair of vertices if the cosine similarity score between them crosses a predefined threshold.
  • A novel graph-based overlapping clustering algorithm has been deduced based on splitting and merging operations. In the splitting operation, a graph is split into subgraphs using the clustering coefficient and degree of the vertices, and in the merging operation, a graph is reformed by fusing two subgraphs based on edge density.
  • Fuzzy theorem is applied on overlapping clusters, where fuzzification is done to provide membership values to the reports lying in the overlapping regions, and defuzzification is done to label the reports by multiple crime types. Thus, reports outside overlapping regions of the clusters are of a single crime type and those in overlapping regions are of multiple crime types.

1.4. Summary of the Paper

The remaining sections of the paper are arranged as follows: Section 2 describes the preprocessing and report embedding process, and the proposed graph-based fuzzy overlapping clustering algorithm is described in Section 3. The experimental results and discussions are presented in Section 4. Finally, in Section 5, the conclusion and the future work have been discussed.

2. Preprocessing and Report Embedding

Here, the collected crime reports are preprocessed to remove the irrelevant words and extract only the root words of the reports. Also, each report is represented by a vector using a universal sentence encoder model.

2.1. Preprocessing of Reports

The unlabelled crime reports have been collected from an online platform and described by the short description together with the headline of the report and removed all other information from the report. The words of each report have been tokenized and assigned with the tag of part of speech. This operation is carried out by the Natural Language Tool Kit (NLTK)’s [29] built-in sentence segmenter, word tokenizer, and parts-of-speech tagger by default. The next step is to look for any named entities present in a sentence by bunching noun phrases [4]. This process has been depicted through an example in Figure 2. Here, PPR stands for Personal Pronoun, NN for Noun, VBD for verb. The example contains the named entities shown in two larger square boxes on both sides of the tokenized word ’killed’, which has been tagged as VBD. These noun phrases have been collected and bunched together for each report. Then the stopwords have been discarded, and a lemmatization operation has been performed to find the root words. Thus after preprocessing, the sentence “Her husband killed their children” becomes “husband kill children”. This noun phrase has a pair of named entities, namely “husband” and “children”, which are related by “Kill”, which is a crime-related word. Thus, our objective in this preprocessing step is to represent each report as the collection of preprocessed noun phrases, which are applied to the universal sentence encoder model for report embedding.

2.2. Report Embedding

We have used transformer architecture based Universal Sentence Encoder model [6] for the purpose of embedding. The model takes the input as a lowercase PTB tokenized string and produces output, a 512-dimensional vector as the sentence embedding. This model is a Transformer architecture, which provides better accuracy on downstream tasks but imposes significantly higher computational complexity due to its complex architecture. Its computation time scales dramatically with the length of the sentence. In our work, we have considered preprocessed phrases as individual sentences which are of very small length, and thus the encoder model is used efficiently. Also, we have used the publicly available pre-trained universal sentence encoder, which also reduces the time complexity of the proposed work. The transformer architecture’s encoding sub-graph is used by the module to carry out the sentence embeddings [30]. This sub-paragraph employs attention to compute contextual word representation in a phrase that takes into consideration the identity and order of every other word. The sum of representations at each word location is calculated element-by-element to turn the contextual word representations into a fixed-length sentence encoding vector. Then the average of the encoded vectors of all extracted phrases of a report has been calculated and used to represent the report in vector form and it has been taken to accomplish the rest of the work.

3. Graph Based Fuzzy Clustering

The proposed clustering method takes the report embeddings as the input and extracts the inherent groups of similar crime reports naturally, without having prior knowledge about the number of the groups and the size of each of the groups. The relationships among the reports have represented by a graph G = ( V , E ) . The graph has been constructed with each report has been treated as a vertex for the graph. So, a vertex of the graph is basically a vector representation of a particular report. An edge has constructed between a pair of vertices if the cosine similarity between two respective report embeddings crosses a predefined threshold. This concept of graph construction is used in paper [28] (named as ξ -ball), which has been applied in our work to construct the graph. The created graph is undirected, and depending on the similarity value, it can even be disconnected. When the graph becomes connected, the proposed clustering algorithm based on Splitting, and Merging operations is applied to it to produce overlapping subgraphs, each of which produces a cluster of reports. If the constructed graph becomes a disconnected graph, then the proposed graph-based clustering algorithm applies to every component individually. The proposed algorithm is developed based on the concept of clustering coefficient of a vertex (the fraction of possible triangles through that vertex), degree of a node (number of edges incident on the vertex), and edge connectivity of the graph (the number of edges divided by the maximal number of edges) and followed two steps, namely Splitting, and Merging steps.

3.1. Splitting

We split the graph based on the clustering coefficient and degree of the vertices. The clustering coefficient and degree of each node have been calculated for all the vertices present in the graph, and the vertex with the highest clustering coefficient has been chosen for performing the splitting operation. In case of the existence of multiple vertices with the same clustering coefficient, we have considered a node with the highest degree from those vertices, and even after that, if the tie exists between multiple such vertices, then a vertex has been randomly selected from those tie sets. Considering this vertex, the partition has been made on the graph to get the subgraphs. If vertex v of the graph G = ( V , E ) is the selected one for splitting the graph, then we create a set V 1 of vertices that consists of v and all its neighbours in G. Next, we create a graph G 1 = ( V 1 , E 1 ) , where E 1 is the subset of edges of E with end vertices of the edges in V 1 . Next, we remove the subgraph G 1 from G to get G 2 without neglecting any edge of G. That is, though a vertex is in G 1 , it may also appear in G 2 to keep all the edges in G 2 which are not in G 1 . Thus the splitting operation provides the pair of overlapping subgraphs, G 1 , and G 2 . If G 2 is a null graph, the process terminates. Otherwise, the same splitting process is continued for graph G 2 . Thus the process provides a list of overlapping subgraphs. If all the vertices of a subgraph are covered by some of the other subgraphs, i.e., if each vertex of a subgraph is a vertex of some other subgraphs in the list, then the subgraph is redundant and removed from the list of subgraphs. This operation has been illustrated through Figure 3. In Figure 3, we can see that vertex, a has the highest clustering coefficient and degree, and so the graph G has been split into G 1 and G 2 considering this vertex. In G 1 , all vertices adjacent to a have been kept, and the connected edges between them have also been preserved. The remaining portion comes out as G 2 . In G 2 , the vertices v, d, e, and f are of the highest clustering coefficients and degrees, so we randomly select any one vertex, say v, for splitting G 2 . Repeating this process we obtain the subgraphs G 21 , G 22 , G 23 , and G 24 . But, the vertices of G 24 are covered by G 1 , G 21 , and G 23 , and so G 24 is removed. Thus, after performing splitting operation on the given graph G, we have the subgraphs G 1 , G 21 , G 22 , and G 23 , as shown by green color subgraphs in Figure 3. The pseudocode of the proposed splitting algorithm is given in Algorithm 1.

3.2. Merging

After the splitting process, the merging operation performs over the subgraphs, based on the edge density of the subgraphs. In the merging process, two subgraphs are fused if the edge density of the resultant subgraph is greater or equal to the average of the edge densities of both the individual graph. Here, we check the overlapping region of every pair of subgraphs in the list S. We start to merge two subgraphs for which the overlapping region contains a maximum number of vertices. Next, among the resultant subgraphs, consider two subgraphs with a maximum number of common vertices for merging and so on. The process terminates if no more merging is possible. The merging step has been explained in Figure 4. After splitting operation, the graph G has been partitioned into a list of subgraphs, G 1 , G 21 , G 22 , and G 23 . Here, the subgraphs, G 21 and G 23 has maximum common vertices which are c and y. Here, the edge density of G 21 and G 23 are 1.0 and 0.66, respectively. So the average edge density is 0.83. If we merge them, then the resulting graph becomes G 213 and its edge density is 0.83, which is equal to the average edge density of G 21 and G 23 . Thus we get resultant subgraphs, G 1 , G 22 , and G 213 . Though there are common vertices between G 22 , G 213 and G 1 , G 213 but based on the condition of merging, they fail to merge. So the final set of subgraphs is { G 1 , G 22 , G 213 }, and the clusters are { a , b , v , x }, { c , e , f }, and { c , d , v , y }. The pseudocode of the merging operation is described by Algorithm 2.
Algorithm 1: Split a Graph into subgraphs - S P L I T ( G , S )
Mathematics 11 00611 i001
Algorithm 2: Merge subgraphs into graphs- M E R G E ( S )
Mathematics 11 00611 i002

3.3. Fuzzy Theory and Report Labelling

After applying the splitting and merging operations in G, we have found a set of overlapping subgraphs, which implies that some crime reports have been placed in more than one cluster. Therefore, A fuzzy theory has been applied to handle the overlapping problem. We have applied fuzzification by defining a euclidean distance-based membership function. This membership function gives the membership value by which a report belongs to a cluster. This has been applied only to the reports which belong to more than one cluster. We have already embedded each report in a 512-dimensional vector. First, we compute the mean of all elements of a cluster and considered it as the representative of that cluster. Let, a report, say r i lies on t clusters, say C 1 , C 2 , , C t . Then the membership value, μ i j by which report r i lies in cluster C j is defined by Equation (1), where d i j is the euclidean distance between report r i and cluster C j . Thus, report r i has t membership values, μ i 1 , μ i 2 , , μ i t using fuzzification technique.
μ i j = 1 d i j j = 1 t d i j
After assigning the membership values to the reports in the overlapping regions, we apply defuzzification. We consider a threshold δ for defuzzification and if the membership value of a report r i to reside in a cluster C j is less than δ , then the report r i is removed from C j . After applying the defuzzification technique, a report may still be in a few clusters based on the δ value. Thus, after defuzzification, the report r i may be in l clusters where l < t . Next, for labelling the reports, we first label the clusters by different crime types. As each report is described by the noun phrases and two named entities in a phrase are related by some crime words, so we select a set of such words for each cluster. The highest frequency word is selected from the set and the cluster is labelled by this crime-related word. So, each report in the non-overlapping region of a cluster is labelled by the label of the cluster. But, if a report r i of the overlapping region has l membership values after defuzzification, then r i is labelled by the label of corresponding l clusters. So, when we want to investigate the reports of some particular crime type, we simply extract the reports labelled by this crime type, which makes the investigation process simpler. The pseudocode of the crime report labelling technique is described by Algorithm 3.
Algorithm 3: Fuzzy Theory based Crime Report Labelling-FTCRL( S , R )
Mathematics 11 00611 i003
Algorithm 3: Cont.
Mathematics 11 00611 i004

4. Experimental Results

The targeted task has been completed utilising a variety of Python 3.7 modules, including pytorch 1.12.0, numpy 1.12.0, matplotlib 2.2, and networkx 1.11. Initially, the news dataset of about 200,000 news items from various categories published in the United States of America between the years of 2012 and 2018 has been gathered from the website kaggle.com [3]. The efficiency of the proposed algorithm has been evaluated on five different categories of news datasets that have been made considering Crime, Women, Food and drinks, Environment, and College and named D S 1 , D S 2 , D S 3 , D S 4 , and D S 5 , respectively for future reference in the paper. The crime report dataset used in paper [31] is also considered to evaluate our proposed model and named D S 6 in our paper. This dataset contains news of crime incidents that happened in different places in India, the USA, and the UAE between 2008 to 2016 years.

4.1. Cluster Analysis

After collecting the datasets, the proposed graph-based fuzzy clustering algorithm has been applied to all the datasets for making the clusters. The clusters that have been found by applying our proposed algorithm are overlapping in nature. The description of the dataset with the information about the clusters by applying the proposed clustering algorithm for each dataset has been given in Table 1.

4.2. Performance Evaluation

The comparison of the proposed algorithm’s performance with some existing clustering algorithms has been done in this section. Here, some algorithms have been chosen which make overlapping clusters, and some disjoint clustering algorithms also have been selected for making the comparison. After applying the fuzzification technique to the overlapping clusters, we have kept the report in a single cluster based on their degree of membership to get the disjoint clusters. In the case of disjoint clustering algorithms, internal indices are considered, and in the case of overlapping clustering algorithms, overlapping indices are considered.

4.2.1. Comparison Using Internal Cluster Indices

The disjoint clustering algorithms that have been considered here for comparison are (i) Modularity Optimization-based Community Detection (MOCD) [32], (ii) Label Propagation Algorithm using Node Influence (LPNI) [33], (iii) Community identification by Smart local Moving Algorithm (CSLMA) [34], (iv) Gini Index-based Community Detection Algorithm (GICDA) [35], and (v) Crime Report Clustering algorithm (CRCA) [36] and the internal cluster validation indices that have been taken into consideration are Dunn’s index (DN) [37], Silhouette index (SL) [37], Davies-Bouldin index (DB) [37], Calinski-Harabasz index (CH) [37], Xie-Beni index (XB) [37], and I-index (IN) [37]. These are computed for all the datasets that have been mentioned earlier, and the results are listed in Table 2. The best results are marked by the boldface.
In Table 2, it has been seen that for Datasets D S 5 , the proposed algorithm does not provide the best index values of the XB index and IN index. It is a college dataset, where our proposed named-entity-based paraphrase vectorization possibly does not work well for the discrimination of the reports. It has also been seen for dataset D S 6 that the best value of IN index is provided by CRCA. But it can be seen clearly from the table that the proposed algorithm produces good index values for other remaining internal indices for all the datasets.
The average value of internal cluster validation indices for various approaches computed for all six datasets, and provided in Table 3 to show overall performance. The table shows that, for all indices except the IN index, the proposed algorithm offers the best result, while the CRCA algorithm does so for the IN index. Figure 5, shows a graphic representation of the average performance for easier visualisation. As it is known, the better outcome is indicated by the higher index values of SL, DN, CH, and IN and the lower index values of XB and DB; the figure demonstrates that, with the exception of Sl and IN, all internal indices receive better values from our Proposed algorithm.

4.2.2. Comparison Using Overlapping Cluster Indices

The overlapping algorithms that have been chosen for comparison are (i) Overlapping Community detection by label Propagation (OCLP) [38] (ii) Seed Expansion based Overlapping Community identification (SEOC) [39], (iii) Fuzzy Clustering by Multiobjective optimization(FCMO) [23], (iv) Gini Index-based Community Detection Algorithm (GICDA) [35], and (v) Crime Report Clustering algorithm (CRCA) [36] and the overlapping cluster validation metrics that have been measured for comparing the performance of our proposed algorithm with mentioned overlapping clustering algorithms are (i) Partition Coefficient (PC) [40], (ii) Partition Entropy (PE) [40], (iii) Dave Index (DI) [41], (iv) Graded distance index (GD) [41] and (v) Kwon Index [41]. In Table 4, these indices have been listed, and the best values are highlighted by the boldface.
From Table 4, it can be said that except for the indices value of DI and KI for the dataset D S 2 , PC index value for the dataset D S 4 , KI index value for the dataset D S 5 , and DI index value for D S 6 the proposed RCASRR algorithm does not provide the best value. Hence, it can be said by analysing the index values of Table 2 and Table 4 that the proposed algorithm is able to provide better clusters than the other clustering algorithms, which shows the efficiency of the Proposed method.
In Table 5, the average overlapping cluster validation indices of various approaches are shown for all the datasets. Figure 6 provides a graphic depiction of this Table. It shows that, with the exception of the KI index, all indices produce better overlapping index values when using the proposed algorithm. Since the higher index values of PC, DI, and GD and the lower index values of PE and KI produce the best clustering results.

5. Conclusions

The proposed work makes clusters of crime reports according to their context. When the reports are grouped together and the groups are labelled properly, it makes it easier for the police or other law enforcement organisations to evaluate them and recognise the various sorts of offences. This makes it easier to put the required preventive measures in place to stop illegal activity. To locate the overlapped clusters of crime reports, a novel graph-based clustering algorithm with a fuzzy technique has been developed and it also provides a degree of membership to the objects inside the clusters. Other types of datasets have also been employed using the suggested strategy, and it is clear from the results of the experiments that the method works just as well for other applications. As the proposed algorithm makes overlapping clustering, therefore, it is advantageous for applications where objects may belong to multiple classes.
An innovative cluster labelling technique is proposed to understand the nature of the clusters, where each cluster is labelled according to its category, which is a beneficial step for unlabelled datasets. However, the suggested work has two drawbacks, one is that our proposed clustering algorithm can not identify some of the most suitable clusters for a report when a report resides in multiple clusters, and another drawback is that the proposed clustering algorithm can not identify larger outliers. In our future work, we will try to address these problems.

Author Contributions

The contribution of A.P. is designing the model, Programming, and Writing—Original draft preparation. A.K.D. has contributed to Data curation, Conceptualization, Supervision and Validation. The Investigation, Validation and Paper Editing tasks have been carried out by D.P. J.N. has contributed to Visualization, Validation and Reviewing process. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Not applicable.

Data Availability Statement

Acknowledgments

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Conflicts of Interest

The authors declare that they have no conflict of interest.

References

  1. Saeed, M.Y.; Awais, M.; Talib, R.; Younas, M. Unstructured Text Documents Summarization With Multi-Stage Clustering. IEEE Access 2020, 8, 212838–212854. [Google Scholar] [CrossRef]
  2. Li, L.; Yang, B.; Zhang, F. Clustering for Complex Structured Data Based on Higher-Order Logic. In Proceedings of the 2008 International Conference on Computer Science and Software Engineering, Wuhan, China, 12–14 December 2008; Volume 4, pp. 390–393. [Google Scholar] [CrossRef]
  3. Misra, R. News category dataset. ResearchGate 2018, 3, 11429. [Google Scholar] [CrossRef]
  4. Das, P.; Das, A.K. Graph-based clustering of extracted paraphrases for labelling crime reports. Knowl.-Based Syst. 2019, 179, 55–76. [Google Scholar] [CrossRef]
  5. Khyani, D.; B S, S. An Interpretation of Lemmatization and Stemming in Natural Language Processing. Shanghai Ligong Daxue Xuebao/J. Univ. Shanghai Sci. Technol. 2021, 22, 350–357. [Google Scholar]
  6. Cer, D.; Yang, Y.; Kong, S.y.; Hua, N.; Limtiaco, N.; John, R.S.; Constant, N.; Guajardo-Cespedes, M.; Yuan, S.; Tar, C.; et al. Universal sentence encoder. arXiv 2018, arXiv:1803.11175. [Google Scholar]
  7. Girvan, M.; Newman, M.E. Community structure in social and biological networks. Proc. Natl. Acad. Sci. USA 2002, 99, 7821–7826. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  8. Baadel, S.; Thabtah, F.; Lu, J. Overlapping clustering: A review. In Proceedings of the 2016 SAI Computing Conference (SAI), London, UK, 13–15 July 2016; pp. 233–237. [Google Scholar]
  9. Hauff, B.M.; Deogun, J.S. Parameter tuning for disjoint clusters based on concept lattices with application to location learning. In Proceedings of the International Workshop on Rough Sets, Fuzzy Sets, Data Mining, and Granular-Soft Computing; Springer: New York, NY, USA, 2007; pp. 232–239. [Google Scholar]
  10. Yang, L.; Cao, X.; Jin, D.; Wang, X.; Meng, D. A unified semi-supervised community detection framework using latent space graph regularization. IEEE Trans. Cybern. 2014, 45, 2585–2598. [Google Scholar] [CrossRef] [PubMed]
  11. Bianchi, F.M.; Grattarola, D.; Alippi, C. Spectral clustering with graph neural networks for graph pooling. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 874–883. [Google Scholar]
  12. Taha, K. Disjoint community detection in networks based on the relative association of members. IEEE Trans. Comput. Soc. Syst. 2018, 5, 493–507. [Google Scholar] [CrossRef]
  13. Ghoshal, A.K.; Das, N.; Das, S. Disjoint and overlapping community detection in small-world networks leveraging mean path length. IEEE Trans. Comput. Soc. Syst. 2021, 9, 406–418. [Google Scholar] [CrossRef]
  14. Li, M.; Lu, S.; Zhang, L.; Zhang, Y.; Zhang, B. A community detection method for social network based on community embedding. IEEE Trans. Comput. Soc. Syst. 2021, 8, 308–318. [Google Scholar] [CrossRef]
  15. Whang, J.J.; Gleich, D.F.; Dhillon, I.S. Overlapping community detection using neighborhood-inflated seed expansion. IEEE Trans. Knowl. Data Eng. 2016, 28, 1272–1284. [Google Scholar] [CrossRef]
  16. Lu, M.; Zhang, Z.; Qu, Z.; Kang, Y. LPANNI: Overlapping community detection using label propagation in large-scale complex networks. IEEE Trans. Knowl. Data Eng. 2018, 31, 1736–1749. [Google Scholar] [CrossRef]
  17. Rezvani, M.; Liang, W.; Liu, C.; Yu, J.X. Efficient detection of overlapping communities using asymmetric triangle cuts. IEEE Trans. Knowl. Data Eng. 2018, 30, 2093–2105. [Google Scholar] [CrossRef]
  18. Chakraborty, T.; Kumar, S.; Ganguly, N.; Mukherjee, A.; Bhowmick, S. GenPerm: A unified method for detecting non-overlapping and overlapping communities. IEEE Trans. Knowl. Data Eng. 2016, 28, 2101–2114. [Google Scholar] [CrossRef] [Green Version]
  19. Van Lierde, H.; Chow, T.W.; Chen, G. Scalable spectral clustering for overlapping community detection in large-scale networks. IEEE Trans. Knowl. Data Eng. 2019, 32, 754–767. [Google Scholar] [CrossRef]
  20. Su, J.; Havens, T.C. Quadratic program-based modularity maximization for fuzzy community detection in social networks. IEEE Trans. Fuzzy Syst. 2014, 23, 1356–1371. [Google Scholar] [CrossRef]
  21. Yazdanparast, S.; Havens, T.C.; Jamalabdollahi, M. Soft overlapping community detection in large-scale networks via fast fuzzy modularity maximization. IEEE Trans. Fuzzy Syst. 2020, 29, 1533–1543. [Google Scholar] [CrossRef]
  22. Biswas, A.; Biswas, B. FuzAg: Fuzzy agglomerative community detection by exploring the notion of self-membership. IEEE Trans. Fuzzy Syst. 2018, 26, 2568–2577. [Google Scholar] [CrossRef]
  23. Gupta, A.; Datta, S.; Das, S. Fuzzy clustering to identify clusters at different levels of fuzziness: An evolutionary multiobjective optimization approach. IEEE Trans. Cybern. 2019, 51, 2601–2611. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  24. Madhuri, V.; Bazighifan, O.; Ali, A.H.; El-Mesady, A. On Fuzzy-Simply Connected Spaces in Fuzzy-Homotopy. J. Funct. Spaces 2022, 2022, 9926963. [Google Scholar] [CrossRef]
  25. PM, D.; PB, R.; Cletus, N.; Joy, P. Fuzzy Hypergraph Modeling, Analysis and Prediction of Crimes. Int. J. Comput. Digit. Syst. 2022, 11, 649–661. [Google Scholar]
  26. Lee, S.J.; Jiang, J.Y. Multilabel text categorization based on fuzzy relevance clustering. IEEE Trans. Fuzzy Syst. 2013, 22, 1457–1471. [Google Scholar] [CrossRef]
  27. Meng, T.; Cai, L.; He, T.; Chen, L.; Deng, Z. Local higher-order community detection based on fuzzy membership functions. IEEE Access 2019, 7, 128510–128525. [Google Scholar] [CrossRef]
  28. Liu, Z.; Barahona, M. Graph-based data clustering via multiscale community detection. Appl. Netw. Sci. 2020, 5, 3. [Google Scholar] [CrossRef] [Green Version]
  29. Loper, E.; Bird, S. Nltk: The natural language toolkit. arXiv 2002, arXiv:0205028. [Google Scholar]
  30. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
  31. Das, A.K.; Das, P. Graph based ensemble classification for crime report prediction. Appl. Soft Comput. 2022, 125, 109–215. [Google Scholar] [CrossRef]
  32. Blondel, V.D.; Guillaume, J.L.; Lambiotte, R.; Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp. 2008, 2008, P10008. [Google Scholar] [CrossRef] [Green Version]
  33. Xing, Y.; Meng, F.; Zhou, Y.; Zhu, M.; Shi, M.; Sun, G. A node influence based label propagation algorithm for community detection in networks. Sci. World J. 2014, 2014, 627581. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  34. Waltman, L.; Van Eck, N.J. A smart local moving algorithm for large-scale modularity-based community detection. Eur. Phys. J. B 2013, 86, 471. [Google Scholar] [CrossRef]
  35. Goswami, S.; Murthy, C.; Das, A.K. Sparsity measure of a network graph: Gini index. Inf. Sci. 2018, 462, 16–39. [Google Scholar] [CrossRef] [Green Version]
  36. Das, A.; Nayak, J.; Naik, B.; Ghosh, U. Generation of overlapping clusters constructing suitable graph for crime report analysis. Future Gener. Comput. Syst. 2021, 118, 339–357. [Google Scholar] [CrossRef]
  37. Liu, Y.; Li, Z.; Xiong, H.; Gao, X.; Wu, J. Understanding of internal clustering validation measures. In Proceedings of the 2010 IEEE International Conference on Data Mining, IEEE, Sydney, NSW, Australia, 13–17 December 2010; pp. 911–916. [Google Scholar]
  38. Dong, S. Improved label propagation algorithm for overlapping community detection. Computing 2020, 102, 2185–2198. [Google Scholar] [CrossRef]
  39. McDaid, A.; Hurley, N. Detecting highly overlapping communities with model-based overlapping seed expansion. In Proceedings of the 2010 International Conference on Advances in Social Networks Analysis and Mining, IEEE, Odense, Denmark, 9–11 August 2010; pp. 112–119. [Google Scholar]
  40. Dave, R.N. Validating fuzzy partitions obtained through c-shells clustering. Pattern Recognit. Lett. 1996, 17, 613–623. [Google Scholar] [CrossRef]
  41. Joopudi, S.; Rathi, S.S.; Narasimhan, S.; Rengaswamy, R. A new cluster validity index for fuzzy clustering. IFAC Proc. Vol. 2013, 46, 325–330. [Google Scholar] [CrossRef]
Figure 1. Workflow diagram of the proposed methodology.
Figure 1. Workflow diagram of the proposed methodology.
Mathematics 11 00611 g001
Figure 2. Bunching of noun phrases.
Figure 2. Bunching of noun phrases.
Mathematics 11 00611 g002
Figure 3. Splitting Operation.
Figure 3. Splitting Operation.
Mathematics 11 00611 g003
Figure 4. Merging Operation.
Figure 4. Merging Operation.
Mathematics 11 00611 g004
Figure 5. Average Comparison of Internal indices.
Figure 5. Average Comparison of Internal indices.
Mathematics 11 00611 g005
Figure 6. Average Comparison of overlapping indices.
Figure 6. Average Comparison of overlapping indices.
Mathematics 11 00611 g006
Table 1. Description of Datasets and Clustering of Reports.
Table 1. Description of Datasets and Clustering of Reports.
DatasetNumber ofNumber of(Cluster Number, No. of Reports)
NameReportsClusters
D S 1 340524(C1,389), (C2,178), (C3,190),
(C4,214), (C5,50), (C6,49),
(C7,81), (C8,230), (C9,85),
(C10,76), (C11,54), (C12,171),
(C13,146), (C14,439), (C15,64),
(C16,79), (C17,529), (C18,42),
(C19,50), (C20,290), (C21,188),
(C22,48), (C23,80), (C24,68)
D S 2 349026(C1,392), (C2,196), (C3,68),
(C4,59), (C5,143), (C6,214),
(C7,77), (C8,138), (C9,158),
(C10,168), (C11,111), (C12,97),
(C13,263), (C14,78),(C15,121),
(C16,204), (C17,96), (C18,170),
(C19,95), (C20,145), (C21,212),
(C22,144), (C23,297), (C24,146),
(C25,110), (C26,163)
D S 3 622632(C1,442), (C2,158), (C3,269),
(C4,249), (C5,543), (C6,234),
(C7,377), (C8,638), (C9,245),
(C10,185), (C11,371), (C12,503),
(C13,63), (C14,358), (C15,110),
(C16,170), (C17,240), (C18,350),
(C19,295), (C20,145), (C21,232),
(C22,344), (C23,297), (C24,146),
(C25,118), (C26,87), (C27,206 ),
(C28, 49), (C29, 126), (C30,74)
(C31, 78), (C32,88)
D S 4 132316(C1,124), (C2,96), (C3,65),
(C4,54), (C5,113), (C6,96),
(C7,77), (C8,138), (C9,95),
(C10,276), (C11,49), (C12,103),
(C13,53), (C14,78), (C15,110),
(C16,98)
D S 5 114415(C1,194), (C2,226),(C3,60),
(C4,42), (C5,58), (C6,76),
(C7,168), (C8,71), (C9,45),
(C10,89), (C11,58), (C12,178),
(C13,68), (C14,72), (C15,83)
D S 6 31,51533(C1,516), (C2,396), (C3,1612),
(C4,871), (C5,768), (C6,1482),
(C7,416), (C8,3480), (C9,2945),
(C10,1752), (C11,2551), (C12,790),
(C13,3379), (C14,2591), (C15,3374),
(C16,2861), (C17,2897), (C18,390),
(C19,1682), (C20,1889), (C21,2975),
(C22,814), (C23,2552), (C24,3701),
(C25,4021), (C26,2896), (C27,3215),
(C28,4498), (C29,3002), (C30,3169),
(C31,4296), (C32,1289), (C33,4158)
Table 2. Comparison of several clustering techniques using internal indices.
Table 2. Comparison of several clustering techniques using internal indices.
DatasetAlgorithmSLDNDBXBCHIN
MOCD0.721.200.510.42419528
D S 1 LPNI0.751.370.490.69406523
CSLMA0.761.540.520.64474584
GICDA0.701.010.530.48458590
CRCA0.800.980.510.39466540
Proposed0.811.940.420.34474591
MOCD0.730.920.520.59402410
D S 2 LPNI0.691.170.500.54407397
CSLMA0.681.060.490.56399389
GICDA0.630.980.580.52372377
CRCA0.760.930.560.33396467
Proposed0.771.980.440.31409473
MOCD0.710.970.490.45411496
D S 3 LPNI0.680.920.510.47407368
CSLMA0.690.880.480.41398407
GICDA0.680.810.630.48396412
CRCA0.720.980.700.37436491
Proposed0.721.160.420.36443507
MOCD0.690.910.590.41392589
D S 4 LPNI0.660.840.600.42387596
CSLMA0.680.780.580.39396593
GICDA0.640.760.620.41404508
CRCA0.721.070.650.33431579
Proposed0.741.120.500.31457612
MOCD0.610.940.710.49205310
D S 5 LPNI0.640.910.680.41192302
CSLMA0.620.920.710.48184279
GICDA0.550.820.650.54146304
CRCA0.681.100.700.37263593
Proposed0.691.060.630.38315586
MOCD0.771.130.490.45372553
D S 6 LPNI0.720.970.500.47363594
CSLMA0.780.950.480.41405579
GICDA0.710.830.570.53368571
CRCA0.811.190.400.37436687
Proposed0.812.910.400.33441589
Table 3. Average Internal indices of different algorithms.
Table 3. Average Internal indices of different algorithms.
MethodsInternal Cluster Validation Indices
SLDNDBXBCHIN
MOCD0.701.010.550.463.664.81
LPNI0.691.030.540.503.64.63
CSLMA0.701.020.540.483.765.21
GICDA0.650.860.590.493.574.61
CRCA0.741.040.580.364.035.59
Proposed0.611.690.470.364.055.03
Table 4. Comparison of several clustering techniques using overlapping indices.
Table 4. Comparison of several clustering techniques using overlapping indices.
DatasetAlgorithmPCPEDIGDKI
OCLP0.730.310.710.528.98
D S 1 SEOC0.700.280.730.508.86
FCMO0.710.320.700.488.49
GICDA0.630.330.520.459.14
CRCA0.790.290.780.518.94
Proposed0.790.250.790.568.31
OCLP0.800.350.730.689.38
D S 2 SEOC0.780.370.740.679.15
FCMO0.760.370.780.6910.08
GICDA0.710.410.730.5610.04
CRCA0.810.330.800.628.71
Proposed0.840.270.790.659.26
OCLP0.770.340.710.559.14
D S 3 SEOC0.730.310.740.589.02
FCMO0.770.350.720.589.10
GICDA0.710.370.680.529.15
CRCA0.800.280.810.578.72
Proposed0.820.260.810.618.58
OCLP0.740.230.510.609.12
D S 4 SEOC0.680.370.680.619.16
FCMO0.540.420.670.589.38
GICDA0.800.260.510.509.44
CRCA0.820.260.800.618.41
Proposed0.800.260.810.648.38
OCLP0.760.250.810.688.25
D S 5 SEOC0.700.290.840.658.28
FCMO0.710.310.730.658.21
GICDA0.730.380.780.699.52
CRCA0.810.170.860.707.41
Proposed0.820.220.880.728.46
OCLP0.810.250.780.797.78
D S 6 SEOC0.810.280.750.838.04
FCMO0.830.340.790.778.14
GICDA0.790.360.830.628.39
CRCA0.850.130.910.847.24
Proposed0.860.130.850.897.19
Table 5. Average overlapping indices of different algorithms.
Table 5. Average overlapping indices of different algorithms.
MethodsOverlapping Cluster Validation Indices
PCPEDIGDKI
OCLP0.760.280.700.638.77
SEOC0.730.310.740.648.75
FCMO0.730.350.730.628.91
GICDA0.720.350.700.559.28
CRCA0.810.240.820.648.23
Proposed0.820.230.820.678.36
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pramanik, A.; Das, A.K.; Pelusi, D.; Nayak, J. An Effective Fuzzy Clustering of Crime Reports Embedded by a Universal Sentence Encoder Model. Mathematics 2023, 11, 611. https://doi.org/10.3390/math11030611

AMA Style

Pramanik A, Das AK, Pelusi D, Nayak J. An Effective Fuzzy Clustering of Crime Reports Embedded by a Universal Sentence Encoder Model. Mathematics. 2023; 11(3):611. https://doi.org/10.3390/math11030611

Chicago/Turabian Style

Pramanik, Aparna, Asit Kumar Das, Danilo Pelusi, and Janmenjoy Nayak. 2023. "An Effective Fuzzy Clustering of Crime Reports Embedded by a Universal Sentence Encoder Model" Mathematics 11, no. 3: 611. https://doi.org/10.3390/math11030611

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop