Next Article in Journal
FurMoLi: A Future Query Technique for Moving Objects Based on a Learned Index
Previous Article in Journal
Transient Analysis for a Queuing System with Impatient Customers and Its Applications to the Pricing Strategy of a Video Website
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Graph-Based Hotspot Detection of Socio-Economic Data Using Rough-Set

1
School of Computer and Systems Sciences, Jawaharlal Nehru University, New Delhi 110067, India
2
Department of CSE, Koneru Lakshmaiah Education Foundation, Vaddeswaram 522502, India
3
Department of Computer Science, College of Engineering and Computer Science, Jazan University, Jazan 45142, Saudi Arabia
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Mathematics 2024, 12(13), 2031; https://doi.org/10.3390/math12132031
Submission received: 5 June 2024 / Revised: 20 June 2024 / Accepted: 24 June 2024 / Published: 29 June 2024

Abstract

:
The term hotspot refers to a location or an area where the occurrence of a particular phenomenon, event, or activity is significantly higher than in the surrounding areas. The existing statistical methods need help working well on discrete data. Also, it can identify a false hotspot. This paper proposes a novel graph-based hotspot detection using a rough set (GBHSDRS) for detecting the hotspots. This algorithm works well with discrete spatial vector data. Furthermore, it removes the false hotspot by finding the statistical significance of the identified hotspots. A rough set theory is applied to the graph of the spatial polygon data, and the nodes are divided into lower, boundary, and negative regions. Therefore, the candidate hotspot belongs to the lower region of the set, and the boundary value analysis will ensure the identification of the hotspots if the hotspot is present in the dataset. The p-value is used to find the statistical significance of the hotspots. The algorithm is tested on the socioeconomic data of Uttar Pradesh (UP) from 1991 on medical facilities. The average gain in density and Hotspot Prediction Accuracy Index (HAPI) of the detected hotspots is 26.54% and 23.41%, respectively. An average reduction in runtime is 27.73%, acquired compared to all other methods on the socioeconomic data.
MSC:
91B72; 60L90; 05C90; 68R10

1. Introduction

A hotspot is a spatial cluster pattern. A hotspot represents a region or geographical area where the occurrence of the phenomenon is prevalent [1,2]. Due to the greater spatial autocorrelation among events within a hotspot, it is possible to differentiate between the hotspot and a cluster. However, compared to a hotspot, a cluster has less spatial autocorrelation between events, or its objects may be independently and identically distributed (I.I.D.) [1]. Hotspot detection involves finding the location or region with the most active events. Hotspot detection has several application domains, including (1) crime [3,4]: analysing crime data to identify places with high levels of criminal activity. This would facilitate the efficient allocation of resources and enhance focused crime prevention efforts. (2) Health [5]: identifying the geographical regions with a high prevalence of illnesses such as COVID-19 [5], flu, and malaria [6,7] would facilitate the distribution of resources and enhance preventive measures. (3) Risk assessment: identifying the regions susceptible to natural calamities, such as seismic activities and floods [8,9], and fire [10,11], would aid in readiness and reaction efforts. (4) Traffic analysis [12] involves the identification of congestion hotspots in order to optimise traffic flow, strategise infrastructure projects, and enhance public transit systems, among other objectives.
The significance of geospatial data analysis has grown with the growing use of location-aware services [13]. It is used in a variety of fields, such as climate change analysis [14,15], weather forecasting [16,17], and global warming [18,19]. Geographic classification, clustering, outlier identification, co-location patterns, regression, and spatial hotspot detection are essential tasks in spatial data mining [20,21,22,23]. Spatial data analysis is more complex than the analysis of non-spatial data due to spatial autocorrelation [1,23,24].
The literature section describes the methods used for the detection of hotspots for point data and polygon data. The methods discussed for the point data do not have significance tests, so they can generate a false hotspot. While several strategies can produce significant hotspots, all of these approaches are computationally challenging because they all need the whole dataset to compute the hotspots. The methods discussed for the hotspot detection of the polygon data assume that the study region is stationary and works well with the continuous data. Some graph-based methods are also discussed, but those methods also use some clustering algorithms like DBSCAN for grouping similar objects. Again, setting parameters for the clustering algorithm is challenging, and the results depend on the parameter setting. Therefore, this paper aims to develop an algorithm that does not depend on the distribution of the underlying dataset, is cost-effective, and can find the hotspot where it is present (removes false positives) in the dataset.
The proposed GBHSDRS algorithm will determine the hotspot and eliminate false hotspots by using significance tests. Like other machine learning techniques, rough set-based methods need assessment criteria to measure the effectiveness of the proposed method. The proposed method is categorised as an unsupervised method as the patterns are identified without predetermined labels. The study outcomes benefit policymakers, healthcare administrators, and stakeholders in promoting fair and equal availability of healthcare services at the village level. This study can potentially optimise resource distribution by explicitly focusing on areas with high demand, minimising inequalities in healthcare, and enhancing health results among various population groups.
The GBHSDRS algorithm prunes the dataset by removing the polygons belonging to the negative region, thereby contributing to the running time of the algorithm. Significant contributions of the research work are as follows:
  • This paper contributes by incorporating graph theory and rough sets for hotspot detection. This novel technique improves hotspot identification accuracy and efficiency by combining the best features of both approaches.
  • This paper strategically applies the BFS algorithm in the wrapper model to process the data from the boundary region toward the proposed efficiency of the algorithm.
  • This method can identify the hotspot if the hotspot exists and eliminates the detection of a false hotspot.
  • By developing scalable hotspot identification techniques, this paper makes it possible to apply the method effectively to polygon datasets for real-world applications where data volumes are significantly high.
The remainder of this paper is organised as follows. A literature survey related to the proposed methodology for hotspot detection is discussed in Section 2. Preliminaries related to the algorithm are presented in Section 3. Graph-based hotspot detection using the rough set algorithm is discussed in detail in Section 4. Experimental evaluation containing the dataset used, evaluation metrics, and results are discussed in Section 5. Discussion of the results are given in Section 6. Finally, in Section 7, the conclusion and future research directions are discussed.

2. Literature Survey

Spatial vector data can be represented as points, lines, and polygons. Depending on the spatial vector data representation, the method for detecting the hotspot can vary. The literature survey related to the proposed study is given in Table 1 and Table 2. There are various methods for the detection of the hotspot. Clustering techniques like DBSCAN [25] and CLIQUE [26] are able to find the hotspot of the point data. They are inexpensive but will generate false hotspots because they lack a significance test. A rough set-based hotspot detection method [27] can be used to find the hotspot of the point data. This method can also generate a false hotspot as there is no significance test. The spatial scan statistics [28] algorithm can find the hotspot of the point data, but is computationally very expensive and unable to find the hotspot in the presence of obstacles in the study region. Spatial scan statistics are used by the software SatScan v10.2 [29], which is a well-known method for the detection of hotspots. This software takes the point data and applies the spatial scan statistics method to identify the hotspots. This software only finds the circular hotspots and cannot find the hotspot if there is an obstacle in the study region. Runadi, T. (2017) [30] uses the ZDD-based enumeration method to find the hotspot of the sudden infant death syndrome in North Carolina. In this method, it is impossible to find the p-value for the significance test. The Cubic Grid Circle (CGC) method [31] can find the hotspot of the point data. The identified hotspot is circular in shape. This method can not find the hotspot of other shapes, nor can not it be applied to the polygon dataset.
Tabarej, M. S. (2022) [5] finds the hotspots of confirmed, deceased and recovered cases of COVID-19 in an Indian district. The method used for the detection of the hotspot is Moran’s I. This method is also dependent on a spatial weight matrix, which assumes that the space in stationary. Kumar, S. (2021) [32] finds the hotspot of hydroponic farming using Getis Ord Gi* statistics. A high-resolution satellite is used to find hotspots. This method is also sensitive to outliers. Chaikaew, N. (2009) [33] finds the hotspot of diarrhoea instances in Chiang Mai, Thailand. The method used for the analysis of the hotspot is Moran’s I and Geary’s C Index. This method assumes a linear relationship among the variables and does not work well with the non-linear relation among the variables. Rahman, M. T. (2020) [34] finds the hotspot of traffic crashes and land use in Dammam, Saudi Arabia, to improve crime prediction. The method used for the analysis of the hotspot is geographically weighted regression. This method is also dependent on the choice of the spatial neighbour.
A graph-based approach to detecting tourist movements [35] creates the tourist destination as the vertex, and movement from one location to the next is taken as the edge. Different tweets in the same locality are categorised as a vertex by the DBSCAN clustering algorithm. The major issue is in choosing two parameters, Eps (search radius) and MinPts (minimum number of points in the search radius). A more advanced method is needed to find the vertex, and this method is also sensitive to parameter settings. A clustering algorithm using the spatial information of the neighbourhood is defined by ref. [37]. The algorithm employs adjacent pixel values to form the cluster. This algorithm can find all the clusters in the dataset. This algorithm should be tested on feature-based similarity. The algorithm for hotspot detection of the mapper graph is given in ref. [36]. This algorithm builds the graph of the multidimensional data to see the loops, flares, and clusters. This algorithm has difficulty in selecting the clustering algorithm. The choice of parameters, such as the number of samples, noise level, clustering parameters, and Mapper settings, can significantly impact the results. A graph-based clustering algorithm for the social community transmission of COVID-19 is given in ref. [38]. This method constructs a graph by considering the vertex as the location of an activity, and transmission between the two locations serves as the edge between the two vertices. This algorithm divides the nodes into three colours based on their severity. The selection of the clustering algorithm and the threshold cause significant limitations.

3. Preliminaries

Definition 1. 
A geographical point is a pair p = ( x , y ) , consisting of latitude x and longitude y. A geographical point data p ω is a pair p ω = ( p , ω ) , consisting of point p and the magnitude of occurrence of the event at the point p.
Example: a earthquake of magnitude 5.3 on the richer scale is observed in Yonakuni, Japan. The geographical coordinates of the location are ( 25.248 , 123.555 ) . Therefore, p = ( 25.248 , 123.555 ) , and the point data p ω = ( 25.248 , 123.555 , 5.3 ) .
Definition 2. 
Let a line segment (edge) e be defined by a pair of points and represented as e = ( p i , p j ) . Then a line l, a finite sequence of connected line segments, represented as l = ( e 1 , e 2 , . . . , e k ) . i.e., l = ( ( p 0 , p 1 ) , ( p 1 , p 2 ) , . . . ( p k 1 , p k ) ) . Therefore, alternately, l = ( p 0 , p 1 , p 2 , . . . , p k 1 , p k ) .
Definition 3. 
A polygon surrounds a geographical region or area. A polygon A is defined by a sequence of edges A = e 1 , e 2 , . . . , e k . Alternatively, a polygon is represented by a sequence of points A = ( p 1 , p 2 , p k ) where p 1 = p k .
A polygon data P is a polygon containing one or more point data p ω , represented by a pair P = ( A , ω P ) where ω P is the weight of the polygon. A random point taken from a polygon is considered the vertex c.
As shown in Figure 1, a polygon A is composed of edges ( e 1 , e 2 , e 3 , e 4 , e 5 , e 6 ) or by points ( p 0 , p 1 , p 2 , p 3 , p 4 , p 5 ) . A polygon with some data point is called polygon data. As shown in Figure 1, a polygon contains point data ( p ω 0 , p ω 1 , p ω 2 , p ω 3 , p ω 4 , p ω 5 , p ω 6 , p ω 7 , p ω 8 ) . Each point data p ω can have several attributes like crime, disease cases, the occurrence of fire, etc. These attributes are denoted as a 1 , a 2 , . . . a n . Values of these attributes are denoted as u i j , where i corresponds to the i t h point and j corresponds to the j t h attribute, as shown in Table 3. The table contains various columns: P and a 1 , a 2 , . . . a n . Column P includes the data point in the polygon P. Attribute a 1 , a 2 , . . . a n is the attribute in a polygon that contains the value of the attributes. The column sum gives the sum of attribute a i for a polygon. The sum of all the attributes gives the weight of the polygon ω P as given in Equation (1).
ω P = i = 1 m j = 1 n u i , j
Definition 4. 
A polygon datapoint is said to be a hotspot [27,39] H a if the weight ω P of the polygon is greater than a pre-decided threshold δ. Then, a polygon hotspot is defined as:
H a = P | ω P > δ
where δ is a user-defined threshold.
Definition 5. 
Spatial neighbourhood: Let the two polygons be represented as A i and A j , and let N be the spatial neighbourhood relation defined over the polygons. Then, for any two polygons A i , A j , the neighbourhood relation is defined as:
N ( A i , A j ) i f ( A i A j ) ϕ
Equation (2) states that if the intersection of the two polygons are non-phi, then the polygons are said to be spatial neighbours.
The geographical neighbourhood uses a reflexive, symmetric, and non-transitive relation to formulate a spatial relationship in geographic data science and spatial statistics. Two polygons A i and A j are considered spatial neighbours if A i and A j share a common point or a common edge between them. The spatial neighbour of A i is the list of all A j , the spatial neighbours of A i .
For example, a polygon dataset consisting of the point is shown in Figure 2. The spatial neighbour of the polygons is shown in Table 4. The table contains three columns: polygon, spatial neighbour, and occurrences. The variable polygon includes the name of the polygon for which the spatial neighbour is being calculated. The variable spatial neighbour contains all the spatial neighbours of the polygon given in the column polygon. The column occurrence includes the weight of the polygon.
Definition 6. 
A  neighbourhood graph is an ordered pair G N = ( V , E ) , where V is the set of vertices V = c 1 , c 2 , c n and E is the set of edges E = e 1 , e 2 , e m , where the edge e i = ( c i , c j ) | c i A i , a n d c j A j , A i is the polygon consisting of points A = ( p 1 , p 2 , p k ) and A j is a polygon composed of point A = ( p k + 1 , p k + 2 , p k + s ) .

3.1. Local Moran’s I

Local spatial autocorrelation (L.S.A.) identifies the association within the spatially distributed data, i.e., the degree to which the spatial data are related to their neighbours. Local spatial autocorrelation finds clusters and outliers within spatial data. A hotspot is a special cluster where the density of occurrence is high. The L.S.A. method uses a spatial autocorrelation index for hotspot identification. Various index methods have been proposed. One such method is Moran’s I [24], which is defined as:
I i = n ( n 1 ) ς 2 ω P i ω P ¯ j = 1 n η i j ω P i ω P ¯
where ω P i represents the weight of the polygon P i , ω P i ¯ represents the average weight of the study region S, η i j is the reciprocal value of the spatial weight between the polygon i and the neighbouring polygon j, n represents the total number of polygons in the study region, and ς 2 represents the variance of the observed cases.

3.2. Getis Ord G i

Let the study region S be divided into sub-regions P 1 , P 2 , P n and each sub-region P i be called a polygon. The G i statistic, in its simplest form, measures the degree of association between a given polygon and the neighbouring polygon [40,41] and is defined as:
G i = i = 1 n w i j ω P i i = 1 n ω P j j i
where w i j is a spatial weight whose values depend on the spatial neighbour of a polygon with the others w i j = 1 : for all polygons that are nearest neighbours 0 : outside of the nearest neighbour and ω P j is the values at the polygon P j , and n is the total number of spatial units or polygons.

3.3. Getis-Ord G i *

Let the study region S be divided into P 1 , P 2 , P n where each sub-region P i represents a polygon. Then, the G i * statistic [40,42] is defined as:
G i * = j = 1 n w i j ω P j ω P ¯ j = 1 n w i j [ n j = 1 n w i j 2 ( j = 1 n w i j ) 2 ] n 1 S
where η j is the value at polygon P, and w i j = 1 : for all polygons that are nearest neighbours 0 : outside of the nearest neighbour n is the total number of polygons in the study area and ω P ¯ = j = 1 n ω P j n is the mean and ς = j = 1 n ω P j 2 n ω P ¯ is the variance. The G i * statistic is the z score, so no further calculation is needed for the significance test. For positive value, the larger the value of the z-score, the denser the clustering of the results, i.e., the hotspot.

3.4. DBSCAN on the Graph of the Polygon Vector Data

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm commonly used to find spatially dense clusters. DBSCAN works on a point dataset. To apply the DBSCAN to the polygon vector data, a graph is constructed on the polygon data by taking a random point from each polygon. The DBSCAN method finds the dense cluster by using two parameters: the epsilon ( ϵ ) neighbourhood radius and the minimum number of points required to form a dense region (minPts). The cluster identified is called the hotspot. All the noise points are considered non-hotspots.

3.5. Rough-Set Concepts

The rough set theory [43] is a mathematical model that deals with uncertainty, vagueness, and imprecision. Mostly, data presented in the real world are imperfect. The main aim of rough set theory is to understand, manipulate, and analyse imperfect data.
In rough set theory, data are represented by the information system I = ( U , Q , V , f ) where U is a finite set of objects, Q is a set of attributes, V is a set of domains corresponding to each attribute a Q , and f is a function mapping an attribute value to a data point. Let U = P 1 , P 2 , . . . , P n , and let v 1 , v 2 , . . . v t be the domain of attribute a t . Then, P i U and a j Q if v i j is the value of attribute a i for data point P i , then
f ( P i , a j ) = v i j
The mapping of the function f is shown in Table 5. The table contains a column name f, which is a function and contains the polygons P 1 , P 2 , . . . P n . These polygons will be mapped to the attribute of the polygon a 1 , a 2 , . . . a n . The value of the polygon P i with the attribute a j is given as v i j .
Let B Q ; then, the indiscernibility relation I N D B is defined as:
I N D B = ( P i , P j ) U 2 a B , a ( P i ) = a ( P j )
The indiscernibility I N D B is denoted as x B . Now let X U , where X is defined as X = P i | i ω P δ , where δ is a pre-decided threshold. Then, the lower and upper approximations are defined as:
B ̲ X = x : [ x ] B X
B ¯ X = x : [ x ] B X ϕ
The boundary region of X concerning attribute B is defined as the set difference between the upper approximation and the lower approximation.
B N B X = B ¯ X B ̲ X
The negative region is defined as U B ¯ X , which contains the set of objects that can be definitely ruled out as members of the target set X.
For example, consider Table 4. The value of λ is defined as:
λ = 0 , if ω P δ 1 , 1 , if δ 2 ω P δ 1 , 2 , otherwise
where the thresholds δ 1 and δ 2 are user-defined and δ 2 < δ 1 , The value of λ is given in Table 6. This table contains three columns: polygon, weight ( ω P ) , and the λ . The column name P contains the name of the polygon, weight ( ω P ) contains the weight of the polygon, and column λ contains the value of the λ , as described in Equation  (10). The information system is defined as I = ( U , Q , V , f ) . The indiscernibility relation when B = λ is given as I N D ( B ) = P 3 , P 6 , P 7 , P 1 , P 2 , P 0 , P 4 , P 5 , P 8 . The two thresholds δ 1 and δ 2 , are defined as the third and second quartile, respectively, and their values are taken as 7.5 and 5. Now if X is defined as X = P i | i ω P δ , where δ = 0.625 then X = P 1 , P 3 , P 6 , P 7 . Then, the lower approximation is B ̲ X = P 3 , P 6 , P 7 and the upper approximation is B ¯ X = P 3 , P 6 , P 7 P 1 , P 2 = P 1 , P 2 , P 3 , P 6 , P 7 . Now the boundary region is the set difference between the upper and the lower approximation: B N B X = P 1 , P 2 , P 3 , P 6 , P 7 P 3 , P 6 , P 7 = P 1 , P 2 . The negative region is given as: P 0 , P 1 , P 2 , P 3 , P 4 , P 5 , P 6 , P 7 , P 8 P 1 , P 2 , P 3 , P 6 , P 7 = P 0 , P 4 , P 5 , P 8 .

3.6. Neighbour Graph Construction

For constructing a neighbour graph from the spatial polygon data, a random point c i from each polygon A i is taken and considered as the vertex c of the graph. Let the function g find the spatial neighbour, then N i = g ( A i ) , where N i is the spatial neighbour of the polygon A i . There is an edge between the two polygons if they are spatial neighbours. Let there be a function h that maps each vertex to either lower, upper, boundary, or negative regions, and then h is defined as h : U l a b e l , where the domain of h is the set of all polygons, and the range is a set of labels l a b e l = l o w e r r e g i o n , b o u n d a r y r e g i o n , n e g a t i v e n e g a t i v e .
For example, consider Figure 2: the figure has a set of polygons P 0 , P 1 , P 2 , . . . P 8 and the random point chosen from each polygon is given as c o , c 1 , c 2 , . . . c 8 , respectively. The polygon P 0 shares a boundary with P 1 and P 3 but a point with P 2 . Therefore, P 1 , P 2 , P 3 are the spatial neighbours of the P 0 . Therefore, there is an edge between c 0 to c 1 , c 2 , c 3 . Similarly, the neighbour of each polygon is calculated. The neighbour of each polygon is shown in Table 1. This table shows the neighbour with an adjacency list along with the occurrence of an event in each polygon. The neighbour graph of the spatial polygon data is shown in Figure 3. This figure also shows the label of the node as the lower, boundary and negative regions, and these are shown in different colours.

3.7. Graph Traversal

Following the construction of the graph, as shown in Figure 3, the traversal of the graph is needed for the spatial analysis. The nodes are traversed in the breadth-first search manner. The children of each node belonging to the boundary region are chosen for the traversal. All the spatial neighbours of the node in the boundary region are calculated as described in Definition 5 and traversed.
For example, the nodes belonging to the boundary region are c 1 , c 2 , the children of the node c 1 and c 2 are c 0 , c 2 , c 3 , c 4 , and c 0 , c 3 , c 4 , c 5 , c 7 , respectively. A tree representing the spatial neighbour is shown in Figure 4.

4. Methodology

Hotspot identification has emerged as a critically important research area in recent years because of the COVID-19 outbreak. Hotspot detection refers to identifying the region or place where a particular phenomenon is most often and intensively observed. Depending on the data format, a hotspot may be identified for point, line, or polygon vector data. This research specifically examines the identification of hotspots in polygonal data. The focal areas of the point and line data are outside of the scope of this research article. This study uses a rough set as the preferred method for hotspot identification. This theory categorizes the dataset into three distinct regions: the lower, boundary, and negative regions. This process is sometimes referred to as three-way decision-making. The members in the lower region of the dataset surely belong to the target set, whereas the members in the boundary region may or may not belong to the target set and need more analysis. The members of the negative area are excluded from the target set as they surely do not belong to the target set and are not included in the analysis further. By using the rough set, the dataset is pruned and enhances the effectiveness of the algorithms. The primary rationale for using rough set theory for hotspot identification is its effectiveness. The dataset used in this article consists of polygon vector data that have been granulated at the village level. This dataset includes the locations where healthcare facilities are concentrated at the village level.

Graph-Based Hotspot Detection Using Rough Set Algorithm

This paper proposes a novel algorithm called the Graph-Based Hotspot Detection algorithm using a Rough Set (GBHSDRS). This method applies a rough set on the graph of the spatial polygon data to find the hotspot. First, it considers spatial polygon data composed of P = ( A , ω P ) . Thus, each polygon data point is represented by a polygon A and the number of events in the polygon ω P . Next, the spatial neighbour of each polygon datapoint is calculated for spatial processing, as described in Definition 5. Now λ is defined based on the value of the ω P , as described in Section 3.5. The values of ω P and λ are given in Table 6. After finding these values, a rough set is applied to the polygon data, as described in Section 3.5. By applying a rough set on the spatial polygon data, the study region is divided into the lower approximation and upper approximation, and consequently, the polygons are divided into the lower region, boundary region, and negative region. Following the application of a rough set, a neighbour graph is constructed. A random point from each polygon A is chosen as the vertex of the neighbour graph. An edge E exists between the two nodes if they are spatial neighbours. Table 4 gives a complete list of each spatial neighbour. In this way, a graph is constructed for the spatial polygon data. The nodes are labelled as the lower region, boundary region, or negative region based on the outcome of the rough set.
The node labelled lower approximation is considered the candidate hotspot and is represented by C H 1 . Following this, the node belonging to the boundary region is selected and traversed, as described in Section 3.7. Suppose the combined frequency of occurrence of events in the node belonging to the boundary region and one of its neighbours is greater than a pre-decided threshold, δ 1 . Then, the vertex is approximated as the lower approximation and is labelled as lower. These newly approximated nodes are considered candidate hotspots C H 2 . After detecting the candidate hotspots C H 1 and C H 2 , these two hotspots are merged to form the final set of candidate hotspots C H . Now, the two hypotheses are defined: (1) Null hypothesis ( H 0 ), which states that there is no significant hotspot in the data, and all the hotspots are generated due to a random chance. (2) Alternate hypothesis ( H 1 ), which states that there is a significant hotspot in the data. Then, a test statistics p-value is calculated, representing the probability of observing the test statistics under the null hypothesis. Now, a significance threshold α is chosen. This means the hotspots with a p-value less than α are considered to be statistically significant. The most common significance levels are 0.05 ,   0.01 , and 0.001 , but the choice depends on the context of the study and the consequences of making errors. For example, a significance level of α = 0.05 is commonly used with a moderate risk, α = 0.01 used with low risk and α = 0.001 very stringent, used in high-stakes research. Therefore, we have chosen the significance level as α = 0.05 so that it appropriately balances the risks based on the above considerations. The flow chart of the algorithm is shown in Figure 5 and the pseudo code for the proposed algorithm in given in Appendix A.

5. Experimental Evaluation

5.1. Study Area and Socio-Economic Data

NASA’s socio-economics data and application center (SEDAC) publishes geospatial socio-economic data for India at the village level [44]. This dataset was created by digitising the survey of India’s village-level physical map from 1991. This dataset also utilizes the tabular data from 1991. The dataset for each state is provided as a shapefile. In this study, village-level data of Uttar Pradesh are used. Figure 6 shows the study area. The geospatial socio-economic dataset contains over 200 variables for over 600,000 towns and villages. When village and town statistics are not accessible, sub-district taluk data are used. This dataset is the first data available publicly at the village level.
Out of all the variables, we have selected the attribute related to the type of medical facility available at the village level. The overall health facility at the village level is calculated by adding the frequency of various health facilities and termed the “Medical Facility”. The identification of hotspots is based on this aggregated value. The type and the abbreviations for the health facilities are given in Table 7. The table contains three columns, namely attributes, abbreviations, and number. The attributes contain the type of health facility being considered, abbreviations contain the abbreviations used for the particular attributes, and the number includes the occurrences of the attribute for that specific village.
Figure 7 shows the distribution of the polygons (villages) in the study region S. From the figure, we can see the clustering of the phenomenon at some locations. This is due to the presence of spatial autocorrelation.
The density plot of the number of the medical facility is given in Figure 8. From the density plot of the dataset, we see that the maximum density lies around 5. The range of the medical facility is 0 to 165. Due to the variation in the density value, there will be a chance of the presence of the hotspot.

5.2. Evolution Metrics

Various performance metrics are available in the literature to assess the performance of the hotspot detection method. J-Y Wuu, et al. [46] define prediction accuracy and memorising accuracy as performance metrics. M. B. Ulak et al. [47] define the crash prediction accuracy index as the performance metrics for evaluating the hotspot detection methods. In this paper, socio-economic data granulated at the village level are used to measure the performance of the algorithm. Three evaluation metrics are used, i.e., the density of the hotspot, the hotspot prediction accuracy index, and the running time of the algorithm.

5.2.1. Density of the Hotspot

The density of the hotspot is defined as the average event density of all the hotspots. Let the dataset D contain the n member of polygons (villages) P 1 , P 2 , P n , each of the villages contain ω P 1 , ω P 2 , ω P n number of events, and the area of each hotspot be given as α 1 , α 2 , α n . If out of the n villages, let us say k number of villages are identified as the hotspot, then
Density of the hotspot = i = 1 k α i ω P i k

5.2.2. Hotspot Prediction Accuracy Index

The hotspot prediction accuracy index (HPAI) is developed by Chainey, Thompson, and others in order to find the crime hotspots [48]. This ratio is calculated by dividing the crime incidence by total crime by the ratio of the hotspot’s area to the total area of the study region. The hotspot prediction accuracy index is essential for the viability of any method to predict the hotspot. The HPAI method evaluates the effectiveness of the method for predicting hotspots. The original intent of the crash prediction accuracy index was to predict the accuracy of road crashes. This method is modified so that it can be applied to the hotspot of the polygon dataset. It is given as follows:
H P A I = h i t r a t e a r e a p e r c e n t a g e = n N 100 a A 100
where n is the number of events in the hotspot, N is the total number of events, a is the area of the hotspot, and A is the total area of the study region.

5.2.3. Run Time

The running time of the GBHSDRS algorithm is defined as the time taken by the algorithm to complete the execution of the program. It depends on many factors, like the size of the data, the number of lines of code, the configuration of the hardware, etc. It is calculated as:
R T = C T S T
where RT is the run time of the algorithm, ST is the start time, and CT is the completion time of the algorithm.

5.3. Results

This section presents the outcome of the GBHSDRS algorithm on the socio-economic data described above. First, the hotspots from all the methods are presented for a pictorial depiction, and then the evaluation parameters are presented and compared with the methods found in the literature.

5.3.1. Result of GBHSDRS

The dataset described in Section 5.1 contains 123,471 villages, i.e., polygons. The variable of the dataset containing the healthcare facility is used for hotspot detection. After completing step 1 of the algorithm, each polygon’s spatial neighbour is found and represented by N i . Next, the value of λ is calculated. An example is given in Table 4. Then, the rough set is applied to find the rough approximation of the polygons in the corresponding region and labelled as the lower region, boundary region, or negative region. Now, a graph is constructed using the dataset by choosing the random points from each subregion based on the spatial neighbour. A graph is a non-linear data structure. A breadth-first search traversal is used to search the spatial neighbour, which minimizes the algorithm’s running time compared to the linear data structure. The nodes of the graphs are labelled based on the outcome of the rough set after this boundary value analysis is performed. Finally, statistical significance is assessed for the candidate hotspot. Only the candidate hotspot that satisfies the predefined threshold p-value < 0.05 is considered as the hotspot.
The hotspot of the medical facility provided at the village level is given in Figure 9. Each red spot on the map shows a hotspot village represented by a polygon. Out of 123,471 villages (polygons), only 2335 villages are found to be significant hotspots, as shown in the figure. The hotspot villages are the villages that have better medical facilities. We can also see that hotspots are also clustered in some regions. This is due to the spatial autocorrelation in the dataset. The clustering of the hotspot is circled and shown in Figure 9. The hotspot villages are distributed over the entire study region. From the plot, we can see that the hotspots are mostly in the west UP, east UP, and south UP, and some smaller clustering of the hotspots are also found in the north region of the UP.
The hotspots generated by the four literature methods are given in Figure 10, Figure 11, Figure 12 and Figure 13. Figure 10 shows the hotspot generated by the Moran’s I method. All the identified hotspots are shown in red on the map. In the plot in Figure 10, we can see that the hotspots are scattered over the study region and clustered over some areas like west UP, east UP, and south region, and the central region mostly has fewer hotspot villages. Compared to the proposed algorithm, there is a very low hotspot in the central region of the study area. Other areas have a similar clustering of hotspots as compared to the GBHSDRS.
The hotspot detected by the Getis Ord Gi algorithm is given in Figure 11. From this figure, it can be seen that the most detected hotspot is scattered over the study region. Most hotspots are clustered in west UP and east UP, with some clustering in the north and south regions. Few hotspots are seen in the middle region of the study area.
The hotspot detected by the Getis Ord G i * algorithm is given in Figure 12. From this figure, it can be seen that the detected hotspot is scattered over the entire study region. Most hotspots are clustered in west UP and east UP, with some clustering in the north and south regions. Several hotspots are seen in the middle region of the study area.
The hotspots detected by DBSCAN on the graph of the spatial data are shown in Figure 13. From this figure, it can be seen that the detected hotspots are scattered over the entire study region. The detected hotspots are clustered into areas, but a clear separation is not seen. All the regions of the study area have some hotspots. After applying the DBSCAN to the graph of the spatial polygon data, we obtained different dense clusters. We have plotted all the clusters in red.
The density estimation plot of the hotspot identified by the particular algorithm is given in Figure 14. The density plot shows that the peak value lies around 15, and the maximum value of the medical facility is around 165. There is a significant increase in the peak density before and after detecting the hotspot.

5.3.2. Evaluation Metrics

The GBHSDRS algorithm finds the hotspot of the given socio-economic dataset. The performance of the proposed algorithm is compared with the four literature methods. The methods are Local Moran’s I, Getis Ord G i , Getis Ord G i * , and the DBSCAN used on the graph of the polygon vector data. These algorithms are compared based on the three parameters described in Section 5.2. These metrics are the density of the hotspot, the hotspot prediction accuracy index, and the run time of the algorithms. The result of the algorithm for the density of the algorithm is given in Figure 15. All the experiments are run ten times, and the average value of the results is presented. The results of all the methods compared to the proposed method is given in the Table 8. The density of the hotspot generated by the DBSCAN method is lower than that of all other methods. In contrast, the density of the hotspot generated by the Getis Ord Gi and the Getis Org Gi star methods are nearly the same or higher than Moran’s I but are less than the GBHSDRS algorithm. The density of the hotspot generated by the GBHSDRS algorithm is the highest among all the algorithms. The analysis of the density value is given in Table 9. It is noted that the percentage increases in the density of the hotspot generated by the GBHSDRS algorithm with respect to the Moran’s I, Getis Ord Gi, Getis Ord G i * and the DBSCAN algorithms are 36.14%, 9.94%, 10.42%, and 49.67%, respectively.
The result of the evaluation metrics of HPAI is shown in Figure 16. The results show that HPAI is lowest for Moran’s I. Getis Org Gi and Getis Org G i * have HAPI values nearly all the same. At the same time, the HAPI value for DBSCAN is in between Moran’s I and the Getis Org G i * . The HPAI value is the highest for the proposed GBHSDRS algorithm. The analysis of the PAI value is given in Table 10. From the analysis, it is found that the percentage increase in the HPAI found from the GBHSDRS algorithm is higher than Moran’s I, the Getis Org Gi, and Getis Org, and DBSCAN G i * by 40.39%, 9.15%, and 14.60% and 29.50%, respectively.
The results of the algorithms for the evaluation metric running time are given in Figure 17. The running times of the algorithms for the Getis Org G i * and DBSCAN are nearly the same. The Moran’s I method has more running time than all the methods. Getis Org Gi has more running time than the Getis Org Gi star. The running time of the GBHSDRS algorithm is the shortest among all the algorithms given in the figure. The analysis of the running time of the algorithm is given in Table 11. It is noted that the percentage decrease in the running time of the hotspot generated by the GBHSDRS algorithm compared to Moran’s I, Getis Ord Gi, and the Getis Ord Gi star, and DBSCAN is 63.78%, 37.85%, and 2.46%, and 6.83%, respectively.
In the suggested technique, we have only used spatial adjacency as specified by Equation (2). The effectiveness of the proposed method on the other spatial neighbours, such as distance base, needs to be checked. This approach is specifically applicable to polygon data. However, it is necessary to verify its effectiveness on other vector data formats, such as point and line.

6. Discussion of the Results

The proposed GBHSDRS method for hotspot detection is tested on the socioeconomic data of Uttar Pradesh. The results of the algorithm are compared with the four state-of-the-art methods, namely Moran’s I [5], Getis Ord Gi [40], Getis Ord G i * [32], and the DBSCAN [35], applied to the graph of the polygon vector data. The output of all the methods is plotted. The maps are plotted with hotspots in red. Three evaluation parameters are used to measure the performance of the algorithm, namely the density of the hotspot, HPAI, and the running time of the algorithms. The density of the proposed algorithm is high because we chose a high threshold δ 1 , δ 2 for the target set X. If the value of δ 1 , δ 2 is low, then the density will be low, and consequently, more hotspots will be generated.
The running time of the proposed algorithm is less than that of the other methods because the proposed algorithm uses a rough set that divides the space into three regions: the lower region, the boundary region, and the upper region. The lower region contains a set of polygons that surely belong to the target set called hotspots. The boundary region is the set of polygons that may or may not belong to the target set, i.e., hotspot. The negative region is the set of polygons that surely do not belong to the target set, which is a hotspot. Therefore, the polygons belonging to the negative region are removed from the further processing, and the data are pruned for the detection of the hotspot. As a result of the pruning of the data, the algorithm’s running time is reduced. The HPAI of the proposed algorithm is also high because the HPAI value is the ratio of the hit rate to the area percentage. When the density of the detected hotspot is high, the area or the number of the activity in the hotspot is high. This algorithm detected high-density hotspots due to the high thresholds δ 1 , δ 2 . Therefore, the HPAI value is high.
The proposed method is able to find the hotspot of the polygon data. Some of the merits of the above method are that this method does not require any underlying data distribution, and it can find the hotspot for the polygon data with significance, so this method removes the false hotspots. Some disadvantages of the proposed method are the graph construction and the implementation of the rough set. The boundary value analysis of the boundary region is a major concern.

7. Conclusion and Future Research Directions

The proposed GBHSDRS algorithm is developed to find the hotspot polygon vector data. The main contributions of the algorithm are less running time, denser hotspots, and a high HPAI. This algorithm is able to find the hotspot if a hotspot exists in the data and also eliminate false hotspots. A neighbour relation is defined for the geospatial polygon data. The neighbour relation is used to construct the neighbour graph. The neighbour graph is a non-linear data structure that allows easy access to a node for analysis. The rough set approach is applied to deal with uncertainties and partition the data into lower, upper, boundary, and negative regions. The only processing is needed for the boundary region node, and traversal is the breadth-first search. Therefore, the complexity is reduced. The wrapper model of clipping the negative region achieves the algorithm’s efficiency by data reduction. Finally, the candidate hotspot significance was tested, and only the significant hotspot was selected. The algorithm is tested on the socio-economic dataset of UP India. The experiments performed on village-level data of medical facilities in Uttar Pradesh in 1991 yielded that of a total of 123,471 villages, only 2335 villages were identified as hotspots.
Three evaluation metrics, namely density, HPAI, and running time, are used to compare the relevance of the algorithm with four state-of-the-art methods. The density value computed by all the methods, i.e., Moran’s, Getis Ord Gi, Getis Ord G i * , DBSCAN and GBHSDRS, are 43.37, 61.16, 60.48, 34.18, 67.91, respectively, and the density gains with respect to the state-of-the-art methods of the GBHSDRS are 36.14%, 9.94%, 10.42%, and 49.67%, respectively. The average density gain with all the state-of-the-art methods is 26.54%. The HPAI values of all the methods, i.e., Moran’s, Getis Ord Gi, Getis Ord G i * , DBSCAN and GBHSDRS, are 1.17, 2.65, 2.49, 2.06, 2.92, respectively. The HPAI gains of GBHSDRS with respect to the state-of-the-art methods, i.e., Moran’s, Getis Ord Gi, Getis Ord G i * , and DBSCAN, are 40.39%,9.15%, 14.60%, and 29.50%, respectively. The average HPAI gain with all the methods is 23.41%. The running times of the methods, i.e., Moran’s, Getis Ord Gi, Getis Ord G i * , DBSCAN and GBHSDRS, are 91.04, 53.05, 33.80, 35.39, 32.97, respectively. The decrease percentages in the run time of the GBHSDRS algorithm with respect to the state-of-the-art methods, i.e., Moran’s, Getis Ord Gi, Getis Ord G i * , and DBSCAN, are 63.78%, 37.85%, 2.46%, and 6.83%, respectively. The average decrease percentage for all the methods is 27.73%. Overall, the performance of the proposed method is good compared to the state-of-the-art methods.
This work can be utilised by policymakers, healthcare administrators, and stakeholders to facilitate equitable access to healthcare at the village level. This research is beneficial in efficiently allocating resources by targeting hotspots, reducing health disparities, improving health outcomes across diverse populations, and implementing preventive measurements and awareness campaigns.
The area of spatial analytics can potentially apply more concepts of rough sets in the future. This algorithm can also be used to evaluate this technique in other types of spaces, such as Euclidean, and in different datasets, such as fire data and epidemiology. Other notions of the neighbourhood need to be defined and checked with algorithms. Furthermore, as the algorithm is tested on the polygon vector data to find the hotspot, the algorithm needs to be tested on other spatial vector data, such as point and polygon data.

Author Contributions

Conceptualisation, M.S.T. and S.M.; methodology, M.S.T. and S.M.; software, M.S.T.; validation, M.S.T.; formal analysis, S.M.; investigation, S.M.; resources, F.J.; data curation, M.S.T.; writing—original draft preparation, M.S.T. and S.M.; writing—review and editing, A.A.S. and M.S.; visualisation, M.S.T.; supervision, S.A. and F.J.; project administration, M.S., F.J. and S.A. All authors have read and agreed to the published version of the manuscript.

Funding

The authors gratefully acknowledge the funding of the Deanship of Graduate Studies and Scientific Research, Jazan University, Saudi Arabia through Project Number: GSSRD-24.

Data Availability Statement

NASA’s socio-economics data and application center (SEDAC) publishes geospatial socio-economic data for India at the village level [44]. This dataset was created by digitising the survey of India’s village-level physical map from 1991. The dataset can be found at https://sedac.ciesin.columbia.edu/data/set/india-india-village-level-geospatial-socio-econ-1991-2001, accessed on 20 April 2024.

Conflicts of Interest

All the authors certified that there are no conflicts of interest.

Appendix A. Graph-Based Hotspot Detection Using Rough Set

The Pseudo code for the GBHSDRS algorithm is given in Algorithm A1. The algorithm is divided into various steps for easy understanding. In Step 1, the spatial neighbour of each polygon is computed. A function g ( A ) is defined to find the spatial neighbour of each polygon A (lines 1–3). In Step 2, the value of λ is calculated for applying a rough set on the polygon data (lines 4–12). In Step 3, a rough set is used on the polygon data. A function R S ( P i , λ ) is defined, which applies the rough set. The output of this function is the label for each polygon corresponding to the region (lines 13–15). In Step 4, a graph is constructed for the dataset, and a function is defined that creates the graph and labels each node according to the rough set. The nodes labelled as a lower region are termed candidate hotspots C H 1 (lines 16–19). In Step 5, the boundary value analysis is performed. In this phase, the combined frequency of the boundary node and the neighbouring node is analysed. A function is defined that approximates some node from the boundary region to the lower region. These newly defined lower region nodes are called candidate hotspots C H 2 (lines 20–23). The sum of C H 1 and C H 2 is termed the candidate hotspot C H (line 24). After this, in Step 6, the statistical significance of the candidate hotspot is found. Only the candidate hotspot with a p-value less than 0.05 is considered the hotspot H (lines 25–33).
Algorithm A1 Graph-based hotspot detection using rough set
  • Data: Dataset D = P 0 , P 1 , . . . P n
  • Input: Two thresholds δ 1 , δ 2 , a n d s i g n i f i c a n c e _ t h r e s h o l d = 0.05
  • Result: Hotspot H
  • Step 1: Spatial neighbour identification
  1:
for   i t o l e n ( D )  do
  2:
    N i = f ( A i )
  3:
end for
  • Step 2: Lambda λ calculation
  4:
for   i t o l e n ( D )  do
  5:
   if  ω P i > δ 1  then
  6:
      λ [ i ] = 0
  7:
   else if  ω P i > δ 2 a n d ω P i < δ 1  then
  8:
      λ [ i ] = 1
  9:
   else
10:
      λ [ i ] = 2
11:
   end if
12:
end for
  • Step 3: Application of rough set
13:
for   i t o l e n ( D )  do
14:
    l a b e l [ i ] = R S ( P i , λ )
15:
end for
  • Step 4: Graph construction
16:
for   i t o l e n ( D )  do
17:
    g r a p h = G C ( P i , l a b e l [ i ] )
18:
end for
19:
C H 1 = g r a p g [ l a b e l = l o w e r ]
  • Step 5: Boundary value analysis
20:
for   i t o l e n ( g r a p h [ l a b e l ] = = b o u n d a r y )  do
21:
    l a b e l _ n e w = B V ( c ( i ) )
22:
end for
23:
C H 2 = g r a p h [ l a b e l _ n e w = l o w e r ]
24:
C H = C H 1 + C H 2
  • Step 6: Significance test
25:
for   f o r i 1 t o l e n ( C H )  do
26:
    Z i = η C H i m e a n ( η C H i ) s t d ( η C H i )
27:
    p _ v a l u e = p _ v a l u e _ c a l c u a t i o n ( Z i )
28:
end for
29:
for   i 1 t o l e n ( C H )  do
30:
   if  p _ v a l u e < s i g n i f i c a n c e _ t h r e s h o l d  then
31:
      H = C H i
32:
   end if
33:
end for

References

  1. Shekhar, S.; Evans, M.R.; Kang, J.M.; Mohan, P. Identifying patterns in spatial information: A survey of methods. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2011, 1, 193–214. [Google Scholar] [CrossRef]
  2. Lessler, J.; Azman, A.S.; McKay, H.S.; Moore, S.M. What is a hotspot anyway? Am. J. Trop. Med. Hyg. 2017, 96, 1270. [Google Scholar] [CrossRef]
  3. Eftelioglu, E.; Shekhar, S.; Tang, X. Crime hotspot detection: A computational perspective. In Improving the Safety and Efficiency of Emergency Services: Emerging Tools and Technologies for First Responders; IGI Global: Hershey, PA, USA, 2020; pp. 209–238. [Google Scholar]
  4. Aarthi, S.; Samyuktha, M.; Sahana, M. Crime hotspot detection with clustering algorithm using data mining. In Proceedings of the 2019 3rd International Conference on Trends in Electronics and Informatics (ICOEI), Tirunelveli, India, 23–25 April 2019; pp. 401–405. [Google Scholar]
  5. Tabarej, M.S.; Minz, S. Spatio-temporal changes pattern in the hotspot’s footprint: A case study of confirmed, recovered and deceased cases of COVID-19 in India. Spat. Inf. Res. 2022, 30, 527–538. [Google Scholar] [CrossRef]
  6. Stresman, G.; Bousema, T.; Cook, J. Malaria hotspots: Is there epidemiological evidence for fine-scale spatial targeting of interventions? Trends Parasitol. 2019, 35, 822–834. [Google Scholar] [CrossRef]
  7. Nandana, G.; Mala, S.; Rawat, A. Hotspot detection of dengue fever outbreaks using dbscan algorithm. In Proceedings of the 2019 9th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Noida, India, 10–11 January 2019; pp. 158–161. [Google Scholar]
  8. Kansal, M.L.; Osheen; Tyagi, A. Hotspot identification for urban flooding in a satellite town of National Capital Region of India. In Proceedings of the World Environmental andWater Resources Congress 2019: Emerging and Innovative Technologies and International Perspectives, Pittsburgh, PA, USA, 19–23 May 2019; American Society of Civil Engineers: Reston, VA, USA, 2019; pp. 12–24. [Google Scholar]
  9. Iyer, R.; Sen, P.; Layek, A.K. FFHIApp: An Application for Flash Flood Hotspots Identification Using Real-Time Images. In Proceedings of the Applications of Artificial Intelligence and Machine Learning: Select Proceedings of ICAAAIML 2020; Springer: Berlin/Heidelberg, Germany, 2021; pp. 385–398. [Google Scholar]
  10. Zahran, E.S.M.M.; Shams, S.; Said, S.; Zahran, E.S.M.M.; Gadong, B.; Brunei-Muara, B.D. Validation of forest fire hotspot analysis in GIS using forest fire contributory factors. Syst. Rev. Pharm. 2020, 11, 249–255. [Google Scholar]
  11. Vadrevu, K.P.; Csiszar, I.; Ellicott, E.; Giglio, L.; Badarinath, K.; Vermote, E.; Justice, C. Hotspot analysis of vegetation fires and intensity in the Indian region. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2012, 6, 224–238. [Google Scholar] [CrossRef]
  12. Bíl, M.; Andrášik, R.; Sedoník, J. A detailed spatiotemporal analysis of traffic crash hotspots. Appl. Geogr. 2019, 107, 82–90. [Google Scholar] [CrossRef]
  13. Kaasinen, E. User needs for location-aware mobile services. Pers. Ubiquitous Comput. 2003, 7, 70. [Google Scholar] [CrossRef]
  14. Hagenlocher, M.; Lang, S.; Hölbling, D.; Tiede, D.; Kienberger, S. Modeling hotspots of climate change in the Sahel using object-based regionalization of multidimensional gridded datasets. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2013, 7, 229–234. [Google Scholar] [CrossRef]
  15. Cheval, S.; Dumitrescu, A.; Adamescu, M.; Cazacu, C. Identifying climate change hotspots relevant for ecosystems in Romania. Clim. Res. 2020, 80, 165–173. [Google Scholar] [CrossRef]
  16. Srinivasulu, S.; Sakthivel, P. Extracting spatial semantics in association rules for weather forecasting image. In Proceedings of the Trendz in Information Sciences & Computing (TISC2010), Chennai, India, 17–19 December 2010; pp. 54–57. [Google Scholar]
  17. Ferstl, F.; Kanzler, M.; Rautenhaus, M.; Westermann, R. Time-hierarchical clustering and visualization of weather forecast ensembles. IEEE Trans. Vis. Comput. Graph. 2016, 23, 831–840. [Google Scholar] [CrossRef]
  18. Romeiko, X.X.; Guo, Z.; Pang, Y. Comparison of support vector machine and gradient boosting regression tree for predicting spatially explicit life cycle global warming and eutrophication impacts: A case study in corn production. In Proceedings of the 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 9–12 December 2019; pp. 3277–3284. [Google Scholar]
  19. Saitoh, T.S.; Wakashima, S. An efficient time-space numerical solver for global warming. In Proceedings of the Collection of Technical Papers. 35th Intersociety Energy Conversion Engineering Conference and Exhibit (IECEC)(Cat. No. 00CH37022), Las Vegas, NV, USA, 24–28 July 2000; Volume 2, pp. 1026–1031. [Google Scholar]
  20. Koperski, K.; Adhikary, J.; Han, J. Spatial data mining: Progress and challenges survey paper. In Proceedings of the ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Montreal, QC, Canada, 4–6 June 1996; pp. 1–10. [Google Scholar]
  21. Shekhar, S.; Zhang, P.; Huang, Y.; Vatsavai, R.R. Trends in spatial data mining. In Data Mining: Next Generation Challenges and Future Directions; AAAI Press: Washington, DC, USA, 2003; pp. 357–380. [Google Scholar]
  22. Ester, M.; Kriegel, H.P.; Sander, J. Spatial data mining: A database approach. Citeseer 1997, 97, 47–66. [Google Scholar]
  23. Shekhar, S.; Zhang, P.; Huang, Y. Spatial data mining. In Data Mining and Knowledge Discovery Handbook; Springer: Berlin/Heidelberg, Germany, 2010; pp. 837–854. [Google Scholar]
  24. Anselin, L. Local indicators of spatial association—LISA. Geogr. Anal. 1995, 27, 93–115. [Google Scholar] [CrossRef]
  25. Hermawati, R.; Sitanggang, I.S. Web-Based clustering application using Shiny framework and DBSCAN algorithm for hotspots data in peatland in Sumatra. Procedia Environ. Sci. 2016, 33, 317–323. [Google Scholar] [CrossRef]
  26. Agrawal, R.; Gehrke, J.; Gunopulos, D.; Raghavan, P. Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, Seattle, WA, USA, 1–4 June 1998; pp. 94–105. [Google Scholar]
  27. Tabarej, M.S.; Minz, S. Rough-set based hotspot detection in spatial data. In Proceedings of the International Conference on Advances in Computing and Data Sciences, Ghaziabad, India, 12–13 April 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 356–368. [Google Scholar]
  28. Runadi, T.; Widyaningsih, Y. Application of hotspot detection using spatial scan statistic: Study of criminality in Indonesia. In Proceedings of the AIP Conference Proceedings, Jawa Barat, Indonesia, 27–28 September 2016; AIP Publishing: New York, NY, USA, 2017; Volume 1827. [Google Scholar]
  29. Block, R. Software review: Scanning for clusters in space and time: A tutorial review of SatScan. Soc. Sci. Comput. Rev. 2007, 25, 272–278. [Google Scholar] [CrossRef]
  30. Ishioka, F.; Kawahara, J.; Mizuta, M.; Minato, S.i.; Kurihara, K. Evaluation of hotspot cluster detection using spatial scan statistic based on exact counting. Jpn. J. Stat. Data Sci. 2019, 2, 241–262. [Google Scholar] [CrossRef]
  31. Eftelioglu, E.; Tang, X.; Shekhar, S. Geographically robust hotspot detection: A summary of results. In Proceedings of the 2015 IEEE International Conference on Data Mining Workshop (ICDMW), Atlantic City, NJ, USA, 14–17 November 2015; pp. 1447–1456. [Google Scholar]
  32. Kumar, S.; Parida, B.R. Hydroponic farming hotspot analysis using the Getis–Ord Gi* statistic and high-resolution satellite data of Majuli Island, India. Remote Sens. Lett. 2021, 12, 408–418. [Google Scholar] [CrossRef]
  33. Chaikaew, N.; Tripathi, N.K.; Souris, M. Exploring spatial patterns and hotspots of diarrhea in Chiang Mai, Thailand. Int. J. Health Geogr. 2009, 8, 1–10. [Google Scholar] [CrossRef]
  34. Rahman, M.T.; Jamal, A.; Al-Ahmadi, H.M. Examining hotspots of traffic collisions and their spatial relationships with land use: A GIS-based geographically weighted regression approach for Dammam, Saudi Arabia. ISPRS Int. J. Geo-Inf. 2020, 9, 540. [Google Scholar] [CrossRef]
  35. Hu, F.; Li, Z.; Yang, C.; Jiang, Y. A graph-based approach to detecting tourist movement patterns using social media data. Cartogr. Geogr. Inf. Sci. 2019, 46, 368–382. [Google Scholar] [CrossRef]
  36. Loughrey, C.F.; Orr, N.; Jurek-Loughrey, A.; Dłotko, P. Hotspot identification for Mapper graphs. arXiv 2020, arXiv:2012.01868. [Google Scholar]
  37. Raj, A.; Minz, S. Spatial clustering using neighborhood for multispectral images. J. Appl. Remote Sens. 2020, 14, 038503. [Google Scholar] [CrossRef]
  38. Behera, V.N.J.; Ranjan, A.; Reza, M. Graph based Clustering Algorithm for Social Community Transmission Prediction of COVID-19. arXiv 2020, arXiv:2011.00414. [Google Scholar]
  39. Tabarej, M.S.; Minz, S. Change Footprint Pattern Analysis of Crime Hotspot of Indian Districts. In Proceedings of the International Conference on Advanced Machine Learning Technologies and Applications, Jaipur, India, 13–15 February 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 325–335. [Google Scholar]
  40. Ord, J.K.; Getis, A. Local spatial autocorrelation statistics: Distributional issues and an application. Geogr. Anal. 1995, 27, 286–306. [Google Scholar] [CrossRef]
  41. Getis, A.; Ord, J.K. The analysis of spatial association by use of distance statistics. Geogr. Anal. 1992, 24, 189–206. [Google Scholar] [CrossRef]
  42. Songchitruksa, P.; Zeng, X. Getis–Ord spatial statistics to identify hot spots by using incident management data. Transp. Res. Rec. 2010, 2165, 42–51. [Google Scholar] [CrossRef]
  43. Pawlak, Z. Rough sets. Int. J. Comput. Inf. Sci. 1982, 11, 341–356. [Google Scholar] [CrossRef]
  44. Meiyappan, P.; Roy, P.; Soliman, A.; Li, T.; Mondal, P.; Wang, S.; Jain, A. India Village-Level Geospatial Socio-Economic Data Set: 1991, 2001; NASA Socioeconomic Data and Applications Center (SEDAC): Palisades, NY, USA, 2018; Volume 10, p. H4CN71ZJ. [Google Scholar]
  45. Tabarej, M.S.; Minz, S. Rough-graph-based hotspot detection of polygon vector data. Multimed. Tools Appl. 2024, 83, 16683–16710. [Google Scholar] [CrossRef]
  46. Wuu, J.Y.; Pikus, F.G.; Marek-Sadowska, M. Metrics for characterizing machine learning-based hotspot detection methods. In Proceedings of the 2011 12th International Symposium on Quality Electronic Design, Santa Clara, CA, USA, 14–16 March 2011; pp. 1–6. [Google Scholar]
  47. Ulak, M.B.; Ozguven, E.E.; Vanli, O.A.; Dulebenets, M.A.; Spainhour, L. Multivariate random parameter Tobit modeling of crashes involving aging drivers, passengers, bicyclists, and pedestrians: Spatiotemporal variations. Accid. Anal. Prev. 2018, 121, 1–13. [Google Scholar] [CrossRef]
  48. Chainey, S.; Tompson, L.; Uhlig, S. The utility of hotspot mapping for predicting spatial patterns of crime. Secur. J. 2008, 21, 4–28. [Google Scholar] [CrossRef]
Figure 1. Polygon data containing the point data.
Figure 1. Polygon data containing the point data.
Mathematics 12 02031 g001
Figure 2. Study region consisting of polygon data.
Figure 2. Study region consisting of polygon data.
Mathematics 12 02031 g002
Figure 3. Graph of the spatial polygon data.
Figure 3. Graph of the spatial polygon data.
Mathematics 12 02031 g003
Figure 4. Spatial neighbour of boundary polygons.
Figure 4. Spatial neighbour of boundary polygons.
Mathematics 12 02031 g004
Figure 5. Flow chart of the hotspot detection algorithm.
Figure 5. Flow chart of the hotspot detection algorithm.
Mathematics 12 02031 g005
Figure 6. Map of the study area [45].
Figure 6. Map of the study area [45].
Mathematics 12 02031 g006
Figure 7. Joint plot for the village locations.
Figure 7. Joint plot for the village locations.
Mathematics 12 02031 g007
Figure 8. Density plot of the dataset for the medical services.
Figure 8. Density plot of the dataset for the medical services.
Mathematics 12 02031 g008
Figure 9. Hotspot of medical facilities at the village level.
Figure 9. Hotspot of medical facilities at the village level.
Mathematics 12 02031 g009
Figure 10. Hotspot of medical facilities in 1991 at the village level using Moran’s I.
Figure 10. Hotspot of medical facilities in 1991 at the village level using Moran’s I.
Mathematics 12 02031 g010
Figure 11. Hotspot of medical facilities in 1991 at the village level using Getis ord Gi.
Figure 11. Hotspot of medical facilities in 1991 at the village level using Getis ord Gi.
Mathematics 12 02031 g011
Figure 12. Hotspot of medical facilities in 1991 at the village level using Getis ord Gi star.
Figure 12. Hotspot of medical facilities in 1991 at the village level using Getis ord Gi star.
Mathematics 12 02031 g012
Figure 13. Hotspot of medical facilities in 1991 at the village level using DBSCAN.
Figure 13. Hotspot of medical facilities in 1991 at the village level using DBSCAN.
Mathematics 12 02031 g013
Figure 14. Density estimation plot of the hotspots.
Figure 14. Density estimation plot of the hotspots.
Mathematics 12 02031 g014
Figure 15. Density of the hotspots.
Figure 15. Density of the hotspots.
Mathematics 12 02031 g015
Figure 16. Hotspot prediction accuracy index of the algorithms.
Figure 16. Hotspot prediction accuracy index of the algorithms.
Mathematics 12 02031 g016
Figure 17. Running time of the algorithms.
Figure 17. Running time of the algorithms.
Mathematics 12 02031 g017
Table 1. Literature survey related to the proposed study.
Table 1. Literature survey related to the proposed study.
Author (Year)Proposed WorkDataset UsedRemarks
Spatial Scan Statistics (2019) [28]Uses spatial scan statistics to find the hotspots of crime cases in Indonesia.The dataset was released by Badan Pusat Statistik (BPS) and was utilised in this research to find the hotspots.Finds the hotspot of the 22 crime cases. Can generate a false hotspot. Computationally high.
Runadi, T. (2017) [30]ZDD-based enumeration method to find the hotspot.Apply the methodology to investigate the sudden infant death syndrome in North Carolina.Monte Carlo simulation is used to find the p-value for statistical significance. It is impossible to find the p-value using all the likelihood ratio (LR) statistics computed from each conceivable window acquired via ZDD.
Eftelioglu, E. (2015) [31]Find the hotspot of the point data by using a cubic grid circle algorithm.Crime dataset is used to evaluate the performance of the algorithm.Can find the hotspot of the point data set. It can only find the hotspot of the polygon shape. This method cannot be applied to a polygon dataset.
Tabarej, M. S. (2022) [5]Find the hotspots of confirmed, deceased and the recovered cases of COVID-19 in an Indian district. The method used for the detection of the hotspot is Moran’s I.Dataset used is the COVID-19 data granulated at the district level.This method can find the hotspot of polygon data. This method assumes that the space is stationary, but in the real scenario, the space is non-stationary. Dependent on the spatial weight matrix.
Kumar, S. (2021) [32]Finds the hotspot of hydroponic farming using Getis Ord Gi* statistics.A high-resolution satellite data of Majuli Island is used for the detection of the hotspots.Finds the hotspot of the high-resolution images. This method is also sensitive to the outlier.
Chaikaew, N. (2009) [33]Finds the hotspot of diarrhoea in Chiang Mai, Thailand. The method used for the analysis of the hotspot is Moran’s I and Geary’s C Index.The dataset for the detection of the hotspot is taken from the Bureau of Epidemiology, Ministry of Public Health, Thailand.Finds the hotspots of the diarrhoea. The study is helpful in finding high density areas of diarrhoea and reducing it by taking proper action. This method assumes a linear relationship among the variables and does not work well with a non-linear relation among the variables.
Table 2. Literature survey related to the proposed study.
Table 2. Literature survey related to the proposed study.
Author (Year)Proposed WorkDataset UsedRemarks
Rahman, M. T. (2020) [34]Finds the hotspot of traffic crashes and land use in Dammam Saudi Arabia to improve the crime prediction. The method used for the analysis of the hotspot is geographically weighted regression.The crash dataset for the period 2009 to 2016 was collected from the Dammam traffic department. Boundaries about the residential neighbour, population, and land use are collected from the Dammam Municipality’s office.Finds the hotspot of road accidents. Useful in predicting crashes. Sensitive to the choice of the spatial neighbour.
Hu, F. (2019) [35]This paper investigates the movement of a tourist based on their tweets about a place. The method is used for finding a location. DBSCAN clustering is used for finding the clustering, i.e., the hotspot of the geotagged tweets. And then a graph is constructed to find the movements.The dataset used in the study is the Twitter data for tourist movements.Find the movements of the tourist using hotspot analysis. DBSCAN clustering is used for dense cluster identification. This method is sensitive to parameter settings.
Loughrey, C. F (2020) [36]The mapper graph algorithm is used in this study on multidimensional data to show the loops, flares, and clusters.TCGA dataset is used to construct the Mapper graphs.This algorithm has difficulty in selecting the clustering algorithm. The choice of parameters, such as the number of samples, noise level, clustering parameters, and Mapper settings, can significantly impact the results.
Table 3. Weight of a polygon.
Table 3. Weight of a polygon.
P a 1 a 2 ... a n
p 1 u 11 u 12 u 1 n
p 2 u 21 u 22 u 2 n
... .
... .
p m u m 1 u m 2 ... u m n
ω P i = 1 m u m 1 i = 1 m u m 2 ... i = 1 m u m n
Table 4. Spatial neighbours.
Table 4. Spatial neighbours.
PolygonSpatial Neighbourhood (N)Occurrence ( ω P )
P 0 [ P 1 , P 2 , P 3 ]2
P 1 [ P 0 , P 2 , P 3 , P 4 ]7
P 2 [ P 0 , P 1 , P 3 , P 4 , P 5 , P 7 , P 8 ]5
P 3 [ P 0 , P 1 , P 2 , P 7 , P 8 ]9
P 4 [ P 1 , P 2 , P 5 ]2
P 5 [ P 2 , P 4 , P 6 , P 7 ]4
P 6 [ P 5 , P 7 , P 8 ]8
P 7 [ P 2 , P 3 , P 5 , P 6 , P 8 ]10
P 8 [ P 3 , P 6 , P 7 ]1
Table 5. Function mapping.
Table 5. Function mapping.
f a 1 a 2 .. a j ..
P 1
P 2 .
. .
P i .. v i j ..
. .
. .
Table 6. Polygon weight and the λ value.
Table 6. Polygon weight and the λ value.
PolygonWeight ( ω P ) λ
P 0 22
P 1 71
P 2 51
P 3 90
P 4 22
P 5 42
P 6 80
P 7 100
P 8 12
Table 7. Health facilities and their abbreviations.
Table 7. Health facilities and their abbreviations.
AttributesAbbreviationsNumber
Number of Health CentreH_CNTR1
Number of Maternity and Child Welfare CentreMCW_CNTR0
Number of Maternity HomeMH_CNTR1
Number of Child Welfare CentreCWC_CNTR2
Number of Health CentreHC_CNTR0
Number of Primary Health CentrePHC_CNTR0
Number of Primary Health Sub CentrePHS_CNTR3
Number of DispensaryDISP_CNTR1
Number of T.B. ClinicTB_CNTR0
Number of Nursing HomeNH_CNTR0
Number of Other medical facilitiesOTH_CNTR0
Medical Facility—-8
Table 8. Evaluation parameters of the proposed method compared with the literature methods.
Table 8. Evaluation parameters of the proposed method compared with the literature methods.
Area of the Hotspot Density PAI Run Time
Moran’s I1,869,247.8343.371.7491.04
Getis Ord Gi2,911,097.0061.162.6553.05
Getis Ord G i * 2,894,084.0660.842.4933.80
DBSCAN2,994,807.3734.182.0635.39
GBHSDRS926,913.8567.912.9232.97
Table 9. Analysis of the density value.
Table 9. Analysis of the density value.
Density Increase (in %)
Moran’s I43.3736.14
Getis Ord G i 61.169.94
Getis Ord G i * 60.8410.42
DBSCAN34.1849.67
GBHSDRS67.91
Average Density Gain 26.54
Table 10. Analysis of the HPAI.
Table 10. Analysis of the HPAI.
Hotspot Prediction Accuracy Index Increase (in %)
Moran’s I1.7440.39
Getis Ord G i 2.659.15
Getis Ord G i * 2.4914.60
DBSCAN2.0629.50
GBHSDRS2.92
Average HPAI Gain 23.41
Table 11. Analysis of the running time.
Table 11. Analysis of the running time.
Run Time Increase (in %)
Moran’s I91.0463.78
Getis Ord G i 53.0537.85
Getis Ord G i * 33.802.46
DBSCAN35.396.83
GBHSDRS32.97
Average Running Time Gain 27.73
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tabarej, M.S.; Minz, S.; Shaikh, A.A.; Shuaib, M.; Jeribi, F.; Alam, S. Graph-Based Hotspot Detection of Socio-Economic Data Using Rough-Set. Mathematics 2024, 12, 2031. https://doi.org/10.3390/math12132031

AMA Style

Tabarej MS, Minz S, Shaikh AA, Shuaib M, Jeribi F, Alam S. Graph-Based Hotspot Detection of Socio-Economic Data Using Rough-Set. Mathematics. 2024; 12(13):2031. https://doi.org/10.3390/math12132031

Chicago/Turabian Style

Tabarej, Mohd Shamsh, Sonajharia Minz, Anwar Ahamed Shaikh, Mohammed Shuaib, Fathe Jeribi, and Shadab Alam. 2024. "Graph-Based Hotspot Detection of Socio-Economic Data Using Rough-Set" Mathematics 12, no. 13: 2031. https://doi.org/10.3390/math12132031

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop