1. Introduction
A hotspot is a spatial cluster pattern. A hotspot represents a region or geographical area where the occurrence of the phenomenon is prevalent [
1,
2]. Due to the greater spatial autocorrelation among events within a hotspot, it is possible to differentiate between the hotspot and a cluster. However, compared to a hotspot, a cluster has less spatial autocorrelation between events, or its objects may be independently and identically distributed (I.I.D.) [
1]. Hotspot detection involves finding the location or region with the most active events. Hotspot detection has several application domains, including (1) crime [
3,
4]: analysing crime data to identify places with high levels of criminal activity. This would facilitate the efficient allocation of resources and enhance focused crime prevention efforts. (2) Health [
5]: identifying the geographical regions with a high prevalence of illnesses such as COVID-19 [
5], flu, and malaria [
6,
7] would facilitate the distribution of resources and enhance preventive measures. (3) Risk assessment: identifying the regions susceptible to natural calamities, such as seismic activities and floods [
8,
9], and fire [
10,
11], would aid in readiness and reaction efforts. (4) Traffic analysis [
12] involves the identification of congestion hotspots in order to optimise traffic flow, strategise infrastructure projects, and enhance public transit systems, among other objectives.
The significance of geospatial data analysis has grown with the growing use of location-aware services [
13]. It is used in a variety of fields, such as climate change analysis [
14,
15], weather forecasting [
16,
17], and global warming [
18,
19]. Geographic classification, clustering, outlier identification, co-location patterns, regression, and spatial hotspot detection are essential tasks in spatial data mining [
20,
21,
22,
23]. Spatial data analysis is more complex than the analysis of non-spatial data due to spatial autocorrelation [
1,
23,
24].
The literature section describes the methods used for the detection of hotspots for point data and polygon data. The methods discussed for the point data do not have significance tests, so they can generate a false hotspot. While several strategies can produce significant hotspots, all of these approaches are computationally challenging because they all need the whole dataset to compute the hotspots. The methods discussed for the hotspot detection of the polygon data assume that the study region is stationary and works well with the continuous data. Some graph-based methods are also discussed, but those methods also use some clustering algorithms like DBSCAN for grouping similar objects. Again, setting parameters for the clustering algorithm is challenging, and the results depend on the parameter setting. Therefore, this paper aims to develop an algorithm that does not depend on the distribution of the underlying dataset, is cost-effective, and can find the hotspot where it is present (removes false positives) in the dataset.
The proposed GBHSDRS algorithm will determine the hotspot and eliminate false hotspots by using significance tests. Like other machine learning techniques, rough set-based methods need assessment criteria to measure the effectiveness of the proposed method. The proposed method is categorised as an unsupervised method as the patterns are identified without predetermined labels. The study outcomes benefit policymakers, healthcare administrators, and stakeholders in promoting fair and equal availability of healthcare services at the village level. This study can potentially optimise resource distribution by explicitly focusing on areas with high demand, minimising inequalities in healthcare, and enhancing health results among various population groups.
The GBHSDRS algorithm prunes the dataset by removing the polygons belonging to the negative region, thereby contributing to the running time of the algorithm. Significant contributions of the research work are as follows:
This paper contributes by incorporating graph theory and rough sets for hotspot detection. This novel technique improves hotspot identification accuracy and efficiency by combining the best features of both approaches.
This paper strategically applies the BFS algorithm in the wrapper model to process the data from the boundary region toward the proposed efficiency of the algorithm.
This method can identify the hotspot if the hotspot exists and eliminates the detection of a false hotspot.
By developing scalable hotspot identification techniques, this paper makes it possible to apply the method effectively to polygon datasets for real-world applications where data volumes are significantly high.
The remainder of this paper is organised as follows. A literature survey related to the proposed methodology for hotspot detection is discussed in
Section 2. Preliminaries related to the algorithm are presented in
Section 3. Graph-based hotspot detection using the rough set algorithm is discussed in detail in
Section 4. Experimental evaluation containing the dataset used, evaluation metrics, and results are discussed in
Section 5. Discussion of the results are given in
Section 6. Finally, in
Section 7, the conclusion and future research directions are discussed.
2. Literature Survey
Spatial vector data can be represented as points, lines, and polygons. Depending on the spatial vector data representation, the method for detecting the hotspot can vary. The literature survey related to the proposed study is given in
Table 1 and
Table 2. There are various methods for the detection of the hotspot. Clustering techniques like DBSCAN [
25] and CLIQUE [
26] are able to find the hotspot of the point data. They are inexpensive but will generate false hotspots because they lack a significance test. A rough set-based hotspot detection method [
27] can be used to find the hotspot of the point data. This method can also generate a false hotspot as there is no significance test. The spatial scan statistics [
28] algorithm can find the hotspot of the point data, but is computationally very expensive and unable to find the hotspot in the presence of obstacles in the study region. Spatial scan statistics are used by the software SatScan v10.2 [
29], which is a well-known method for the detection of hotspots. This software takes the point data and applies the spatial scan statistics method to identify the hotspots. This software only finds the circular hotspots and cannot find the hotspot if there is an obstacle in the study region. Runadi, T. (2017) [
30] uses the ZDD-based enumeration method to find the hotspot of the sudden infant death syndrome in North Carolina. In this method, it is impossible to find the
p-value for the significance test. The Cubic Grid Circle (CGC) method [
31] can find the hotspot of the point data. The identified hotspot is circular in shape. This method can not find the hotspot of other shapes, nor can not it be applied to the polygon dataset.
Tabarej, M. S. (2022) [
5] finds the hotspots of confirmed, deceased and recovered cases of COVID-19 in an Indian district. The method used for the detection of the hotspot is Moran’s I. This method is also dependent on a spatial weight matrix, which assumes that the space in stationary. Kumar, S. (2021) [
32] finds the hotspot of hydroponic farming using Getis Ord Gi* statistics. A high-resolution satellite is used to find hotspots. This method is also sensitive to outliers. Chaikaew, N. (2009) [
33] finds the hotspot of diarrhoea instances in Chiang Mai, Thailand. The method used for the analysis of the hotspot is Moran’s I and Geary’s C Index. This method assumes a linear relationship among the variables and does not work well with the non-linear relation among the variables. Rahman, M. T. (2020) [
34] finds the hotspot of traffic crashes and land use in Dammam, Saudi Arabia, to improve crime prediction. The method used for the analysis of the hotspot is geographically weighted regression. This method is also dependent on the choice of the spatial neighbour.
A graph-based approach to detecting tourist movements [
35] creates the tourist destination as the vertex, and movement from one location to the next is taken as the edge. Different tweets in the same locality are categorised as a vertex by the DBSCAN clustering algorithm. The major issue is in choosing two parameters, Eps (search radius) and MinPts (minimum number of points in the search radius). A more advanced method is needed to find the vertex, and this method is also sensitive to parameter settings. A clustering algorithm using the spatial information of the neighbourhood is defined by ref. [
37]. The algorithm employs adjacent pixel values to form the cluster. This algorithm can find all the clusters in the dataset. This algorithm should be tested on feature-based similarity. The algorithm for hotspot detection of the mapper graph is given in ref. [
36]. This algorithm builds the graph of the multidimensional data to see the loops, flares, and clusters. This algorithm has difficulty in selecting the clustering algorithm. The choice of parameters, such as the number of samples, noise level, clustering parameters, and Mapper settings, can significantly impact the results. A graph-based clustering algorithm for the social community transmission of COVID-19 is given in ref. [
38]. This method constructs a graph by considering the vertex as the location of an activity, and transmission between the two locations serves as the edge between the two vertices. This algorithm divides the nodes into three colours based on their severity. The selection of the clustering algorithm and the threshold cause significant limitations.
3. Preliminaries
Definition 1. A geographical point is a pair , consisting of latitude x and longitude y. A geographical point data is a pair , consisting of point p and the magnitude of occurrence of the event at the point p.
Example: a earthquake of magnitude 5.3 on the richer scale is observed in Yonakuni, Japan. The geographical coordinates of the location are . Therefore, , and the point data .
Definition 2. Let a line segment (edge) e be defined by a pair of points and represented as . Then a line l, a finite sequence of connected line segments, represented as . i.e., . Therefore, alternately, .
Definition 3. A polygon surrounds a geographical region or area. A polygon A is defined by a sequence of edges . Alternatively, a polygon is represented by a sequence of points where .
A polygon data P is a polygon containing one or more point data , represented by a pair where is the weight of the polygon. A random point taken from a polygon is considered the vertex c.
As shown in
Figure 1, a polygon
A is composed of edges
or by points
. A polygon with some data point is called polygon data. As shown in
Figure 1, a polygon contains point data
. Each point data
can have several attributes like crime, disease cases, the occurrence of fire, etc. These attributes are denoted as
. Values of these attributes are denoted as
, where
i corresponds to the
point and
j corresponds to the
attribute, as shown in
Table 3. The table contains various columns:
P and
. Column
P includes the data point in the polygon
P. Attribute
is the attribute in a polygon that contains the value of the attributes. The column sum gives the sum of attribute
for a polygon. The sum of all the attributes gives the weight of the polygon
as given in Equation (
1).
Definition 4. A polygon datapoint is said to be a hotspot [27,39] if the weight of the polygon is greater than a pre-decided threshold δ. Then, a polygon hotspot is defined as:where δ is a user-defined threshold. Definition 5. Spatial neighbourhood: Let the two polygons be represented as and , and let N be the spatial neighbourhood relation defined over the polygons. Then, for any two polygons , the neighbourhood relation is defined as: Equation (
2) states that if the intersection of the two polygons are non-phi, then the polygons are said to be spatial neighbours.
The geographical neighbourhood uses a reflexive, symmetric, and non-transitive relation to formulate a spatial relationship in geographic data science and spatial statistics. Two polygons and are considered spatial neighbours if and share a common point or a common edge between them. The spatial neighbour of is the list of all , the spatial neighbours of .
For example, a polygon dataset consisting of the point is shown in
Figure 2. The spatial neighbour of the polygons is shown in
Table 4. The table contains three columns: polygon, spatial neighbour, and occurrences. The variable polygon includes the name of the polygon for which the spatial neighbour is being calculated. The variable spatial neighbour contains all the spatial neighbours of the polygon given in the column polygon. The column occurrence includes the weight of the polygon.
Definition 6. A
neighbourhood graph is an ordered pair , where V is the set of vertices and E is the set of edges , where the edge , is the polygon consisting of points and is a polygon composed of point .
3.1. Local Moran’s I
Local spatial autocorrelation (L.S.A.) identifies the association within the spatially distributed data, i.e., the degree to which the spatial data are related to their neighbours. Local spatial autocorrelation finds clusters and outliers within spatial data. A hotspot is a special cluster where the density of occurrence is high. The L.S.A. method uses a spatial autocorrelation index for hotspot identification. Various index methods have been proposed. One such method is Moran’s I [
24], which is defined as:
where
represents the weight of the polygon
,
represents the average weight of the study region
S,
is the reciprocal value of the spatial weight between the polygon
i and the neighbouring polygon
j,
n represents the total number of polygons in the study region, and
represents the variance of the observed cases.
3.2. Getis Ord
Let the study region
S be divided into sub-regions
and each sub-region
be called a polygon. The
statistic, in its simplest form, measures the degree of association between a given polygon and the neighbouring polygon [
40,
41] and is defined as:
where
is a spatial weight whose values depend on the spatial neighbour of a polygon with the others
and
is the values at the polygon
, and
n is the total number of spatial units or polygons.
3.3. Getis-Ord
Let the study region
S be divided into
where each sub-region
represents a polygon. Then, the
statistic [
40,
42] is defined as:
where
is the value at polygon
P, and
n is the total number of polygons in the study area and
is the mean and
is the variance. The
statistic is the
z score, so no further calculation is needed for the significance test. For positive value, the larger the value of the z-score, the denser the clustering of the results, i.e., the hotspot.
3.4. DBSCAN on the Graph of the Polygon Vector Data
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm commonly used to find spatially dense clusters. DBSCAN works on a point dataset. To apply the DBSCAN to the polygon vector data, a graph is constructed on the polygon data by taking a random point from each polygon. The DBSCAN method finds the dense cluster by using two parameters: the epsilon () neighbourhood radius and the minimum number of points required to form a dense region (minPts). The cluster identified is called the hotspot. All the noise points are considered non-hotspots.
3.5. Rough-Set Concepts
The rough set theory [
43] is a mathematical model that deals with uncertainty, vagueness, and imprecision. Mostly, data presented in the real world are imperfect. The main aim of rough set theory is to understand, manipulate, and analyse imperfect data.
In rough set theory, data are represented by the information system
where
U is a finite set of objects,
Q is a set of attributes,
V is a set of domains corresponding to each attribute
, and
f is a function mapping an attribute value to a data point. Let
, and let
be the domain of attribute
. Then,
and
if
is the value of attribute
for data point
, then
The mapping of the function
f is shown in
Table 5. The table contains a column name
f, which is a function and contains the polygons
. These polygons will be mapped to the attribute of the polygon
. The value of the polygon
with the attribute
is given as
.
Let
; then, the indiscernibility relation
is defined as:
The indiscernibility
is denoted as
. Now let
, where
X is defined as
, where
is a pre-decided threshold. Then, the lower and upper approximations are defined as:
The boundary region of
X concerning attribute
B is defined as the set difference between the upper approximation and the lower approximation.
The negative region is defined as , which contains the set of objects that can be definitely ruled out as members of the target set X.
For example, consider
Table 4. The value of
is defined as:
where the thresholds
and
are user-defined and
, The value of
is given in
Table 6. This table contains three columns: polygon, weight
, and the
. The column name
P contains the name of the polygon, weight
contains the weight of the polygon, and column
contains the value of the
, as described in Equation (
10). The information system is defined as
. The indiscernibility relation when
is given as
. The two thresholds
and
, are defined as the third and second quartile, respectively, and their values are taken as 7.5 and 5. Now if
X is defined as
, where
then
. Then, the lower approximation is
and the upper approximation is
. Now the boundary region is the set difference between the upper and the lower approximation:
. The negative region is given as:
.
3.6. Neighbour Graph Construction
For constructing a neighbour graph from the spatial polygon data, a random point from each polygon is taken and considered as the vertex c of the graph. Let the function g find the spatial neighbour, then , where is the spatial neighbour of the polygon . There is an edge between the two polygons if they are spatial neighbours. Let there be a function h that maps each vertex to either lower, upper, boundary, or negative regions, and then h is defined as , where the domain of h is the set of all polygons, and the range is a set of labels .
For example, consider
Figure 2: the figure has a set of polygons
and the random point chosen from each polygon is given as
, respectively. The polygon
shares a boundary with
and
but a point with
. Therefore,
are the spatial neighbours of the
. Therefore, there is an edge between
to
. Similarly, the neighbour of each polygon is calculated. The neighbour of each polygon is shown in
Table 1. This table shows the neighbour with an adjacency list along with the occurrence of an event in each polygon. The neighbour graph of the spatial polygon data is shown in
Figure 3. This figure also shows the label of the node as the lower, boundary and negative regions, and these are shown in different colours.
3.7. Graph Traversal
Following the construction of the graph, as shown in
Figure 3, the traversal of the graph is needed for the spatial analysis. The nodes are traversed in the breadth-first search manner. The children of each node belonging to the boundary region are chosen for the traversal. All the spatial neighbours of the node in the boundary region are calculated as described in Definition 5 and traversed.
For
example, the nodes belonging to the boundary region are
, the children of the node
and
are
, and
, respectively. A tree representing the spatial neighbour is shown in
Figure 4.
4. Methodology
Hotspot identification has emerged as a critically important research area in recent years because of the COVID-19 outbreak. Hotspot detection refers to identifying the region or place where a particular phenomenon is most often and intensively observed. Depending on the data format, a hotspot may be identified for point, line, or polygon vector data. This research specifically examines the identification of hotspots in polygonal data. The focal areas of the point and line data are outside of the scope of this research article. This study uses a rough set as the preferred method for hotspot identification. This theory categorizes the dataset into three distinct regions: the lower, boundary, and negative regions. This process is sometimes referred to as three-way decision-making. The members in the lower region of the dataset surely belong to the target set, whereas the members in the boundary region may or may not belong to the target set and need more analysis. The members of the negative area are excluded from the target set as they surely do not belong to the target set and are not included in the analysis further. By using the rough set, the dataset is pruned and enhances the effectiveness of the algorithms. The primary rationale for using rough set theory for hotspot identification is its effectiveness. The dataset used in this article consists of polygon vector data that have been granulated at the village level. This dataset includes the locations where healthcare facilities are concentrated at the village level.
Graph-Based Hotspot Detection Using Rough Set Algorithm
This paper proposes a novel algorithm called the Graph-Based Hotspot Detection algorithm using a Rough Set (GBHSDRS). This method applies a rough set on the graph of the spatial polygon data to find the hotspot. First, it considers spatial polygon data composed of
. Thus, each polygon data point is represented by a polygon
A and the number of events in the polygon
. Next, the spatial neighbour of each polygon datapoint is calculated for spatial processing, as described in Definition 5. Now
is defined based on the value of the
, as described in
Section 3.5. The values of
and
are given in
Table 6. After finding these values, a rough set is applied to the polygon data, as described in
Section 3.5. By applying a rough set on the spatial polygon data, the study region is divided into the lower approximation and upper approximation, and consequently, the polygons are divided into the lower region, boundary region, and negative region. Following the application of a rough set, a neighbour graph is constructed. A random point from each polygon
A is chosen as the vertex of the neighbour graph. An edge
E exists between the two nodes if they are spatial neighbours.
Table 4 gives a complete list of each spatial neighbour. In this way, a graph is constructed for the spatial polygon data. The nodes are labelled as the lower region, boundary region, or negative region based on the outcome of the rough set.
The node labelled lower approximation is considered the candidate hotspot and is represented by
. Following this, the node belonging to the boundary region is selected and traversed, as described in
Section 3.7. Suppose the combined frequency of occurrence of events in the node belonging to the boundary region and one of its neighbours is greater than a pre-decided threshold,
. Then, the vertex is approximated as the lower approximation and is labelled as lower. These newly approximated nodes are considered candidate hotspots
. After detecting the candidate hotspots
and
, these two hotspots are merged to form the final set of candidate hotspots
. Now, the two hypotheses are defined: (1) Null hypothesis (
), which states that there is no significant hotspot in the data, and all the hotspots are generated due to a random chance. (2) Alternate hypothesis (
), which states that there is a significant hotspot in the data. Then, a test statistics
p-value is calculated, representing the probability of observing the test statistics under the null hypothesis. Now, a significance threshold
is chosen. This means the hotspots with a
p-value less than
are considered to be statistically significant. The most common significance levels are
, and
, but the choice depends on the context of the study and the consequences of making errors. For example, a significance level of
is commonly used with a moderate risk,
used with low risk and
very stringent, used in high-stakes research. Therefore, we have chosen the significance level as
so that it appropriately balances the risks based on the above considerations. The flow chart of the algorithm is shown in
Figure 5 and the pseudo code for the proposed algorithm in given in
Appendix A.
5. Experimental Evaluation
5.1. Study Area and Socio-Economic Data
NASA’s socio-economics data and application center (SEDAC) publishes geospatial socio-economic data for India at the village level [
44]. This dataset was created by digitising the survey of India’s village-level physical map from 1991. This dataset also utilizes the tabular data from 1991. The dataset for each state is provided as a shapefile. In this study, village-level data of Uttar Pradesh are used.
Figure 6 shows the study area. The geospatial socio-economic dataset contains over 200 variables for over 600,000 towns and villages. When village and town statistics are not accessible, sub-district taluk data are used. This dataset is the first data available publicly at the village level.
Out of all the variables, we have selected the attribute related to the type of medical facility available at the village level. The overall health facility at the village level is calculated by adding the frequency of various health facilities and termed the “Medical Facility”. The identification of hotspots is based on this aggregated value. The type and the abbreviations for the health facilities are given in
Table 7. The table contains three columns, namely attributes, abbreviations, and number. The attributes contain the type of health facility being considered, abbreviations contain the abbreviations used for the particular attributes, and the number includes the occurrences of the attribute for that specific village.
Figure 7 shows the distribution of the polygons (villages) in the study region
S. From the figure, we can see the clustering of the phenomenon at some locations. This is due to the presence of spatial autocorrelation.
The density plot of the number of the medical facility is given in
Figure 8. From the density plot of the dataset, we see that the maximum density lies around 5. The range of the medical facility is 0 to 165. Due to the variation in the density value, there will be a chance of the presence of the hotspot.
5.2. Evolution Metrics
Various performance metrics are available in the literature to assess the performance of the hotspot detection method. J-Y Wuu, et al. [
46] define prediction accuracy and memorising accuracy as performance metrics. M. B. Ulak et al. [
47] define the crash prediction accuracy index as the performance metrics for evaluating the hotspot detection methods. In this paper, socio-economic data granulated at the village level are used to measure the performance of the algorithm. Three evaluation metrics are used, i.e., the density of the hotspot, the hotspot prediction accuracy index, and the running time of the algorithm.
5.2.1. Density of the Hotspot
The density of the hotspot is defined as the average event density of all the hotspots. Let the dataset
D contain the
n member of polygons (villages)
, each of the villages contain
number of events, and the area of each hotspot be given as
. If out of the
n villages, let us say
k number of villages are identified as the hotspot, then
5.2.2. Hotspot Prediction Accuracy Index
The hotspot prediction accuracy index (HPAI) is developed by Chainey, Thompson, and others in order to find the crime hotspots [
48]. This ratio is calculated by dividing the crime incidence by total crime by the ratio of the hotspot’s area to the total area of the study region. The hotspot prediction accuracy index is essential for the viability of any method to predict the hotspot. The HPAI method evaluates the effectiveness of the method for predicting hotspots. The original intent of the crash prediction accuracy index was to predict the accuracy of road crashes. This method is modified so that it can be applied to the hotspot of the polygon dataset. It is given as follows:
where
n is the number of events in the hotspot,
N is the total number of events,
a is the area of the hotspot, and
A is the total area of the study region.
5.2.3. Run Time
The running time of the GBHSDRS algorithm is defined as the time taken by the algorithm to complete the execution of the program. It depends on many factors, like the size of the data, the number of lines of code, the configuration of the hardware, etc. It is calculated as:
where RT is the run time of the algorithm, ST is the start time, and CT is the completion time of the algorithm.
5.3. Results
This section presents the outcome of the GBHSDRS algorithm on the socio-economic data described above. First, the hotspots from all the methods are presented for a pictorial depiction, and then the evaluation parameters are presented and compared with the methods found in the literature.
5.3.1. Result of GBHSDRS
The dataset described in
Section 5.1 contains 123,471 villages, i.e., polygons. The variable of the dataset containing the healthcare facility is used for hotspot detection. After completing step 1 of the algorithm, each polygon’s spatial neighbour is found and represented by
. Next, the value of
is calculated. An example is given in
Table 4. Then, the rough set is applied to find the rough approximation of the polygons in the corresponding region and labelled as the lower region, boundary region, or negative region. Now, a graph is constructed using the dataset by choosing the random points from each subregion based on the spatial neighbour. A graph is a non-linear data structure. A breadth-first search traversal is used to search the spatial neighbour, which minimizes the algorithm’s running time compared to the linear data structure. The nodes of the graphs are labelled based on the outcome of the rough set after this boundary value analysis is performed. Finally, statistical significance is assessed for the candidate hotspot. Only the candidate hotspot that satisfies the predefined threshold
p-value < 0.05 is considered as the hotspot.
The hotspot of the medical facility provided at the village level is given in
Figure 9. Each red spot on the map shows a hotspot village represented by a polygon. Out of 123,471 villages (polygons), only 2335 villages are found to be significant hotspots, as shown in the figure. The hotspot villages are the villages that have better medical facilities. We can also see that hotspots are also clustered in some regions. This is due to the spatial autocorrelation in the dataset. The clustering of the hotspot is circled and shown in
Figure 9. The hotspot villages are distributed over the entire study region. From the plot, we can see that the hotspots are mostly in the west UP, east UP, and south UP, and some smaller clustering of the hotspots are also found in the north region of the UP.
The hotspots generated by the four literature methods are given in
Figure 10,
Figure 11,
Figure 12 and
Figure 13.
Figure 10 shows the hotspot generated by the Moran’s I method. All the identified hotspots are shown in red on the map. In the plot in
Figure 10, we can see that the hotspots are scattered over the study region and clustered over some areas like west UP, east UP, and south region, and the central region mostly has fewer hotspot villages. Compared to the proposed algorithm, there is a very low hotspot in the central region of the study area. Other areas have a similar clustering of hotspots as compared to the GBHSDRS.
The hotspot detected by the Getis Ord Gi algorithm is given in
Figure 11. From this figure, it can be seen that the most detected hotspot is scattered over the study region. Most hotspots are clustered in west UP and east UP, with some clustering in the north and south regions. Few hotspots are seen in the middle region of the study area.
The hotspot detected by the Getis Ord
algorithm is given in
Figure 12. From this figure, it can be seen that the detected hotspot is scattered over the entire study region. Most hotspots are clustered in west UP and east UP, with some clustering in the north and south regions. Several hotspots are seen in the middle region of the study area.
The hotspots detected by DBSCAN on the graph of the spatial data are shown in
Figure 13. From this figure, it can be seen that the detected hotspots are scattered over the entire study region. The detected hotspots are clustered into areas, but a clear separation is not seen. All the regions of the study area have some hotspots. After applying the DBSCAN to the graph of the spatial polygon data, we obtained different dense clusters. We have plotted all the clusters in red.
The density estimation plot of the hotspot identified by the particular algorithm is given in
Figure 14. The density plot shows that the peak value lies around 15, and the maximum value of the medical facility is around 165. There is a significant increase in the peak density before and after detecting the hotspot.
5.3.2. Evaluation Metrics
The GBHSDRS algorithm finds the hotspot of the given socio-economic dataset. The performance of the proposed algorithm is compared with the four literature methods. The methods are Local Moran’s I, Getis Ord
, Getis Ord
, and the DBSCAN used on the graph of the polygon vector data. These algorithms are compared based on the three parameters described in
Section 5.2. These metrics are the density of the hotspot, the hotspot prediction accuracy index, and the run time of the algorithms. The result of the algorithm for the density of the algorithm is given in
Figure 15. All the experiments are run ten times, and the average value of the results is presented. The results of all the methods compared to the proposed method is given in the
Table 8. The density of the hotspot generated by the DBSCAN method is lower than that of all other methods. In contrast, the density of the hotspot generated by the Getis Ord Gi and the Getis Org Gi star methods are nearly the same or higher than Moran’s I but are less than the GBHSDRS algorithm. The density of the hotspot generated by the GBHSDRS algorithm is the highest among all the algorithms. The analysis of the density value is given in
Table 9. It is noted that the percentage increases in the density of the hotspot generated by the GBHSDRS algorithm with respect to the Moran’s I, Getis Ord Gi, Getis Ord
and the DBSCAN algorithms are 36.14%, 9.94%, 10.42%, and 49.67%, respectively.
The result of the evaluation metrics of HPAI is shown in
Figure 16. The results show that HPAI is lowest for Moran’s I. Getis Org Gi and Getis Org
have HAPI values nearly all the same. At the same time, the HAPI value for DBSCAN is in between Moran’s I and the Getis Org
. The HPAI value is the highest for the proposed GBHSDRS algorithm. The analysis of the PAI value is given in
Table 10. From the analysis, it is found that the percentage increase in the HPAI found from the GBHSDRS algorithm is higher than Moran’s I, the Getis Org Gi, and Getis Org, and DBSCAN
by 40.39%, 9.15%, and 14.60% and 29.50%, respectively.
The results of the algorithms for the evaluation metric running time are given in
Figure 17. The running times of the algorithms for the Getis Org
and DBSCAN are nearly the same. The Moran’s I method has more running time than all the methods. Getis Org Gi has more running time than the Getis Org Gi star. The running time of the GBHSDRS algorithm is the shortest among all the algorithms given in the figure. The analysis of the running time of the algorithm is given in
Table 11. It is noted that the percentage decrease in the running time of the hotspot generated by the GBHSDRS algorithm compared to Moran’s I, Getis Ord Gi, and the Getis Ord Gi star, and DBSCAN is 63.78%, 37.85%, and 2.46%, and 6.83%, respectively.
In the suggested technique, we have only used spatial adjacency as specified by Equation (
2). The effectiveness of the proposed method on the other spatial neighbours, such as distance base, needs to be checked. This approach is specifically applicable to polygon data. However, it is necessary to verify its effectiveness on other vector data formats, such as point and line.
6. Discussion of the Results
The proposed GBHSDRS method for hotspot detection is tested on the socioeconomic data of Uttar Pradesh. The results of the algorithm are compared with the four state-of-the-art methods, namely Moran’s I [
5], Getis Ord Gi [
40], Getis Ord
[
32], and the DBSCAN [
35], applied to the graph of the polygon vector data. The output of all the methods is plotted. The maps are plotted with hotspots in red. Three evaluation parameters are used to measure the performance of the algorithm, namely the density of the hotspot, HPAI, and the running time of the algorithms. The density of the proposed algorithm is high because we chose a high threshold
for the target set
X. If the value of
is low, then the density will be low, and consequently, more hotspots will be generated.
The running time of the proposed algorithm is less than that of the other methods because the proposed algorithm uses a rough set that divides the space into three regions: the lower region, the boundary region, and the upper region. The lower region contains a set of polygons that surely belong to the target set called hotspots. The boundary region is the set of polygons that may or may not belong to the target set, i.e., hotspot. The negative region is the set of polygons that surely do not belong to the target set, which is a hotspot. Therefore, the polygons belonging to the negative region are removed from the further processing, and the data are pruned for the detection of the hotspot. As a result of the pruning of the data, the algorithm’s running time is reduced. The HPAI of the proposed algorithm is also high because the HPAI value is the ratio of the hit rate to the area percentage. When the density of the detected hotspot is high, the area or the number of the activity in the hotspot is high. This algorithm detected high-density hotspots due to the high thresholds . Therefore, the HPAI value is high.
The proposed method is able to find the hotspot of the polygon data. Some of the merits of the above method are that this method does not require any underlying data distribution, and it can find the hotspot for the polygon data with significance, so this method removes the false hotspots. Some disadvantages of the proposed method are the graph construction and the implementation of the rough set. The boundary value analysis of the boundary region is a major concern.
7. Conclusion and Future Research Directions
The proposed GBHSDRS algorithm is developed to find the hotspot polygon vector data. The main contributions of the algorithm are less running time, denser hotspots, and a high HPAI. This algorithm is able to find the hotspot if a hotspot exists in the data and also eliminate false hotspots. A neighbour relation is defined for the geospatial polygon data. The neighbour relation is used to construct the neighbour graph. The neighbour graph is a non-linear data structure that allows easy access to a node for analysis. The rough set approach is applied to deal with uncertainties and partition the data into lower, upper, boundary, and negative regions. The only processing is needed for the boundary region node, and traversal is the breadth-first search. Therefore, the complexity is reduced. The wrapper model of clipping the negative region achieves the algorithm’s efficiency by data reduction. Finally, the candidate hotspot significance was tested, and only the significant hotspot was selected. The algorithm is tested on the socio-economic dataset of UP India. The experiments performed on village-level data of medical facilities in Uttar Pradesh in 1991 yielded that of a total of 123,471 villages, only 2335 villages were identified as hotspots.
Three evaluation metrics, namely density, HPAI, and running time, are used to compare the relevance of the algorithm with four state-of-the-art methods. The density value computed by all the methods, i.e., Moran’s, Getis Ord Gi, Getis Ord , DBSCAN and GBHSDRS, are 43.37, 61.16, 60.48, 34.18, 67.91, respectively, and the density gains with respect to the state-of-the-art methods of the GBHSDRS are 36.14%, 9.94%, 10.42%, and 49.67%, respectively. The average density gain with all the state-of-the-art methods is 26.54%. The HPAI values of all the methods, i.e., Moran’s, Getis Ord Gi, Getis Ord , DBSCAN and GBHSDRS, are 1.17, 2.65, 2.49, 2.06, 2.92, respectively. The HPAI gains of GBHSDRS with respect to the state-of-the-art methods, i.e., Moran’s, Getis Ord Gi, Getis Ord , and DBSCAN, are 40.39%,9.15%, 14.60%, and 29.50%, respectively. The average HPAI gain with all the methods is 23.41%. The running times of the methods, i.e., Moran’s, Getis Ord Gi, Getis Ord , DBSCAN and GBHSDRS, are 91.04, 53.05, 33.80, 35.39, 32.97, respectively. The decrease percentages in the run time of the GBHSDRS algorithm with respect to the state-of-the-art methods, i.e., Moran’s, Getis Ord Gi, Getis Ord , and DBSCAN, are 63.78%, 37.85%, 2.46%, and 6.83%, respectively. The average decrease percentage for all the methods is 27.73%. Overall, the performance of the proposed method is good compared to the state-of-the-art methods.
This work can be utilised by policymakers, healthcare administrators, and stakeholders to facilitate equitable access to healthcare at the village level. This research is beneficial in efficiently allocating resources by targeting hotspots, reducing health disparities, improving health outcomes across diverse populations, and implementing preventive measurements and awareness campaigns.
The area of spatial analytics can potentially apply more concepts of rough sets in the future. This algorithm can also be used to evaluate this technique in other types of spaces, such as Euclidean, and in different datasets, such as fire data and epidemiology. Other notions of the neighbourhood need to be defined and checked with algorithms. Furthermore, as the algorithm is tested on the polygon vector data to find the hotspot, the algorithm needs to be tested on other spatial vector data, such as point and polygon data.