**1. Introduction**

**Reference Context.** Cities worldwide are experiencing significant evolution due to numerous factors, e.g., new forms of communication, new ways of transportation, and fast urbanization. The pervasive and large-scale diffusion of sensing networks, image-scanning devices, and GPS devices is enabling the collection of huge volumes of geo-referenced urban data every day. As more and more data become available, data scientists can analyze such an abundance of urban spatial data to discover predictive and descriptive data-driven models, which can assist city managers in dealing with the major problems that cities face, e.g., human mobility, traffic flows, air pollution, crime forecasts, and virus diffusion [1–9]. In particular, detecting city hotspots is emerging as a frequent task when analyzing urban data. In fact, given the availability of geo-referenced data, it is useful to detect areas

**Citation:** Cesario, E.; Lindia, P.; Vinci, A. Detecting Multi-Density Urban Hotspots in a Smart City: Approaches, Challenges and Applications. *Big Data Cogn. Comput.* **2023**, *7*, 29. https://doi.org/10.3390/ bdcc7010029

Academic Editor: Carson K. Leung

Received: 28 December 2022 Revised: 22 January 2023 Accepted: 3 February 2023 Published: 8 February 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

where urban events (e.g., crimes, traffic spikes, viral infections, and pollution peaks) occur with a higher density than in other regions of the dataset. Additionally, hotspot detection can serve as a useful organizational technique for elaborating thorough knowledge of an urban area, and their borders and shapes can enable high-level spatial knowledge summaries, which are valuable for policymakers, scientists, and planners [5,10,11]. As an instance, environmental scientists are interested in partitioning a city into uniform regions based on environmental characteristics and pollution density [3,12]. Similarly, during viral emergencies, as recently happened with the COVID-19 pandemic, virologists and epidemiologists are steadily interested in detecting city hotspots in which viruses are spreading with higher densities than other areas of the same city [6,7,13]. Moreover, city administrators can be interested in determining uniform regions of a city with respect to the functions they serve for citizens or visiting people. Additionally, police authorities are interested in detecting crime hotspots (i.e., areas with a high crime density) to ensure public safety in the city territory better [4,5]. Regarding data analysis, the search for intra-hotspot and inter-hotspot models is a hot topic for scientists. For instance, intra-hotspot models can reveal the changes in density within a hotspot over time, and inter-hotspot models can study how the appearance of a given hotspot can affect the generation of other hotspots in a different area [14].

**Motivations.** In metropolitan cities, the density of events, traffic, or population can differ widely between different areas, making urban regions highly dissimilar regarding density. This issue is made evident in Figure 1, which shows how inter-city and intra-city population densities strongly differ in different metropolitan city areas. Specifically, Figure 1a plots the population density of the 200 densest square kilometer grid cells in six representative cities [15], while the coefficient of variation of the population density of several countries is shown in Figure 1b. Focusing on the first chart (https://garrettdashnelson.github.io/square-density/, accessed on 18 December 2022), we can observe that densities largely vary within the same city, and between several cities. As an instance, New York City represents a classic case of multi-density regions: there are several high-density areas (Manhattan), and many other low-density areas (Queens). Chicago shows similarly top-heavy density pyramids, where the high-density areas (Loop and Near North Side) stand out from the rest of the region [15]. Other cities, such as Boston, San Francisco, and Los Angeles, show similarly multi-density distributions, with a high variation of densities among different city regions. As a second observation, it is worth noting that densities largely vary between several cities. For example, it is worth noting that the lowest-density areas of New York City are even denser than the densest parts of Dallas or Boston, and that even Chicago and Los Angeles' densest areas barely crack into the bottom half of New York City's top 200. On the other side, Figure 1b shows the average, minimum, and maximum values (and the names of the corresponding cities) of the coefficient of variation of the population density for several countries [16]. The coefficient of variation displayed in Figure 1b is defined as the relative standard deviation of urban population density, i.e., *CV* = *SD*/*PD*, where given a city, *SD* is the standard deviation of population density within the city, and *PD* is the average population density of the same area. Thus, the coefficient of variation is a unit-free measure of the density variation of the population within a city. The higher the coefficient of density variation of a city, the higher the dispersion in the population density of a city. The chart confirms a very high variability of densities within the same country, and between several countries. For example, in Mexico, the coefficient of variation ranges from 1.05 in Mexico City to 14.03 in Navajoa, showing an extremely high dispersion in population density. A similar observation can be made for Korea, the U.S., Canada, and the other listed countries. This aspect must be taken in consideration to properly infer the real hotspots when analyzing urban data. The density of traffic, events, population, etc., in metropolitan cities can largely differ between different areas, making urban regions extremely dissimilar in terms of density. It is worth noting that, in our experience, given an urban area and a set of events

#### (related to, for example, crimes, COVID infections, and mobility), high-density variations can be observed in the collected data.

**Figure 1.** Intra-city and inter-city population densities in metropolitan urban areas. (**a**) Population densities of the densest 200 cells for a given set of cities. Each cell has a 1 km<sup>2</sup> area [17]. (**b**) Coefficient of variation of population density across urban areas and countries (2014) [16]. For each country, the gray dot is the average computed on the coefficient of variation of each city of the country. The figure also displays, for each country, the minimum and the maximum coefficients of variation, and the cities where they occur.

Clustering is the most appropriate technique to discover urban hotspots. However, we can split such algorithms into two groups. The first group includes algorithms that, due to the adoption of global parameters, define a single minimum threshold value to distinguish between dense and not-dense areas. Often, a proper threshold setting becomes all the more difficult when clusters in different regions of the feature space have considerably different densities, or clusters with different density levels are nested. In such cases, the partitioning might not be proper with one single-density threshold. In fact, if the chosen threshold is too high, they can discover several small non-significant clusters that actually do not represent dense regions; otherwise, if the chosen threshold is too low, they can discover a few large regions that actually are no longer dense as well. As a matter of fact, the application of such algorithms to a multi-density dataset, such as urban data, could not achieve good results. The second group includes algorithms that rely on multiple minimum threshold values. Such algorithms generally detect multiple pattern distributions of different densities, aiming at distinguishing between several density regions, which may or may not be nested and are generally of a non-convex shape. Then, they automatically estimate the number of threshold values to optimally identify the different density regions, without any prior knowledge about the data. Such algorithms usually detect better data partitioning than single-density threshold algorithms, but their drawback is a very high computational cost.

**Contributions and plan of the paper.** Given the presented context, this paper presents a study on hotspots detection in urban environments. As the main contribution, the study compares the most important approaches proposed in the literature for clustering urban data and analyzes their results on two synthetic datasets and a real-world one, having in mind two different goals. The experimental evaluation on synthetic state-of-the-art multi-density datasets is performed to evaluate the clustering quality and the ability of the algorithms to retrieve proper hotspots. To do that, we exploit two synthetic datasets, where each point owns a target cluster label, and thus the algorithms could be evaluated qualitatively and quantitatively by taking advantage of such ground truth information. The experimental evaluation on real-world data is performed on crime data from the Chicago Police Department, inherently characterized by points distributed with very different densities in the city area. Such a concrete scenario is exploited to show the

practical usefulness of density-based clustering algorithms in discovering multi-density urban hotspots in real urban cases.

The remainder of the paper is structured as follows. Section 2 briefly describes the most important density-based approaches in spatial clustering literature, and the most representative projects in that field of research. Section 3 presents a selection of the main density-based clustering algorithms exploited in the literature to analyze urban data, by summarizing how they work. Section 4 provides the comparative experimental evaluation of the different approaches on state-of-the-art datasets. Section 5 shows the algorithm results on a real-world scenario. Finally, Section 6 concludes the paper and plans future research works.

## **2. Related Works**

The analysis of urban data and the detection of urban hotspots from geo-referenced data are very challenging tasks. For this purpose, several approaches have been proposed in the literature, tackling the problem by adopting clustering approaches. In some cases, the discovery of urban hotspots represents one step of a more complex workflow, based on a common inspiring idea of several approaches that first detect geographic hotspots and then extract predictive models of intra-hotspots and/or inter-hotspots. In this section, we briefly review the most representative research work in the area.

The DSPM (density-based sequential pattern mining) approach, aimed at the discovery of mobility patterns from GPS data, is proposed in [2]. The method consists of (i) discovering urban dense regions of interest (more densely passed through ones) and (ii) extracting mobility patterns among those regions. As a case study, the approach is applied to a reallife GPS dataset tracing the movement of taxis in the urban area of Beijing. Additionally, the authors describe a comprehensive validation methodology for assessing the accuracy and quality of detected dense regions and trajectory patterns. The approach relies on the DBSCAN algorithm for detecting dense regions and could be improved by considering multi-density clustering analysis, detecting also lower-dense but homogeneous regions.

An approach to predict ozone concentrations at given target observation stations, based on spatial clustering and multilayer perceptron models, is proposed [18]. In particular, the approach exploits k-means clustering to detect similar stations and then train them together to ge<sup>t</sup> a base model for spatial transfer learning. The final models are used to predict the ozone concentration for three-day-ahead prediction horizons. The experimental evaluation, performed using historical data of stations in Germany, has shown higher forecasting accuracy of ozone exceedances with respect to traditional chemical transport models and popular machine learning approaches. Since the work groups sensor stations which are localized on a large area, it could benefit from exploiting multi-density clustering algorithms instead of k-means. Additionally, in a recent paper [19], the application of artificial intelligence (AI) and machine learning (ML) to build air pollution models, aimed at forecasting pollutant concentrations and health risks, is analyzed. The paper depicts how air pollution data can be uploaded into AI-ML models to discover the correlation between exposure to pollution and public health risks, giving a survey of applications and challenges of such a research field. In particular, it is pointed out that explainability is one of the paramount requirements in choosing AI-ML models for analyzing pollution data.

In [20], an approach is proposed to predict high-resolution electric consumption trends at finely resolved spatial and temporal scales. The approach is composed of two steps. First, apartment-level historical electric consumptions data are collected and clustered. Second, the clusters are aggregated based on the consumption profiles of consumers. The clustering analysis is performed by the k-means algorithm, while forecasting models are discovered by two deep learning techniques: long short-term memory unit (LSTM) and gated recurrent unit (GRU). The experimental evaluation was performed on electricity consumption data collected from residential buildings situated in an urban area of South Korea. In particular, a comparative analysis with state-of-the-art machine learning models and deep learning variants showed good performance in terms of building- and floor-level prediction accuracy. The clustering of the consumption profiles of the consumers does not take into account features related to the location of apartments, buildings and floors. A multi-density hotspot detection can benefit the analysis, as it could group together building in the same city area, maybe constructed in the same years and having similar characteristics.

In [21], the authors designed a workflow composed of five steps, i.e., data preprocessing, feature extraction, machine learning training, performance evaluation, and explainable artificial intelligence, to analyze the effects of changes in land cover, such as deforestation or urbanization, on the local climate. In particular, machine learning models have been trained to learn the relation between land cover changes and temperature changes. Then, explainable artificial intelligence has been further exploited to interpret and analyze the impact of different land cover changes on temperature. Additionally, the experimental results have shown that random forest outperformed other machine learning methods (e.g., linear regression) proposed in the literature for discovering the relation of land cover–temperature changes.

A methodology for discovering behavior rules, correlations, and mobility patterns of visitors attending large-scale public events by analyzing social media posts is proposed in [22]. In particular, the authors describe a multi-step approach based on the detection of hotspots of interest (bounded areas) where the public events are held, collection of the geo-tagged items related to the events, gathering of trajectories of users publishing posts concerning such events, and discovery of touristic mobility patterns. The methodology is tested through two case studies: a mobility pattern analysis on Instagram users who visited EXPO 2015, and behavior modeling of geo-tagged tweets posted by users attending the 2014 FIFA World Cup, showing reasonable predictive accuracy.

A system for geo-localized crime data analysis, named CrimeTracer, is proposed in [23]. The approach is based on a probabilistic framework to discover spatial clusters in urban areas, and it is applied for crime event forecasting. In particular, the algorithm partitions the area of interest in activity spaces, which represent hotspots frequented by known offenders to make their criminal activities. On the bases of such knowledge, spatial crime predictions are performed on each activity space. Another approach for spatial data clustering is proposed in [24], which classifies locations as crime hotposts or no crime hotspots by exploiting one-class support vector machines (SVM). Similarly, in [25] an approach based on recurrent neural network models is designed to analyze spatial information and classify grid-cells as hotspot or not-hotspot.

An approach aiming at detecting crime hotspots in cities and forecasting crime trends in each hotspot is described in [5]. The approach leverages auto-regressive forecasting models and spatial cluster analysis to build a specific crime predictor for each hotspot detected during the spatial clustering analysis. The predictors can estimate crime trends in terms of the number of expected future crime events. The approach is assessed on real-world data, consisting of crime events collected in New York City and Chicago, and is demonstrated effective in terms of forecasting accuracy considering different time horizons. The above reviewed works in crimes analysis [5,23–25] are not capable of considering automatically detected hotspots characterized by different densities.

A predictive approach based on spatial analysis and regressive models is proposed in [13], aiming at discovering spatio-temporal predictive epidemic patterns from infection and mobility data. The algorithm is composed of several steps, starting from the detection of epidemic hotspots (urban areas where infection events occur more densely with respect to others) and mobility hotspots (urban regions more densely visited by mobility traces), to the discovery of epidemic patterns among epidemic hotspots. The approach finally processes each epidemic hotspot and analyzes the infection data of the epidemic hotspots involved in mobility patterns, then it extracts hotspot-specific epidemic forecasting models. The approach has been validated on real-world data regarding mobility and COVID-19 infections in Chicago. The paper focuses only on high-density hotspots in the given analysis and exploits the DBSCAN algorithm for detecting epidemic hotspots. This work can also benefit from the exploitation of other multi-density-based clustering algorithms.

#### **3. Algorithms to Detect Urban Hotspots**

This section shortly describes four density-based clustering methods—CHD, DBSCAN, HDBSCAN and OPTICS-Xi—that we selected from the literature as the most used and interesting approaches to analyze urban data.
