1. Introduction
Segregation, in the context of society and public spaces, refers to the separation of individuals or groups based on certain social phenomena, such as race, ethnicity, gender, religion, point of interest, shopping recommendation, Google reviews, or socioeconomic status [
1,
2,
3,
4,
5,
6,
7]. This practice has a long history and has been a source of significant social and political contention. Segregation can manifest in various ways, including residential segregation, educational segregation, and segregation in using public facilities like schools, hotels, hospitals, and many more.
Segregation plays an important role in studying various patterns in GIS data like surveys, POI reviews, and census datasets. Geographic segregations are categorized into four types [
8,
9]:
Legal segregation;
Social segregation;
Gated communities;
Voluntary segregation.
In this work, a study on social segregation in Dublin using urban facilities and activity-based segregation is showcased. The study aims to identify zones or points in the city where those with Indian names are interested. Social segregation or spatial segregation refers to mapping the patterns in society on a map (city, country, etc.) based on activity, nationality, income, race, gender, ethnicity, and many more. These forms of studies help governing agencies and city planners to better plan the city with new services based on patterns and behaviors. Understanding segregation patterns in the city can help policymakers develop effective policies to address residential segregation and its impact on urban life. By identifying areas of concentrated poverty or disparities in access to resources and services, policymakers can develop targeted interventions to promote more inclusive and equitable cities. There have been similar studies in Singapore, China, and many other countries to study the growth, pattern, and segregation changes of small communities, immigrants, and the growth and usability of social services like school, transport, tourism, and their utilization [
1,
2,
3,
4,
5].
This study aims to find and study the clusters of people with Indian based names in Dublin using the proposed HDBSCAN (Hierarchical Density-based spatial clustering of applications with noise) algorithm over the Google POI (Point of Interest) dataset for Dublin. This study uses the usernames of the reviewers in Google POI to identify possible nationality and gender using the machine learning-based NamSor API. This allows us to identify possible segregation based on the nationality and gender produced with NamSor.
The study is divided into four sections.
Section 2 is an overview of the existing literature and using POI to study various forms of segregation. In
Section 3, the proposed model is explained, including the data description and data cleaning phase.
Section 4 presents the study and results of the study. In
Section 5, the outcome of the study and future work are discussed.
3. Methodology
In this study we analyse Google POI data to examine the probable activity separation of Google POI reviewers with names that, according to the NamSor [
11] tool, are most likely of Indian origin. The project tries to pinpoint Dublin’s prospective Indian citizens’ regional activity patterns based on the reviews of POIs. Based on the behaviors discovered in the Google POI dataset, the analysis also reveals the intensity of group size in a particular geographic area. The study is an extension of earlier work by [
1], which finds the possible gender and nationality of the person based on the name using the NamSor app with 92% accuracy to predict correct origin and gender by name. The NamSor app aims to identify the possible origin or nationality of a person based on the pre-trained model with some errors. So, this work only claims to identify possible clusters of POI where persons with names of Indian origin are active in reviewing in Google places. These two additional elements in the POI data open a new window to analyze the segregation in POI use and identify the segregation based on nationality and gender in an area. Since Dublin has a large count of migrants coming in search of jobs and education, this gives us an opportunity to study the behavior of citizens and tourists based on POI data.
An updated version of the Google POI dataset was used which includes more features as compared to the original Google POI dataset. The dataset combines reviews, ratings, names, and POI details for Dublin. Urban facilities, including hospitals, supermarkets, pharmacies, tourist attractions, and many more, are considered POIs. A total of 54,856 POIs and 110,713 reviews are included in the dataset. The reviews do not include any personal information. The data used are in the public domain. This information is a comprehensive collection of preferences regarding preferred locations, likes, and dislikes. Data were collected from 15 January 2021 through 27 February 2021. The data are further enriched in the next step (see
Table 1). The dataset includes 2218 reviews from users whose username is classified as having an Indian origin using the NamSor API.
Figure 1 shows the flow diagram of the proposed model and the various steps included in the proposed model.
The dataset is a collection of reviews of POIs which are highly rated and the most visited places in Dublin. Some of the places are tourist spots, churches, schools and town halls, shopping complexes, grocery stores, other shopping places, pubs, restaurants, hotels, and government offices.
Figure 2a showcases the various POIs in Dublin where people with various nationalities visit and have submitted reviews. The work aims to identify the clusters based on user activity in Dublin.
Figure 2b showcases the filtered POIs with reviewers whose username is classified as having an Indian origin using the NamSor API.
3.1. Preprocessing
In this step, the Google POI data harvested for POIs in Dublin are further preprocessed with the NamSor API [
11] to identify the gender and nationality of the username. This API aims to tag each row with the nationality and gender of the person. This allows us to further analyze the segregation in the city based on gender and nationality.
3.2. Data Cleaning
In the data cleaning phase, the data with no information, i.e., empty cells and incomplete data like invalid names, empty ratings, empty comments, and any invalid data in the POI, are removed. The data cleaning phase removed 25,910 rows from the data, which have no nationality or gender attached to them. Of these, 19,739 empty values were from “POI comment” and 725 had an invalid “Author Name”. This phase improves the quality of the dataset. After data cleaning, the data are visualized to identify the contribution of various nationalities in the dataset. The final data include 28,937 rows with valid nationalities.
Figure 3 showcases the contribution of various nationalities in the final data after deleting the invalid data and data from users with Irish usernames (according to NamSor). The data show reviews from 171 countries are identified, whereas
Figure 3 shows the top 15 different nationalities of usernames in the dataset.
3.3. Data Processing and Classification Model
In this section, an HDBSCAN (Hierarchical Density-Based Spatial Clustering)-based clustering and classification model [
12] is proposed for clustering. The model is used on Google POI data of Dublin, Ireland. The clustering is performed based on the geographical location of the POI and aims to identify the segregation of reviewers whose username is classified as having an Indian origin using the NamSor API, i.e., activity-based segregation. HDBSCAN is an extended version of the DBSCAN and OPTICS models. Where the existing DBCAN model cannot be used to identify variable density clusters in the dataset, on the other hand, HDBSCAN allows us to identify variable density clusters. This permits us to identify multiple classes and clusters with different densities in the dataset.
The proposed model clusters points that are not close to another point with a minimum Epsilon (EPS) distance [
12] between them and assigns a minimum number of points per cluster i.e., cluster id “−1”. The points which cannot be added to any clusters are referred to as outliers. The model takes into consideration minimum cluster membership (min_cluster_size) and minimum point neighbors (EPS).
4. Result and Analysis
This section shows the results using the methodology over the prepared dataset of Google POIs for Dublin. The overall dataset included 171 nationalities, and this work aims to identify the activity-based segregation of people whose username is Indian in Dublin. The dataset has 625 reviews where NamSor has indicated the username to be of Indian origin.
Figure 2 shows the spread of these POIs over the Dublin map.
HDBSCAN is used to cluster the POIs and find the pattern over the Dublin map. The model takes the minimum number of members in the cluster and Epsilon (EPS), the maximum distance between points, which are taken as 10 and 0.009, respectively. The distance between the points is defined by the Euclidean distance, where latitude and longitude are used to define the distance between POIs. The cluster is formed if a minimum number of points are within the minimum distance, else they are considered outliers represented as a negative value in the cluster ID. The proposed model plays an important role as compared to DBSCAN as it can identify clusters with variable density, which is not possible in DBSCAN. The proposed HDBSCAN creates clusters of the points in the data that are close to each other and connecting all such points creates a cluster. These clusters have similar or nearby locations. This proposal highlights the zone with similar interests. This helps to find the zones in the city where most of the reviewers whose usernames are potentially Indian are interested and it also identifies points that are outliers, i.e., points that do not form a cluster.
Figure 4 shows the clusters in Dublin for reviewer names with Indian origin (according to NamSor), where cluster ID “−1” represents outliers. Here the proposed model identifies 16 unique clusters using the HDBSCAN clustering model, where the minimum number of points in the cluster is 7. This showcases all the clusters formed using the POI dataset, including outliers and clusters with no unique significance.
Figure 5a shows the clusters after removing outliers with cluster ID “−1” and clusters near the City Center area of Dublin which is a shopping zone that is visited by every nationality. This cluster is not considered because it does not add to unique segregation, as it is visited by everyone. HDBSCAN allows the identification of clusters with variable density, i.e., segregation in Dublin that includes clusters with high density and low density. On the other hand, DBSCAN with the same parameters can identify only major clusters with high correlation, leaving behind small clusters with high significance, as shown in
Figure 5b. DBSCAN shows seven clusters which are the same as the major clusters of HDBSCAN. Next is the study to identify the size of the clusters in the proposed HDBSCAN-based segregation.
Figure 6 shows the count of POIs in each cluster, which is also showcased in
Table 2, with a minimum cluster size of seven.
Further in the study, gender-based segregation is performed to study the activity behavior of male and female reviewers which NamSor indicated as having names originating in India, in Dublin.
Figure 7a shows the segregation of males with five unique clusters. On the other hand, only one cluster is identified using HDBSCAN. The gender-based study showcases the role and contribution of males and females in segregation. This also shows the locations where males are more interested than females.
As shown in
Table 3, the HDBSCAN model was tested with multiple values of EPS, which shows a high value of MinPts as 10, and clusters with large sizes are encountered. On the other hand, keeping ESP value as 7, more clusters can be identified with variable density. This allows us to identify more sets of clusters with more features.
Table 4 shows the analysis of the HDBSCAN model for clustering with different configurations. The analysis highlights the average cluster size, the number of outliers, and the size of the smallest and largest clusters. In
Table 4, case 1 and case 3 are important as the number of outliers has decreased, and the count of clusters has increased.
Table 5 shows the count of clusters and the size of each cluster using DBSCAN. This shows that DBSCAN cannot identify clusters with variable density as compared to HDBSCAN.
Table 6 shows the analysis of DBSCAN clustering where the change in MinPts does not affect the number of outliers drastically.
Table 7 shows the features of the clusters identified by HDBSCAN. The results show that most of the clusters are based on the reviews of shopping area and restaurants where few of them live because of the availability of accommodation and public services like schools and transport facilities. In this set of clusters, two clusters are identified to be education institutes (University College Dublin and TU Dublin). In totality, it can be concluded that the clusters are formed near services like restaurants, transport, and shopping centers. However, some of the clusters are found to be near educational institutes, which showcases the interest of people with Indian usernames or students.
The proposed HDBSCAN model is well suited for finding segregation in POIs since there exist clusters of variable density rather than fixed. Moreover, HDBSCAN can point to more features and clusters in the data with new information about the segregation in Dublin.
5. Conclusions
In this work, a study on using HDBSCAN-based activity segregation in Dublin is showcased using the Google POI dataset of user reviews in Dublin. To demonstrate the approach, HDBSCA is used to identify the areas in Dublin City where reviewers with Indian origin names (according to the NamSor API) are interested. The work shows that the Google POI data with location and review comments have knowledge about the user and the pattern of visits in the form of geospatial information. This work identifies the segregation of reviewers in Dublin based on their activity behavior in Google POI. The results show the performance of HDBSCAN and DBSCAN clustering models to find the most suitable model for segregation in the POI dataset. The results show clusters of locations in Dublin. Also, a comparative study is carried out using DBSCAN and HDBSCAN to find the clusters and locations with variable density. The proposed HDBSCAN shows a higher number of clusters in Dublin with variable density of clusters. The results show 12 unique locations where reviewers in a test case are interested with an average cluster size of 25 and with a minimum number of seven user reviews in the cluster. Similarly, for DBSCAN the unique cluster size is seven with an average cluster size of 43. The work also studies gender segregation in Dublin with five unique clusters of male dominance and one unique cluster of females. In the future, the work will examine the underlying drivers that shape these clusters and can be used to validate the approach. Linked to this, we will also identify the change in segregation over time. However, the work is based on the accuracy of the NamSor API for identification of possible nationality and gender and future work may need to be carried out to further understand the accuracy of NamSor in the Irish context. In the future, the work can be extended to study the formation of new clusters and the growth of existing clusters using the HDBSCAN model in new locations.