*3.2. G-Clustering*

Since thousands of corresponding categories for *POI*s exist in certain regions, it is impractical to perform a one-to-one clustering for mapping a single category to a class. Therefore, we propose G-clustering to allocate categories into classes, where the characteristics of each category are similar to those of all the other categories in the same class. The G-clustering is inspired by the Gini coefficient [14], which is an index proposed by Corrado Gini to judge the fairness of annual income distribution according to the Lorenz curve. In order to apply the concept of the Lorenz Curve in our work, we modify the definition of it, which is illustrated in Figure 4. The Gini coefficient is equal to the area ratio between A and (A + B), and it is also equal to 2A since the sum area of A and B is 0.5.

**Figure 4.** Lorenz Curve.

We apply this index to evaluate the distribution for each category in several regions clustered by geographical locations in the given problem space, and thus categories with similar distribution into the same clusters. The pseudocode for the G-clustering algorithm is depicted in Algorithm 1.


The proposed G-clustering is composed of three parts: initialization (line 1), construction for heat matrix (line 2 to 7), clustering for categories (line 8 to 13). First, *POI* is clustered into *D*1 clusters, where *D*1 is adjustable and set to 20 in our evaluation. Next, a

heat matrix *HD* = *hi*,*<sup>j</sup>* is constructed with *hi*,*<sup>j</sup>* representing the number of the *i*th category in the *j*th cluster according to *P*. Meanwhile, *Cd* refers to the corresponding cluster result of the *j*th *POI* in line 1. We divide each category into different groups according to its value point; the result *CCD* is returned once all categories are run through and assigned to a certain cluster. Each item in *CCD* indicates a set of the same level categories; meanwhile, we set *l* = 6 in the following evaluation.

Line 14 to 17 is a function that calculates the category value point. In this function, the Gini coefficient is applied to measure the distribution of each category in different clusters. The more even the distribution is, the closer this index gets to 0 (closer to the blue line in Figure 4); otherwise, it gets closer to 1 (closer to the red line in Figure 4). The function in lines 14 to 17 is designed to determine whether a category is indicative or not. The more even the distribution, the higher the value point, and the less indicative the category is. On the other hand, we also apply K-means to reallocate *POI* into *D*2 clusters as described in line 1, where *D*2 is set to 20 in our evaluation, and all other steps for G-clustering are left the same. The two types of clustering results are listed in Table 2.

**Table 2.** Results of G-clustering.


We list several representative categories in each cluster to explain the effectiveness of the G-Clustering algorithm. In the left column of Table 2 (DBSCAN), categories more evenly distributed in the area such as Fitness Venues and Retail Banks are clustered in the same class since these types of *POI*s have no obvious regional characteristics. In other words, there is no excessive demand from these categories in specific districts. On the contrary, the number of Night Markets and Art Galleries is obviously larger in certain areas and thus may be regarded as indicative categories in the prediction. A similar trend can also be found in the right column of Table 2 (K-means). The small difference between DBSCAB and K-means clustering results in some categories being clustered in different hierarchies. For example, Art Galleries and Junior High Schools are in different clusters, which might be due to their different local characteristics. The clustering results will then be used as important categorical features for the bike stations.
