5.2. Filter Analysis of Stay Point
Due to positioning errors, the actual data of the stay points may not be completely unchanged; it may change within a small positioning inaccuracy range. This paper defines the stay point threshold to reduce the impact of stay point positioning errors on clustering during the movement of the vehicle; the position of the stay point is allowed to shift within a small range.
In order to analyze the impact of the stay threshold on the experimental data set of this paper, we analyzed the influence of the stay threshold in the pre-processing of stay point filtering of four data sets with different data volumes. According to the stay point events defined in
Table 1, the stay time of the experiment in this section is set as follows: DS1 stay time is 15 min, and the remaining data set stay time is 30 min.
The experiment compares the number of samples retained at the point where the thresholds of the original data are 0, 0.0001, 0.001, and 0.01 respectively. The results are shown in
Table 3.
The purpose of the stay point filtering is intended to delete the stay points considered as noise points. Therefore,
Table 3 lists the number of retained points in different data sets when different thresholds are applied. The fewer the retained points, the more stay points.
It is shown in
Table 3 that there are a large number of re-sampled data points due to taxi stays in the four data sets. When the stay point threshold is 0, there are some data points that do not move at all in the four data sets. When the stay point threshold is 0.0001, 0.001, and 0.01, the stay point threshold is increased and more stay points are filtered, so the retention points are reduced. The differences between the stay point thresholds at 0, 0.0001, and 0.001 are not significant, but at 0.01, the data retention is greatly affected, especially in DS3 and DS4. In DS3, the reserved data is reduced from 2575 to 1475. In DS4, the retained data is reduced from 10,194 to 3377. So, if the threshold is too large, the location of normal driving is recognized as the stop, too many points are removed, and it is not suitable to take such a large threshold. Therefore, the threshold in the stay point filtering process needs to be judged and determined according to the actual situation.
According to the classification in
Table 1, we analyzed the five stay point events of waiting for traffic lights, getting on and off, traffic jams, business suspension, and business breaks in the experiment in this paper, and we analyzed them on four data sets respectively. When the latitude and longitude are judged, it is judged that the difference between the latitude and longitude is 0. The experimental results are shown in
Table 4.
It can be seen from
Table 4 that in the four sets of datasets, taxi stay times of less than 30 min accounted for the majority, which shows that the stay events are mainly caused by waiting for traffic lights, passengers and traffic jams. There are fewer stay points filtered by business suspension and business breaks. From
Table 4, it can be seen that the definition of stay point events in this paper considers real scenes. This pre-processing method for stay point filtering is more realistic and accurate.
5.4. Visual Analysis of Dense Grid Clustering
In this group of experiments, the stay point threshold was set to 0, the DS1 traffic jam residence time Δt was set to within 15 min, and the DS2, DS3, DS4 traffic jam residence time Δt was set to within 30 min in the pre-treatment stage. In order to show the influence of different grid side lengths on the clustering results in the grid mapping stage, the DS1 and DS4 grid side lengths were set to 0.01, and the DS2 and DS3 grid side lengths were set to 0.05. The grid cell density threshold was set to 10; that is, when the number of sampling points in a grid cell was 10, it was determined as a dense grid cell.
The visualization results of dense grid clustering are shown in
Figure 6.
Figure 6 is a combination diagram of dense grid cells and the original location point distribution. The figure first shows the clusters composed of dense grid cells in the foreground, and secondly shows the distribution of data sampling points in the form of background color. The data sampling points of dense grid cells are the data points in the cluster, and the sampling points that do not exist in any grid cell are sparse noise points.
The light gray points in
Figure 6 are the data sampling points, that is, the distribution of the position points in different data sets after filtering the stay points. The dark points such as blue, red, and yellow represent the cluster grid points. Since the side length of the experimental grid cell has been given, the grid cell can be uniquely determined according to any grid endpoint. Therefore, the grid cell is represented by the endpoint at the lower left of the grid cell in order to simplify the graphic display.
Figure 6 shows that the proposed clustering can effectively determine sparse noise points. For example, there are a large number of sparse points at 115.5–116.1 degrees north latitude and 39.7–40.05 degrees east longitude in the DS2 result of
Figure 6b, which are not included in any clusters. Moreover, the experimental results in other data sets also show that similar sparse points have no effect on the clustering results, for example, the sampling points around the two clusters and at the joint in DS1 clustering results in
Figure 6a.
The four sets of experimental results in
Figure 6a–d show that the method in this paper can accurately determine the clustering of the high-density areas of the sampling points, which represents the high-density areas of the taxi distribution, and it has a good effect on the extraction of urban hotspots.
Figure 6 also shows the influence of grid cell side length on grid mapping. In
Figure 6a,d, the grid cells of DS1 and DS4 are denser. In
Figure 6b,c, the grid cells of DS2 and DS3 are relatively sparse. This is because the side length of the grid of DS1 and DS4 is set to 0.01 and that of DS2 and DS3 is set to 0.05. The different length of the grid leads to different density and sparseness, which is caused by these four sets of data being collected from Beijing taxis, so the spatial range of the data is not large.
5.5. Analysis of Comparative Experiment Results
In this paper, the clustering method-based on stay point and grid density(CMSPGD) and Hybrid Feature-based DBSCAN(HF_DBSCAN) [
30], Effective Parameter Selection Process for the DBSCAN(PS_DBSCAN) [
34] were compared.
- (1)
HF_DBSCAN
HF_DBSCAN is an algorithm based on improved DBSCAN proposed by Luo et al. in 2017 [
30]. DBSCAN is a classic density-based algorithm used to find high-density areas in space, and different derivatives of the algorithm have been proposed to find urban hotspot areas. The density of the current point in the DBSCAN algorithm is determined by the distance from the current point. The number of points within a certain distance is used for balance. The HF_DBSCAN algorithm uses a Gaussian function as the density of points. The calculation method is as Formula (8).
where
represents the point,
represents the Euclidean distance between
and
, and
represents the standard deviation. The standard deviation in this experiment is 0.3.
- (2)
PS_DBSCAN
PS_DBSCAN is an improved algorithm proposed by Huang et al. in ACM Trans in 2019 [
34]. For the original DBSCAN algorithm, there is no strict index determination for the selection of two parameters of radius length and density threshold, resulting in inaccurate clustering. The author improved the method for determining these two sets of parameters with the following steps. Firstly, the author determined a larger radius length, and then gradually reduced the radius length. The author observed the number of clusters for each radius length cluster density threshold comparison; as a result, the author found the density threshold when the number of clusters just decreased as the density threshold rose, and set it to the appropriate density threshold under the radius length of the group. The density threshold of the last set of the above changes is the final value. The author observed the comparison between the number of clusters and the radius length under the appropriate density threshold obtained in the previous step. The radius length corresponding to the larger number of clusters is the appropriate value.
In this paper, DS4 is first tested according to the parameter selection method in the PS_DBSCAN algorithm to find the appropriate radius length and density threshold. First, we determined a larger radius length of 0.025, and then reduced it to 0.01 and 0.005 in sequence. The comparison results of the density threshold and the number of clusters under these three groups of radius lengths are shown in
Table 5,
Table 6 and
Table 7 below.
First of all, it is judged that there are three groups in which the density threshold increases and the number of clusters decreases in the three sets of data: radius length 0.005 and density threshold 60, radius length 0.01 and density threshold 110, and radius length 0.025 and density threshold 150. Among these three sets of data, the density threshold 150 is the largest and is the key value for the last change, so it is used as a suitable density threshold parameter. Then, among the three groups of data, the data with density threshold of 150 is as follows: the density threshold with a radius length of 0.025 is 150 and the number of clusters is three; the density threshold is 150 with a radius length of 0.005 and the number of clusters is three; the radius length is 0.025, the density threshold is 150, and the number of clusters is seven. Therefore, for DS4, the appropriate radius length of the DBSCAN algorithm based on parameter selection is 0.025, and the density threshold is 150.
Similarly, the appropriate radius lengths for DS1, DS2, and DS3 are 0.005, 0.025, and 0.025, respectively, and the density thresholds are 10, 30, and 50, respectively.
- (3)
Contrast analysis of clustering accuracy
In this paper, the experimental clustering results of HF_DBSCAN and PS_DBSCAN in four data sets are shown in
Table 8,
Table 9,
Table 10,
Table 11,
Table 12,
Table 13,
Table 14,
Table 15,
Table 16,
Table 17,
Table 18 and
Table 19.
The attribute
No in the table indicates the number of the cluster,
m indicates the number of data points in the cluster; the larger
m, the more points participating in the cluster, and the fewer noise points discarded.
Longitude and
Latitude are the cluster center coordinates of the cluster, that is, the distance and minimum of all points in the cluster to the point;
LoadLength represents the clustering distance of the cluster, the sum of the distances from all points to the cluster center. The calculation is as Formula (9).
where
p represents the cluster element in the cluster;
center represents the clustering center of the cluster, namely Longitude and Latitude coordinates;
Avg represents the average aggregation distance of each point, calculated as in Formula (10).
Avg represents the average density of points in the cluster. The greater the Avg, the denser the points in the cluster. If there are more points in the cluster and the denser, the better the clustering effect of each cluster.
Table 8,
Table 9 and
Table 10 show that the proposed algorithm and PS_DBSCAN algorithm have fewer
m values than the HF_DBSCAN algorithm for the DS1 dataset, indicating that in small-scale data sets, the algorithm and PS_DBSCAN algorithm will have more cluster sampling points lost. Secondly, the values of
LoadLength and
Avg in the group table show that the distance within the cluster generated by the HF_DBSCAN algorithm is large, indicating that the clustering accuracy quality is not as good as the algorithm of this paper and the PS_DBSCAN algorithm.
Table 11,
Table 12,
Table 13,
Table 14,
Table 15,
Table 16,
Table 17,
Table 18 and
Table 19 show that the
m values of the three algorithms in the DS2, DS3, and DS4 data sets are not much different, indicating that the three sets of algorithms are basically the same in the number of sampling points of the clustering results. The comparison algorithm is higher, which shows that the clustering accuracy of the algorithm in this paper is worse in the intra-cluster distance. This is because the grid mapping process of the clustering in this paper will bring a certain accuracy loss.
Table 8,
Table 9,
Table 10,
Table 11,
Table 12,
Table 13,
Table 14,
Table 15,
Table 16,
Table 17,
Table 18 and
Table 19 show that although the HF_DBSCAN algorithm produces more clusters, most of the clusters have sparse points and fewer elements. The clustering algorithm and the PS_DBSCAN algorithm in this paper result in more uniform data. For example,
Table 14,
Table 15 and
Table 16 shows that the clustering method in this paper and PS_DBSCAN algorithm generate two to three clusters and the HF_DBSCAN algorithm generates nine clusters, but according to the value of
m, the clusters 2, 3, 4, 5, 6, 7, and 9 have only one data point. This type of experimental data can be removed as noise or merged into other clusters. Similar results have been obtained for other datasets. The clusters formed by the clustering algorithm in this paper and PS_DBSCAN algorithm are more reasonable, balanced and stable. However, in the PS_DBSCAN algorithm, the parameters are optimized to make the clustering results more uniform, and the algorithm is relatively complicated to implement. Therefore, this paper’s algorithm is simpler and more efficient than the PS_DBSCAN algorithm to implement in the formation of reasonable clusters.
In summary, compared to the PS_DBSCAN algorithm in terms of clustering effect, the algorithm in this paper is simpler and discards fewer noise points. Compared to the HF_DBSCAN algorithm, the clusters formed by this method are more uniform and reasonable.
- (4)
Comparative analysis of running time
The experiment also compares and analyzes the execution time of the algorithm in this paper with the HF_DBSCA N and PS_DBSCAN algorithms. The experimental results for four data sets are shown in
Table 20.
Table 20 shows that the running time consumption of the clustering algorithm in this paper when processing the same size data set is much lower than that of the comparison algorithm. As the number of data object sets continues to increase, the running time of the comparison algorithm increases sharply. In this paper, the increase in the running time of the grid-based and density clustering algorithm based on grid cells is much smaller than that of the comparison algorithm. It has advantages over comparison algorithms when dealing with large datasets. This is because the algorithm uses a grid clustering algorithm to divide the grid, so that the processed object is not a data point, but a divided grid cell, and the improved DBSCAN clustering algorithm operates on data objects, so our clustering algorithm is more efficient than the HF_DBSCAN and PS_DBSCAN algorithms.
In this experiment, clustering is performed for grid cells. The number of grid cells after space division and the number of non-dense cells will also affect the efficiency of this experiment, and further judgments on non-dense grid cells are also required. But the overall efficiency is still significantly better than the comparison algorithm.