1. Introduction
Vehicle trajectory data have been widely utilized in traffic flow detection, route selection, and traffic violation enforcement, which are vital to transportation management in urban areas [
1,
2]. The analysis of vehicle trajectories is important for transport industries, such as taxi driving. Trajectory data can assist taxi companies in detecting their drivers’ behaviors and understanding their operation patterns, which can be further used to enhance driving safety and transport efficiency [
3,
4,
5]. For instance, it is possible to evaluate the route planning of taxi drivers with trajectories and optimize their operational routines. Meanwhile, the trajectories can be used to identify traffic incidents encountered by taxi drivers and warn taxi managers if necessary [
6,
7]. For example, Kan et al. [
8] analyzed the conditions of urban traffic congestion using taxi GPS trajectory data and proposed a method for congestion assessment. This approach could help transportation authorities to formulate more effective policies and strategies. In addition to congestion detection, the use of taxi trajectories can reveal the speed characteristics of roadway segments with different control types (e.g., one-way control and road closure) [
9,
10,
11]. Hereby, the quality of GPS trajectories should be ensured to successfully implement the above-stated applications.
Although the vehicle trajectories are readily available, the abnormality can be commonly found in most datasets. The trajectory points recorded by the GPS receivers may be positioned away from the actual locations because of the impacts of road environments [
12], satellite positioning errors [
13], instability of GPS signal transmission [
14], and algorithm differences [
15]. These deviations lead to inaccurate detection of actual driving lanes and directions and the disappearance of the trajectories from the roadways [
16]. To address these issues, studies typically adopt the map-matching algorithm to correct the coordinates of the recorded GPS points and match them to the corresponding roads along the driver’s routes.
Different methods and algorithms are used in map matching for the GPS trajectories, which exhibit notable differences in terms of model structure, data inputs, and logical rules [
17]. Currently, traditional map-matching algorithms encompass geometric map-matching, topological map-matching, probabilistic statistical map-matching, and advanced map-matching algorithms [
18,
19]. Furthermore, various new technologies have recently been introduced to the field of map matching, including low-frequency trajectory data match [
20,
21]; high-frequency trajectory data match [
22]; methods that consider memorized multiple matching candidates [
21]; the AMM algorithm for online map matching [
23]; deep learning-based models [
24,
25], including Recurrent Neural Networks (RNNs) [
25] or Convolutional Neural Networks (CNNs) [
25]; the Python Toolbox (PyTrack) [
26]; and the Valhalla solution based on an open-source routing engine [
27]. These methods have been successfully used to match the GPS points from various trajectories. However, they often have specific requirements for the trajectory data (e.g., data volume, data frequency, and error distribution) or the scenarios (e.g., online or offline matching). Some methods aim to enhance the matching accuracy by incorporating additional data features or adjusting algorithm parameters. Nonetheless, these approaches often suffer from high memory usage and large time costs during the map-matching process, making it challenging to match large GPS data on complex road networks.
The HMM-based map-matching method effectively addresses many of the drawbacks associated with the aforementioned approaches [
28]. Due to the simplicity and Markovian property of the HMM algorithm, it offers significant improvements in terms of computational efficiency, storage efficiency, and broad applicability. It is particularly well-suited to handling errors on complex road networks from large-scale datasets [
27,
28]. This superiority makes it a valuable tool in the practical applications of trajectory matching. However, these map-matching algorithms may yield incorrect matches because of their complicated trajectories. For instance, past studies have found that curved driving [
15], high-speed movement [
29], and road network topology [
30] can contribute to wrong matches. These map-matching errors could mislead the identification and prediction of vehicle status and behaviors, consequently impairing the ability of systems to monitor the trajectories of running vehicles. Hence, the map-matching errors ought to be specifically identified and addressed.
Currently, although some research has focused on the issue of map-matching errors, there is relatively limited research into the mechanism of map-matching errors, as well as analysis of the spatiotemporal distribution and road characteristics of these errors. For example, Dey et al. [
17] proposed a method to automatically identify and detect map-matching errors in the absence of ground truth. Chao et al. [
31] found that the density of roads and road segments with curves significantly impacted the quality of map matching. Furthermore, Luo et al. [
32] found that intersections and large indoor areas often resulted in significant indoor positioning errors. Different measurements have been developed to detect and deal with these map-matching errors. The application of detection approaches is highly dependent on the availability of data sources and the purposes of matching. The most common method is the rule-based approach that detects a wrong match by examining the relationship between trajectories and roadway networks. For example, the method could detect errors that occur when trajectories are out of the roads or a part of the trajectories vanishes from the roads (i.e., deviation) [
33]. Another type of method is developed through machine learning, among which supervised learning is commonly used to detect wrong-matching trajectories using training data, such as the use of Support Vector Machines (SVM) [
34] and Random Forests (RF) [
35]. Comparatively, the Fusion-based method, which generally consists of several modules used to improve judgment accuracy, has also been constructed to detect map-matching errors [
36]. Despite the various approaches, the wrongly matched trajectories cannot be fully detected and effectively corrected. One of the critical reasons for this issue is that the mechanism of map-matching errors is not thoroughly uncovered.
As such, investigating how map-matching errors occur could improve the quality of map matching when matching the trajectories to the locations at which errors are prone to appear. To this end, the spatiotemporal analysis could be conducted to reveal the mechanism of the map-matching errors and explore their characteristics. For instance, Santi et al. [
37] conducted a spatial–temporal analysis to identify the patterns of taxi travel based on New York City taxi trajectories. Livio et al. [
38] combined traffic accidents and GPS trajectory data to identify accident black spots in space and time scales. Likewise, spatiotemporal analysis could be used to identify the distribution of map-matching errors and reveal the location or scenarios that are associated with the occurrence of the errors.
In summary, this study attempts to enrich the knowledge of where, when, and why various types of map-matching error occur by means of spatial–temporal and factor analysis. The study matches the trajectories based on the HMM algorithm and accordingly identifies different types of map-matching errors, as well as investigating the spatial–temporal distributions and contributing factors of the map-matching errors. The main contribution of the study is that we identify the map-matching errors generated via the HMM algorithm, analyze the spatiotemporal distribution patterns of these errors, and explore the relationship between map-matching errors and road environment factors. The conclusions could assist the analysts in understanding the occurrence of map-matching errors when applying the HMM and improving the accuracy of the HMM algorithm.
The rest of the study is organized as follows. The next section describes the study area, GPS trajectory data, and road features.
Section 3 provides a detailed introduction of the map-matching method.
Section 4 presents the study’s results and discussions. Finally,
Section 5 summarizes the study’s conclusions.
2. Data Description and Pre-Processing
Taxi GPS trajectory data were obtained from Chengdu Municipal Traffic Management Bureau in China. Taxi GPS trajectories were collected in the period 1–14 September 2020 inside the First Ring Road area (
Figure 1). The taxi data include three parts: taxi GPS trajectory data, a GIS map of the road network, and the dataset of road features. The taxi GPS trajectories contain approximately 1.4 billion GPS data points in the collection period. To control the GPS data size, 14 million trajectories generated by 500 randomly selected taxis were matched by the HMM algorithm and provided map-matching errors. There were 11 variables in the raw trajectory data recorded using the vehicular GPS recorder, including license plate number, plate color, alarm status, vehicle status, latitude, longitude, direction, speed, satellite time, creation time, and creator (
Table 1).
The road feature dataset was manually fetched from Baidu Street View, Version 2020. In addition to road features, time and environmental factors were considered in the study.
Table 2 describes all of the key factors of the errors. In addition, features of the roads that were immediately adjacent to the roads/intersections on which a map-matching error occurs in the driver’s routine were considered to be factors. The roads adjacent to the error location were named as the previous road and the latter road, respectively (
Figure 2).
Prior to map-matching error detection, there were two pre-processing steps. The first step involved GPS data and road network data pre-processing. In the GPS data pre-processing step, we conducted a cleaning procedure on the raw data, including removing the missing and duplicate data (e.g., duplicated records of GPS points with the same latitude–longitude and timestamp), as well as filtering out abnormal drift points (e.g., discontinuous jumps in the longitude and latitude data along the trajectory), and the second step was to adjust the coordinate system and restrict the study area by converting the GPS data latitude and longitude coordinates from World Geodetic System 1984 to Xian_1980_3_Degree_GK_CM_105E projected coordinate system for subsequent map-matching operations. At the same time, we selected the research area within the First Ring Road of Chengdu city from the downloaded road network map for the map-matching process based on the HMM algorithm. This process was conducted by removing all of the roads and intersections beyond this area in the digital map.
Data processing and analysis were sequentially three-fold: raw data collection, data pre-processing, and map matching (
Table 3). The data output from the previous process served as the input for the subsequent process.
This study primarily utilized an Inspur server model NF5280M6, which was equipped with an Intel(R) Xeon(R) CPU E5-@ 2.10 GHz processor, 32 GB of RAM, and a 4-terrabite hard disk capacity. The platform runs on Windows Server 2019, and we used software such as ArcGIS, Python, and SPSS. ArcGIS was used for Kernel Density modeling analysis; Python was used for data pre-processing and map matching, incorporating modules like Pandas, ArcpyUtil, and HmmUtil; and SPSS software was used for analyzing the spatiotemporal factors that contributed to map-matching errors using the multinomial logistic regression model.
The experimental results indicated that the CPU time consumption in the data pre-processing stage was approximately 0.02112 s per iteration, memory usage was around 6.21 GB, and disk utilization was 1%. Similarly, in the map-matching error detection and analysis stage, the CPU time consumption was roughly 0.02592 s per iteration, memory usage was about 3.07 GB, and disk utilization stayed constant at 1%.
4. Results
4.1. Outputs of Map Matching
All of the selected taxi trajectories are matched to their nearest road lanes using the HMM map-matching algorithm.
Figure 5a,b show the sample outputs of the match on a single segment and a road network, respectively. The red dots represent the raw GPS points from a trajectory, and the green dots represent the matched points. It is observed that the HMM algorithm performs well in matching trajectories on a single segment. All of the trajectory points can be properly matched to the corresponding road lane if the vehicle is moving in a fixed direction. However, map matching on a large-scale road network becomes much more complicated. As shown in
Figure 5b, a majority of trajectory points can be correctly matched to the corresponding road lane, but a few points are overtly mismatched when the vehicle is driving on the road segment or passing through the intersection, leading to the occurrence of WREs and WJEs, respectively. In addition, the number of matched trajectory points does not equal the number of original trajectory points, which manifests the existence of OREs or OJEs.
Figure 6 illustrates the trajectories before and after map matching, which shows that most of the original trajectories are correctly matched to their corresponding roadways. The success rate of matching reaches 89% for the whole study area, demonstrating a relatively effective matching performance. For the remaining 11% of trajectories with map-matching errors, we calculate the proportion of the four types of map-matching errors. In
Table 4, the total number of OREs, WREs, OJEs, and WJEs in the study area is 175,512, among which there are 113,349 WREs, accounting for 64.6% map-matching errors, followed by OREs (14.1%), OJE (10.8%), and WJE (10.6%). The process of identifying these errors took a total of 4550 s (approximately 75 min).
The total time complexity of the HMM map-matching algorithm is O(N×M2), and the overall space complexity of the HMM map-matching algorithm is O(M×N). N represents the number of observed points (trajectory points), and M represents the number of road segments (states) in the road network.
4.2. Temporal and Spatial Distribution of Trajectory Errors
Based on the KDE analysis,
Figure 7 illustrates the spatial–temporal distributions of four map-matching errors in different analytical units.
The distributions of OREs (
Figure 7(a1–e1)) do not change evidently across times of day and between weekends and weekdays. Specifically, OREs tend to cluster at intersections located in central, eastern, southwestern, and southeastern regions of the road network, indicating that the density of OREs is not consistent in the study area. As for the temporal characteristics, we can observe that daytime and weekdays are associated with more intensive OREs at the above-stated regions, while OREs observed in peak hours, at night, and at the weekend are much sparser. It is also found that the distributions of OJEs (
Figure 7(a3–e3)) are almost identical in different time scales, which are clustered in the central, northern, eastern, and southwestern areas of the road network. Similarly, the densities of OJEs in different time scales are analogous to those of OREs.
The clusters of WREs (
Figure 7(a2–e2)) are spatially sparser across the road network. It is shown that WREs tend to cluster in the central, southeastern, and southwestern parts of the road network. In the time scale, we identify that WREs in peak hours and weekends are more intensive than those in daytime, at night, and on weekdays. In contrast to WREs, the WJE clusters (
Figure 7(a4–e4)) visibly vary across the road network in different time scales. More specifically, WJEs are intensively clustered in the central and southwestern regions of the road network in the daytime, at peak hours, and on weekdays, while these errors become sparser at night and on weekends. Meanwhile, we reveal that WJEs are prone to cluster in intersections located in the southeastern part at night, on weekdays, and on weekends, but the cluster is not observed in daytime and at peak hours. Furthermore, the densities of WJEs in daytime, peak hours, and weekdays surpasses those at night and on weekends.
4.3. Contributing Factors of Map-Matching Errors
Variance inflation factor (VIF) is used to diagnose multicollinearity among all predictor variables prior to modeling [
43]. All factors have a VIF of less than 5, indicating that the model estimates are not explicitly influenced by multicollinearity. A multinomial logistic model is then adopted to explore the contributing factors causing the map-matching errors with the reference category of ORE.
Table 5 presents the estimated results. Relative to ORE, there are 26, 23, and 21 factors significantly associated with WRE, OJE, and WJE, respectively.
In terms of time factors, compared to daytime, the probabilities are 8.8% lower, 14.3% lower, and 34.2% higher at night for WRE, OJE, and WJE, respectively. During peak hours, the probability of WRE is 4.0% lower, and the probability of OJE is 5.8% lower. On weekends, the probability of WRE is 5.2% lower than on workdays, while the probability of OJE is 4.8% lower than on workdays.
For intersection types, the probability of WRE is significantly higher if the map-matching error occurs at a location close to a flyover (Odds Ratio = 225.2%), a crossroad (Odds Ratio = 367.8%), an X-junction (Odds Ratio = 108.7%), or a T-junction (Odds Ratio = 325.1%), as opposed to the errors located outside the vicinity of the intersections. The conclusion is also applicable to WJE (except near the flyover) and OJE (except near the X-junction).
Relative to the previous road section, the probability of WRE is significantly higher if the previous road has bicycle dividers (Odds Ratio = 10.1%), median dividers (Odds Ratio = 53.5%), roadside parking (Odds Ratio = 27.9%), or is one-way controlled (Odds Ratio = 27.8%). The conclusion can be applicable to OJE (except if the previous road has bicycle dividers and median dividers) and WJE (except if the previous road has bicycle dividers, median dividers, and roadside parking). With regard to the speed limit of the previous road, speed limits of <30 km/h and 30–50 km/h reduce the probability of WRE by 64.2% and 10.3%, respectively, compared to roads with a speed limit of ≥60 km/h, while increasing the probability of OJE by 53.9% and 50.3%, respectively, and reducing the probability of WJE by 30.2% and 32.2%, respectively. Furthermore, an increase of 1 unit per km in resident density raises the possibility of WRE by 4.6%, while a 1-unit-per-km increase in public service density raises the possibility of WRE by 1.9%, while decreasing the possibility of OJE by 3.3%.
Factors related to the latter road show that the probability of WRE is significantly higher if the road has bicycle dividers (Odds Ratio = 63.6%), median dividers (Odds Ratio = 15.6%), and roadside parking (Odds Ratio = 12.5%). Similar findings are observed for OJE and WJE. However, roadside parking decreases the possibility of OJE and WJE. As for speed limit, roads with a speed limit of <30 km/h are associated with significantly reduced probabilities of WRE (65.1%) and OJE (23.1%), but a higher probability of WJE (134.2%), compared to the speed limit of ≥60 km/h. The speed limit of 30–50 km/h is also linked to a 15.5% reduction in the probability of WRE and a 19.1% reduction in the probability of OJE. Moreover, an increase of 1 unit per km in resident density and public service density is positively associated with higher probabilities of WRE, OJE, and WJE.
5. Discussion
It is found that time factors have a significant effect on determining the error types. For instance, WREs and OJEs are more likely to be observed in the daytime and on weekdays. This result is mainly due to the fact that taxis are usually clustered in the urban center during the daytime and on weekdays to seek passengers [
44]. Moreover, the city center has a high-density road network, which may be associated with increasing the occurrence of map-matching errors. Specifically, the trajectories of running taxis within this area are more likely to be incorrectly matched to another road (WRE). Also, the taxis have more complex trajectories when they enter the intersections due to the increased traffic volume during the period [
45], resulting in more matching losses (OJE). However, WJEs are more likely to be observed at night, which may be interpreted as suggesting that taxis tend to wait or cruise around the city’s intersections, hospitals, or transit stations, where the taxi requirements are more intensive [
46]. In this case, the trajectories could be incorrectly matched to nearby road segments that are adjacent to intersections, leading to WJEs.
The occurrence of map-matching errors varies across the types and sizes of intersections. Specifically, WREs and OJEs are more likely to occur on flyovers, which is not the case for WJEs. It could be explained by the fact that trajectories are likely to be matched to the ground roads under the flyovers (WRE). Also, the trajectories could be lost on ramps (OJE). However, the trajectories are not likely to be matched to another access (WJE) because of the large size of the flyover. We also found that WRE and WJE are more overtly observed on X- and T-junctions due to the complex movements of vehicles and the difficulty of positioning at these junctions [
47,
48]. The probabilities of WRE, OJE, and WJE can be interpreted as complex trajectories at medium- and large-sized intersections. Comparatively, the reason for map-matching errors in small-sized intersections may be different. Specifically, the heights and densities of buildings and trees around intersections can affect GPS signal transmission quality [
49,
50]. As such, the GPS trajectories are more difficult to position within smaller intersections since they are more susceptible to being obstructed by adjacent buildings and trees [
51], consequently causing the failure of GPS match (OJE).
As for the characteristics of both previous and latter roads, factors of bicycle dividers, median dividers, and one-way control could increase the possibility of ORE and WRE, which can be interpreted as suggesting that the dividers and one-way control can accelerate the vehicle’s speed, meaning that there will be fewer reliable trajectories on the road segment compared to the trajectories generated via low-speed movements. It is also found that the dividers on the latter road could increase the possibility of OJE and WJE because vehicles could easily accelerate or change lanes after they pass the junctions, which causes more OJEs and WJEs. We notice that roadside parking on both previous and latter roads could increase the probability of WREs, which implies that roadside parking could hamper the sight view of the driver and, consequently, encourage them to change lanes or adjust their driving speed. However, the likelihood of junction errors decreases (i.e., OJE and WJE) since vehicles have to slow down if the latter road has roadside parking. For land use, commercial areas lower the possibility of WRE and OJE due to the traffic delay. In contrast, WREs are prone to occur on both previous and latter roads with more public facilities and residents due to the intensive accesses near to the junctions, meaning that the trajectories are likely to be mismatched. This outcome is also the case for OJE and WJE if the latter road has intensive public facilities and residents.
The speed limit of previous and latter road segments is found to significantly influence the map-matching errors. WREs tend to occur on road segments where previous and latter roads have a higher speed limit, which could be explained by the assumption that higher vehicle speeds can lead to deteriorated GPS signal quality [
52]. Additionally, vehicles are likely to lose their trajectories (OJE) if they switch from a lower speed limit road to a road with a higher speed limit through a junction. This result occurs because the trajectories on the roads with lower speed limits tend to be more stable, but the vehicle could lose the trajectory signals if it suddenly accelerates and enters a high-speed road. Conversely, the modeling results demonstrate that WJE generally takes place when vehicles move from a road with a high speed limit to a road with a low speed limit. This result may be explained by the fact that vehicles have to slow down in advance before they enter a low speed limit scenario after they enter the junction. Hence, the vehicles could generate a large number of trajectories within the junction, which could be mismatched to other nearby roads rather than being lost.
6. Conclusions
The study identifies four kinds of trajectory map-matching errors (i.e., ORE, WRE, OJE, and WJR) based on the HMM algorithm using taxi trajectories in Chengdu. The study employs temporal Kernel density analysis and a multinomial logistic model to examine the spatial–temporal patterns of the map-matching errors and contributing factors associated with different error likelihoods. Several key findings are offered below:
The spatial patterns of ORE, WRE, OJE, and WJE overtly vary across the time scales (e.g., time of day and weekday/weekends), signifying that the map-matching errors are not consistently located in the study area.
Compared to ORE, the probability of WRE and OJE is higher on weekdays, while the probability of WJE is higher on weekends. It is noted that OREs and WJEs are more likely to occur during peak hours and at night.
WREs, OJEs, and WJEs are more likely to be observed at intersections, especially on a flyover, an X-junction, and a T-junction.
WREs tend to occur on the road where previous and latter roads simultaneously have bicycle dividers, median dividers, one-way control, and roadside parking, while these factors have mixed impacts on OJEs and WJEs. Also, higher resident and public service density on the latter road could increase the probability of WRE, OJE, and WJE.
WREs are likely to occur on roads with low-speed limits. OJEs tend to occur when vehicles switch from a low speed limit to a high speed limit road, while the occurrence of WJE has the opposite trend.
There are several limitations to the study. Firstly, we only identified four types of map-matching errors from the taxi trajectories, which may not cover all of the error types in practice. Secondly, we only focus on the effect of time, intersection characteristics, and road features on map-matching errors, but the influence of traffic conditions and drivers’ responses are unknown. Thirdly, the primary objective of this paper is to utilize offline GPS data to investigate the spatiotemporal patterns of map-matching errors and examine the effect of road environments on these errors. Therefore, an online map matching system is not considered in the current study.
These three limitations could be overcome if more accurate trajectory data became available. In the future research, we aim to develop advanced algorithms to identify more types of map-matching errors, such as Breakage Error, Ambiguous Match, and Ghost Trajectory Error [
17,
27]. On the basis of detecting more types of errors, the researchers could gain a more comprehensive understanding of all potential errors that may occur in the scenarios when conducting the map-matching algorithm. Additionally, incorporating real-time traffic data and driver responses will allow us to analyze the influence of traffic conditions and driver-related factors on the distribution of map-matching errors, which can potentially tackle the limitations related to the inadequacy of considering the impact of time, intersection features, and road characteristics and enable comprehensive consideration of traffic situations and driver decision-making processes on the trajectories. We aim to address these challenges and provide more reliable conclusions in a future study.