1. Introduction
The global pandemic of the corona virus disease 2019 (COVID-19), as a global public health emergency, has brought major challenges to the entire world. The outbreak of the virus has changed the thinking and lifestyle of residents, and maintaining social distance and avoiding contagion have become a unified perception of residents in public places. It has brought out an unprecedented reduction in public transport demand over the past three years, shifting residents’ perceptions of public transport from positive to negative [
1]. According to the survey, residents’ propensity to use transportation has changed significantly after the COVID-19 outbreak. Public transport has been the most affected, with a remarkable decrease in the number of users, while the use of private cars has increased. However, for people who do not have a private car and do not want to use public transportation, shared mobility is a good option in many cities. Online car-hailing is a kind of shared mobility, which is favored by people for its convenient and reliable service. As the best alternative to private car travel and an important supplement to public transportation, it has gradually become one of the important travel modes [
2]. At the same time, making full use of shared travel resources can alleviate various traffic pressures. The online car-hailing platform collects and matches passenger orders and service vehicles in a new way for service sources, creating conditions for the provision of large-scale transportation services [
3]. It is helpful to build an intelligent, green, efficient, and safe integrated transportation system and promote the sustainable development of urban transportation [
4,
5].
According to the national online car-hailing regulatory information interaction platform [
6], 263 online car-hailing platform companies are licensed in China, and 4.053 million online car-hailing driving licenses have been issued. Online car-hailing faces several problems. Residents cannot quickly access car-hailing services, and drivers run empty for long times without orders, which wastes energy resources and increases traffic congestion. Therefore, it is of great significance to accurately predict the time and space of online car-hailing travel demand. This can help residents to better understand demand according to regions and times, enabling better decisions and improved travel efficiency. Reasonable scheduling of vehicles and the timely fulfilment of travel needs can reduce the waste of road resources and ease traffic congestion, energy waste, and pollution. The data used in this article come from the Gaiya data plan (where personal information is anonymized) of Didi Taxi. Online car-hailing platforms have almost uniform operation characteristics. Customers manually input the longitude and latitude of the starting and ending points. Orders include information such as an order number and times of boarding and alighting, which provides a basis for the spatiotemporal prediction of online car-hailing travel.
Online car-hailing travel data are spatiotemporal, with obvious periodicity, enabling the prediction of travel demand. Past travel research initially used historical average data for prediction, and focused on time series prediction, with typical models such as Autoregressive Integrated Moving Average Model (ARIMA) and its variants [
7,
8]. Traditional methods use linear mathematical representation to find the internal characteristics of traffic flow, but this does not apply to online car-hailing, which has more complex nonlinear spatiotemporal sequence data. Machine learning and deep learning models, such as neural networks, perceptrons, and support vector machine, have been used in traffic prediction [
9], and perform better and are more accurate than traditional models [
10]. Deep learning has formed the basis of many excellent network models in the field of transportation. Convolutional neural networks (CNNs) [
11] and recurrent neural networks (RNNs) [
12] are the basic structures for traffic prediction research. However, CNNs have a weak ability to capture long-term time dependence, while RNNs require intensive computation. Previous studies considered online car-hailing data as one-dimensional, which has a weak ability to capture spatiotemporal features.
Several models have been proposed to capture the spatiotemporal characteristics of network car trips. Convolutional long short-term memory (ConvLSTM) [
13] was originally used for precipitation prediction, and is widely used in transportation due to its spatiotemporal prediction. To predict urban short-term traffic flow, Chen et al. [
14] designed a prediction model based on ConvLSTM. Zheng et al. [
15] introduced an attention module to a ConvLSTM model that can extract spatial and short-term temporal features to improve prediction accuracy. However, ConvLSTM lacks the characteristics of parallel operation, so training time is slow, and the network lacks long-term dependence on sequences.
Among the above methods, previous studies have treated the information of online car-hailing demand as one-dimensional data, and the ability to capture its spatiotemporal characteristics is weak. Although the emergence of ConvLSTM can fill this gap, due to its lack of parallel operation characteristics, its training time is slow, and there is also a lack of long-term dependence on sequences. In order to overcome the above shortcomings, this paper converts the online car-hailing order data into video frame data containing spatial information and time series, and proposes a predictive Spatiotemporal Transformer (SPTformer) model based on the architecture of the Transformer model, so as to better predict the demand for online car-hailing travel from the temporal and spatial scale. From this perspective, this study also converts the demand forecast problem of online car-hailing into an image extrapolation problem. In recent years, deep learning technology has improved the technical level in many fields with its good performance, especially in computer vision [
16]. The Transformer model was originally developed by Vaswani et al. for natural language processing [
17], and was widely used in computer vision [
18,
19,
20,
21,
22]. SPTformer utilizes the encoding part of the Transformer network. At the same time, in order to consider the temporal correlation between sequences, we inject information about the relative or absolute position of the image sequence into the input data by adding position coding, and introduce a 3D convolutional layer (Conv3D) and scaled dot product attention in the self-attention calculation part to capture the short-term and long-term dependencies between sequences. Like the Transformer model, our model can run in parallel, maintaining good performance while speeding up training. As far as we know, this is the first time online car-hailing order data have been converted into video frame data, and that the Transformer network has been used to predict the travel demand of online car-hailing in time and space. The contributions of this paper are as follows:
- (1)
Based on the Transformer architecture, a spatiotemporal prediction SPTformer model is proposed, and the experimental results prove that our model is competitive for the spatiotemporal prediction of residents’ online car-hailing travel.
- (2)
After the online car-hailing order data are processed into video frame data, they still contain the spatiotemporal information of the online car-hailing trip data, so the model can better predict the online car-hailing demand on the spatiotemporal scale.
The remainder of this paper is organized as follows: in
Section 2, we briefly review the existing solutions and models for solving traffic prediction problems. In
Section 3, the structure of the model in this paper is described, and each part of the model is explained in detail. At the same time, the special processing method and data structure of online car-hailing data are introduced. In
Section 4, extensive experiments are described using Haikou online car-hailing order data, and the existing spatiotemporal prediction methods in the transportation field are used as a comparison to test the effect of the proposed model. The results of the projections are discussed at the end. In
Section 5, we summarize this research and discuss further work.
2. Related Work
Deep learning has found application in various fields, and most transportation research depends on it. This paper addresses the spatiotemporal prediction of online car-hailing travel demand.
The CNN and RNN are two basic models in traffic prediction research. In order to better consider the spatiotemporal correlation of ride-hailing travel demand, Zhang et al. proposed an end-to-end multi-task learning beat convolutional neural network (MTL-TCNN), which predicts short-term passenger demand at a multi-regional level based on Didi Chuxing’s ride-hailing data in Chengdu, China and taxi data in New York City [
23]. To predict short-term traffic flow, Zhang et al. [
24] designed a model based on a CNN, with higher accuracy than the traditional model. Chen et al. [
25] proposed PCNN, which is based on deep CNNs and can model periodic traffic data and predict short-term traffic congestion. Mou et al. [
26] proposed a temporal information augmented LSTM (T-LSTM) model to predict the traffic flow of a single road segment, which can capture the intrinsic correlation between traffic flow and temporal information, thereby improving prediction accuracy. The prediction of traffic flow during peak hours is of great significance to alleviate traffic pressure. Yu et al. [
27] designed a traffic flow prediction model based on long short-term memory (LSTM) to predict traffic flow in urban peak hours. Tang et al. [
28] proposed ST-LSTM, which extracts spatiotemporal features from data and combines them as input. Gu et al. [
29] combined an LSTM neural network and gated recurrent unit (GRU) in a two-layer deep learning model that outperformed a single network model. However, ordinary CNNs weakly capture long-term temporal dependencies, and RNNs do not capture spatial dependencies well, which motivates their combination.
A CNN can capture the spatial basis of traffic flow, while an RNN can mine short-term changes and periodicity. Wu et al. [
30] combined these in a deep learning framework, CLTFP, for the spatiotemporal prediction of traffic flow. Similarly, Zhen et al. [
31] used a CNN to extract traffic spatial features, and an RNN to predict traffic flow changes. Liu et al. [
32] designed a ConvLSTM model based on a CNN, which can extract the spatiotemporal features of traffic flow, and has an end-to-end deep learning architecture. To consider the temporal and spatial characteristics of traffic flow and extract the temporal and spatial correlation and variation law of traffic flow data, Li et al. [
33] combined a CNN and bidirectional LSTM (BiLSTM) in Conv-BiLSTM. To extract the spatiotemporal correlation of data from historical traffic data, He et al. [
34] designed a spatiotemporal CNN (STCNN) based on a model of convolutional LSTM cells. Wang et al. [
35] designed a traffic demand prediction model based on deep spatiotemporal ConvLSTM, which was experimentally shown to outperform traditional models in both accuracy and speed. Huang et al. [
36] designed a ConvLSTM-Inception network (CL-IncNet) to make spatiotemporal predictions of traffic flow data. Li et al. [
37] constructed a ConvLSTM network to predict taxi demand, which was shown to more accurately process spatial information. Chen et al. [
38] proposed a BT-ConvLSTM model to introduce temporal information to a ConvLSTM network, and it was experimentally shown to improve traffic flow prediction accuracy. Di et al. [
39] proposed CPM-ConvLSTM, a spatiotemporal model to make short-term predictions of the congestion levels of road segments. To reduce resource requirements, Huang et al. [
40] built a sparse convolutional recurrent network utilizing sparse gates in ConvLSTM and ConvGRU. Ranawaka et al. [
41] used a ConvLSTM model with Google traffic data to predict traffic flow in the next 20, 30, and 60 min. Although the combination of CNN and RNN can capture the spatiotemporal features of traffic data, they are computationally expensive and slow to train, due to the sequential nature of the recurrent structure.
This study uses a Transformer network to construct a prediction model. Compared with RNN-based methods, Transformer can effectively capture long-term dependencies, can be operated in parallel, has good performance and a fast training speed, and can capture the correlation of each part of an image through self-attention well. In order to consider the different spatial relationships between variables, Grigsby et al. [
42] proposed a method called Spacetimeformer, which achieved good results in the field of spatiotemporal prediction. Xu et al. [
43] proposed a new paradigm of spatiotemporal transformer networks (STTNs), which exploits dynamic directional spatial dependencies and long-term temporal dependencies to improve the accuracy of long-term traffic flow prediction, and their model performs well in long-term traffic flow prediction. Song et al. [
44] proposed a model named TSTNet based on the Transformer architecture, which is a sequence-to-sequence (Seq2Seq) spatiotemporal traffic prediction model, which can be used for urban traffic spatiotemporal flow prediction. Zhang et al. [
45] used the Transformer network to propose a novel architecture called a time-fusion transformer (TFT), which can predict short-term highway speeds, which has been experimentally shown to have high accuracy. Cai et al. [
46] referred to Google’s Transformer machine translation framework to design a network called a Traffic Transformer network that captures the continuity and periodicity of traffic flow time series and models spatial correlations. Girdhar et al. [
47] designed an anticipatory video transformer (AVT) based on a Transformer network to predict actions, with an attention module and an end-to-end model architecture. In order to make accurate predictions of autonomous driving trajectories, Zhang et al. [
48] designed a Gatformer model based on transformer architecture, which can make more accurate predictions while shortening the forecasting time. Wu et al. [
49] proposed an object-centric video transformer (OCVT) to predict video frames, decomposing a scene into tokens suitable to generate video transformers. Farazi et al. [
50] designed an end-to-end learnable model, the frequency domain transformer network (FDTN), which can estimate and use signal transforms in the frequency domain. Wang et al. [
51] designed a concise and efficient temporal Transformer network with progressive prediction, aggregating observed features, and a lightweight architecture to progressively predict features. Liu et al. [
52] proposed a ConvTransformer network for video frame sequence learning and synthesis. Shi et al. [
53] proposed Transformer-based video interpolation framework self-attention to compute long-term dependencies. Zheng et al. [
54] designed a pure Transformer-based network to predict the next step for a 3D human pose in a video. Tai et al. [
55] designed higher-order self-attention, and proposed a higher-order recursive layer design, HORST. Farazi et al. [
56] introduced a transformer model that enables local predictions with selectable sparsity.
The Transformer network has achieved great success in computer vision, and provides a theoretical basis for our research, since traffic data are spatiotemporal. To better predict online car-hailing demand, we convert order data into video data with spatiotemporal characteristics according to a fixed time period. To predict video data, we propose a model called the Spatiotemporal Convolution Transformer (SPTformer) based on the architecture of a Transformer. Experiments show that the model is suitable for the research of spatiotemporal prediction, and it performs well.
4. Experimental Setup and Analysis
4.1. Study Area and Data
4.1.1. Overview of Study Area
This study takes Haikou City, the capital city of Hainan Province and the central city of the Beibu Gulf urban agglomeration, as an example. It is located from 19°31′–20°04′ north latitude and 110°07′–110°42′ east longitude. It is the political, economic, technological, and cultural center of Hainan Province, and is its largest transportation hub. It is the fulcrum city of China’s “One Belt, One Road” strategy [
58]. However, Haikou has some problems in transportation, especially in the aspect of public transport development. Compared with the evaluation indicators of public transport in China, Haikou has fewer public transport lines, and the number of buses per 10,000 people is lower than the national standard. Buses are mainly concentrated on trunk roads and their lines are unevenly distributed. In addition, the time interval between bus lines is long, which makes few citizens choose to take the bus. As an important supplement to public transportation, online car-hailing is very popular among residents [
59]. In 2012, Didi Chuxing, Shenzhou, Yidao, and other online car-hailing companies began to operate in Haikou City. By the end of 2016, the number of online car-hailing vehicles in Haikou had reached 10,000, including 6000 legal car-hailing drivers [
60].
Figure 2 shows an overview of the study area.
4.1.2. Ride-Hailing Data
The online car-hailing order data used in this study come from the travel dataset published by Didi Chuxing’s Gaia data open plan [
61]. This study selected the daily order data of Haikou from 1 May to 31 October 2017, including order ID, order time, order type, traffic type, number of passengers, estimated road distance between departure and destination, arrival time, estimated price, duration, primary business line, and longitude and latitude of the destination and starting point. Personal information was anonymized, and did not affect the research.
4.2. Data Preprocessing
Map data: The online car-hailing data cover the downtown area of Haikou. Therefore, the main study areas of this paper are Jinmao Street, Zhongshan Street, Jinyu Street, Guoxing Street, Heping South Street, Haifu Street, Haixiu Street, Xiuying Street, haixiu Town, Datong Street, Chengxi Town, Haiken Street, Binhai Street, Bailong Street, Lantian Street, Boai Street, Binjiang Street, Fucheng Street, Fengxiang Street, Haidian Street, Renmin Road Street, Xinbu Town, and Baisha Street, covering 19°89′–20°12′ north latitude and 110°10′–110°63′ east longitude.
Research area division: To facilitate future modeling, the area should be divided into small research units in order to avoid the complexity of map matching in large-scale network demand forecasting. We divided the study area into multiple grids for faster analysis. Since the point data are aggregated, the results will be affected by the size or method of grid division. The size of the grid should be comprehensively considered and determined according to actual needs. When selecting a small-scale grid, the travel demand in each grid is low, the network complexity is high, and the actual operation is difficult, but the small-scale grid division describes the demand more finely in terms of spatial granularity. Although the computational complexity is reduced when large-scale grids are selected, the accuracy of large-scale grid description is poor. In this article, the study area is initially divided into 60 × 60 grids, i.e., 0.09 km2, and the travel demand of each grid according to the time scale is calculated.
Online car-hailing data: The order data are found to have missing information and abnormal problems, such as a missing order ID or a null estimated distance. Invalid information in the historical order data is deleted, including city ID, city area code, secondary district and county, driver sub product line, estimated road distance between departure and destination, estimated price, duration, and primary business line. IDs are randomly generated for missing IDs. Orders whose origin and destination latitude and longitude are outside the scope of the study area are deleted. Only order data appearing for the first time are retained in the case of duplication. The original data include 14,160,162 order data, and 11,255,140 order data are retained after data cleaning.
4.3. Experimental Data Construction
Online car-hailing order data must be converted into video frame data to construct a spatiotemporal matrix, i.e., into different grayscale images according to time periods. Each pixel represents a study area, whose grayscale represents its travel volume. Video frames are combined to construct a video frame dataset, which is imported into the model for spatiotemporal prediction. The method is as follows. The five-month data are arranged before and after the event, and the time scale of historical variables is divided. For more accurate prediction, we divide time slices for research to judge the influence of time divisions on the experimental results, and select the best method. To facilitate data segmentation, we divide the data into time slices of 10, 15, 20, and 30 min. It is then necessary to calculate the number of online car-hailing trips in each grid in each time slice. This study uses the latitude and longitude of each order, determines which grid the order falls in, and records the travel demand in the grid study area. Video frame data of each time slice are obtained, and are arranged in order of time slices, to obtain 26,496 frames of images for 10 min, 17,664 for 15 min, 13,284 for 20 min, and 8832 frames for 30 min. To control the variables, all models use the data of the first two hours to predict the data of the next hour. For three hours of data, the data of the first two hours are for training, and the data of the last hour are used as the label of the spatiotemporal prediction result to calculate the loss and accuracy rate. Image data for periods of 10, 15, 20, and 30 min are processed into video data with 18, 12, 9, and 6 frames, respectively. The first 1400 frames are used as experimental data, with 80% for training, 10% for validation, and 10% for testing.
Figure 3 shows the converted data.
4.4. Evaluation Indicators
This study evaluates the quality of model prediction by mean absolute error (MAE) and root mean square error (RMSE),
where
n is the number of predicted frames,
k denotes the frame,
m is the number of video frame grids,
i denotes the research area,
is the predicted value, and
is the real value. MAE is the real error, which can intuitively reflect the average difference between the predicted and actual values. A lower MAE indicates a more accurate prediction. RMSE reflects the difference between the predicted and real data, magnifying larger errors, and it reflects the maximum error. A smaller RMSE indicates a better prediction result. When calculating MAE and RMSE, y and
are gray values. The prediction accuracy can be obtained by calculating each grid of each frame.
4.5. Experimental Analysis
4.5.1. SPTformer
The SPTformer model, as shown in
Figure 1, includes a spatiotemporal embedding layer, which encodes and adds locations to the data; an Encoder layer for computing self-attention; and an output layer, which consists of a Conv3D layer that performs the final prediction. The Encoder consists of two Encoder layers in series.
This study uses the video frame datasets divided by different time slices to compare the impact of time division methods on the prediction results.We take the first 1400 data points as experimental data; construct training, validation, and test sets in an 8:1:1 ratio; and select the optimal model parameters after experiments. We set the number of models to 16, the number of attention heads to 4, the size of the convolution kernel of the spatial embedding convolutional layer to (3, 3, 3), the number of convolution kernels to 16, and use a ReLU activation function. In the Encoder block, the convolutional layer settings calculated by Q, K, V are the same. The 3D-Feedforward has two Conv3D layers, with 32 and 16 convolution kernels in the first and second layer, respectively, of size (3, 3, 3), using ReLU activation. The output layer is a Conv3D layer with one convolution kernel of size (1, 1, 1). This study uses MSE to calculate the model training loss, optimized with RMSprop, with a 0.001 learning rate and 0.9 decay rate, and uses temporal backpropagation feedback to adjust the model. During training, when the number of model iterations reaches a certain level, the loss and accuracy rates change slowly, and the model reaches the optimum, so we set the number of model iterations to 50.
4.5.2. Compared Models
This study adopts a convolutional LSTM network as our baseline model, and utilizes ConvGRU and Self-Attention ConvLSTM (SaConvLSTM) models for comparison.
The ConvLSTM neural network was first used to solve the problem of precipitation nowcasting. This structure can establish temporal relationships between two-dimensional plane data and extract spatial relationships like a CNN [
62]. Its principle is similar to that of an LSTM network. There are also forgetting, input, and output gates, but the difference is the addition of convolution between the input and each gate. ConvLSTM has been widely used in the research of spatiotemporal prediction. The formula is as follows:
where the
,
,
,
,
, and
are all converted from two-dimensional to three-dimensional tensors. Two dimensions represent the rows and columns of the network in which the grid is located, and the other dimension represents the number of features in each grid;
,
, and
represent input, oblivion, and output gates.
represents the input of the network at t-time,
represents output at t-time, and
represents the cellular state at t-time;
W and
b represent the weights and biases for each gate, respectively. However,
W acts like a convolutional kernel, and “∗” represents convolutional operations; “
” represents the Hadamard product as in LSTM.
Since LSTM is slower to train, GRU has made slight modifications to increase the speed. Inspired by this, the LSTM was replaced by a GRU, and the ConvGRU model was proposed. Like the ConvLSTM model, ConvGRU changes the operation between the input and each gate to convolution, and it can perform spatiotemporal prediction. Unlike ConvLSTM, ConvGRU converts LSTM into GRU for computation. Yu et al. [
63] found that ConvGRU is faster and has better spatiotemporal prediction results.
Lin et al. [
64] found that SaConvLSTM relies on convolutional layers to capture spatial dependencies, which is locally inefficient, and introduces self-attention to extract spatial features with global and local dependencies and capture features with long-term dependencies in the spatial and temporal domains. Experimental results show that the method achieves better prediction results, with fewer parameters and higher efficiency.
4.5.3. Results
This paper compared the effect of the proposed model to that of other models on the same dataset, using the data of the first two hours to predict the data of the next hour. We used datasets constructed by division in different time periods, taking the first 1400 data points of each part as the experimental data, and constructing training, validation, and test sets, with results as shown in
Table 1, from which it can be seen that our model has the highest prediction accuracy on all constructed datasets, with increasing accuracy with finer time scale divisions. Our model has the lowest RMSE and MAE when the dataset is constructed with 10 min time periods, and they gradually decrease with finer time periods. To observe the model training process, we visualized the changes in the accuracy of each model as it was trained.
Figure 4 shows the changes in the MAE and RMSE.
It can be seen from
Figure 4 that the proposed model has the best fitting degree, the loss rate and accuracy curves are smoothest, and the accuracy has a rising trend. It can be seen from the accuracy change that the training speed of the proposed model is best in the first 20 rounds of training, and then the accuracy changes slowly and gradually flattens. At 50 training rounds, the fitting degree of the model is best, and the accuracy reaches the maximum. Compared to the other models, the MAE and RMSE curves of the proposed model are always at the bottom, and its training effect is best. The accuracy curve of the reference model is serrated, and the data fit poorly. The reference model has the fastest training speed in the first 10 rounds, and then it slows down, reaching the optimal value at the 30th round. To more intuitively see the performance of the model, we visualize the prediction results in
Figure 5.
This study selected the same dataset and used the trained model to predict it.
Figure 5 shows the visualization of the prediction results of different models. Each frame of data has an image data structure, whose gray value is the travel demand. Comparing the real map to the predictions, it can be found that the prediction map of the proposed model is similar to the real map, it has the smallest difference in distribution shape and intensity, and it can best express details. Although the other models predict the general distribution characteristics of the data, there is a big gap in the details, and the predicted value of travel demand differs greatly from the actual value.
After careful observation the prediction results of each model, we found that: ConvLSTM and ConvGRU have similar forecasts, with predictions for travel intensity roughly the same. However, in predicting spatial distribution, ConvLSTM is slightly better. Compared with the first two models, the saConvLSTM has poor prediction performance, which is lower than the ConvLSTM and ConvGRU models in both spatial distribution and travel intensity prediction. Overall, although the three comparison models predicted the spatial distribution of the central area of the study area, they had poor predictions for the marginal area around the study area. Looking at the original map, it can be found that the travel intensity of the edge area around the original map is very low or there are even no travel data, and the compared model makes excessive predictions. In contrast, the prediction results of the model proposed in this paper are better, the prediction of the edge area of the study area is basically the same as the original data, and the prediction of travel intensity is closer to the original data. Thus, our model is more competitive.
Analyzing the prediction results of the model proposed in this paper, it can be found that: our model predicted best for the city center area, but the Xinbu Street, Haixiu Street, Xiuying Street, Haixiu Town, Chengxi Town, Fengxiang Street, and Binjiang Street areas were poor. The model predicts better results in the city center area due to the higher intensity of online car-hailing trips in urban centers and the stronger cyclicality of residents’ daily trips. However, in other areas, the prediction results are poor due to the small number of daily trips and the irregular use of online car-hailing. Subsequent studies can analyze this part of the area separately to increase the accuracy of the forecast.
4.6. Discussion
This study used the same dataset to experiment with different models. The experimental results show that our model has the best fitting degree to the travel demand data of online car-hailing. The spatial distribution of the prediction results is closer to the original data, and it can better describe details. At the same time, our model most accurately predicts the demand for car-hailing. In order to capture the spatial relationship between sequences, the contrasting model changes the operation between the input and each gate to the form of convolution, while the CNN receptive field is usually small, which is not conducive to capturing global features. Unlike CNN, Transformer can extract all the information we need from the input and its relations at the same time, thus capturing long-range dependencies.
Using the same data and training batch, the average training times of SPTformer, ConvLSTM, and ConvGRU are 16 s, 33 s, and 20 s, respectively. ConvGRU is faster than ConvLSTM, but its prediction accuracy is poor. The training time of SaConvLSTM is 44 s, so our model has a shorter training time and faster speed. In addition, our model consumes fewer GPU resources. Since the contrastive model utilizes the LSTM and GRU structures to capture the temporal relationship between sequences, this type of neural network has evolved from the RNN structure. Since the RNN was proposed, it has been widely used in time series data problems. Generally speaking, the RNN is a for loop structure, which reuses the results of the previous iteration of the loop. Theoretically, it should be able to remember information seen before many time steps, but in fact, it can hardly learn this long-term dependence. Therefore, the LSTM network has been proposed subsequently. It is a variant of the RNN, which can better learn long-term dependence than the RNN. However, like the RNN, it must process sequence data in sequence, so it has no room for parallelization to accelerate the speed of model training. The working principle of GRU is the same as that of LSTM, with some simplifications and less computation, but its representation ability is not as good as LSTM in terms of prediction results. SPTformer is a deep learning model that utilizes an attention mechanism. Attention mechanisms in neural networks enhance the relevant and important parts of the input and remove the irrelevant parts, and learn which parts are important through training. Compared with a recurrent neural network, SPTformer has the advantage that it does not need to process sequential data in sequence. In the process of model training, there is a larger parallel interval, so the training time is reduced. To sum up, our model is more competitive.
Observing the prediction results of datasets constructed with different time divisions, we found that the more refined the time division, the better the prediction effect. When the time division is finer, the more information the model obtains, and the more accurate the prediction. Therefore, when forecasting travel demand, the data should be divided more finely.
5. Conclusions
To make more accurate spatiotemporal predictions of online car-hailing travel demand, based on the Transformer architecture, this paper proposes a new spatiotemporal prediction model. We utilized positional encoding, an attention mechanism, and a 3D convolutional network to effectively capture the spatiotemporal relationships between data. Based on the parallel mechanism of the Transformer network, our model has a fast training speed. This study processes the car-hailing order data into a video frame sequence, and the processed data are more in line with the spatiotemporal characteristics of online car-hailing travel data. At the same time, the travel intensity of online car-hailing can be directly obtained from the predicted results. Compared with the overall travel forecast for Haikou, our experiment can obtain travel demand in the small study area. This research used the 2017 real online car-hailing order data of Haikou City to test the performance of the proposed model, and the experiment proved the effectiveness of the method proposed in this paper. In real life, the method proposed in this paper can be used to predict the travel demand of online car-hailing in the next hour. For passengers, it is possible to better understand the changing laws of online car-hailing travel demand in different regions and at different times, so as to make more reasonable travel decisions and improve the travel efficiency of residents. For online car-hailing drivers, it is possible to accurately find the hot spots of travel demand, reduce the empty driving rate, and increase the income of online car-hailing drivers. At the same time, for urban management personnel, it can reasonably dispatch vehicles and timely solve traffic travel needs, improve the level of urban traffic management, and reduce urban road traffic pressure. This will provide a reference for the research on shared travel and promote the development of shared mobility.
This paper only considered the impact of historical travel on the future. Although the model achieved good results, differences can be found in some details of performance. In real life, there are many factors that affect residents’ travel, such as weather, points of interest, holidays, and differences in travel intensity at different time periods on the same day. Therefore, in the next study, more influencing factors can be comprehensively considered. In addition, the outbreak of large-scale infectious diseases has a greater impact on residents’ thinking and travel methods, such as COVID-19. Therefore, follow-up research can analyze and predict the travel characteristics of residents using online car-hailing during COVID-19. In this paper, Haikou City was only divided according to the fixed grid scale, yet the prediction results are different with different division methods. Subsequent research can divide the study area into different scales for comparison to achieve the best results.