As each training data contains more vehicles and the state of each vehicle is not the same, when trained with VAE, its posterior distribution is too simple and its objective function is KL scatter, which is relatively fixed and cannot learn the features of each vehicle well. DCGAN, on the other hand, solves the problem of difficult convergence of GAN, but also has the characteristics of CNN, with powerful feature extraction ability, and can extract the features of each vehicle. Moreover, compared to VAE, the input of DCGAN is a random vector and the output of DCGAN is more random and richer. Therefore, this study rotates DCGAN as the basic framework to further improve this research.
2.1. Improved DCGAN Model
The overall structure of DCGAN is shown in
Figure 1. In order to enable the network to learn the distribution
of the real sample
, we define the random input of the generator as
, and then map it to the data space
according to
; we also define discriminator
, whose corresponding output
is a scalar representing the probability that
x comes from the real dataset rather than from the data space
. In the process of confrontation training, we classify the probability of correctly classifying real samples and generating samples by maximizing
, and then we can complete the training of
by minimizing
, where they play a minimum and maximum game.
DCGAN reflects this theory in the actual training steps to carry out partial training for the two networks. The first step is to train the discriminator with the mixed samples from the generator and training set to complete the training of one step and improve its ability to distinguish between true and false data. In the next step, the generator and discriminator are combined into a network, which is fixed to the
that completed the training in the previous step, and then the generator completes the training of
under the current step according to minimum
, and then it turns into the next cycle of training. In the end, the two will fight against each other in the cycle of circuit training and improve their abilities to reach Nash equilibrium [
22].
The core of DCGAN is the training of the generator and discriminator against each other to make the generated data conform to the distribution of real data. In order to make the basic DCGAN adapt to the scene data tensor, we modified the basic DCGAN in this paper. First, the input and output dimensions of the tensor were modified so that it could output traffic scene data that met our requirements. Second, because the training set data have a certain time dependence and the longitudinal position of the vehicle increases monotonically with time, this paper combines the Gate Recurrent Unit (GRU) with the DCGAN to change the discriminator model of DCGAN to a model composed of the GRU network.
2.1.1. Generator Model
The vehicle’s natural traffic flow data includes the running state of each vehicle on a long, straight structured road. Due to the number of participants in traffic, which is very large, a large amount of data must be generated. Therefore, it is necessary to improve the output size of DCGAN to generate the required traffic flow data volume.
The generator structure of the native DCGAN is shown in
Figure 1, which includes a four-layer deconvolutional network to raise the input one-dimensional noise dimension to a size of 64 × 64 × 3 output to generate sample images. The size generated by the DCGAN generator is small. In this paper, a layer of the convolutional network is added because of the original network to make the size of the data output by its generator reach a size of 100 × 100 to meet the needs of the number of traffic participants and the total number of frames. Meanwhile, since the traffic flow data to be generated in this study records the transformation of vehicle coordinates in each frame, the structure of the generated data only needs to generate the
x and
y that correspond to the ID and frame. Therefore, the final output size is changed to a size of 100 × 100 × 2. The modified generator network is shown in
Figure 2.
2.1.2. Discriminator Model
The parameters of vehicle trajectories not only have the characteristics of spatial distribution but also have a strong correlation with time, and the DCGAN network structure is not designed to process sequential data [
23]. GRU is a variant of a recurrent neural network [
24,
25]. Similarly to Long Short-Term Memory (LSTM), it is used to solve the problem in which the gradient of the RNN network disappears during training. GRU is simpler and more efficient than LSTM [
26,
27,
28]. So, we added the GRU network to the discriminator to try to improve its ability to discriminate continuous data.
The self-generated trajectory coordinate data of traffic participant vehicles in natural traffic flow are highly correlated with time; in particular, the x-coordinate of traffic participants has a certain increasing relationship in the time scale. Therefore, it is necessary that the generated data should also have a certain correlation in the time scale, and it is necessary to improve the discriminator of DCGAN to enable it to identify the feature changes among sequences. So, we introduced a gated neural network into the discriminator D network. The GRU has two gates: the update gate and the reset gate, which are used to determine the amount of past information used and the amount of past information forgotten.
Its network structure is shown in
Figure 3, and the formula for the update gate is shown below:
where
σ represents the sigmoid activation function,
and
represent the weight matrix,
is the input vector, and
indicates the implied state of the previous moment. The calculation formula for the reset gate is:
Similar to the update gate,
σ represents the sigmoid activation function,
and
represent the weight matrix,
is the input vector, and
indicates the implied state of the previous moment. Candidate memory content is related to past information retained by the reset gate, and its calculation formula is as follows:
where
represents the new memory content,
represents the Hadamard product, and the input
and the previous time step's information
first undergo a linear transformation, i.e., left multiplication of the matrices
and
, respectively.
The retained memory content is controlled by the update gate and reset gate and its calculation formula is as follows:
where
is the result of the activation of the update gate, which also controls the inflow of information in the form of gating; the Hadamard product of
and
represents the information retained in the previous time step related to the final memory, and this information, plus the information retained in the current memory related to the final memory is equal to the output of the final gated loop unit.
The improved discriminator based on the GRU network has a similar basic structure, using GRU units after the convolutional layer to replace the original fully connected layer. The final discriminator structure is shown in
Figure 4.
2.1.3. Loss Function
As shown in
Figure 1, in the adversarial training of the model, the generator and discriminator in the network are trained separately at every moment and have their own loss functions. The confrontation between the two is a minimum and maximum game between the generator and discriminator loss.
For the discriminator, the calculation formula is
The discriminator loss function is divided into two parts because the data sources in the training process are from the real dataset and the fake data generated by the generator.
in represents the real sample distribution. This part represents the discriminator’s ability to distinguish real sample data. The closer is to 1, the stronger the discriminator is to the real sample. The in represents the distribution of fake samples generated by the generator, and this part represents the discriminator’s ability to distinguish the fake sample data generated by the generator; the closer it is to 0, the stronger the identification ability of the generated fake samples. Therefore, by synthesizing the representational meanings of the two parts, it is required to maximize in training, so we obtain the .
For generators, the calculation formula is
The purpose of the generator is to trick the discriminator into recognizing its data as real data. Therefore, the closer
is to 1, the better the quality of the data generated by the generator, and the smaller the corresponding.
So, we obtain the .
Therefore, the loss function of the network shown in this paper can be expressed as follows:
2.2. Data Processing
The overall data-processing process of this study is shown in
Figure 5. The dataset processing stage includes the production and generation content of the training set, including scene extraction from the original dataset, data slice segmentation, and design of the neural network input tensor. In the analysis stage, the generator completed by training is used to generate false data. Finally, the generation quality is evaluated by combining the samples in the original dataset.
This study processed the road vehicle running tracks recorded in the highD dataset and carried out follow-up research. The highD dataset used a bird’s-eye view to record the track data of about 110,000 vehicles at six separate locations on flat and straight highways. No. 1 location in the highD dataset has the largest number of records and its speed limit is 120 km/h, which is the same as the speed limit of China’s highways. Therefore, this study selects the recorded data of No. 1 location to make the real sample set. The location is configured with six lanes in both directions, and the total recorded length of the road is 420 m.
2.2.1. Scene Processing Extraction
In the raw dataset, many vehicles running in reverse are also recorded, and their motion patterns have similar characteristics to those of the forward vehicles. In this study, the recorded data of vehicles in the opposite direction are obtained by the relative coordinate transformation according to their increment, and the recorded data of vehicles driving in the positive direction is expanded.
Considering that there may be congestion in the dataset within the recorded time, which has a great impact on the motion performance of vehicles, this paper divides the dataset according to the time interval of 240 s and calculates the average speed of the samples at each time point.
The formula for calculating the average speed is as follows:
where
represents the number of vehicles in the time interval,
represents the total length of the road, and
represents the time for vehicles to pass through the road section. In actual processing, the total frame number is divided by the frame rate to obtain the total time.
The motion pattern of vehicles in the same traffic flow scenario can be analogized to a uniform flow of water; in order to make all vehicle samples’ input to the neural network the same length as possible, this paper classifies the scenarios based on the average vehicle speed. In this paper, scenes in the dataset are classified according to the classification criteria of the China Road Traffic Congestion Evaluation Method (GA/T115-2020) in
Table 1.
Traffic congestion is evaluated at intervals of 180 s to 300 s, which is set at 240 s. The data processed above are split into units of 240 s, and those less than 240 s are split into units of 180 s, and the average travel speed is calculated for the split data. Traffic congestion is determined according to the average travel speed corresponding to the speed limit of 120 km/h in the table. The number of scenes with different traffic congestion degrees is shown in
Figure 6.
2.2.2. Sample Segmentation and Filling
The average time for a vehicle to pass through the area in this record is 14.3 s. Therefore, we chose to segment all records within a time window of 20 s to ensure that the vehicle trajectory is relatively complete in each segment of the scene. The classified scene data were sliced at a sliding time interval of one second, and a total of 64,240 scene segments were extracted from the above sample.
However, not all vehicles’ running tracks are from the starting point of the road to the ending point of the road in the time window, so in the 20 s scene, some vehicles may have a situation of missing positions. In this paper, the X-coordinate is interpolated by a one-time function, whereas the y-coordinate is processed by a consistent invariant path, as shown in
Figure 7.
It is assumed that the vehicle enters the monitoring area at time
and drives out at time
, and the blue area is the filling value, where
2.2.3. Training Set Production
The input size of the neural network in the training process is fixed and matches the data of the training set. In this paper, the X and Y coordinates of the vehicle running track are divided into two channels, and the X and Y coordinates of each vehicle at different times are represented by a matrix, which is normalized and filled into the matrix.
The data tensor input by the network is shown in
Figure 8, where each channel contains a matrix with
rows and
columns, where
represents the vehicle ID,
represents the frame bits in the dataset, and the data in the corresponding cell represents the data value normalized by the following formula under
or
channel of a vehicle with a certain ID in a certain frame.
where
is the value that fills the tensor, and
and
are the maximum and minimum values of the parameter in the training set.
Considering that the original dataset recorded the vehicle operation data at a frame rate of 25 fps, the data displacement difference between frames is very small. In this study, the coordinate data within the original scene is sampled every 5 frames, and the total number of data frames is compressed to 100 frames to obtain .
According to statistics, the maximum number of vehicles appearing in scenes in the highD dataset is about 80. The maximum number of vehicle IDs that can be accommodated in the training set designed in this paper is , which is sufficient to meet the requirements of all scenes. For incomplete records, is assigned to the coordinate data.