1. Introduction
The fully end-to-end driving systems convert raw sensory inputs directly into actionable driving commands via a single neural network model [
1,
2,
3,
4]. Imitation learning, especially behavioral cloning, has been the dominant training paradigm for training such models [
4,
5,
6,
7] in the past ten years. The camera feed is often the only input [
8,
9,
10,
11,
12,
13]. As output, the model attempts to predict human-like steering angle, throttle, and brake values, i.e., “what the human expert would do in the given situation” [
8,
13]. These actionable “end” commands can be passed into the drive-by-wire system of the vehicle. Alternatively, end-to-mid approaches would output a desired trajectory [
10], cost map [
14,
15], or other not directly actionable output representations [
16], which need to be further processed by subsequent modules. In the following, camera-based fully end-to-end driving is employed, while some of the problems raised and solutions proposed generalize to a wider variety of approaches.
Over recent years, different neural network architectures have been employed for mapping inputs to outputs. Convolutional neural networks (CNNs) can efficiently process single-frame inputs [
8,
17]. To deal with multiple input timesteps, the images can be stacked and processed by a CNN [
14,
18] or can be encoded by a CNN and then analyzed in sequence by a recurrent part of the network [
10,
19,
20]. Recently, visual transformers have emerged as a powerful tool for merging multiple input sources [
7,
21,
22]. Using auxiliary tasks adds learning signals while also improving interpretability [
6,
14]. In the following, only CNN-based approaches are utilized due to their relatively low computing and data needs.
As a major limitation of behavioral cloning, the learned task of transforming randomly mini-batched images to actionable commands is not identical to the actual task of driving [
6,
23,
24]. The task of predicting on a static dataset (open-loop testing, off-policy testing) and the task of controlling a vehicle in real time (closed-loop testing, on-policy testing) differ in the following aspects:
The type of mistakes that matter differs. Driving is a sequential decision-making process. Small temporally correlated errors can make the vehicle drift off the safe trajectory little by little [
25]. Behavioral cloning with usual loss functions and random minibatch sampling penalizes such consistent but weak biases minimally. Such errors look harmless in off-policy testing. Furthermore, the long-term negative effects of actions are not evaluated in off-policy testing [
24].
Incoming data differ. During deployment, the learned policy defines the distribution of situations the vehicle ends up in. This distribution may significantly differ from the training and test set situations resulting from expert driving. Behavioral cloning models trained on human driving may fail to generalize to the distribution caused by self-driving. This well-known effect is often called the distribution shift [
1,
24,
26].
Delays differ. Delays play no role in the predictive behavioral cloning task optimized when creating a model. The loss does not increase if the computation is slow. When deployed in the real world, delays exist between the moment an observation is captured by the camera and the moment the actuators receive the computed response to this observation. This has been discussed for model predictive control [
27,
28] but not for end-to-end approaches. Due to delays, optimal decisions may become outdated by the time they are applied.
Decision frequency differs. Prediction frequency does not influence the loss in the behavioral cloning task the models are optimized to perform. The loss value does not increase if the computation is slow. However, during driving, infrequent decisions may overshoot their intended effect, resulting in an oscillation around the actual desired path. Furthermore, at low frequencies, the model can be simply too late to react to situations arising between decision ticks.
In our previous work [
29], we noticed that the driving speed differed between the data collection drives (including data for off-policy evaluation) and deployment (on-policy testing). In particular, we deployed steering-only models on a real vehicle at speeds 0.5 to 0.8 times the speed the human used in the given GPS location. Similarly, ref. [
11] used speeds only up to 10 km/h when deploying models. Indeed, a lower speed is safer for the safety driver and is, in many ways, a smart approach. However, the present work aims to demonstrate that deploying at unusually low speeds can cause an out-of-distribution problem for the steering models.
Intuitively, the on-policy sequential task of driving fast is a different task from driving slow, due to the physics of the world. At higher speeds, the vehicle slips more in turns and the forces counteracting actuation are higher, increasing actuation delays. Beyond physics, the computational delays result in decisions being increasingly late and sparse in terms of space as speed increases. These are intuitive and known effects that one can attempt to reduce by minimizing delays and optimizing hardware. Importantly, this work reveals that also the behavioral cloning task itself becomes different at different speeds. During data collection, the expert driver (human, or another model) must account for the chosen speed profile and the steering commands, i.e., the labels of the task, to differ. In particular, faster driving necessitates preemptive actions to counter inertia, delays in transmitting commands, actuator delays, and computational delays (within the brain if the data were collected by a human), while slow driving does not. To illustrate, while the camera image at a given location (e.g., 2 m before a turn) remains very similar at different speeds, the labels (the output a model should predict) differ. In the following, this is referred to as the task shift, a shift in the predictive function from inputs to outputs the model attempts to learn; in particular, a shift in the correct outputs for a given input. The severity of task shift between speed levels increases linearly with delays and with speed difference as explained in
Figure 1.
Human drivers also turn preemptively, so this “task difference” is captured in the recorded data. Even at zero computational delays during deployment, models trained on fast data will be less adequate for driving slowly (and vice versa). There exists a difference in the image–label pairs defining the task for behavioral cloning, originating in the physics of the world and computational delays and decision frequencies of the data collector (e.g., human reaction time and decision frequency). As a consequence, testing an end-to-end autonomous driving system (ADS) under a novel speed profile will always pose a generalization challenge for the model.
Beyond steering models learned by behavioral cloning, the relevance of the problem of speed-induced task shift depends on the model type and how the self-driving task is formulated. Models relying on multiple frames or receiving speed as input will face input out-of-distribution challenges if deployed at a novel speed. Models controlling speed jointly with steering can learn to output adequate pairs for a given input if (1) exposed to them during training, and (2) the output speed is not artificially clipped in post-processing (for safety reasons, to ensure driving slow). Clipping results in slow speed being matched with a steering value intended for a higher velocity. Producing full joint probability distribution or energy landscape over speed and steering values would allow sampling pairs in a desired speed range without the need to clip the speed. In many end-to-mid approaches, the imitation learning task is very similar at different speeds (e.g., predicting trajectory and cost map) because counteracting the physics is left for the consecutive control module. However, at deployment, the predicted mid-level outputs will become outdated by the time they are produced and some correction of ego-location is needed.
In general, driving at different speeds always constitutes a different task for the ADS, if not due to computational delays, then due to slip and actuator delays (higher centrifugal force acting against the actuation). The presence of computational delays amplifies this difference during deployment, e.g., on-policy testing. Making decisions late in terms of time does not necessarily lead to crashes (e.g., at speed 0), but being late in terms of space (location on the road) does. Delay × speed, i.e., the spatial belatedness, is an important characteristic of the deployment task the vehicle is performing. Driving fast in the presence of minimal delay is surprisingly similar to driving slow in the presence of significant computational delay, as in both cases, decisions need to be made early in terms of location and hence in terms of camera inputs. Consequently, there are two widely applied ways to counteract spatial belatedness—driving slowly and minimizing delays.
Here, a third option is proposed—conditioning behavioral cloning models to match observation at time T to the label recorded at time , where is the expected computation time during deployment. This way, the produced command is relevant at the moment it actually gets applied. Technically, the training set labels must be matched with an earlier frame, so we name this approach label-shifting.
To safely demonstrate the effects of speed, computational delays, and the proposed label-shifting countermeasure, this work utilized 1:10 scale Donkey Car [
30,
31] mini-cars equipped with Raspberry Pi 4b and a frontal camera. The models were trained fully end-to-end with behavioral cloning and controlled only the steering of the vehicle. The training and deployment procedure is very similar to what was performed on the real-sized car in our previous work [
29]. We believe the lessons learned are transferable to the domain of real-sized cars.
The main contributions of the present work are given as follows:
- 1.
We demonstrate that the performance of good driving pipelines may fall apart if deployed at a speed the system was not exposed to during training. The underlying reasons and the implications for on-policy testing are explained. To our knowledge, the effect of deployment speed has previously not been discussed in the end-to-end literature.
- 2.
We illustrate, via real-world testing, how the performance of good driving models suffers due to computational delays. The presented results demonstrate that label-shifting the training data allows to easily alleviate the problem, reducing the effect of delays. Incorporating delay anticipation into end-to-end models has not been attempted before.
2. Materials and Methods
In this work, Donkey Car S1 1:10 scale mini-cars were employed to study the effect of speed and the effect of computational delays. The practical part of the work was conducted as two master theses, independently (the two theses can be found at
https://comserv.cs.ut.ee/ati_thesis/datasheet.php?id=74970&language=en (accessed on 1 December 2023) and
https://comserv.cs.ut.ee/ati_thesis/datasheet.php?id=75358&language=en (accessed on 1 December 2023). In this section, first, the overall setup and hypotheses of the two sub-studies of this work are defined. Thereafter, descriptions are provided for the hardware used, for the methods of data collection and organization into datasets, for the used model architectures, and for the performance evaluation methods.
2.1. Experimental Design
2.1.1. Study on the Effect of Speed
To demonstrate the effect of speed on model performance, behavioral cloning models [
32] predicting the steering angle on data collected at a certain speed (low, high) were created, and their generalization ability to another, novel speed was evaluated. The low speed was chosen as the lowest possible speed achievable with the vehicle (insufficient torque at lower throttle values). The high speed was chosen to be the maximal comfortable speed for the human data collector. The resulting lap times are perceivably and statistically clearly different. Single-frame models considering only the latest frame and multi-frame models considering the past 3 frames were trained, the architectures of which are given below. We hypothesized that novel speed causes a performance drop in the ability to predict the labels in the off-policy behavioral cloning task, as well as in the on-policy driving task of driving on the track.
Furthermore, we hypothesized that multi-frame inputs become out-of-distribution (OOD) if the data originate from a novel speed. The input images are increasingly dissimilar from each other as speed grows, posing a generalization challenge for models that have developed visio-temporal features assuming a specific speed. In particular, we hypothesized that the out-of-distribution effect of novel speed inputs results in measurably different activation patterns inside the networks as shown for OOD inputs in other domains [
33,
34].
2.1.2. Study on Counteracting the Effect of Delays via Label-Shifting
In the second study, the effect of computational delays during deployment was quantified and counteracted. The dataset was collected at high speeds to amplify the effect of delays and render the results more evident. This collected dataset was transformed into a variety of training sets by shifting the steering values (labels) by one or multiple positions back or forward in time, matching labels with previous or subsequent frames. This way, models that predict optimal commands for the past, present, or future were created. These models were deployed on the track in the presence of different amounts of computational delays, and their driving performance was measured.
In particular, the existing computational delay was increased artificially by inserting a waiting time (25 ms, 50 ms, 75 ms, or 100 ms) after neural network computing was finished and before sending out the command (via time.sleep() function). The longer compute delay imitates using a larger network, weaker compute environment, or more processes sharing the compute resources. We hypothesized that by default, a model’s ability to drive deteriorates quickly as delays increase. Additionally, we hypothesized that models predicting future commands can perform better in the presence of increasing delays, as they implicitly take into consideration their presence.
2.2. Hardware Setup and Data Collection Procedure
This work was performed using Donkey Car open-source software version 4.3.22 [
30] deployed on the 1:10 scale Donkey Car S1 platform (
https://www.robocarstore.com/products/donkey-car-starter-kit, accessed on 15 January 2022), equipped with MM1 control board, Raspberry Pi 4b, Raspberry Pi wide-angle camera (160°), and two servomotors (steering, throttle). The turning diameter was 140 cm, and the maximum speed was several meters per second. The vehicle, referred to as the mini-car from here on, was deployed on a track 60–80 cm wide and 17 m long (
Figure 2). The data were collected, and deployment was performed always under the same light conditions, either cloudy afternoon or evening with artificial lighting.
During data collection, the steering was controlled by the researcher using a Logitech F710 gamepad or by a competent self-driving model running on board the mini-car (explained below). The messaging delay from the gamepad to Raspberry Pi is not perceivable. However, actuator delays become perceivable when driving fast, likely due to inertia and the friction of the wheels with the ground. Actuator delays and the reaction time (approximately 250 ms for humans [
35]) of the driver result in different driving styles when driving fast and slow.
The throttle can be fixed at a constant value by the human operator. The speed resulting from a constant throttle value depends on the battery level and the heating up of the servomotors. The mini-car cannot measure and maintain constant speed. In this study, driving speed was defined via actual achieved lap times, not via the throttle value. During data collection and on-policy testing, the operator steadily increased the throttle value to achieve stable lap times as the car heated and batteries drained.
In all conducted driving experiments, all computation happened on board, in the Raspberry Pi 4b device. This holds for both on-policy testing and data collection performed with the help of a competent “teacher” model. The duration of different computations, including neural network computing time, was measured by a built-in function of the Donkey Car software v 4.3.22. Model training took place on CPUs in laptops or in Google Colab with GPU access and took up to a few hours per model.
Quality of Driving Data
During data collection, the authors found it difficult to give intermediate-valued commands with the Logitech F710 gamepad’s stick button, especially at higher speeds where the actions are rushed. The need for steeper turning to counter inertia was exaggerated by low reaction time and precision of movement. This risked creating an artificial difference between slow and fast driving data, caused by the operator’s ability and not inherent to the tasks.
To bypass this issue, a teacher agent capable of driving at various speeds was utilized in the part of this study comparing the tasks of driving fast and driving slow. The teacher model used the same architecture and training setup as the single-frame models defined below but had been exposed to a variety of speeds during training. The same teacher agent collected training datasets at two different (slow and fast, defined below) speeds.
2.3. Data Preparation
The cleaned (infractions removed, speed in the designated range) datasets were prepared to demonstrate the two effects.
Study of speed. In the slow speed dataset (19,250 frames), the 17 m lap was, on average, completed in 24.25 ± 1.9 s, i.e., 0.7 m/s speed. The fast dataset (20,304 frames) corresponded to an average 14.85 ± 0.8 s lap time, i.e., 1.1 m/s speed. Both these sets were collected by the teacher-agent driving in the evening time with artificial light. From these two sets of recordings, single and multi-frame datasets were created. In the latter, each data point consisted of three frames matched with the steering command recorded simultaneously with the third frame.
Five-fold cross-validation was performed by dividing the data into 5 blocks along the time dimension. In off-policy evaluations, the average validation results across the five folds are reported. For multi-frame models, the data were split into several periods along the time axis, and a continuous 1/5 of each period was assigned to each of the 5 folds. For both model types, new models were trained on the entirety of the given-speed dataset for on-policy evaluations to make maximal use of the data and achieve the best possible performance.
Study of counteracting the effect of delays by label-shifting. All data in this study were recorded by a very proficient human driver at an average speed of 8.32 ± 0.41 s per lap. Data were collected in the afternoon with no direct sunlight or shadows on the track. Datasets matching camera frames with commands recorded up to 100 ms before and up to 200 ms after the frame capture were created. In total, there are seven datasets with the labels shifted by −100 ms, −50 ms, 0 ms, 50 ms, 100 ms, 150 ms, and 200 ms (due to recording at 20 Hz, shifting by a position corresponds to 50 ms). Each dataset was divided into training and validation sets with a random 80/20 percent split (46,620 and 11,655 frames, respectively). The validation set was only used for early stopping.
2.4. Architectures and Training
In this work, relatively simple types of neural networks were chosen for use. Firstly, the computation must run in real-time on a Raspberry 4b device, restricting us to low image resolution and limited network depth. Advanced approaches like visual transformers have not been validated to perform under these restrictions, while the chosen architectures have been used by the Donkey Car community and are known to perform sufficiently well on similar hardware setups. Secondly, more complicated network types usually require more training data to converge. Practically, the simplest model types could perform the task and were sufficient for this study.
For investigating the effect of speed of model performance, four types of models were trained:
- 1.
Single-frame CNN architecture, trained on fast data.
- 2.
Single-frame CNN architecture, trained on slow data.
- 3.
Multi-frame CNN architecture, trained on fast data.
- 4.
Multi-frame CNN architecture, trained on slow data.
The study on label-shifting to correct the effect of delays used only the single-frame architecture. The architectures used are summarized in
Table 1 and
Table 2.
The default training options in the Donkey Car repository were used. The mean squared error (MSE) loss function and Adam optimizer with weight decay with default parameters were used. Early stopping was evoked if no improvement in the validation set was achieved in 5 consecutive epochs, with the maximum epoch count fixed to 100.
2.5. Evaluation Metrics in the Study of Speed
The off-policy metric used in the study of speed was the validation set mean absolute error (MAE) as averaged over the 5 folds of cross-validation. On-policy behavior was observed when deploying the models on the vehicle at a fixed low or high speed (the same two speed ranges as in the training data). The main on-policy metric was the number of infractions, i.e., collisions with walls, during 10 laps.
Measuring the Out-of-Distribution Effect
The following analysis intends to demonstrate that for multi-frame models, novel-speed validation data cause activation patterns in the network to become more distinct from the patterns generated by the training data than same-speed validation data. Prior works have shown that out-of-distribution inputs cause detectably different activation patterns (i.e., embeddings) in the hidden layers of a network [
33,
34].
To this end, for every multi-frame model in 5-fold cross-validation, for both speeds, the following hold:
- 1.
Using training data, the final embedding layer neuron activations in three possible locations on the computational graph are computed: (a) after the matrix multiplication, (b) after batch normalization (BN), and (c) after both BN and ReLU activation. For each possible extraction location, the analysis is run separately. These activation vectors are called the reference activations.
- 2.
Similarly, neural activations on the validation set data points of the same speed dataset are computed. The resulting activations are referred to as same-speed activations.
- 3.
Every validation sample is described by a measure of distance to the reference set, defined as the average distance to the 5 nearest reference set activation vectors. Euclidean and cosine distances are employed as the proximity measures, and a separate analysis is performed for each (Ref. [
33] proposed to use Mahalanobis distance, but our experience shows a competitive performance across different datasets with these simpler metrics).
- 4.
The activation patterns for the entirety of the other-speed dataset are computed. These activation vectors are called the novel speed activations. The distances of these activation patterns to the reference set according to the same metric are computed.
- 5.
Approximately, the further the activation patterns are from the training patterns, the further out-of-distribution the data point is judged to be for the given model [
33]. By setting a threshold on this distance, one can attempt to separate the same speed and novel speed activations. The assumption was that novel speed activations are more different and mostly fall above the set distance threshold. The AUROC of such a classifier is computed and presented as the main separability metric.
In the Results, the average Euclidean and cosine distances to the reference set for the same and novel speed validation data are reported, averaged over the 3 possible extraction points. Averaging over the extraction locations is performed because there is no a priori knowledge of which extraction location to choose. Additionally, for all possible combinations of model, metric, and extraction location, the AUROC metric is computed, quantifying if it is possible to separate activation patterns emerging in response to novel speed data from those resulting from the same speed data.
We acknowledge that despite conscious efforts to guarantee similar conditions, the lighting might have slightly changed between fast and slow data collection. Also, limited motion blur can be seen in the fast-speed data. These two sources can cause additional input distribution shift between the collected slow and fast-driving data, beyond the frame-difference increase. To eliminate these other sources of input change, synthetically sped-up validation data were generated by skipping frames in the validation recordings. This analysis was performed only for slow-speed data; methods to generate artificially slow data based on fast recordings are more complicated and do not guarantee perfectly in-distribution individual frames. For every slow-speed validation set image triplet, i.e., frames from timesteps (t, t + 1, t + 2), a matching sped-up validation triplet from timesteps (t, t + 2, t + 4) was constructed. The network activations and the distances to the reference set were computed (as explained above) for the triplets in these two validation sets. The resulting distance values in the two sets are paired and not independent. Hence, instead of measuring AUROC, a one-sided Wilcoxon rank sum test was applied to compare the two lists of distances. The procedure of generating artificially fast data and comparing the resulting distances was performed for every model (i.e., every validation set) in the 5-fold cross-validation, for all different distance metrics and activation vector extraction points.
2.6. Evaluation Metrics in the Study of Delays
As discussed earlier, driving slowly can compensate for higher compute delays, as the root cause of failure is spatially late decisions. It was also concluded that driving at different speeds is a different task, as is driving in the presence of different delays. Consequently, when studying the deterioration of performance as delay grows, the other factor, the speed, must be fixed.
As the first evaluation, it is determined which delay and label-shifting combinations allow the vehicle to drive at the training set speed, i.e., perform the original task in terms of speed. Here, driving at the designated speed is defined as not slower than the training set mean speed + 2 standard deviations (× s).
As delays and speed combine multiplicatively to cause spatial belatedness and failure, a model capable of driving faster must have counteracted the effect of delays more effectively. The highest possible safe driving speed is therefore also a revealing metric. The highest possible speed is determined by deploying the model–delay pair and increasing the speed gradually until the vehicle starts to crash regularly. The speed is then readjusted to just a fraction slower, and the vehicle attempts around 25 laps at this highest speed. The average lap time over these laps is reported for each model–delay combination.
2.7. Code and Data Availability