1. Introduction
Research and development in the field of machine learning and more precisely deep learning lead to many discoveries and practical applications in different domains. The domain where machine learning has a huge impact is the automotive industry and the development of fully autonomous vehicles. Machine learning solutions are used in several autonomous vehicle subsystems, as the perception, sensor fusion, simultaneous localization and mapping, and path planning. In parallel with work on full autonomy of commercial vehicles, the development of various automotive platforms is the current trend. For example, delivery vehicles or various different robots and robot-cars are used in warehouses. The main idea of our work was to develop a solution for autonomous driving for a light automotive platform that has limited hardware resources, processor power, and memory size. Having those hardware restrictions in mind, we are aiming to design a light deep neural network (DNN), an end-to-end neural network that will be able to perform the task of autonomous driving on the representative track, while the developed networks’ model used for inference is possible to deploy on a the low-performance hardware platform.
The autonomous driving system can be generally divided into four blocks: sensors, perception subsystem, planning subsystem, and control of the vehicle,
Figure 1. The vehicle is sensing the world using many different sensors mounted on the vehicle. The information from the sensors is processed in a perception block, whose components combine sensor data into meaningful information. The planning subsystem uses the output from the perception block for behavior planning and for both short- and long-range path planning. The control module ensures that the vehicle follows the path provided by the planning subsystem and sends control commands to the vehicle.
An end-to-end deep neural network we designed for autonomous driving uses camera images as an input, which is a raw signal (i.e., pixel), and steering angle predictions as an output to control the vehicle,
Figure 2. End-to-end learning presents the training of neural networks from the beginning to the end without human interaction or involvement in the training process. The purpose of end-to-end learning is that the system automatically learns internal representations of the necessary processing steps, such as detection of useful road characteristics, based only on the input signal.
Nowadays the machine learning applications have been increasingly deployed to embedded devices, mobile phones, and the Internet of Things (IoT) solutions. Deployment of machine learning solutions to embedded hardware platforms leads to new developments in two directions: development of novel hardware platforms able to process data needed for machine learning inference, and development of novel light machine learning architectures and model implementations suitable for low-performing hardware.
The known solutions for end-to-end learning for autonomous driving [
1,
2,
3,
4,
5] are developed mostly for the real vehicles, where the machine learning model used for inference is deployed on the high-performance computer, which is usually located in the trunk of the vehicle, or those solutions use very deep neural networks that are computationally expensive (e.g., using ResNet50 architecture in) [
3]. However, our idea was to develop a significantly smaller solution, a light deep neural network, with similar performance during autonomous driving as known solutions, but using a smaller computational cost that will enable deployment on an embedded platform. This lighter solution will be used for robot-cars, embedded automotive platforms able to carry the goods or perform some similar tasks among relatively known trajectories.
In this paper, we present a novel deep neural network developed for the purpose of end-to-end learning for autonomous driving, called J-Net, which is designed for embedded automotive platforms. In addition to this, for the purpose of objective evaluation of J-Net, we discuss results of our re-implementations of PilotNet [
1,
2] and AlexNet [
6]. AlexNet is originally developed for object classification, but after our modifications, the new AlexNet-like model is suitable for autonomous driving. Firstly, the novel deep neural network architecture J-Net is developed, the model is trained and used for inference during autonomous driving. Secondly, we have compared architectures of three network architectures, J-Net, PitotNet, and AlexNet, and discuss computational complexity. This is done from the reason to have an objective assessment of our network architecture. Next, the implemented models have been trained with the same dataset that we collected. The data collection and inference are done in the self-driving car simulator designed in Unity environment by Udacity, Inc. [
7]. Finally, the trained models have been used for the real-time autonomous driving in a simulator environment. The results of autonomous driving using each of these three deep neural network models have been presented and compared. Results of the autonomous driving are given as video recordings of the autonomous driving in a representative track in a simulator environment, in addition to the qualitative and quantitative performance evaluation of autonomous driving parameters during inference.
The results show that while the complexity and size of novel network are smaller in comparison with other models, the J-Net maintained the performance, achieving similar efficiency in autonomous driving. Aiming for implementation on the embedded automotive platform amplifies the importance of a computationally light solution for the DNN used for autonomous driving, since embedded systems may suffer from hardware limitations for onboard computers that are not capable of running state-of-the-art deep learning models.
In the next section, we will present the related work. The autonomous driving system used for data collection and autonomous driving is described in section III. Dataset and data collection strategy are given in section IV. Proposed approach, architecture, and implementation details of J-Net are given in section V. The implementation of models, comparison of network architectures and parameters, and the training strategy for all three solutions are presented in section VI. Results and discussion of the implementation of all three neural networks and inference during autonomous driving are given in section VII. The conclusion is given in the last section.
2. Related Work
Deep learning is a machine learning paradigm, a part of a broader family of machine learning methods based on learning data representations [
8,
9,
10,
11]. Representations in one layer of a deep neural network are expressed in terms of other, simpler representations from previous layers of the deep neural network. The core gradient of deep neural networks are convolutional networks [
12]. Convolutional neural networks (CNN) are a specialized kind of neural network for processing data that has a known grid-like topology. CNNs combine three architectural ideas: local representative fields, shared weights, and spatial or temporal sub-sampling, which leads to some degree of shift, scale, and distortion invariance. Convolutional neural networks are designed to process data with multiple arrays (e.g., color image, language, audio spectrogram, and video), and benefit from the properties of such signals: local connections, shared weights, pooling, and the use of many layers. For that reason, CNNs are most commonly applied to analyzing visual imagery.
Deep learning for computer vision has significant usage in various commercial and industrial systems and products, as automotive, security and surveillance, augmented reality, smart home applications, retail automation, healthcare, and the game industry;
Figure 3. Convolutional neural networks were some of the first deep models to perform well and were some of the first networks to solve important commercial applications. One of the first examples of convolutional neural network practical application was in the 1990s by research group at AT&T, Inc. that uses CNN for optical character recognition and reading checks [
13]. Later, several optical character recognition and handwriting recognition solutions were developed based on convolutional neural networks [
14], while the newest applications of CNNs for computer vision are endless [
15,
16,
17,
18].
Significant contribution to the development of convolution neural networks and deep learning architectures is given by ImageNet Large Scale Visual Recognition Challenge [
19]. Over several years, the architectures that won this competition represent the state-of-the-art of neural networks and deep learning, becoming a building block and inspiration for new solutions. Some of the most known architectures are AlexNet designed by the SuperVision group from University of Toronto [
6]; VGG-16 model designed by VGG (Visual Geometry Group) from University of Oxford [
20]; GoogLeNet designed by researches from Google, Inc. [
21] that introduced inception modules; Residual Neural Network (ResNet) designed by researchers from Microsoft Research [
22], with the depth of even 152 layers; and ReNet designed by researches from Politecnico di Milano and University of Montreal [
23]. Some of the novel breakthroughs in deep learning are automated machine learning [
24], training deep networks with synthetic data [
25], video-to-video synthesis [
26], playing the game of Go [
27,
28], and end-to-end learning [
29,
30,
31].
The first successful attempts of the development of autonomous vehicles started in the 1950s. The first fully autonomous vehicles were developed in 1984 [
32,
33], and in 1987 [
34]. Significant breakthrough in the field of autonomous vehicles is done during the Defense Advanced Research Projects Agency’s (DARPA) challenge, Grand Challenge events in 2004 and 2005 [
35,
36], and Urban Challenge in 2007 [
37], where it was demonstrated that machines could independently perform the complex human task of driving.
Although there are the prototypes of autonomous vehicles currently tested on the regular streets, some of the challenges for the autonomous driving are not completely solved yet. Current challenges in autonomous vehicles development are sensor fusion [
38,
39,
40,
41], higher-level planning decisions [
42,
43,
44,
45,
46], an end-to-end learning for autonomous driving [
1,
2,
3,
4,
5,
47,
48,
49], reinforcement learning for autonomous driving [
5,
50,
51,
52,
53], and human machine interaction [
54,
55]. A systematic comparison of deep learning architectures used for autonomous vehicles is given in [
56], a short overview of sensors and sensor fusion in autonomous vehicles is presented in [
57].
The aim of our work was to achieve an end-to-end learning using only camera images as an input. Although there are many sensors used for autonomous vehicles, such as lidar, radar, sonar, a global positioning system, an inertial measurement unit, and wheel odometry, the camera is an indispensable sensor in autonomous driving, which enables an autonomous vehicle to visualize its surroundings. Cameras are very efficient at the classification of texture interpretation, widely available, and more affordable than other sensors used for detection, such as radar or lidar. The limitation of the camera is the computational power needed for processing the data.
In this paper, we present a novel network for simplified solution of an end-to-end learning for autonomous driving. The input in our autonomous driving system is only the camera image, the raw pixel. Output is steering angle prediction. The aim is to achieve the computationally light model that can be deployed to an embedded automotive platform for real-time inference. Developing a non-expensive machine learning solution in terms of computational power and memory resources is not easy to achieve [
58,
59,
60,
61,
62]. The techniques that enable efficient processing of deep neural networks to improve energy efficiency and throughput without sacrificing application accuracy or increasing hardware cost are critical to the wide deployment of deep neural networks in artificial intelligence systems [
62].
The goal of machine learning algorithms is to solve the problem they are addressing with the highest possible accuracy. It often leads to very complex and deep neural networks that are computationally demanding [
60,
61]. This is especially the case for deep learning for computer vision-based applications. For example, some of the well-known models that use a large number of layers in network architecture are VGGNet (16 to 19 layers) [
20], GoogLeNet (22-layerd inception architecture) [
21], ResNet (152 layers) [
22], and similar network architectures based on those. Finally, the progress in convolutional neural networks development and applications, and experimentation with more complex architectures are a consequence of two factors, having a large amount of data and improved computational efficiency.
However, application on real-time embedded platforms with limited computational power and memory spaces seeks for a different approach [
62]. Since the execution of DNN depends heavily on the weights, also called model parameters, the solution for deep neural network architecture suitable for applications on embedded platforms is the smaller model that needs to communicate less data [
58]. This is challenging to achieve, especially if the application is computer vision and that has, as an input, a high-quality image. The reduction of the neural network depth and number of parameters often leads to accuracy degradation. Hence, by developing deep neural networks for computer vision for embedded platforms we are looking for a solution that will be good enough for both an acceptable accuracy and the possibility for inference on hardware platforms with limited capabilities. Therefore, our goal is to develop the right deep neural network architecture that can achieve acceptable accuracy, the successful autonomous driving in a representative track, but which operates in real-time within power and energy limitations of its target embedded automotive platform.
In order to achieve this, the approach we took was not to reduce the parameters of some of the well-known neural networks that can be used for autonomous driving, but to start from scratch, developing a network architecture layer by layer, until we found the satisfying solution. The general workflow to find an appropriate model size is to start with relatively few layers and parameters, and to increase the size of the layers or add new layers until returns diminish with regard to validation loss.
In summary, the contributions of the approach proposed in this paper are:
We modified the convolutional neural network for image classification, AlexNet, and used the new AlexNet-like model for end-to-end learning for autonomous driving.
We proposed a novel deep neural network J-Net that, despite having a light architecture in comparison with AlexNet and PilotNet, is able to successfully perform autonomous driving. This recommends the J-Net model for deployment on low-performing embedded automotive platforms.
We have demonstrated the suitability of a new proposed network J-Net for autonomous driving for embedded automotive platforms by doing the performance evaluation of autonomous driving in simulated environment, where the J-Net shows the best performance in terms of latency and frame rate, among three implemented solutions.
5. Proposed Approach
The leading idea during the design process was to achieve end-to-end autonomous driving using the light model (computationally least expensive model), while simultaneously achieving the best possible performance, in our case the autonomous driving on the representative path. The computationally least expensive model is usually the model with the least number of parameters that effects their memory footprint and computations. Therefore, the type and size of layers, kernel sizes, and a number of feature maps have an influence on computational performance. The performance of autonomous driving was examined in a self-driving car simulator.
The final architecture of the J-Net model was the result of experimenting with building blocks of deep neural networks—different number of layers, kernel sizes for convolutional layers, number of feature maps, placement of pooling layers, and, at the end, experimenting with the size and number of the fully-connected layers. Block diagram of this experimental shallow CNN is presented in
Figure 11.
The first step in design of the novel solution was to use an extremely shallow CNN; we performed a 2D convolution operation on the raw data of the input image:
where the three-channel input image
I had the dimensions 320 × 160, the used two-dimensional kernel size was 2 × 2. The kernel was used to extract the patches of the image in convolutional operation. The output of the convolution operation was
S(
i,
j), the two-dimensional feature map tensor. The weights,
w, were shared across patches for a given layer in a CNN to detect the particular representation regardless of where in the image it was located. The equation to calculate width and height of the convolutional output layer is:
where
W and
H are width and height of input layer,
F is filter (kernel) size,
S is stride,
P is padding, and
K is number of filters. In our first experiment, input image was 320 × 160,
F = 2, stride is 1, and we did not use padding,
P = 0. This led to the size of output layers:
= 319, and
= 159. Output depth was equal to the number of filters;
= 16 in this case. Output volume after convolutional operation is:
which was 811,536. After convolution, the
ReLU activation function was applied:
ReLU activation function is the most commonly used activation in hidden layers of DNN. The main advantage of
ReLU over other activation functions, such as
sigmoid or
tanh, is that with
ReLU there is no problem with gradient vanishing [
59].
This convolution was followed with the flattened layer, where the two-dimensional feature map was converted into a single dimension vector. The flattened layer did not result with the new trainable parameters, since the conversion of nodes from the previous layer into a single dimension. Finally, at the end, when we had spatial features of an image as a result of convolution, we applied fully-connected layer where the all nodes from the flattened layer were combined into the single output, which directly predicted a steering angle values. Since there is single output node from the network, this is a regression network.
The model was trained with the dataset previously collected, using mean squared error (MSE) loose function and Adam optimizer, as described in
Section 6.2. The results during real-time inference, as expected, was not good, the car was not able to maintain course on the road. However, qualitative performance evaluation showed that the model still learned some useful features. As it can be seen from the video of autonomous driving in the simulator environment [
63], the model learned to follow the line, so the vehicle was driving by the edge of the lake, following the line between the ground and the water. This little experiment showed that the chosen direction was appropriate, but the model needed more features to be extracted in order to have successful autonomous driving.
As it can be seen from the first experimental results, even though the network we used was very shallow, with only three layers, the number of nodes and weights and the trainable parameters was quite high. The reason for this is that we used the entire input image for convolution, which produced a high number of nodes in the output layer, and we did not use any operation for reduction of parameter numbers, like the pooling operation. Insights from this experiment were valuable for the next phase of development of the final solution for our light deep neural network for autonomous driving:
Do not use the entire input image size. Parts of the image as the sky or the lower part of the image was not relevant for the autonomous driving. What is relevant is the road, the curves, and the boundaries of the road, like a red stripe, shoulder, dust, bridge.
Do perform data normalization in order to have the same range of values for each of the inputs to the model. This can guarantee stable convergence of weight and biases.
Use the pooling operation in order to lower the number of network nodes in the next layer, and, consequently, the number of trainable parameters. A pooling layer is generally used to decrease the size of the output and to prevent overfitting.
Use more convolutional layers, since convolutions are responsible for feature extractions. The first experiment showed that the convolutional layer can extract some features needed for autonomous driving (e.g., one feature is a line to be followed by the autonomous vehicle), but one convolutional layer was not good enough for our aim, more feature maps are needed.
The J-Net Architecture
Introducing more hidden layers to a deep neural network helps with parameters efficiency. It is likely to get much more performance with pure parameters by going deeper rather than wider. In addition to this, deep neural networks applied to images are very efficient, since images tend to have a hierarchical structure that deep models naturally capture. The lower layers of deep neural networks capture simple features like lines or edges. Further layers extract more complicated features like geometric shapes, and the last layers are extracting objects. Since the aim of our work was to drive a vehicle in a representative track, the features needed to be extracted were not objects, rather they were the simple features or geometric shapes. For that reason, for our final model, we have chosen three convolutional layers followed with one flattened layer and two fully connected layers, as will be discussed in more detail in the further text.
For the reason of better features extraction, we have chosen three convolutional layers in order to extend extracted features; Equation (2). We chose three convolutional layers with 16, 32, and 64 feature maps, respectfully;
Figure 12. On the input to the first convolution the size was 320 × 65 × 3, after normalization and cropping of the raw image, then we applied the kernel size of 5 × 5 with 16 feature maps. Based on Equations (3) and (4), the total number of trainable parameters after the first convolutional layer was 1216.
Since we wanted to achieve a light solution, and convolution is a very expensive operation that adds a significant number of the network nodes, and the weights assigned to each of those nodes, downsampling was needed. One solution for this problem would be using striding during convolution, to shift the filters by a few pixels each time and reduce the feature map size. However, this downsampling of an image may cause the loss of some important features since it removes a lot of information. The second solution on downsampling an image is the pooling operation. Instead of skipping one in every two convolutions, we used a small stride in combination with all the convolutions in the neighborhood and combined them. In order to reduce size of the deep neural network layers, we applied the max pooling operation after each convolutional layer. In max pooling layer, every point of the feature map is compared with a small neighborhood around that point and the maximum of all the responses around it is computed:
where
is the value of one input point.
The first advantage of using max pooling is that this operation is parameter free, it does not add new parameters. This lowers the possibility of having increased overfitting. Second, max pooling often yields a more accurate model. On the other side, since the convolutions that went below run at the lower stride, the model becomes more expensive to compute. Additionally, introducing a new layer as pooling adds more hyperparameters to tune, such as the pooling region size and the pooling stride. The pooling layer operates independently on every depth slice of the input and resizes it spatially. In our model we chose MaxPooling with size 2 × 2 that downsamples every depth slice in the input by 2, along both width and height, discarding 75% of the activations. The depth dimension remained unchanged. In this case, we reduced the number of trainable network parameters, while the feature map was not degraded much.
After applying the MaxPooling layer after the first convolution, the input size of the second convolutional layer was 158 × 30 × 16, and the second kernel size was 5 × 5 with 32 feature maps, which, after calculating number of trainable parameters using Equations (3) and (4), led to 12,832 trainable parameters for the second convolution. After this convolution layer, the same MaxPooling as after the first convolution was applied. The third kernel size was 5 × 5 with 64 features map, which led to the total number of trainable parameters of 18,496, after the third convolution.
The final solution for J-Net had three convolutional layers, each followed by the ReLU activation function described in the previous subsection, followed with the MaxPooling layer size of 2 × 2. Applying convolution and MaxPooling operations led to 32,544 trainable parameters after the last convolutions. Since we were developing the representative learning network, those nodes had to be connected to the one final node. After the last convolutional layer, the flattenedlayer was added to get a one-dimensional vector of parameters. The flattened layer did not add new parameters, but, rather, only rearranged the existing in one dimension. Finally, the last layers of the developed DNN were two fully connected layers, the first with ten output nodes and the second with just one output node for the steering angle prediction. The final number of trainable parameters for this architecture and for the input size of 320 × 160 × 3 pixels was 150,965. The network architecture is presented in
Figure 12.
7. Results and Discussion
The proposed deep neural network J-Net was compared with AlexNet and PilotNet, which we re-implemented in order to conduct an objective performance evaluation of the novel design. The models of all three network architectures were implemented, trained with the same dataset, and trained models were used for inference in the simulator for autonomous driving. The results were compared in terms of performance, successfully driving on the representative track, and in terms of complexity of the network, a number of trainable parameters, and the size of the trained model.
7.1. Computational Complexity
The execution of the deep neural network models depends heavily on a set of static constants, weights, also called model parameters. The network architecture itself, the connections between nodes, directly determines the computational cost of the network. One of the main differences between conventional neural networks and conventional artificial neural networks is that the connections between the neurons are not fully connected. Hereby, the particular organization of the deep neural network and the precise characterization of the computations in filter elements determines the network complexity.
In order to have a quantified measure of the computational demand of each trained network, the complexity of the network, the numbers of trainable parameters were compared. The depth of the network, number of layers, and the types of the layers, convolutions, pooling, and density uniquely determine the number of network parameters. This applies to a design model that is not yet trained. In
Table 2, the layers and the number of the trainable parameters of the implemented neural networks are presented. As it can be seen from the table, AlexNet had 42,452,305 trainable parameters, PilotNet had 328,219 of trainable parameters, while J-Net had only 150,965 trainable parameters. The network introduced in this paper, J-Net, had about half the trainable parameters than PilotNet, and about 280 times less than AlexNet. In addition to this, during the training, we calculated a number of floating-point operations for each model that were, as expected, proportional to the number of parameters: 42.45 million multiplication and the same number of addition operations for the AlexNet, 347.82 thousand operations for multiplication and for the addition for the PilotNet, and about 150.84 thousand operations for the multiplication and for the addition for the J-Net;
Table 2.
Additionally, we compared the size of the trained models: The AlexNet model had a memory size of 509.5 MB, PilotNet 4.2 MB, and J-Net only 1.8 MB;
Table 3. All models were trained with the same dataset, loss function, and optimizer. The number of epochs used for the training of each model was different due to the differences in model overfitting, which is indicated by the ratio of training and validation loss gained during the model training. While the network architecture itself, network’s layers type, size, and connections between layers, directly influence the computational cost, the size of the trained model has influence in the inference due to the memory restrictions of the embedded hardware platforms.
The new network we proposed in this paper, J-Net, had about 150 thousand trainable parameters, which was half of our implementation of PilotNet, that had about 350 thousand trainable parameters, and J-Net model had 280 times less parameters than the reimplementation of AlexNet, that had over 41 million trainable parameters, which means that we succeeded to deliver the least computationally demanding solution. The size of the J-Net trained model was four times smaller than the PilotNet model and about 250 times smaller than the AlexNet model. The smaller size of the network and the number of trainable parameters led to improvement of real-time performance in terms of latency reduction, and to the downsizing of the need for interfacing hardware in terms of computational power, cost, and space.
Based on these results, we can say that our proposed network had less deep architecture than the other solutions we compared it with, a smaller number of trainable parameters, and, consequently, was the smaller trained model. This recommends the designed network for deployment on embedded automotive platforms.
7.2. Performance Evaluation during Inference—Autonomous Driving
The verification of successful autonomous driving was done in the simulator on the representative track. During autonomous driving mode, the signal from the central camera mounted on the vehicle was continuously acquired and sent as an input to the trained machine learning model, which resulted with the control of the steering angle. Autonomous driving using all three models was recorded and given in the videos in [
66,
67,
68] for AlexNet, PilotNet, and J-Net, respectively. As can be seen from the given videos, the J-Net fulfilled the requirement for autonomous driving in a predefined path, where the vehicle remained on the road for the full duration of the ride. The measurement of the performance was a successful drive on the representative track, the behavior that the vehicle does not get off the track during the ride, which implies that the better performing solution was the one where the vehicle was in the middle of the track for the full duration of the ride. The performance of the J-Net model was satisfactory. The qualitative performance evaluation of autonomous driving using implemented networks is given in
Table 4.
One of the metrics for performance evaluation was observation of the vehicle behavior in curves,
Table 4 second row. Among those three solutions, AlexNet performed best during the autonomous driving. Using AlexNet for autonomous driving, the vehicle was in the middle of the road most of the time, while during autonomous driving using PilotNet and J-Net, the vehicle was almost always in the middle of the road, but in some curves it came close to the edge. However, all three implementations of the autonomous driving succeeded to drive the vehicle on the road at all times, and did not go off the path.
In addition to observing autonomous driving on a representative track, the steering angle predictions used for autonomous driving were evaluated. As can be seen in
Figure 16, the steering angle predictions for all three models were relatively similar. The graphical presentation of steering angle predictions used for real-time inference is given for one full lap of autonomous driving on the representative track. Positive and negative steering angle values represent the left and right angle of rotation. Since the representative track used for driving during inference was the same, and since the speed of the vehicle had been fixed due to simplicity, the steering angles in
Figure 16 shows the steering angle predictions in similar frames. Steering angle predictions for the J-Net and PilotNet model were similar; however, the J-Net had a bit higher values, in both directions, left and right. The AlexNet model resulted with the mostly smooth steering predictions in the majority of the ride. However, in some points, it had extreme values, for example, at about 2500 frames there is a spike in the left direction, while the other two models did not have that sharp turn for that part of the road.
As another measure of autonomous driving performance, we measured the impact of autonomous driving using each neural network on the trajectory. The relative deviation from the center of the trajectory per one full lap of autonomous driving is presented in
Figure 17. The driving track characteristics may be seen as four main categories: a mostly straight part of the road rounded with shoulders, curves defined with red and white stripe, bridge, and parts of the road defined without any marks but with dust. As it can be seen from
Figure 17, driving using all three models had similar patterns. The models show the car driving mostly without oscillations in straight parts of the road. However, in the curves, the deviation from the center of the trajectory was the biggest (e.g., after the bridge—the part of the diagram in
Figure 17 marked as (c), there are three sharp curves—
Figure 17d,b, while the third curve was the most challenging). Additionally,
Figure 17 shows that all three networks had the deviations in this part, where AlexNet had the biggest deviation, and J-Net performed better than the other models. On the other hand, J-Net had more oscillations during the full lap, while AlexNet had the best performance, being closest to the center of the trajectory for most of the ride.
Statistical analysis of autonomous driving is also presented through the histograms. This analysis is significant for long term tests. In order to examine oscillations, the histogram of the relative deviations from the center of the trajectory per one full lap of autonomous driving is presented in
Figure 18. Histogram of J-Net driving is presented in
Figure 18a, where it was shown that the J-Net has the smallest deviation in the curves, while the oscillations for the center of the trajectory were the biggest. For the Pilot Net,
Figure 18b, the oscillations were medium in comparison with using the two other networks for autonomous driving, but this model had significant oscillations in curves, both left and right, as it can be seen from
Figure 18b, where for both directions sporadic occurrences had relative deviation from the center of the trajectory at almost 100%. Histogram of the relative deviation from the center of the trajectory shows that using AlexNet for autonomous driving,
Figure 18c, had the most stable driving experience, with the smallest oscillations from the center of the trajectory. On the other hand, there was an occurrence of sporadic high deviation from the center of trajectory in one curve direction. However, this deviation was within the allowed limits, the car did not come out of the road, which was the criteria that we defined for successful autonomous driving.
Finally, all models performed well, successfully finishing the lap of autonomous driving with no significant deviation from the center of the trajectory. Differences between autonomous driving using different models were notable, but not large.
Based on the computational complexity analysis, it was expected that J-Net would have the least latency and the highest frame rate among the three evaluated solutions. Quantitative performance evaluation verified this claim, as can be seen from
Table 5. This evaluation was done on the PC platform explained in
Section 6.2, which is a high-performance platform used for a simulator environment. The J-Net was able to successfully finish the task of autonomous driving on the representative track. As the track is a closed loop, we measured the number of successful consecutive laps for ten laps in total. All three models were able to successfully drive during the measured time. For the latency measurement, we calculated the time between two consecutive predictions. This number varies during the whole lap of autonomous driving, so for the latency, the mean value was used. The frames per second were calculated by counting the number of predictions per acquired frames in one second.
Observing the framerate measurements, using the J-Net model for the real-time inference was 30% faster than when using the AlexNet model on the high-performance platform used for the simulator environment. However, if we were using a scalar processor for the inference, the major differences would be expected (e.g., using J-Net over AlexNet would be 280 times faster). In the experiment where the simulator environment was used, the inferencing platform was a high-capacity computer with GPU that provided data parallelization. Hence, those results were for the particular application where the GPU was used. Here, since the neural network architectures were more different in their surface area than depth, the majority of operations were able to be done in parallel, so the difference in frame rate was the consequence of the sequencing in the algorithm execution, which was proportional to the network depth.
The faithful demonstration of J-Net performance advantages is platform dependent. If we go to another extreme, when the operations are only done on scalar processors, it is expected that the execution rates would be much different, that is, comparable to the network capacity, the number of parameters. Real implementations of the J-Net model are intended for the embedded platforms, in which the degree of parallelization will be set so that the performance requirements in frame rate are met.
8. Conclusions
The development of high-performing computers able to perform training and inference for machine learning models leads to great advancement in novel approaches to known problems. However, the industrial application often requires machine learning solutions that can be deployed on computationally inexpensive and smaller memory demanding embedded platforms that have low cost and size. Deploying machine learning models on a low-performing hardware platform implies usage of an inexpensive models in terms of computational power and memory resources, which can be achieved by careful design of the neural network model architecture. In parallel with advancement in hardware development, in the development of novel processor units targeted for machine learning and, more precisely, deep learning applications, there is a trend in the design of light network architectures that can meet strict hardware requirements.
The deep neural network presented in this paper is one possible solution for end-to-end learning for autonomous driving. The aim of our work was to achieve successful autonomous driving using a light deep neural network that is suitable for inference and deployment on an embedded automotive platform. Having this in mind, we designed and implemented J-Net, a deep convolutional neural network able to successfully perform the task of autonomous driving in a representative track, with the computational cost of using this network being the smallest among other known solutions that have been also explored in this paper. The main contribution of proposed work is the novel solution that is computationally effective due to relatively light architecture. The complexity of an algorithm is determined by the number of operations in one iteration, and our deep neural network has shown similar qualitative results gained with much fewer operations in comparison with other neural networks explored in this paper.
The possible limitation of J-Net could be the insufficient generalization for the more complex-use case scenarios. In addition to this, our model is trained using raw camera images and steering angle measurements per each frame, while the speed of the vehicle is taken as a constant due to the simplicity. This causes the limitation during autonomous driving regarding the speed since the constant speed is implied. However, it would be possible to train the J-Net to predict the speed of the vehicle. A similar approach to predicting steering angle can be used, which may lead to making simultaneous predictions for steering angle and speed based on the input camera image in real-time.
The future work will include the deployment of the presented network in an embedded automotive platform with limited hardware resources, low processor power, and low memory size. The possible final use cases for the presented end-to-end learning network are robot-cars in warehouses and delivery vehicles. Usage of light DNN solution, like the one presented in this paper, enables deployment on embedded automotive platforms with low-power hardware, low cost, and size, which is important for practical industrial applications.