1. Introduction
Autonomous driving technologies in the last five years have not only drawn attention from both researchers and the public but have become the true driver-assistant in many scenarios, including turnpike and city expressway. For example, the Autopilot self-driving system of Tesla utilizes eight external cameras and 12 ultrasonic sensors, and a powerful onboard computer provides perception ability beyond human level. Tesla announced that in the first quarter of 2021, the accident rate of their Autopilot self-driving system was one accident for every 4.68 million miles driven. In contrast, the Insurance Institute for Highway Safety (IIHS) reported that the fatality rate per million miles traveled in 2019 was 1.11. The advantage of autonomous driving systems over humans is statistically obvious, but the application of this system is limited by legal and ethical practice. Constrained by the regulation department, the autonomous driving system is better known as an advanced driving assistance system (ADAS) [
1,
2,
3] and employed in regular conditions such as the turnpike, in which the component of the environment is relatively simple and the behavior of others is highly predictable. Lane localization plays a significant role in autonomous driving systems. In addition to the cruise control function, the routing planning and driving decision of vehicles, to a large extent, depend on the result of lane localization. This problem seems to be solved with the advancement of a High-Definition Map (HD map), through which the position of vehicles matches the lane information to achieve accurate localization. However, the HD map is not allowed to be the database for personal vehicle use in most countries. Therefore, lane localization problems need other solutions. Currently, with the development of deep learning, many companies and organizations employ finely designed DNN to simulate human perception ability. The high-level semantic features extracted from the vision information (in most cases, the videos or images accumulated from the cameras that are equipped in the vehicles.) aid the onboard computer to decide the lane index. Despite the remarkable feature extraction ability of DNN, lane localization is still a difficult problem due to the following reasons:
Weak generalization ability in the real world. Recently, many finely designed architectures of DNN have worked effectively in public databases, even achieving state-of-the-art performance. Some of them perform perfectly in the simulation system [
4]. However, the real world is much more complex and complicated than the public database and simulation system. Given that the autonomous driving suite influences not only the drivers but also the walkers, the generalization ability of the architecture must be strong and robust. In the real world, many situations such as strong light, darkness, rain and moisture seriously affect the feature extraction ability of the algorithm. A less robust architecture may infer the wrong feature and then output the wrong lane index, which is likely to result in traffic accidents and endanger human life.
The limitation ofon-board chips and computer. Although some backbones in recent years have worked effectively or efficiently in large-scale image datasets, such as ImageNet, Coco or even real scenarios, noticeably, their performance mostly relies on the floating-point computing capability of Nvidia GPU (GTX series, RTX series and T4). The autonomous driving suite, being both equipment and a commodity, needs to have a balance between computational cost and effectiveness. In the first quarter of 2022, the most widely used chip for the autonomous driving system was the Snapdragon 8155. This Snapdragon processor includes eight Kry CPUs, a Qualcomm Adren 640 GPU and high-performance Hexagon 6 DSP, which has remarkable floating point computing capability as a portable chip but is still inferior to the PC-level counterparts. Furthermore, though the transformer framework shows considerable potential in vision tasks, it is not perfectly supported by the Snapdragon chips. The computing accumulation toolkit cannot be used on the transformer framework. Hence, the backbones selected have to be based on early frameworks rather than the transformer.
Expensive cost on labeled data. In a real-world mass production scenario, the cost of labeled data is a huge problem. To train a robust network, a large amount of data with specific labels are fed into the net. In fact, an effective network in autonomous driving takes a hundred thousand or even millions of images as input. A specific task usually needs corresponding labeled images. In other words, it is less likely that a labeled image that meets the needs of all tasks exists. For example, the front vehicles detection and road edge detection are the priority while driving. The magnitude of these kinds of labeled data for the function is millions. In contrast, to address the lane localization problem, only ten thousand or a hundred thousand labeled images can be utilized. Therefore, to wisely use the output of other tasks is a helpful way to address the problem of lack of data.
To address the problems listed above, we propose a robust and effectively framework named LLNet. Our main contribution is as follows:
We view the lane localization problem as a classification task. To this end, we propose a light, effective, robust lane localization network (LLNet), whose generalized ability is considerably superior to the baseline DNNs. Aided by the dual-classifier heads, LLNet fully utilizes the information of both the images that are extracted from the vehicles equipped cameras and the total lane number that are loaded from the SDmap.
To increase the generalization capability of LLNet, we used the lane marks and road edges outputted from the perception model in the autonomous driving system to improve the performance of the classification network. To improve the training efficiency as well as to allow the classifier to better utilize the perception information, we generated an image of the perceptual data (As shown in
Section 3.1). This image and the corresponding visual image became image pairs and were fed to the classification network simultaneously in a loose coupling manner to generate classification results. We assigned a small confidence weight to the perceptual information due to the instability of the output of the perception module.
To validate the generalized ability and effectiveness of LLNet, we carried out a comprehensive experiment on the data in the real-world scenario (Beijing and Guangdong province, China), which was collected and labeled by the profession data company. Due to commercial confidentiality requirements, only part of the test results are provided in this paper. The experiment results show that the proposed LLNet outweighs the baseline classifiers in precision, recall and generalization ability. We also conducted comparison experiments between different versions of LLNet to verify the rationality of the design of each part of LLNet. The experimental results show that the current LLNET design achieves an optimal balance between performance and computational consumption.
The lane localization algorithm proposed in the paper will be employed by a mass production vehicle with autonomous driving suite in the second quarter of 2022. The number of vehicles is expected to be 300,000. The algorithm may be improved in the final version and will be updated in the following time.
2. Related Work
Originally, the lane localization task was viewed as a simple subtask of precise localization of vehicles. In the driving process, the accurate position of the vehicles is supposed to be calculated with the high precision Inertial Navigation Systems (INS) [
5,
6], the global navigation satellite system (GNSS) [
7,
8,
9], and the Inertial Measurement Unit (IMU) [
10,
11,
12]. Christopher Rose, Jordan Britt et.al [
13] fused the measurements from both camera and lidar, calculating lateral distance through two lane localization systems. The estimated results were utilized by the navigation filter to match a well-established waypoint-based map. The output can be provided in a GPS/INS system. Qingquan Li, Long Chen et al. [
14] made use of the light, lidar and vision data to generate a fused real-time lane detection system. The fusion approach was constructed on feature-level rather pixel-level. Based on an optical selection strategy for the route prediction, the lane detection algorithm took the classification of potential drivable route as reference to selectively execute. Farhad Aghili, Chun-Yi Su et al. [
15] combined the noise-adaptive Kalman filter and ICP registration algorithm to construct a six-degree-of-freedom relative navigation system. The output of the combination module was transformed by the laser scanner and IMU data. In addition, to improve the robustness of the system, the ICP alignment error estimation was calculated and the ICP point matching was checked by the fault detection logic. This navigation system was robust in the environment in which closed loop check can be performed. However, in real world application of autonomous driving, it was less likely that such a loop route exists. Aoki Takanose; Yuki Kitsukawa et al. [
16] exploited the averaging effect by utilizing the time series data. The estimation result was further promoted by a specific module that minimized the GNSS multipath effect. E Soltanaghaei, A Prabhakara et al. [
17] took advantage of the radar data of mmwave frequency to overcome the free space path loss problem. They depended on the signals from the tag at mmwave bands through the Van Atta Arrays.
However, due to the multi-path effects [
18,
19,
20], which resulted from the reflection, and occlusion by buildings and other obstacles, the localization results were not stable enough to meet the requirement of autonomous driving. The information received by the sensor is often limited by the operating range and environmental conditions [
21,
22]. To address the problem of instability, Google [
23] modeled the driving environment as a probabilistic map, in which the remittance values of each block represented its own gaussian distribution. The following Bayesian inference was able to reduce the uncertainty of the consistent angular reflectivity. Moreover, the offline SLAM was employed to align the multiple passes of the same environment. With the above steps, the driving environment was reconstructed with a high confidentiality. Furthermore, lane localization algorithms based on vision information were paid much attention, not only due to the low cost of cameras but the simulation of human drivers. Versaci M, Calcagno S et al. [
24] proposed a fuzzy pre-processing based on geometric techniques with reduced computational load. K Zhao, Kang J et al. [
25] presented a method combining Mask R-CNN with building boundary regularization to solve the problem of localizing all building polygons in the given satellite images. R Labayrade, J Douret et al. [
26] designed three lane localization algorithms, each of which was based on different resolution of input data. The results of two low-level algorithms were fused to generate a reliability indicator. In parallel, the third algorithm based on the high-level data outputted a model by Kalman filtering. The prediction of the searching areas in the next image was given by the combination of the three algorithms. However, this method usually generated confused results when the number of lanes were over 3. Yan Jiang, Feng Gao et al. [
27] designed a vision system based on the estimate and detect scheme to output the precise lane index. A perspective transformation was designed to output the estimation by scanning the row in the top view image. M Danielczuk, M Matl et al. [
28] presented Synthetic Depth (SD) Mask R-CNN to segment unknown objects in depth images and has the potential to enhance robot skills in grasping and object tracking.
With the development of deep learning, the high-level semantic features extracted by DNNs gradually replaced the low-level geometric features such as conner, edge and simple texture. Noha Radwan, Abhinav Valada et al. proposed VLocNet++ [
29] to analyze the inter-task relationship among semantic features learned by DNN, the regressing 6-DoF global pose, and odometry. VLocNet++ embedded the geometric and semantic knowledge of the world into the architecture in order to avoid the limitation of scene specific encoding constrains. A self-supervised warping step was added in the VLocNet++, by which intermediate network representations for sophisticated semantics was warped based on the relative motion. C Toft, E Stenborg et al. [
30] proposed a score system to evaluate the similarity between the semantic features of the query image and scene. The consistency score was positively related to similarity, and the score was transmitted into a standard position pipeline to output the localization performance and then update the parameters of the system. Moreover, the labeled 3D point maps, combined with the semantically segmented images, can be used in the lane localization task. Erik Stenborg, Carl Toft et al. [
30] trained a DNN to generate an image segmenter. The outputted map was then transformed into a localization result by a particle filter. Johannes L. Schönberger et al. [
31] proposed a descriptor learning strategy by leveraging a generative model and studying semantic scene completion as an auxiliary task. In the work, the high-level 3D geometric and semantic information were encoded to difficult conditions, e.g., long time localization and missing observations condition. Based on the enhanced map data constructed from the OSM (Open Street Map), Philipp, Frank et al. [
32] measured the relative position of the lane segment by using camera information to match. The initial pose guess was generated from the GNSS data; odometry was then updated and pose correction was performed by detected landmarks. Der-Hau Lee, Jinn-Liang Liu et al. [
33] treated the lane localization problem as a segmentation task. They presented a DNN that is modified from the baseline segmentation network Unet and generated a simulation model by leveraging the PP algorithm with CNN. The model was used to output the quantity evaluation of the fusion network. Raja Muthalagu, Anudeepsekhar Bolimera et al. [
34] designed a joint framework to fuse the result of a lane mask proposal network and a lane key-point determination network. The aim of the framework was to output a key point prediction that described the left and right lane-markings of the vehicle lanes. Making full use of the semantic cues such as traffic lights and lane markings, Wei-Kang Tseng, Angela P. Schoellig et al. [
35] calibrated the offset between the live GPS and semantic map frames. The GPS information and camera images for semantic cue detection were combined by an iterated extended Kalman Filter. Tong Qin, Yuxin Zheng et al. [
36] proposed a framework that consist of mapping, on-cloud maintenance, and localization modules. They merged several localization results of surrounding vehicles and outputted the optimal estimation of the self-localization. Furthermore, due to the limitation of chip on board, lightweight DNN and accelerated inference were also paid much attention [
37,
38,
39,
40,
41,
42,
43].
3. Method
Given a vision image extracted from the vehicle-equipped camera video and the road perception data outputted from the perception model , the purpose of our proposed model is to wisely utilize the information from two sources. To this end, a fused model based on ResNet18 was designed to address the lane localization task as a classification problem. To better leverage the vision and perception information, a perception image generating step will be performed in the first place. Subsequently, the vision image and perception image are sent into the fused model simultaneously. A new loss function is designed to balance the effectiveness of the vision information and the perception information.
3.1. The Percetion Image Generating Method
In the autonomous driving system, an original idea is that a highly complicated model, which collects all information derived from the various sensors equipped on the car, can be effective enough to output the features needed to control the vehicle. However, due to importance of safety for the autonomous driving system, a complicated fusion model for all tasks is not seen as a reasonable way because of the potential dangers of a high coupling mechanism. Once the model outputs the low-confidentiality feature, all downstream tasks such as threatened objects detection and road edge detection will output the instable or even absolutely wrong results, which may cause serious problems or threats for the autonomous driving system. The instable result outputted from the perception model is shown in
Figure 1. Therefore, low coupling design is required because the extra redundancy brings enough safety for the system.
Considering the lane mark and road edges data outputted from the perception model consist of several lines that label the lane mark and the road edges instead of an image, how to leverage the perception information and vision image information is a question that should be investigated. There are two ways that may address the problem: the high coupling fusion and low coupling fusion. The high coupling fuse is shown in
Figure 1. The fused image can be directly sent into the classification model. Theoretically, this high coupling fuse generates highly accurate results. However, this kind of fusion method reduces both precision and recall in practice (15% decline in precision, 22% decline in recall through our experiment). We suppose the reason for the decline is that the DNN for classification task pays more attention to the rectangle objects than the linear ones. In the lane localization task, it is the lane mark and road edges, rather than the surrounding objects, that assist in determining the lane index. Hence, a new fused image of low coupling design is proposed to overcome this dilemma. The fused method is shown in
Figure 2.
The reason for drawing lines that represent the lane marks and road edges is that the advantage of the DNN is its ability to learn the high-level features, which may be extremely complex but consists of patterns. The surrounding environment is replaced by the random generated meaningless pixels. This step forces the DNN pay attention to the meaningful lines and output the classification result based on these lines.
3.2. The Architecture of the Fusion Model
In the computer vision field, many finely designed classifier networks are used to extract the high-level features of an image to address various tasks. However, due to the seriousness of the autonomous driving task, the classifier used in our model must be stable and robust with high generalization ability. After testing many classifier networks as our baseline, only the Transformer series and the ResNet series met the requirements of autonomous driving. Due to the limitations of the Snapdragon processor equipped in our vehicles, we choose the ResNet as our baseline model. As the lane marks and road edges, rather than the surrounding objects, are essential to the classification process, ResNet18 is superior to the ResNet50 and ResNet101, in which the lane marks that are tens of pixels wide cannot map into the final feature map because deep networks have more convolution and downsampling steps. Moreover, the FLOPS of ResNet18 is considerably lower than that of other two ResNets and thus require less inference time, which is crucial for the autonomous driving task.
The architecture of our proposed model is shown in
Figure 3 and
Figure 4. The LLNet consists of two modules, a vision classification module and a perception classification module. The vision image
and its perception counterpart
are sent into the corresponding classification module, respectively. Additionally, the heads output the classification probability distribution. There are some differences in forward propagation between the training process and the inference process. In the training process, the output from each head in the vision classification module corresponds to its own label and calculates the loss by adding a step with a corresponding weight (as shown in
Section 3.3). In the inference process, only a single head in each module outputs the result and merges the result from the perception module, predicting the final result based on the merged value.
The backbone in LLNet is a modified version of the ResNet18, the two layers of which (average pooling, 1000-d fully connected layer) are replaced by an adaptive average pooling layer and a fully connected layer whose dimension is based on the total lane (5 in our experiment because the 1–4 lane indexes cover 99.1% in all condition; 0 lane index represents inability to classify). The head in the vision classification module of LLNet refers to the SoftMax output that corresponds to different labels. A head corresponds to a label that counts the lane index from left to right, whereas another head corresponds to a label that counts the lane index in the opposite direction. The forward propagation in the training phase is shown in Equations (1)–(3)
where
denotes the probability distribution outputted from a corresponding head (the SoftMax layer),
denotes mapping function that takes image
and
as input and gives the probability distribution as output,
denotes the parameter in the mapping function.
The forward propagation in the inference phase is shown in Equation (4)
where
denotes the output weight of a vision classification module, we set 0.7 as default;
denotes the output weight of perception classification module, we set 0.3 as default.
The reason for using two labels from opposite directions is that through experiments, we found that the single head module based on the ResNet18 pays more attention to the left part of the image if only the label of left to right is taken as the input. In other words, the right side of the images, including the road edges and even the lane mark in the right side of the vehicle, is ignored by the DNN. Furthermore, the two head label contains more constraint and forces the network converge with more precision, compared with a single label. In the inference process, only a single head in the vision classification module is used to output the probability distribution. The purpose of this single output, rather than leveraging the dual heads outputs, is that when the dual head outputs different results, we cannot determine which result has a higher confidence. For example, assuming that the predicted index the vision classification heads output is lane 1 (left to right) and 2 (right to left), but the total lane is 4, these results cannot be distinguished because the same backbone with same parameters is used in the forward propagation process.
3.3. The Fusion Method and the Loss Function
As discussed above, the loss in the training phase and the validate (inference) phase is different. Since we viewed the lane localization problem as a classification task, the cross-entropy loss function was used as the base loss function in the proposed LLNet. The cross-entropy loss function is shown in Equation (5)
where
denotes the predicted probability distribution and
denotes the
-th class.
The loss function in the training phase is shown in Equations (6)–(9)
where
denote the weight of the loss from head 1,2 and 3 respectively. We set them as 0.35, 0.35 and 0.3 as default.
The loss function in the validating (inference) phase is different from the training phase and is shown in Equation (10)
As is mentioned in
Section 3.2, the forward propagation in the validating or inference phase is based on the single output of the vision classification module. Due to the equal confidence of the output from both heads, the weight of loss of head-1 increases from the 0.35 to 0.7. The reason we set the weight of loss from the perception image remarkably inferior to that from vision image is that perception data is derived from the perception DNN rather than the original data. Although the perception model is trained on data whose magnitude is up to millions because of its importance and superiority, the data outputted from perception model is not as reliable as the data from the video camera. The vision images extracted from the video camera provide original information with only a few mistakes in some extreme conditions. This distorted visual information due to camera overexposure or motion blurring can be easily identified by PNSR and other indicators. Conversely, errors, drifts, and unstable values in the output of the perceptual model are difficult to determine or evaluate. That is, it is difficult to have an index that can reasonably assess the confidence level of road edges and lane lines outputted from the perception model. Therefore, only a relatively low weight can be given to the classification results of the perceptual model output in the LLNet.
5. Discussion
Lane positioning has been treated as a pure positioning problem. One common idea to address this problem is the joint optimization of multi-sensors such as GPS and IMU. Another idea is to use manually designed features to extract the lane marks, followed by rule-based localization. However, with the development of deep learning, DNNs are able to not only learn the high-level semantic features that are incomprehensible to humans but also build end-to-end solutions. In addition, the generalization capability of DNN-based lane localization algorithms can be continuously promoted due to the accumulation of data. Therefore, we creatively define the lane localization problem as a classification problem in the deep learning field. To our knowledge, there never been a lane localization algorithm based entirely on deep learning before LLNet. However, directly applying the deep learning classification baselines does not meet the requirement of autonomous driving. This is because lane localization classification tasks are different from typical visual classification tasks. The reasons are as follows:
First, common data augmentation methods cannot be used because the images during the inference process do not have the same changes as the data augmentation, as shown in
Figure 8. Additionally, data augmentation methods such as rotation within a small angle (
Figure 9) can be used in the lane localization task, because the slight rotation of camera does happen as the auto-driving suite is continuously working. The key point of data augmentation is that the augmentation condition must happen in real-world applications.
Second, the core features that lane localization relies on are lane marks and road edges. The high-level semantic features extracted by the classification baselines are helpful but not the most important for this lane localization task. This is a significant difference from the traditional classification task. As can be seen from
Figure 10 and
Figure 11, ResNet outperforms the other baselines (AlexNet and VGG). However, unlike the traditional classification and object detection tasks, ResNet18 is superior to ResNet50. This is because the lane marks, which are important for the lane localization task, are not mapped to the final feature map due to downsampling and convolution operations in the deeper network.
The visual images downsampled by the camera possess remarkable stability and robustness, but as shown in
Table 9, the performance of the classification network relying on visual information alone cannot meet the precision requirement of autonomous driving (90% precision). Therefore, we need additional information to improve the performance of the lane localization algorithm. In an autonomous driving system, the perceptual information is the output from the perceptual model. Due to the importance of the perception module, the perception model is trained on millions of labeled images, so the perception outputs usually have high-level semantic features. The road edges and lane routes from perception model can be used for the lane localization module. As also shown in
Table 9, the precision metric can meet the requirement once LLNet utilizes the perception and vision information with a single classifier head.
However, it should be noted that adding perceptual information significantly reduces the recall metric of the model, because the output of the perceptual model is unstable. Moreover, adding perceptual information actually places more constraints on the classification process, and a large number of scenes that can be judged by a single visual image would be classified as undecidable with the limitation of perceptual information. In general, adding perceptual information can improve precision but reduce recall at the same time.
In order to utilize both visual and perceptual information, a smarter fusion model is necessary. In addition, due to the requirement for real-time lane localization, loosely coupling models are more preferable than tightly coupling models. Therefore, how to loosely couple the visual and perceptual information is a question that should be addressed. The LLNet proposed in this paper makes use of visual and perceptual information for classification separately, followed by a weighted output. We give a relatively low weight to the perceptual information because the perceptual information is not as stable and robust as vision information. What is more, to address the decline due to adding the perception information, we design a dual-head architecture.
As shown in
Table 10, the precision and recall of the dual-head output classification results is significantly higher than that of the single-head version, because the single-labeled lanes are computed from left to right, so the network ignores the information on the right side of the image. The dual-head architecture of LLNet forces the network to extract classification features from both directions equally.
To our knowledge, LLNet is the first lane localization network based entirely on deep learning and achieves the satisfying metrics in the first and second lanes as required by autonomous driving systems. However, our study still has some shortcomings.
The precision and recall metric in third and fourth lanes cannot meet the requirements, which is caused by the insufficient data in the third and fourth lanes. However, in the real world, the three-lane and four-lane in one direction is not a common scenario. Moreover, it is difficult to collect data exclusively for the third and fourth lanes because the labeled data is very expensive. Furthermore, the lane localization task is different from the traditional classification task, and common data enhancement methods cannot be used. Therefore, how to achieve large-scale pre-labeling or suitable data augmentation is a critical problem which is necessary to be solved.
The vehicle driving process is a continuous process. Intuitively, the temporal sequence of visual and perceptual information is highly useful but currently not well utilized. This information could theoretically further improve the accuracy of lane localization in the future.
6. Conclusions
In this paper, we propose a fusion deep learning neural network named LLNet to solve the lane line localization problem in an end-to-end manner. Innovatively, we define the lane detection task as a classification problem. To make full use of the lane marks and road edges information outputted from the perception model, as well as the vision information from the camera, we designed a loosely coupling fusion network to perform the fusion of the visual classification results and perceptual classification result. In LLNet, the visual and perceptual images are passed through the classification model separately, and the probability distribution is outputted and then the weighted sum is calculated. To maintain stability and robustness and to improve the generalization performance, we use ResNet18 as the backbone. In addition, to overcome the problem that the classification model focuses more on the left side of the image and ignores the right side, we designed a multi-head mechanism in the vision module to force the network to focus equally on the classification features from the left and right sides. In the comparison experiments on a large-scale dataset of 11 roads with a total mileage of over 400 km in Beijing and Guangdong Province, the LLNet significantly outperforms the current baseline classification network in all metrics. Moreover, we conducted a comparison of different versions of LLNet to justify the effectiveness of the architecture.
Currently, our method has already met the requirement of precision and recall metrics in the first and second lane scenarios. However, due to the insufficient number of samples in the trainset, the metrics of the third and fourth lanes is relatively low. Semi-supervised learning methods may be effective in this task, and we take it as the future work.