LaneFormer: Real-Time Lane Exaction and Detection via Transformer

Yang, Yinyi; Peng, Haiyong; Li, Chuanchang; Zhang, Weiwei; Yang, Kelu

doi:10.3390/app12199722

Open AccessArticle

LaneFormer: Real-Time Lane Exaction and Detection via Transformer

by

Yinyi Yang

¹

,

Haiyong Peng

^1,*

,

Chuanchang Li

¹,

Weiwei Zhang

¹ and

Kelu Yang

²

¹

School of Mechanical and Automotive Engineering, Shanghai University of Engineering Science, Shanghai 201620, China

²

Research and Development Department, CRRC Nanjing Puzhen Co., Ltd., Nanjing 210031, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(19), 9722; https://doi.org/10.3390/app12199722

Submission received: 31 August 2022 / Revised: 22 September 2022 / Accepted: 23 September 2022 / Published: 27 September 2022

(This article belongs to the Section Transportation and Future Mobility)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In intelligent driving, lane line detection is a basic but challenging task, especially in complex road conditions. The current detection algorithms based on convolutional neural networks perform well for simple scenes with plenty of light, and the lane lines are clean and unobstructed. Still, they do not perform well for complex scenes such as damaged, blocked, and lack-of-light scenes. In this article, we have exceeded the above restrictions and propose an attractive network: LaneFormer; We use an end-to-end network for up and down sampling three times each, then fuse them in their respective channels to extract the slender lane line structure. At the same time, a correction module is designed to adjust the dimensions of the extracted features using MLP, judging whether the feature is completely extracted through the loss function. Finally, we send the feature into the transformer network, detect the lane line points through the attention mechanism, and design a road and camera model to fit the identified lane line feature points. Our proposed method has been validated in the TuSimple benchmark test, showing the most advanced accuracy with the lightest model and fastest speed.

Keywords:

lane line detection; attention mechanism; transformer network; road structure

1. Introduction

A key technology in the autonomous driving system is cameras for lane line detection. However, it is found that lane detection faces the challenge of complex scenes. Above all, the lane line is a slender structure with a few appearance clues. It is difficult for the current model to extract all the features of the lane line accurately. Secondly, under extreme weather conditions (rain, dim light, and snow), lane line wear, and obstruction, it is easy to fail to detect the shape and characteristics of lane lines.

Lane line detection algorithms are roughly divided into two categories: traditional algorithms and algorithms based on neural networks. Shen Y et al. [1] proposed a lane detection and recognition method based on the dynamic region of interest (ROI) selection and firefly algorithm. Determine the height and width of the ROI based on the vanishing point and lane lines. H. Jung et al. [2] proposed a lane line detection scheme that adapts to low computing power systems. Traditional methods [3,4] use Hough transform, canny edge detection, and Kalman filter to extract lane line features and then use B-spline to fit the lane line. Niu et al. [5] used Hough transform to extract lane lines and DBSCAN (density-based application spatial clustering) for clustering, but that model cannot fit well for the lane scene with large curvature. P. Smuda et al. [6] use a particle filter to fuse information from the digital map system and propose a new image-based road detection feature, but this method suffers from error accumulation. Yue Wang et al. [7] transformed the problem of lane detection into the problem of determining the control point set by maximum likelihood estimation. W Yue et al. [8] used the cubic B-spline curve to fit the center line; their method assumed that the two sides of the lane line are parallel, but their model did not perform well in unstructured road scenarios. Zheng F et al. and K. Zhao et al. [9,10] use the Catmull–Rom spline in combination with the extended Kalman filter tracking to realize lane line detection; their model can identify different numbers of lane lines. The above methods require manual adjustment of parameters for different scenarios, which are inefficient and have poor results in complex scenarios.

The current widely used methods are neural networks to detect lane lines. J. Dai et al. [11] proposed a model composed of three networks: distinguished instances, estimated masks, and classified object networks. Gopalan et al. [12] used pixel hierarchical structure features to simulate contextual information and used particle filters to track lane markings. This has a good detection effect on worn and occluded lane lines. Qian, Y. et al. [13] improved the model’s overall performance by joint training of lane lines and drivable areas and achieved good improvement. Ref. [14] used a Self Attention Refinement (SAD) design to make the network layer-by-layer attention, refining from top to bottom to learn features and to identify lane lines by encoding rich context; this algorithm combines global features and local features well. Xinlong Wang et al. [15] decouples the mask branch into a mask core and feature branch. This strategy can teach the convolution kernel and the feature, respectively. Lee et al. [16] proposed a unified end-to-end trainable multi-task network that jointly handled vanishing point-guided lane detection to some extent addressing lanes in rainy and low-light conditions. K. He et al. [17] were inspired by adding a branch to predict the object mask in parallel with the existing bounding box recognition branch. Zhang et al. [18] proposed a corrugated lane line detection network, which learns lane features through gradient maps. Fausto Milletari et al. [19] propose a 3D image segmentation method based on a fully convolutional neural network. Davy Neven et al. [20] segmented the lanes by instance and fitted the lane lines through the images after perspective transformation. M. Bertozzi et al. [21] built a detection model based on a stereo vision to detect lane positions in a structured environment. Pan et al. [22] proposed the Spatial CNN (SCNN) network with layer-by-layer convolution in feature maps, enabling messages to pass between pixels across rows and columns. Haris M et al. [23] proposed a model to learn to decode the lane structure and iteratively draw any number of lanes without the computational and time complexity of a recurrent neural network. The corrugated lane line detection network proposed by Zhang et al. [24] used fast connections and gradient maps to effectively learn lane line features, which could solve challenging scenarios such as occluded lane lines. Qin et al. [25] regarded lane line detection as a line-based selection problem of global features; selecting feature points based on rows greatly reduces the computational cost. Ren et al. [26] have introduced a region proposal network (RPN), which shares full-image convolutional features with the detection network. Nicolai Wojke et al. [27] integrated appearance information to improve SORT’s (Simple Online and Realtime Tracking) performance. This model can track objects through longer occlusion time, thereby effectively reducing the number of identity switching. Linjie Yang et al. [28] used a single forward pass to adapt the segmentation model to the appearance of a specific object, which greatly reduced the computational complexity. Zhang et al. [29] established a multi-task learning framework to segment the lane area, detect the lane boundary simultaneously, and consider two geometric constraints. Developing convolutional neural network-based methods is more mature, and it is not easy to achieve a large improvement. Philion J et al. [30] use a joint fully convolutional network and unsupervised training to detect lane lines.

In the past few years, a Transformer network based on the Attention mechanism has rapidly developed in lane line detection [31,32,33]. Liu, R., Yuan et al. [31] used an attention mechanism jointly with CNN to extract features to predict key points of lane lines with an improvement in accuracy. Chen, L. et al. [32] used an attention mechanism to map the extracted lane line points to the 3D space for fitting, which improves the accuracy but has an unacceptable computational overhead. Qiu, Q et al. [33] improve the accuracy by adding prior knowledge to the attention mechanism, which is effective but has too many hyperparameters, making the already hard-to-train model even harder to converge.

In this work, we proposed a network based on an attention mechanism called LaneFormer; the main contributions of this work are as follows:

(1): We adopt multiple down-sampling and up-sampling for fusion in multiple stages to improve the ability to extract information on slender lane lines.
(2): We use the correction function to visualize the ability of the network to extract features; when the correction function does not continue to decrease, it proves that the model has extracted enough features.
(3): We use the attention module to detect lane lines and then the lane line model to fit the detected feature points. Finally, the fitted lanes are projected into the original image.

This paper is organized as follows. Section 2 elaborates on our lane line detection algorithm. Section 3 describes the experimental results, compares them with the current mainstream algorithms, and conducts ablation experiments. Section 4 concludes our work.

2. Methods

2.1. Overall Architecture

The architecture shown in Figure 1 consists of a backbone for feature extraction. We use three down-samplings, each down-sampling using a shared convolutional neural network

δ

for further feature processing, and each layer up-sampling, stitched with the features of the previous layer, and finally output the features

F

. Then, we adopt the feature correction module to check that our network has extracted enough features. We design the MLP to get F

^{'}

; meanwhile, we use the same MLP work on the F

^{'}

to calculate

F^{″}

; we measure the “difference” that we named loss

L_{c o r r}

between

F^{'}

and

F^{″}

; if the loss does not continue to reduce, it is considered that sufficient features have been extracted. The features

F^{'}

will be sent to the attention mechanism. The features are converted to sequence as a part of a value, we will design a position-coding sequence mixed with feature as key, and the query is composed of camera and road parameters. Finally, we use the lane fitting model to fit the points of predicted lane lines.

2.2. Backbone

The backbone is based on the ResNet network. We use

32

parallel groups, and each group is convolutional three times; the first convolution is

4 \times 1 \times 1 \times 256

, the second is

4 \times 3 \times 3 \times 32

, and the third is

256 \times 1 \times 1 \times 4

, and the output is

256

channels. After each

3 \times 3

convolution, we replace the activation function with the Mish function to prevent the problem of gradient disappearance or gradient explosion that occurs when the data is too large or too small. We design a down-sample and up-sample mechanism to enhance further the model’s ability to express obstructed and discontinuous lane lines. We divide the feature obtained from the backbone into three parallel channels for further processing. The first channel uses the shared extractor

δ

to extract features for the current feature. The second channel first down-samples and then uses the shared extractor

δ

for feature extraction; the third channel is to down-sample the current feature again, uses the same CNN for feature extraction, and then performs up-sampling, concatenating with the feature of the second channel. The obtained feature is up-sampled again and spliced with the first channel, and the final feature is used as the final output feature

F

of the backbone.

2.3. Feature Correction

Inspired by [34], we designed the feature correction network as shown in Figure 2.

F

is the output value of the previous module; we use the shared extractor

δ

to extract features further, adopt a small convolution kernel for fine-tuning, and get the feature

F^{'}

as the value vector in the attention mechanism. In order to ensure that the features are fully obtained as much as possible, use a

M L P

to act on

F^{'},

and then measure the “difference” between

F

and

F^{″}

; when the “difference” no longer changes (increase or reduce), it indicates that the

F^{'}

feature is fully extracted. We use the

L 1

loss function as the objective function to measure the “difference” between

F

and

F^{″}

, the formula is:

L_{c o r r} = ||F - F^{″}||

(1)

2.4. Transformer Encoder

In order to capture the contextual relationship, we use the position code

P

, the value of

F^{'}

is

φ

, and the position code and value of

F^{'}

are concatenated as the value V of the Transformer module:

V = (P, φ)

, in Figure 3a, we denote

P

,

F^{'}

by

S_{p}

,

S_{φ}

, respectively.

The structure of the encoding module is shown in Figure 3a. Each encoder module contains head-attention layers and an Add&Norm layer; after each attention mechanism is calculated, the features will be concat and normalized. The concat operation ensures that the subsequent dimensions are the same and that the next step can be performed. The normalization process can significantly improve target detection and speed up the operation. Next, the feature sequence will be sent to the feed-forward layer to change the input tensor’s and send it to the decoder after normalization in the next layer.

We use the scoring function

α

to measure the relationship between the query and the key. In order to improve the efficiency of the calculation, we use the scaled dot-product attention scoring function. The query and key sequence were modified to the same length

c

, assuming the elements of query and key is an independent and identically distributed random variables; the encoder self-attention model performs scaled dot-product attention by Equation:

α (q, k) = (\frac{q^{T} k}{\sqrt{c}})

(2)

where

q \in R^{w \times h}, k \in R^{h \times w}

represent the sequences of the query and key, respectively,

α

denotes the scoring function.

q^{T}

represents the transpose matrix of

q

. c represents the length of the query and key, and the formula is normalized by dividing by

\sqrt{c}

.

During training, we use mini-batches to increase the speed of operations; the scaled dot-product attention is shown in Equation:

A = s o f t m a x (α (q, k)) V \in R^{w \times 2 h}

(3)

where

V \in R^{w \times 2 h}

represents the value sequences through a linear transformation on each input row.

A

stands for the attention map and measures non-local interactions to capture slender structures in global context. Considering that the lane line is a slender structure and this model system only detects lane lines, multi-head attention is used in the encoding and decoding blocks for feature processing, which can reduce model parameters and computational overhead and promote the lightweight of the model.

2.5. Transformer Decoder

The transformer decoder also consists of multiple identical layers. The residual connection and layer normalization are used in the layers, as shown in Figure 3a.

S_{q}

is the matrix of

N \times C

, which is used to learn the characteristics of lane lines during training.

S_{p}

and the initialized

S_{q}

are sent to the mask attention mechanism layer to detect the current feature. After normalization, the attention formula is calculated. The result of the processing is the same as the coding block and is finally output through the fully connected layer. In the decoder attention, the query, key, and value all come from the output of the previous decoder layer. The key and value derived from the absolute position can only consider all positions before that position. This masked attention preserves the auto-regressive attribute, ensuring that the prediction only depends on the generated output features.

2.6. FFNS

In addition to attention sub-layers, each layer in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. The network contains two linear transformations that map the input

S_{d}

to a high-dimensional space

(2, 64) \times (64, 1024) = (2, 1024)

, then through a nonlinear function, and finally to the original dimension. The

s o f t m a x

function will output the label (lane or background),

S_{d}

will be converted into a

N \times 4

matrix, and 4 represents the four prediction parameters of the curve fitting. Then, the average value is taken in each dimension. The fully connected layer formula is as follows:

F F N (x) = m a x (0, x w_{1} + b_{1}) w_{2} + b_{2}

(4)

where

w

denotes the weight matrix of the fully connected layer and

b

represents the bias matrix.

2.7. Lane Detection Model

2.7.1. Lane Line Fitting

Inspired by the geometric topology in [35], we designed the new road structure. We fit the detected points into a smooth curve; the prior model of the lane shape is defined as a polynomial. We use the least squares method to perform multiple cubic curve-fitting on the lane line detection points. A single-lane line on flat ground is:

X = {k Z}^{3} + m Z^{2} + n Z + b

(5)

where

(X, Z)

represents a coordinate point on a flat road.

k, m, n

are the constant matrix and

b

is the compensation value. When the optical axis of the camera is parallel to the ground plane, the coordinate points on the ground plane should be expressed as:

u = \frac{k^{'}}{v^{2}} + \frac{m^{'}}{v} + n^{'} + b^{'} \times v

(6)

where

k^{'}, m^{'}, n^{'}, b^{'}

are the camera internal and external parameters,

(u, v)

is the coordinate point on the pixel level of the image.

If the angle between the optical axis of the camera and the ground plane is

θ

, then the actual coordinates on the image should be:

u^{'} = \frac{k^{'} \times {c o s}^{2} θ}{{(v^{'} - f s i n θ)}^{2}} + \frac{m^{'} c o s θ}{(v^{'} - f s i n θ)} + n^{'} + \frac{b^{'} \times v^{'}}{c o s θ} - b^{'} \times f t a n θ

(7)

where

f

is the focal length of the camera and

(u^{'}, v^{'})

is the coordinate of the corresponding geometrically transformed pixel point.

First, we extract the current video frame and the images of the previous three frames, and perform least squares third-order polynomial fitting on the coordinate points extracted by the model:

X_{t} = k {(\begin{matrix} z_{11} & \dots & z_{1 j} \\ ⋱ \\ z_{i 1} & \dots & z_{i j} \end{matrix})}^{3} + m {(\begin{matrix} z_{11} & \dots & z_{1 j} \\ ⋱ \\ z_{i 1} & \dots & z_{i j} \end{matrix})}^{2} + n (\begin{matrix} z_{11} & \dots & z_{1 j} \\ ⋱ \\ z_{i 1} & \dots & z_{i j} \end{matrix}) + (\begin{matrix} b_{1} \\ ⋮ \\ b_{i} \end{matrix})

(8)

We will keep some parameters and the coefficient matrix used in the calculation. After that, all the lane lines in all the frames (frame t and the previous three frames) are combined, and then the results are averaged; final result is the average of the fitting results of the current frame and the previous three frames.

2.7.2. Fitting Loss

The loss of lane line detection is divided into two categories: the classification loss of the lane line type (lane line and background) and the regression loss of the lane line position referenced by the anchor. The first loss function uses the Cross-Entropy Loss function [36], and the second regression loss is Smooth

L 1

Loss. During training, we use distance to measure whether it is positive or negative, and the remainder

N p & n

is used for the multi-task loss defined as:

\begin{matrix} L ({α_{i}, β_{i}}_{i = 0}^{N_{p & n} - 1}) = λ \sum_{i} L_{c} (α_{i}, α_{i}^{'}) + \sum_{i} L_{r} (β_{i}, β_{i}^{'}) \\ = λ \sum_{i} (y_{i} l o g (S (f_{θ} (x_{i}))) + \sum_{i} (\{\begin{array}{l} 0.5 x^{2}, if |x| < 1 \\ |x| - 0.5, otherwise \end{array}) \end{matrix}

(9)

The functions

L_{c}

and

L_{r}

are Cross Entropy loss and Smooth

L 1

loss, respectively,

α_{i}

and

α_{i}^{'}

are the classification output values and target values of the

i - th

point,

β_{i}

and

β_{i}^{'}

are the output values and target values of the regression output of the

i - th

point. The regression loss is measured by the distance and common coordinate of the estimated value and the true value. If the anchor is considered to be a negative number, its

L_{r}

is equal to 0. Factor

λ

is used to balance the loss components, hyperparameter

λ = 2.5

.

3. Results

3.1. Datasets

We use the TuSimple dataset to test our method. The TuSimple dataset contains

6408

annotated pictures, which are high-definition pictures (

720 \times 1280

) decomposed from the video recorded by the front-view camera. It includes day and night photos of different road conditions and different weather on American highways. The data set is divided into

2704

test sets,

3521

training sets, and

345

validation sets. We use CULane to validate our method to evaluate the adaptive capability to new scenes. CULane is a large-scale and challenging data set for academic research on lane detection. It was collected by cameras installed on six vehicles driven in Beijing, which collected more than

55

h of video and extracted

133, 235

frames. We divide the data set into 88,880 training sets,

9675

validation sets, and

34

,

680

test sets. The test set is divided into normal categories and

8

challenging categories.

3.2. Evaluation Indicators

In order to show the performance of the model and comparison of it with the previous method, we follow TuSimple’s accuracy detection index. In order to judge whether a lane marking is successfully detected, we view the lane markings as lines with widths equal to 25 pixels and calculate the intersection-over-union

(I O U)

between the ground truth and the prediction. Predictions whose

I O U

are larger than a certain threshold are viewed as true positives

(T P)

. The prediction accuracy is computed as:

a c c u r a c y = \frac{\sum_{c l i p} C_{c l i p}}{\sum_{c l i p} S_{c l i p}}

(10)

Here,

C_{c l i p}

is the number of correct points in the last frame of the video clip, and

S_{c l i p}

is the number of ground truth points in the last frame of the clip. The predicted point is correct if the difference between the width of the ground truth and prediction is smaller than a threshold. On the CULane data set, we use

F 1 - m e a s u r e

as the evaluation indicator.

F 1 - m e a s u r e = 2 * \frac{p r e c i s i o n * r e c a l l}{p r e c i s i o n + r e c a l l}

(11)

p r e c i s i o n = \frac{T P}{T P + F P}

(12)

r e c a l l = \frac{T P}{T P + F N}

(13)

Here,

T P

is the number of positive examples,

F P

is the number of positive examples that were incorrectly classified,

F N

is the number of cases that were incorrectly classified as negative examples,

and T N

is the number of correctly classified as negative examples.

3.3. Experimental Parameters

For the Tusimple data set, we set the input resolution to

360 \times 640

, the learning rate to

5 e - 4

, the batch size as

20

, the number of prediction curves

N

set as

6

, and the number of training iterations as

400 k

. We scale the raw data, rotation, image channel increase or decrease, and cropping for data augmentation. In addition, for the CULane data setting, the learning rate is set to

e - 5

, and the other parameters remain unchanged. Except for the ablation experiment, the hyperparameter settings of all experiments are the same. All results are tested on the dual 2080Ti graphics card platform.

In order to illustrate the performance of our model, we compare it with the excellent models for lane line detection: VPGNet [16], Lanenet [20], Ultra-Fast Lanenet [25], SAD [14], Fast-Draw [30], SCNN [22], LSTM [31]. We test on the Tusimple data, comparing the dimensions of the frame rate, MACS, Para, PP, ACC, FP, FN, and place the comparison results in Table 1.

Table 1 shows the performance comparison between our method and the current excellent lane line detection method. The evaluation index is based on the Tusimple data set. Compared with the LSTM network of the same transformer structure, our method has a slightly lower speed but a large improvement in accuracy. At the same time, the parameter page is lower than LSTM. Compared with CNN-based lane line detection frameworks (VPGNet, Lanenet, SAD, FastDraw, SCNN), our speed is between 5–50 frames faster than theirs, and our accuracy is equal to or even slightly higher than the above methods.

3.4. Comparison to State-of-the-Art Methods

Figure 4 is the visualization result of our lane line detection. We compared the LSTM model based on the transformer network and the Lanenet model using the convolutional neural network. It can be seen that our model can fit farther lane lines in the picture (a). For the second picture (b) of multi-lane curves in the scene, our method fits more accurately and there is no drift at the far end. Our method fits more accurately in the third picture (c) of the large curvature curve scene. The ResNet32 network can capture enough lane line features, and the attention mechanism supplements the lane line’s slender structure and context information, so our model performs well in bends and scenes with strong light.

Figure 5 shows the visual images of our model in different scenes on the Tusimple datasets and the CULane datasets. (a,b) is the scene on the CULane dataset. It can be seen that curves of different curvatures and vehicle occlusion conditions can be well simulated. Meanwhile, (c,d) is the scene on the Tusimple datasets; dashed lines and curves of different colors can also be well identified.

3.5. Ablation Experiment

3.5.1. Position Encoding

The attention mechanism abandons sequential operations because of parallel computing. To use the order information of the sequence, we inject absolute or relative position information by adding position encoding to the input representation. The location code can be obtained by learning or being directly fixed. For the light and computational speed of the model, we use a fixed position encoding based on sine and cosine functions. Position coding formulas such as Equation (14).

P_{i, 2 j} = s i n (\frac{i}{1000^{2 j / d}}) {, P}_{i, 2 j + 1} = c o s (\frac{i}{1000^{2 j / d}})

(14)

where

i, j

represent the rows and columns of pixels in the image coordinate system, and

d

is the dimension of the position embedding matrix, which is encoded after normalization.

For comparison, we use the models with and without position coding to conduct experiments; results are shown in Table 2a, without position

A P

(Average Precise) of the model can reach 32.4%, and AP with position can reach 35.5%. The performance of the second experiment (with position coding) is about 3 points higher than that of the first experiment (without position coding), which verifies the necessity of model input position coding. The reason is that position information establishes a relationship between input and output in the process of sequence supervision.

3.5.2. Backbone Selection

The backbone used in our model is a modified ResNeXt50 network. We use different backbones for comparison; Table 2b shows the comparison in model performance, ResNet50 and inceptionv3 as backbone, respectively. With other parameters kept constant, the

A P

value of ResNeXt50 is the highest,

35.3 %

,

5.3 %

percentage points higher than the second value. It means that the deeper network of the rest network has a stronger ability to extract feature points. At the same time, the unique residual structure of the network can also better fit the data.

3.5.3. Transformer Encoder Module

We perform model performance verification of different numbers of coding modules on the Tusimple datasets. The heat map is shown in Figure 6. The depth of the color of the heat map is the confidence level of the feature, and the position in the heat map and the position in the original map are mapping relationships. The confidence and position of the extracted feature points corresponding to different decoding layers are displayed through the heatmap; the number of encoding modules is 2, 3, 4, and 5. It can be seen that when the number of encoding modules is 4, the performance of feature classification and regression is the best; therefore, we fix the number of coding modules to 4.

3.5.4. Transformer Decoder Module

Our encoder number is set to

4

. We verify the model‘s performance by changing the number of layers in the decoder module. It can be seen from Table 3 that the output layer of each layer has the best performance, However, after the number of floors exceeds four, the performance is reduced, and we analyze that it is the performance degradation caused by the over-fitting of the model.

3.5.5. Lane Shape Module

We use several different curves to fit the characteristic points of the lane. The results are listed in Table 4. We find that the cubic curve fitting based on least squares has the best effect on occlusion and bending. For the straight line, our lane shape model works well, too. This result agrees with the previous work’s conclusion that detecting the lane line’s direction generally uses cubic curve fitting. This experiment also demonstrated that the cubic curve fitting could get a better result. The quadratic curve fitting cannot fit all the points, which the quartic curve fitting will count in the noise points. Both of them will reduce the accuracy.

4. Conclusions

In this paper, we have proposed an end-to-end lane line detection network: a LaneFormer network based on an attention mechanism that can directly visualize lane lines. We use the modified ResNet32 network as a backbone to extract shallow-level features and adopt multiple down-sampling and up-sampling for fusion in stages to improve the ability to extract information on slender lane lines. When the extracted features are not significantly increased, the ability to use the feature correction module to visualize the extraction of features sends the features to the attention module. Use the attention module to enhance the contextual information of the extracted features, and then use the lane line model to fit the detected feature points. Our network fully captures the contextual information during the training process and fits the detected road lane points well. Our method achieves state-of-the-art detection performance in terms of parameter numbers and running time; besides, our model is more stable and reliable. In the next task, we will explore the encoding method of location information to find a better encoding matrix that can improve the models’ accuracy or speed. At the same time, we will study the reasoning ability of the model so that it can be based on a small part of the curve in the urbanization road to infer the line of the lane that is occluded.

Author Contributions

Y.Y.: Writing—original draft, Conceptualization, Methodology, Validation, Visualization; H.P.: Writing—review & editing, Methodology; C.L.: Validation, Project administration; W.Z.: Visualization, Resources; K.Y.: Data curation, Funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 51805312; Shanghai Sailing Program, grant number 18YF1409400; Science and Technology Commission of Shanghai Municipality; Training and funding Program of Shanghai College young teachers, grant number ZZGCD15102.

Data Availability Statement

The data that support the findings of this study are openly available in [https://github.com/TuSimple/tusimple-benchmark, accessed on 29 May 2022], reference number [37].

Acknowledgments

The authors wish to express their appreciation to the reviewers for their helpful suggestions, which greatly improved the presentation of this paper.

Conflicts of Interest

The authors declare that they have no conflict of interest to report regarding the present study.

References

Shen, Y.; Bi, Y.; Yang, Z.; Liu, D.; Liu, K.; Du, Y. Lane line detection and recognition based on dynamic ROI and modified firefly algorithm. Int. J. Intell. Robot. Appl. 2021, 5, 143–155. [Google Scholar]
Jung, H.; Min, J.; Kim, J. An efficient lane detection algorithm for lane departure detection. In Proceedings of the 2013 IEEE Intelligent Vehicles Symposium (IV), Gold Coast City, Australia, 23 June 2013; IEEE: Piscataway, NJ, USA, 2013. [Google Scholar]
Loose, H.; Franke, U.; Stiller, C. Kalman particle filter for lane recognition on rural roads. In Proceedings of the 2009 IEEE Intelligent Vehicles Symposium, Xi’an, China, 3–5 June 2009; IEEE: Piscataway, NJ, USA, 2009. [Google Scholar]
Marzougui, M.; Alasiry, A.; Kortli, Y.; Baili, J. A lane tracking method based on progressive probabilistic hough transform. IEEE Access 2020, 8, 84893–84905. [Google Scholar]
Niu, J.; Lu, J.; Xu, M.; Lv, P.; Zhao, X. Robust lane detection using two-stage feature extraction with curve fitting. Pattern Recognit. 2016, 59, 225–233. [Google Scholar]
Smuda, P.; Schweiger, R.; Neumann, H.; Ritter, W. Multiple cue data fusion with particle filters for road course detection in vision systems. In Proceedings of the 2006 IEEE Intelligent Vehicles Symposium, Meguro-Ku, Japan, 13–15 June 2006; IEEE: Piscataway, NJ, USA, 2006. [Google Scholar]
Wang, Y.; Shen, D.; Teoh, E.K. Lane detection using spline model. Pattern Recognit. Lett. 2000, 21, 677–689. [Google Scholar]
Wang, Y.; Teoh, E.K.; Sben, D. Lane detection and tracking using b-snake, image and vision computer. Image Vis. Comput. 2004, 22, 269–280. [Google Scholar]
Zheng, F.; Luo, S.; Song, K.; Yan, C.; Wang, M. Improved lane line detection algorithm based on Hough transform. Pattern Recognit. Image Anal. 2018, 28, 254–260. [Google Scholar]
Zhao, K.; Meuter, M.; Nunn, C.; Müller, D.; Müller-Schneiders, S.; Pauli, J. A novel multi-lane detection and tracking system. In Proceedings of the 2012 IEEE Intelligent Vehicles Symposium, Madrid, Spain, 3–7 June 2012; IEEE: Piscataway, NJ, USA, 2012. [Google Scholar]
Dai, J.; He, K.; Sun, J. Instance-aware semantic segmentation via multi-task network cascades. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Gopalan, R.; Hong, T.; Shneier, M.; Chellappa, R. A learning approach towards detection and tracking of lane markings. IEEE Trans. Intell. Transp. Syst. 2012, 13, 1088–1098. [Google Scholar]
Qian, Y.; Dolan, J.M.; Yang, M. Dlt-net: Joint detection of drivable areas, lane lines, and traffic objects. IEEE Trans. Intell. Transp. Syst. 2019, 21, 4670–4679. [Google Scholar]
Hou, Y.; Ma, Z.; Liu, C.; Loy, C.C. Learning lightweight lane detection cnns by self attention distillation. In Proceedings of the IEEE/CVF International Conference Computer Vision, Seoul, Korea, 27–28 October 2019. [Google Scholar]
Wang, X.; Zhang, R.; Kong, T.; Li, L.; Shen, C. Solov2: Dynamic and fast instance segmentation. Adv. Neural Inf. Process. Syst. 2020, 33, 17721–17732. [Google Scholar]
Lee, S.; Kim, J.; Shin Yoon, J.; Shin, S.; Bailo, O.; Kim, N.; Lee, T.H.; Seok Hong, H.; Han, S.H.; So Kweon, I. Vpgnet: Vanishing point guided network for lane and road marking detection and recognition. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Lim, K.H.; Seng, K.P.; Ang, L.-M.; Chin, S.W. Lane detection and kalman-based linear-parabolic lane tracking. In Proceedings of the 2009 International Conference on Intelligent Human-Machine Systems and Cybernetics, Hangzhou, China, 26–27 August 2009; IEEE: Piscataway, NJ, USA, 2009; Volume 2. [Google Scholar]
Zhang, Y.; Lu, Z.; Ma, D.; Xue, J.-H.; Liao, Q. Ripple-gan: Lane line detection with ripple laneline detection network and wasserstein gan. IEEE Trans. Intell. Transp. Syst. 2020, 22, 1532–1542. [Google Scholar]
Milletari, F.; Navab, N.; Ahmadi, S.-A. V-net: Fully convolutional neural networks for volu-metric medical image segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; IEEE: Piscataway, NJ, USA, 2016. [Google Scholar]
Neven, D.; De Brabandere, B.; Georgoulis, S.; Proesmans, M.; Van Gool, L. Towards end-to-end lane detection: An instance segmentation approach. In Proceedings of the 2018 IEEE intelligent Vehicles Symposium (IV), Suzhou, China, 26–30 June 2018; IEEE: Piscataway, NJ, USA, 2018. [Google Scholar]
Bertozzi, M.; Broggi, A. Real-time lane and obstacle detection on the gold system. In Proceedings of the Conference on In-telligent Vehicles, Tokyo, Japan, 19–20 September 1996. [Google Scholar]
Pan, X.; Shi, J.; Luo, P.; Wang, X.; Tang, X. Spatial as deep: Spatial cnn for traffic scene understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Haris, M.; Hou, J.; Wang, X. Multi-scale spatial convolution algorithm for lane line detection and lane offset estimation in complex road conditions. Signal Process. Image Commun. 2021, 99, 116413. [Google Scholar]
Zhang, Z. Towards Real-Time Object Detection on Edge with Deep Neural Networks. Ph.D. Thesis, University of Missouri-Columbia, Columbia, MO, USA, 2018. [Google Scholar]
Qin, Z.; Wang, H.; Li, X. Ultrafast structure-aware deep lane detection. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems 28, Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; IEEE: Piscataway, NJ, USA, 2017. [Google Scholar]
Yang, L.; Wang, Y.; Xiong, X.; Yang, J.; Katsaggelos, A.K. Efficient video object segmentation via network modulation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Zhang, J.; Xu, Y.; Ni, B.; Duan, Z. Geometric constrained joint lane segmentation and lane boundary detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Philion, J. Fastdraw: Addressing the long tail of lane detection by adapting a sequential prediction network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11582–11591. [Google Scholar]
Liu, R.; Yuan, Z.; Liu, T.; Xiong, Z. End-to-end lane shape prediction with transformers. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 3694–3702. [Google Scholar]
Chen, L.; Sima, C.; Li, Y.; Zheng, Z.; Xu, J.; Geng, X.; Li, H.; He, C.; Shi, J.; Yu, Q.; et al. PersFormer: 3D Lane Detection via Perspective Transformer and the OpenLane Benchmark. arXiv 2022, arXiv:2203.11089. [Google Scholar]
Qiu, Q.; Gao, H.; Hua, W.; Huang, G.; He, X. PriorLane: A Prior Knowledge Enhanced Lane Detection Approach Based on Transformer. arXiv 2022, arXiv:2209.06994. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
López, A.; Serrat, J.; Canero, C.; Lumbreras, F.; Graf, T. Robust lane markings detection and road geometry computation. Int. J. Automot. Technol. 2010, 11, 395–407. [Google Scholar]
Noh, H.; Hong, S.; Han, B. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
Tusimple/Tusimple-Benchmark: Download Datasets and Ground Truths. Available online: https://github.com/tusimple/tusimple-benchmark (accessed on 29 May 2022).

Figure 1. The overall architecture.

δ

is a CNN network for further feature extract,

\otimes

is feature concatation.

F

is the finally feature,

F^{'}

is the result of conv processing

F

, and

F^{″}

is the result of conv processing

F^{'}

.

K, Q, V, P

represent the key, query, value and position encoding of the transformer network, respectively.

Figure 1. The overall architecture.

δ

is a CNN network for further feature extract,

\otimes

is feature concatation.

F

is the finally feature,

F^{'}

is the result of conv processing

F

, and

F^{″}

is the result of conv processing

F^{'}

.

K, Q, V, P

represent the key, query, value and position encoding of the transformer network, respectively.

Figure 2. Shared extractor is

δ

, further extraction and fine-tuning of the features of

F

,

F^{'}

is the refined feature. The

M L P

consists of two layers, which adjust the dimension of the feature map of

F

so that it can correspond to the dimension of

F

.

Figure 2. Shared extractor is

δ

, further extraction and fine-tuning of the features of

F

,

F^{'}

is the refined feature. The

M L P

consists of two layers, which adjust the dimension of the feature map of

F

so that it can correspond to the dimension of

F

.

Figure 3. Transformer encoder and transformer decoder module.

S_{φ}, S_{p},

and

S_{q}

represent the sequence of feature, position, and query. The

\otimes

represent the concat of the sequence. (a) Encoder module (b) Decoder module.

Figure 3. Transformer encoder and transformer decoder module.

S_{φ}, S_{p},

and

S_{q}

represent the sequence of feature, position, and query. The

\otimes

represent the concat of the sequence. (a) Encoder module (b) Decoder module.

Figure 4. The pictures on the CULane dataset to verify the performance of our model. From top to bottom, the model visualization effects of LSTM, Lanenet, and our method are shown. (a–c) is the model inference results of the tunnel exit, curve, and curve with large curvature, respectively.

Figure 5. Our method visualizes images in different scenarios on the CULane datasets (a,b) and Tusimple datasets (c,d).

Figure 6. Heat map of encoding modules.

N

is the encoding modules number

(N = 2, 3, 4, 5)

. The encoder modules can capture contextual feature information and slender structures of lane lines.

Figure 6. Heat map of encoding modules.

N

is the encoding modules number

(N = 2, 3, 4, 5)

. The encoder modules can capture contextual feature information and slender structures of lane lines.

Table 1. Comparisons of accuracy (%) on TuSimple testing Set. The number of multiply-accumulate (MAC) operations is given in G. The number of parameters (Para) is given in M (million). The PP means the requirement of post-processing.

Method	FPS	MACS	Para	PP	Acc	FP	FN
VPGNet [16]	45	-	-	√	98.25	0.0048	0.0250
Lanenet [20]	52.6	-	20.68	√	96.45	0.0617	0.0244
UltraFast [25]	75	-	0.95	√	95.58	0.0602	0.0205
SAD [14]	70	-	0.95	√	96.60	0.0601	0.0213
FastDraw [30]	90	-	-	-	96.88	0.0742	0.497
SCNN [22]	7	-	20.45	-	97.36	0.0642	0.0133
LSTM [31]	420	0.574	0.77	-	96.18	0.291	0.338
Ours	310	0.425	0.60	-	97.31	0.0290	0.0332

Table 2. Model metrics comparison in position encoding and backbone contrast. AP is average precision (%), AR means average recall (%).

(a) Position encoding. Position coding can bring about 3% performance improvement.
	AP	AP20	AP30	AR1	AR10
With position	35.5	53.1	35.1	33.8	38.5
Without position	32.4	50.4	27.2	25.4	32.9
(b) Backbone contrast. The overall performance of ResNet32 as a backbone model is the best, 3% higher than the second place.
	AP	AP20	AP30	AR1	AR10
ResNet32	35.5	53.1	35.1	33.8	38.5
ResNetX50	32.4	50.4	27.2	25.4	32.9
InceptionV3	29.8	48.5	32.5	26.4	31.5

Table 3. Quantitative evaluation of different transformer decoder module number on TuSimple validation set (%). The horizontal coordinate is transformer decoder module number, the vertical coordinate is transformer encoder number, and the evaluation metric is mAP (mean Average Precision).

Layer	1	2	3	4	5
2	92.25	93.22	-	-	-
4	92.05	93.11	94.52	95.15	-
6	91.05	93.54	93.52	94.85	93.85

Table 4. Quantitative evaluation of different lane shape models on TuSimple validation set (%).

Least Squares Method	Acc	FP	FN
quadratic curve	90.22	0.1259	0.0895
cubic curve	93.22	0.0954	0.0715
quartic curve	91.58	0.1061	0.0845

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, Y.; Peng, H.; Li, C.; Zhang, W.; Yang, K. LaneFormer: Real-Time Lane Exaction and Detection via Transformer. Appl. Sci. 2022, 12, 9722. https://doi.org/10.3390/app12199722

AMA Style

Yang Y, Peng H, Li C, Zhang W, Yang K. LaneFormer: Real-Time Lane Exaction and Detection via Transformer. Applied Sciences. 2022; 12(19):9722. https://doi.org/10.3390/app12199722

Chicago/Turabian Style

Yang, Yinyi, Haiyong Peng, Chuanchang Li, Weiwei Zhang, and Kelu Yang. 2022. "LaneFormer: Real-Time Lane Exaction and Detection via Transformer" Applied Sciences 12, no. 19: 9722. https://doi.org/10.3390/app12199722

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LaneFormer: Real-Time Lane Exaction and Detection via Transformer

Abstract

1. Introduction

2. Methods

2.1. Overall Architecture

2.2. Backbone

2.3. Feature Correction

2.4. Transformer Encoder

2.5. Transformer Decoder

2.6. FFNS

2.7. Lane Detection Model

2.7.1. Lane Line Fitting

2.7.2. Fitting Loss

3. Results

3.1. Datasets

3.2. Evaluation Indicators

3.3. Experimental Parameters

3.4. Comparison to State-of-the-Art Methods

3.5. Ablation Experiment

3.5.1. Position Encoding

3.5.2. Backbone Selection

3.5.3. Transformer Encoder Module

3.5.4. Transformer Decoder Module

3.5.5. Lane Shape Module

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI