1. Introduction
Lane detection aims to locate and identify the lane markings on a road surface from images or videos captured by cameras mounted on the vehicle. Lane detection can provide useful information for navigation, lane keeping, lane changing, collision avoidance, and traffic analysis [
1]. However, due to the ever-changing road environment, the lane detection task is very complex [
1,
2]. For example, when lane lines are blocked by other vehicles on crowded roads, when night falls, or when the road does not mark the lane, it is very difficult to detect the lane. Sometimes the arrows on the road can also cause a decrease in detection accuracy because they will be incorrectly detected as lane lines. When there is glare and shadow [
3] or when pedestrians cross a road at an intersection but there is no lane line on the ground, it will increase the difficulty of lane detection. In addition, there are even multiple conditions that affect the detection accuracy at the same time, such as dazzle on a road without lane lines at night.
Figure 1 illustrates five of these practical scenarios that affect the accuracy of lane detection.
The core task of lane detection is to depict lane lines as accurately as possible in the image.
Figure 2 shows the depiction of lane lines in the image: the blue, red, azure, and green lines in the figure are lane lines depicted with ground truth, and the lane lines depicted with the predicted results are similar to this.
The methods of lane detection can be roughly divided into two categories: traditional methods and deep learning methods [
4]. Traditional methods mainly use image processing techniques, such as edge detection, color threshold, perspective transformation, and Hough transformation, to extract and fit the features of lane lines. The advantage of this type of method is that it is fast, but the disadvantage is that it has poor robustness and is not suitable for complex scenes and lighting conditions [
5]. Deep learning methods mainly use convolutional neural networks to learn semantic information about lane lines, such as segmentation, classification, and regression. In general, deep learning methods attain higher accuracy than traditional methods in various scenarios. Therefore, current research on lane detection mainly adopts artificial-neural-network-based models [
6].
After carefully analyzing the abovementioned scenarios that affect the accuracy of lane detection, we found that the lane lines in these scenes are very context-dependent [
6,
7]. For example, when a car found that there were no lane lines on the ground in a short distance, but there were lane lines earlier, that is, the content of the context can be used to enhance the features of the lane lines, then the accuracy of lane detection will be improved. For example, in the dazzle scene, most of the time, the strong light from the oncoming car will pass by quickly and the lane lines before and after this are not disturbed by the strong light, and thus, they are relatively clear. Similarly, if the front and rear lane lines, that is, the context content, can be used to enhance the lane line features, the accuracy of lane detection in the shadow scenario will also be improved. Based on such observation and analysis, we proposed a nonlocal-based neural network model with the goal of achieving higher accuracy for lane detection. Specifically, we embedded the nonlocal [
7] module into the FPN [
8] (Feature Pyramid Network) of the CLRNet [
6] model.
The FPN (Feature Pyramid Network) is a powerful deep learning architecture that can be used for various computer vision tasks, including object detection and semantic segmentation. It builds on the standard convolutional neural network (CNN) to generate multi-scale feature maps, which are then used for further analysis. The FPN module consists of several convolutional layers with different filter sizes and strides. Each layer produces a set of feature maps that correspond to a specific scale or level of the input image. By using the FPN architecture, developers can generate multi-scale feature maps that are useful for various computer vision tasks, leading to improved performance and accuracy in object detection and semantic segmentation, as well as other computer vision applications. In the CLRNet model, the FPN is used to enhance the effectiveness of image feature extraction. CLRNet (Cross-Layer Refinement Network for Lane Detection) is another deep learning architecture that can be used for lane detection in autonomous driving systems. It was introduced in 2022 by Tu Zheng et al. The CLRNet model is designed to perform lane detection in real time using a single camera sensor. The input image is processed by the model, which generates a set of feature maps that are useful for detecting lanes. The CLRNet model uses a cross-layer refinement mechanism to improve the accuracy and robustness of the lane detection algorithm. The model consists of several modules that work together to perform lane detection. The first module generates a feature vector from the input image using a CNN. The feature vector is then processed using a series of response functions that learn to adjust the intensity of each pixel in the feature vector based on its location in the image. The final module generates the lane boundaries by using the learned response functions. However, the output may contain false positives or false negatives due to the limitations of the current lane detection algorithms. To address this issue, the CLRNet model uses a cross-layer refinement mechanism that refines the output of the previous module based on feedback from other layers. This helps to improve the accuracy and robustness of the lane detection algorithm. By using the CLRNet model, developers can achieve high-quality lane detection in real time by using a single camera sensor. This can be useful for applications such as autonomous driving, where accurate lane detection is critical for safe navigation. Additionally, the CLRNet model can be used for research and development purposes, such as improving the performance of existing lane detection algorithms.
The main contributions of this research are threefold:
We proposed a novel model, namely, NLNet, which aimed to further incorporate more long-range dependencies information. To the best of our knowledge, we are the first to apply a nonlocal module to the latest state-of-the-art model in the field of lane detection.
Based on the analysis of the size of the feature map of each layer of the FPN of CLRNet, we creatively applied the nonlocal module directly to the L2 layer of the FPN to enhance the information about long-range dependence.
Our proposed NLNet model achieved state-of-the-art performance on the CULane dataset in terms of accuracy.
2. Related Work
Traditional methods for lane detection usually rely on hand-crafted features and they often require the manual tuning of parameters and thresholds for different scenarios. Deep learning methods, on the other hand, can automatically learn high-level features from data without requiring much domain knowledge or human intervention.
Early methods that used deep learning for lane detection were convolutional neural network [
9,
10] (CNN)-based models, which are able to extract features from road images and then applied a sliding window approach to detect lane candidates. These methods achieved good results on simple scenarios but struggled with complex scenes and curved lanes.
To address the limitations of CNN-based methods, some researchers have adopted semantic segmentation techniques for lane detection. Semantic segmentation is a task that assigns a label to each pixel in an image, indicating the category of the object or region that the pixel belongs to. For lane detection, semantic segmentation can be used to classify each pixel as either lane or non-lane. For example, Pan et al. [
9] proposed a spatial convolutional neural network (SCNN) that uses spatial-message-passing along horizontal and vertical directions to capture long-range dependencies and enhance feature representation. The SCNN was trained on two large-scale datasets: CULane and TuSimple [
11], which contain diverse road scenes and challenging conditions. The SCNN outperformed previous methods on both datasets and achieved real-time performance. Compared with SCNN, the Recurrent Feature-Shift Aggregator [
12] (RESA) uses another message-passing mechanism. The goal of RESA is that each pixel in the final output feature map can contain the information of other pixels. This is used to encode spatial context features. The specific approach is to slice the feature map in the two dimensions of H and W to obtain several slices. For slices along H, information can come from two directions, namely, slices above and below it, and different step sizes will be combined to aggregate information across slices. For slices along W, information can also come from two directions, namely, slices to the left and the right. The aggregation method involves simple addition, which first processes the source slice with a 1D convolution, then applies a nonlinear activation, and finally adds the result to the corresponding target slice. The number of convolution kernels and the number of channels are equal to the number of channels in the feature map. When slicing in the H direction, the one-dimensional convolution is sliding in the W direction, and when slicing in the W direction, the one-dimensional convolution is sliding in the H direction. The final results on CULane show that RESA achieves better detection accuracy than SCNN.
Instead of CNN, Jonah Philion et al. [
13] introduced a novel fully convolutional model of lane detection that learns to decode lane structures instead of delegating structure inference to post-processing. A row-based lane detection task was adopted in the field of lane lines for the first time in End-to-End Lane Marker Detection [
14] (E2E), where E2E aimed to distinguish the specific position of each lane in each line and achieved state-of-the-art results at that time (2020). Ultra-Fast Lane Detection [
15] (UFLD) defines lane detection as finding a set of positions of lane lines in an image, that is, location selection and classification based on line directions. The detection speed of this model is fast. In addition, the model is not a fully convolutional form of segmentation, but a general classification based on fully connected layers, and the features it uses are global. In this way, the problem of the receptive field is directly solved, and for this method, the receptive field is the full image size when detecting the position of the lane line in a row. Therefore, there is no need for a complex information transmission mechanism to achieve good results. Experimental results on Tusimple and CULane show that the proposed method can achieve performance close to or better than the state-of-the-art method at ultra-fast speed.
The PINet [
16] model combines key point detection and point cloud instance segmentation to successfully implement a new lane detection algorithm, which can be applied to any scene and detect any number of lane lines. Combined with the post-processing algorithm, the false detection rate is very low, and it has very strong robustness. Due to the keypoint detection method used, the model is smaller and has less computational overhead compared with the segmentation network.
An attention mechanism is also used in some models, such as the LaneATT [
17] model, where it is able to use information from other lanes more easily by combining local and global features. The backbone of the model can be any generic CNN, such as Resnet, that takes the input image and generates feature maps. Subsequently, each anchor is projected onto the feature map. This projection is used to aggregate the features connected to another set of features created in the attention module. Finally, using the obtained feature set, two layers, namely, one for classification and the other for regression, are used to make the final prediction.
LaneAF [
18] improves the pixel-by-pixel binary classification method for lane detection. Although there are some clustering or instance segmentation methods that can distinguish different lanes, they all have a limit to the maximum number of lanes that can be detected. Abualsaud Hala et al. proposed the LaneAF algorithm, which uses Affinity Field combined with a binary classification method for lane detection and instance segmentation. This method has good performance and can detect changes in the number of lane lines.
The SGNet [
19] model introduces pixel-level perception, lane-level relation, and image-level attention constraints. Using these three parts of constraints on Anchor can obtain more accurate prediction results. Zhan Qu et al. [
20] thought that the pixel-level output has information redundancy and at the same time will bring a lot of noise. Their proposed FOLOLane model uses two branches: one outputs a heatmap to determine whether the pixel is a key point and the other branch outputs offsets to accurately compensate for the position of the key point. The output network completes the local-to-global curve association through the association algorithm to form multiple complete curves.
The algorithm of the CondLaneNet [
21] model is inspired by the instance segmentation algorithm CondInst. The core idea of the latter is to transform the instance segmentation problem from the previous scheme relying on boxes, ROI crops, and feature alignments to the learning scheme of the instance-sensitive convolution kernel parameters. CondLaneNet contains a proposal head and a conditional shape head. The proposal head is used to predict the lane line instance and the dynamic parameters of the convolution kernel at the instance level. The conditional shape head is used to predict the shape information of each lane line instance. CondLaneNet can solve the problem of lane line instance segmentation and crossing line detection and has good real-time performance.
In addition, transfer learning is a viable option for lane detection, and Ke Zhao et al. made significant contributions to this field through multiple papers. For instance, they proposed a federated multi-source domain adaptation method that combines transfer learning and federated learning for machinery fault diagnosis with data privacy [
22]. Additionally, they developed a sophisticated transfer framework that employs an indirect latent alignment process to construct a common feature space using a Gaussian prior distribution instead of directly aligning the source and target distributions [
23].
Moreover, some scholars have conducted in-depth research specifically for challenging scenarios. Xinxin Zhou et al. improved the performance of human detection in crowded scenarios by equipping FPN with multi-scale feature fusion technology and attention mechanisms. They designed a feature pyramid structure with a refined hierarchical split block, referred to as Scale-FPN, which can better handle the problem of scale variation across object instances. Furthermore, they proposed an attention-based lateral connection (ALC) module with spatial and channel attention mechanisms to replace the lateral connection in FPN, enhancing the representational ability of feature maps through rich spatial and semantic information. This enables detectors to focus on the important features of occlusion patterns [
24].
Other scholars have also conducted research in related areas that have inspired the topic of lane detection. For example, driving fatigue seriously threatens traffic safety, and Fuwang Wang et al. proposed the multifractal detrended fluctuation analysis (MF-DFA) method to detect driver fatigue caused by driving for a long time [
25]. Jiawei Xu et al. designed an architecture that analyzes the segmentation windows of three-second data to capture unique driving characteristics and then differentiate drivers based on that basis. The proposed model includes a fully convolutional network (FCN) and a squeeze-and-excitation (SE) block [
26]. Lastly, Bo Jin et al. highlighted in reference [
27] that the time–frequency information in the Hilbert spectrum can be utilized to extract the instantaneous characteristic frequency based on the marginal spectrum features to detect the objective [
27].
3. The Proposed Model: NLNet
Considering the basic idea of improving the accuracy of lane detection by enhancing the context relationship and long-range dependence, the nonlocal module was considered to implement our idea in the existing excellent lane detection model. The nonlocal module has good performance and superior generalization of existing models by enhancing their ability to capture global context and long-range dependencies. This study applied this idea to the CLRNet model.
3.1. Motivation
In fact, in the CLRNet model, there are also ideas of global relations and long-range dependencies, such as the use of CNN and the proposal ROIgather, but we do not believe that CNN and ROIGather are sufficient for enhancing lane line features with global information and long-range dependencies [
6]. CNNs struggle to capture global relationships and long-range dependencies for several reasons [
7,
8,
22]. First, CNNs rely on local receptive fields, which means that each convolutional filter only operates on a small region of the input. This limits the ability of CNNs to aggregate information from distant regions and to model complex interactions between them. Second, CNNs use pooling operations to reduce the spatial resolution of the feature maps and to introduce some degree of translation invariance. However, pooling also discards some spatial information and may cause some loss of semantic information. Third, CNNs use a fixed number of convolutional layers with a fixed filter size and stride. This imposes a limit on the effective receptive field of each layer, which is the region of the input that influences the output of that layer. The effective receptive field grows linearly with the number of layers and quadratically with the filter size and stride. However, this growth may not be sufficient to cover the entire input or to capture long-range dependencies. Although the CLRNet model uses FPN to make up for the drawback that only local information can be obtained by using CNN alone, we do not believe that all the information about long-range dependence can be fully obtained in this way [
28,
29,
30].
ROIGather is another technique used in the CLRNet model to extract long-range dependencies information. The architecture diagram of ROIGather is shown in
Figure 3. Observing
Figure 3, it can be found that CLRNet obtains the dependence between the lane line pixels and their surrounding pixels by continuing the convolution operation on the feature map of the ROI, and obtains the dependence between the lane line and the whole image by performing operations similar to the attention mechanism with the original feature map of the whole image. However, as shown in the red dashed box in
Figure 3, the feature map is eventually flattened into a one-dimensional vector, thus losing the location information, which affects the final detection accuracy.
In summary, the CLRNet is not maximizing its use of the contribution of other global regions (e.g., pixels that are far away) to the current region. In fact, in a road image, the characteristics of each lane line can be enhanced through similarity. Specifically, pixels (no matter how far apart two pixels are) will enhance each other’s features due to their high similarity, and thus, the problem turns into obtaining the relationship weight of any pixel in the image to the current pixel, or in other words, capturing long-range dependencies.
Figure 4 illustrates the similarity between a certain lane line and all other lane lines in the figure.
3.2. The Method for Obtaining Long-Range Dependencies
On the basis of CLRNet, it was the focus of our research to improve the detection accuracy by adding simple and effective long-range dependency modules. The nonlocal module is a neural network component that can capture long-range dependencies in data. It is inspired by the concept of nonlocal means, which is a denoising technique that uses the similarity of patches in an image to reduce noise. The nonlocal module has several advantages over conventional convolutional or recurrent layers. First, it can capture long-range dependencies without increasing the receptive field or the number of parameters. Second, it can adaptively adjust the weights based on the input data rather than using fixed kernels or weights. Third, it can handle variable-length inputs and outputs, such as sequences or graphs.
The basic idea of the nonlocal module is to compute a weighted sum of the features at all positions in the input, where the weights are determined by a pairwise function that measures the similarity or affinity between two positions. The output of the nonlocal module is then added to the original input as a residual connection, which helps to preserve the local information.
The nonlocal module can be formulated as follows:
where
is the output feature at position
,
is the input feature at position
,
is the pairwise function that computes the weight for position j given position
, and
is a function that transforms the input feature at position
. The summation is over all possible positions
in the input.
The pairwise function
can be implemented in different ways, such as using a dot product, Gaussian, embedded Gaussian, or concatenation. The
function can be a linear projection or a more complex transformation. The nonlocal module can also be generalized to multi-head attention, where multiple output features are computed with different
functions and then concatenated [
7].
There are many ways to implement a nonlocal block,
Figure 5 shows this concrete implementation of nonlocal. In this study, we did this through the following specific steps, as used in the original study.
First, linear mapping is performed on the input feature map (essentially, 1 × 1 × 1 convolution is used to compress the number of channels), and then θ, φ, and g features are obtained.
Through the reshape operation, the dimensions of the above three features except the number of channels are forcibly merged, and then matrix dot multiplication of θ and φ is carried out to obtain something like a covariance matrix (this process is very important: the autocorrelation in the features are calculated, that is, the relationships between each pixel in each frame and all pixels in all other frames are obtained).
Then a softmax operation is performed on the autocorrelation features to obtain the weights from 0 to 1, which is the self-attention coefficient we need. Softmax is an activation function that normalizes a numeric vector to a probability distribution vector with probabilities adding up to 1. The formula for softmax is as follows, where
is a vector and
and
are elements of it:
Finally, the attention coefficient is multiplied back into the feature matrix g, and then the channel number is extended (1 × 1 × 1 convolution), and the residual operation is performed with the original input feature map X to obtain the output of the nonlocal block.
The steps above show that nonlocal directly integrates global information rather than simply obtaining a portion of global information by stacking multiple convolutional layers. This brings richer semantic information to the layers that follow.
3.3. NLNet
The network architecture of CLRNet is a cross-layer refinement network for lane detection that was proposed in a paper accepted by CVPR 2022. This network aims to fully utilize both high-level and low-level features in lane detection, and it is a state-of-the-art method that achieves impressive results on the CULane dataset. Therefore, we considered applying nonlocal ideas to CLRNet, and thus, a new model that we called NLNet was proposed.
Figure 6 shows the entire structure of the model. The road image in the lower-left corner of the figure is downsampled through P2, P3, and P4, and the CNN kernel window covers a larger observation area, thereby obtaining more high-level semantic information. From L0 to L2, upsampling is used, and we can see from
Figure 2 that there are also horizontal connections between P2 and L2 and between P3 and L1. These horizontal connections will blend the high-level semantic information captured via downsampling with the low-level detail information captured by upsampling, ensuring that the model extracts more information. In addition, the head actually contains the ROIgather module, which uses a cross-layer refinement mechanism to refine the output of the previous module, ensuring that the model further extracts more information. Our contribution involved adding a nonlocal module (see the yellow dashed box in the figure), where the nonlocal module is a neural network component that can capture long-range dependencies in data. After all these feature extraction processes, the feature map is trained under the guidance of the loss function.
The main idea of CLRNet is to first detect lanes with high-level semantic features, then perform refinement based on low-level features. In this way, it can exploit more contextual information to detect lanes while leveraging the local detailed lane features to improve localization accuracy.
Similar to CLRNet, the network architecture of NLNet consists of two main components: a feature pyramid network (FPN) and a cross-layer refinement module (CLRM). The FPN is used to extract multi-scale features from the input RGB image, and the CLRM is used to refine the lane detection results at each feature stage. The CLRM takes the current stage’s regression results as input and outputs refined regression results that are fed into the next stage. The CLRM also uses cross-layer connections to fuse features from different stages and enhance the feature representation. The FPN is composed of three stages: P2, P3, and P4. Each stage has a different spatial resolution and semantic level. The FPN adopts DLA34 as the backbone network and uses lateral connections to combine low-level and high-level features. The output of each stage is a feature map with 256 channels.
The overall workflow of NLNet is as follows: First, the input image is fed into the FPN to generate three feature maps: P2, P3, and P4. Then, each feature map is processed by the CLRM in a top-down manner, going from P4 to P2, which we named L0, L1, and L2. The CLRM is applied to each feature map separately and consists of two sub-modules: the cross-layer connection (CLC) and refinement head (RH). The CLC is a convolutional layer that takes the previous stage’s refined result as the input and outputs a feature map with the same resolution as the current stage. The RH is a convolutional layer that takes the current stage’s feature map and the CLC’s output as the input and outputs a refined result for the current stage. The fusion module is a convolutional layer that takes all the refined results as the input and outputs a fused result with the same resolution as P2. The output module is a convolutional layer that takes the fused result as the input and outputs a binary segmentation map for lane detection. The contribution of NCLNet is to enhance the long-range dependence of L2 with a nonlocal idea so that more global information can be obtained. The section in the red dotted box in
Figure 6 shows the enhancement of the long-range dependencies on L2 using nonlocal.
Actually, the nonlocal model can be applied not only to L2 but also to L0 and L1; therefore, we can use nonlocal modules for L0, L1, and L2 at the same time, or only use nonlocal modules for one or two of L0, L1, and L2. Analyzing the workflow of NLNet, it is obvious that from L0 to L2, the size gradually increases. As the size of the feature map gradually decreases, the size of the lane line will also decrease at the same time, and too small of a lane line size will have a very poor effect on long-range dependence operations. Therefore, considering the tradeoff between the amount of calculation and the effect, nonlocal operation is only performed on L2.
We added a loss function in the CLRNet model, which means that the line
loss is also used in our proposed model. The loss function designed in reference [
6] was directly adopted, for each point in the predicted lane, we first extend it (
) with a radius
into a line segment. Then
can be calculated between the extended line segment and its ground truth, which is written as follows:
where
and
are the extended points of
, while
and
are the corresponding ground truth points. Note that
can be negative, which can make it feasible to optimize in the case of non-overlapping line segments. Then
can be considered as the combination of infinite line points. The discrete form can be written as follows:
Then, the
loss is defined as
where
; when two lines overlay perfectly, then
1, and
converges to −1 when two lines are far away.
6. Conclusions
Long-range dependence information refers to the relationship between two or more pixels in an image that extends beyond their visual appearance. It is a measure of how dependent a pixel is on its neighbors in terms of color, texture, and spatial arrangement. Understanding the long-range dependence information of an image can significantly enhance the accuracy and efficiency of these tasks. In this study, first, we discovered that CLRNet has not fully utilized the contribution of other global regions (e.g., faraway pixels) to the current region. Second, therefore, we proposed our NLNet model by adding the nonlocal module after noticing that CLRNet does not make full use of the global regions’ contribution. Based on the analysis of the feature map size of each layer of the FPN in CLRNet, we creatively applied the nonlocal module directly to the L2 layer of the FPN to enhance the information about long-range dependence. Finally, test experiments were performed and the test results demonstrated the effectiveness of the long-range dependence information in improving lane line detection performance, and our proposed NLNet model achieved state-of-the-art accuracy on the CULane dataset. In terms of the next research direction, we believe that it should include further enhancement of long-range dependence, optimization of the loss function, acceleration of the inference speed, and effective improvement of the accuracy in difficult scenarios.