1. Introduction
The fields of autonomous driving (AD) and safety driving assistance systems (SDAS) have seen a significant amount of research in recent years. As the deep learning field has grown in recent years, the field of autonomous driving has moved at a breakneck pace. It is well known that deep learning algorithms often perform better than classical algorithms in these aspects of image processing due to variations in lighting conditions, shadows, road breaks and occlusions, and different camera settings, which often result in classical algorithms not performing well. Many works can be found in the literature based on deep learning for road image processing, and invariably the two most important challenges are the localization and classification of targets encountered while driving, and the segmentation of roads taken by the driver.
In general, the goal of object detection is to locate the studied object in each image using a rectangular prediction frame and output the class and confidence level of the object. The evolution of object detection algorithms is divided into two phases: one is the traditional feature-based solution, and the other is the deep learning algorithm. Before 2013, the mainstream detection algorithms were traditional feature-optimized detection methods, which usually consisted of three parts, the first being the selection of the detection window, the second being the design of the features, and the third being the design of the classifier. The sliding window is used to traverse the whole image and extract the features from the window, and then the classifier is used to detect them. The common methods used in the feature extraction stage are Haar, histogram of oriented gradient (HOG), local binary pattern (LBP), aggregated channel features (ACF), and other operators, and the common classifiers are SVM, boosting, random forest, and so on. With the Adaboost-based face detection method [
1], the object detection algorithm has experienced the traditional framework of manually designed features plus shallow classifiers and reached the pinnacle of traditional object detection. However, the traditional sliding window technique used for object detection needs to handle thousands of windows and has low performance without optimization strategies. Secondly, because of the manual design of features, it cannot express the characteristics of the object in more detail, resulting in a lower recognition rate. So, after 2013, the whole academic and industrial communities gradually started using convolutional neural network (CNN) to do object detection.
On the other hand, classical image processing and computer vision approaches typically divide lane line recognition algorithms into four distinct steps. A considerable amount of background noise and extra pixel information may be removed by first establishing an orthogonal system concerning the body and the road surface. Then, the lane detection zone can be calculated. Image enhancement techniques such as smoothing, sharpening, and other methods are used in the second step of lane line feature extraction to improve the quality of the image. Examples include image-enhancing algorithms for the night, fog, or shadow photos [
2,
3,
4]. The extraction of image features is the third step in the procedure. Image-based algorithms for lane line detection rely primarily on picture characteristics such as lane line shapes, pixel gradients, and color cues to identify lane lines [
5,
6,
7]. In the end, some straight lines or curves are used to match the lane lines [
8,
9,
10]. To employ these classic methods, the filtering operator must be de-tuned, and the algorithm’s parameters manually adjusted based on the features of the street scene that is being targeted by the algorithm. In addition, the recognition of lane lines fails when the driving environment undergoes major changes. To meet the ever-increasing demands for lane line recognition precision, the procedure becomes more complex. Therefore, image processing and computer vision technologies will be phased out in favor of semantic segmentation methods, which have only recently begun to be researched.
2. Related Work
Most of the early object detection is based on deep learning using the sliding window approach for window extraction, which is essentially an exhaustive method R-CNN [
11]. Later, regional window extraction algorithms such as selective search were proposed, where instead of using a sliding window to scan the image for a given image, some candidate windows are “extracted”, and the number of candidate windows can be controlled to be in the range of thousands or hundreds, provided that an acceptable recall is obtained for the object to be detected [
12]. The number of candidate windows can be limited to a few thousand or a few hundred, provided that an acceptable recall is obtained for the detected target.
The SPP layer [
13] solves this problem well by first dividing the whole image into 4 equal parts and extracting features of the same dimension in each part, then dividing the image into 16 equal parts, and so on. The extracted dimensional data are consistent regardless of the image size so that they can be sent to the fully connected layer uniformly. Although R-CNN and SPP have made great advances in detection, the duplicate computation problems they bring are problematic, and Fast R-CNN emerged to solve these problems.
Fast R-CNN uses a simplified SPP layer called the region of interest (RoI) pooling layer. Fast R-CNN also uses SVD to decompose the parameter matrix of the fully connected layer, compressing it into two much smaller fully connected layers [
14]. Faster R-CNN uses region proposal networks (RPN) to compute candidate frames directly, which takes a picture of arbitrary size as input and outputs a batch of rectangular regions, each corresponding to a target score and location information [
15]. Image object detectors are usually divided into two types: one is a two-stage detector, which is characterized by high detection accuracy. R-CNN, Fast R-CNN, and Faster R-CNN are two-stage detection algorithms, and YOLO [
16] and SSD [
17], which consider object detection as a regression problem, are one-stage detection algorithms.
According to prior research, it seems that even though the two-stage detector has a high identification rate, it does not perform particularly well when employed in real-time applications [
18]. In road environment recognition, the ability to detect obstacles in real-time is crucial, not only to achieve high recognition rates but also to have fast processing speed. The YOLO family of algorithms [
16,
19,
20,
21,
22] and the single shot multibox detector (SSD) algorithm [
17] are among the one-stage algorithms, these types of algorithms directly regress the class confidence and coordinate values of the object, and the detection speed is very fast and very suitable for real-time detection.
Deeper neural networks are frequently required to achieve higher accuracy in lane semantic segmentation models, making segmentation models more complex and slower, and in some recent approaches, segmentation networks have become increasingly complex, resulting in models that require large GPU resources and are slow.
For example, the SCNN method [
23] for extracting lane lines is effective for slender lane line detection by sequentially convolving in a certain direction compared to the traditional network that convolves directly between layers but is slow at only 7.5 FPS. For the lane curve part, the CurveLanes-NAS [
24] aims to solve the curved lane line detection problem by capturing the global coherence features and local curvature features of the lane lines from the perspective of network search for long lane lines. Although state-of-the-art results have been achieved, they are computationally very time-consuming.
To address the speed issue, there is also some recent research focusing on segmentation speed improvement with SwiftNet for road driving image segmentation. SwiftNet [
25] uses a lightweight general framework with horizontal connectivity and a resolution pyramid approach based on shared parameters to increase the perceptual field of the model. The segmentation speed of this model is faster, but its accuracy is not high.
Yu et al. proposed BiSeNet [
26], a bidirectional segmentation network. One module deals with spatial information and the other with contextual information, and then proposed a new module for fusing features. A 68.4% IOU has been achieved in 2048 × 2024 high-resolution access and 105 FPS in NVIDIA Titan X. The structure of BiSeNet is relatively simple and the segmentation speed is fast, but the segmentation accuracy is still slightly lacking. Currently, the difficulties in semantic segmentation stem from the loss of spatial information and too small acceptance fields. Due to these problems, the accuracy of image semantic segmentation suffers. Even though these lightweight models focus on improving segmentation speed, they produce poor segmentation results and fail to trade-off well.
To improve target detection and lane semantic segmentation in road environments, this paper proposes a single-stage target detection algorithm and a multi-feature fusion semantic segmentation algorithm based on the same lightweight backbone network and combined with a modified attention mechanism for the improved single-stage target detection algorithm. Its main contributions are provided as follows.
A modified attention mechanism module that combines spatial and the channel is proposed. By improving the channel attention module in CBAM [
27] and using 1D convolution to replace the fully connected layer in the original module, which not only avoids dimensionality reduction and effectively captures the information of cross-channel interactions but also greatly reduces the number of parameters. In addition, a residual block is added to the whole attention mechanism to solve the gradient dispersion problem caused by the sigmoid function. It enhances the representation of features, focuses on important features, and suppresses unimportant features, thus improving network accuracy.
The object detection algorithm replaces the backbone network in YOLOv4 [
21] with the lightweight network MobileNet [
28] and modifies some convolutions in the YOLOv4 feature fusion network to reduce the number of parameters in the network. This makes the whole network lighter while ensuring that accuracy is not compromised.
Based on the decoder-encoder end-to-end architecture model, MobileNet is used as the backbone feature extraction network for lane detection, and the extracted multiple sets of features are decoded through a series of upsampling and downsampling and connected to the corresponding feature layers to finally obtain pixel probability results for each category. It has the features of low parameters of the lightweight model, fast convergence, and high accuracy.
Experimental results show that our proposed detection system is effective and maintains high detection quality with a smaller detection model on the PASCAL VOC and freeway driving datasets. Thus, the computational cost of our method is much lower than state-of-the-art methods.
3. Proposed Method
In this paper, we propose a lightweight deep learning system for object detection and lane detection as shown in
Figure 1. Firstly, we propose the lightweight residual convolution attention network, which makes the network pay attention to the detailed features it needs and suppresses the interference of other useless information, to be applied in the object detection and lane semantic segmentation network to improve the network performance. Secondly, we propose an object detection network by replacing the YOLOv4 backbone network with a lightweight network MobileNet and modifying the normal convolution in the feature fusion network with a depthwise separable convolutional layer, which combines the network attention mechanism to make the network more efficient while greatly reducing the number of parameters. Third, a lane semantic segmentation network is proposed, based on the lightweight network MobileNet as the backbone network, using the extracted feature layers and the extended path method of U-Net [
29] as the decoder, which can increase the local perceptual field and collect multi-scale information without reducing the dimensionality. Additionally, the feature representation can be further enhanced by inserting an attention mechanism in the feature layer fusion process. Such an approach can effectively utilize the dataset and improve the segmentation accuracy of the network.
3.1. Lightweight Residual Convolutional Attention Network
The central focus of the attention mechanism is to get the network to pay attention to the features it needs to pay more attention to. When we use convolutional neural networks to process images, it is impossible for us to manually adjust what needs attention, and this is when it becomes extremely important to make the convolutional neural network pay attention to important objects adaptively. The attention mechanism is one way to achieve adaptive attention of the network. The lightweight residual convolutional attention network (LRCA-Net) proposed in this work is an improved attention mechanism proposed in this study for improving accuracy.
Early on, attentional mechanisms were analyzed from brain imaging mechanisms, using the winner-takes-all [
30] mechanism to study how to model attention. In deep learning, it is now more important to build neural networks with attention mechanisms, because they can focus on more detailed information about the target and suppress the interference from other useless information. In convolutional neural networks, visual attention is usually divided into two forms: channel attention and spatial attention. The CBAM [
27] consists of a serial connection between the channel attention module and the spatial attention module. It can calculate the attention map of the feature map from both channel and spatial dimensions, and then multiply the attention map with the input feature map to perform adaptive learning of features.
The LRCA-Net proposed in this paper is an improved module based on CBAM. The overall structure is shown in
Figure 2. First, the idea of residuals is added to the network architecture of the CBAM module, and the original features
F and the features
F″ after CBAM are directly summed and fused. Second, the fully connected layer of the channel attention module is replaced by using 1D convolution.
It can be seen that the input feature map F has a shape of H × W × C after the channel attention module Ac to get the attention weight Ac(F) with the shape of 1 × 1 × C. Then, Ac(F) and F are multiplied to get the feature F’, as shown in Equation (1). F’ then goes through a spatial attention mechanism As to the attention weight As(F’) with the shape of H × W × 1. Then, As(F’) and F’ are multiplied to get the final feature F”, the shape of F” is H × W × C, as shown in Equation (2).
After a series of processing, the shape of the feature does not change, so this attention mechanism can be inserted after any feature, the network does not need to make changes. Finally, the output feature
F‴ is obtained by summing
F and
F″ using the residual idea, as shown in Equation (3).
where
F is the input feature map,
is the spatial-refined feature, and
is the output feature,
is the channel attention module, and
is the spatial attention module.
In convolution, there will be multiple channels of feature outputs, and some channel features will have a greater impact on the final target, so it is necessary to focus attention on these channels better, and a common practice is global pooling. Two types of pooling—global maximum pooling and global average pooling—are used in the original CBAM. As shown in the original channel attention module in
Figure 3, the input feature maps are reshaped into (1, 1, C) after max-pooling and averaging pooling, respectively. To capture the nonlinear cross-channel interactions, the original channel attention module uses two fully connected layers with nonlinearity. The two results of maximum pooling and average pooling of shape (1, 1, C) are summed after two full-connected operations, and then the channel attention weight
Ac of shape (1, 1, C) is obtained by the sigmoid function.
According to the experiments of [
31] Wang Q.L. et al., it can be seen that two fully connected layers has side effects on channel attention prediction and capturing the dependencies between all channels is inefficient and unnecessary. Therefore, in this study, we modified the original CBAM by exploiting the local cross-channel interaction strategy of the ECA-Net module to improve the channel attention module of CBAM. The modification was performed by replacing the fully connected layer of the channel attention module with an adaptive selection of the size of the 1D convolutional kernel. As shown in the modified channel attention module in
Figure 3, the two pooling results are reshaped into (1, 1, C), respectively, by 1D convolution operation, and then the summed results are sigmoid to obtain the weights
Ac. such a modified module can effectively capture the information of cross-channel interactions, thus achieving an improvement in the overall attention mechanism.
For the spatial attention module there is no modification in this study, and for the input incoming feature layer, the maximum and average values are taken over the channels, which are shaped as (H, W, 1). After that the two results are concatenated into (H, W, 2) and the number of channels is adjusted using a convolution with an input channel number of 2 and an output channel number of 1. Then, the sigmoid is taken, at which point we get the spatial attention weight .
Moreover, as shown in
Figure 2, we modified the original CBAM by adding residual blocks to the entire attention mechanism to solve the gradient dispersion problem caused by the sigmoid function since both the channel and spatial attention modules use the sigmoid function to generate the weights.
The structural analysis of CBAM and LRCA-Net can be obtained from
Table 1, assuming that the input feature shape is (26, 26, 512) it can be seen by analyzing the CBAM structure that the number of parameters is mainly concentrated in the two fully connected layers of the channel attention module, and the overall number of parameters can be made to plummet after using 1D convolutional substitution. Assuming that the input feature shape is (26, 26, 512) it can be seen by analyzing the CBAM structure that the number of parameters is mainly concentrated in the two fully connected layers of the channel attention module, and the overall number of parameters can be made to plummet after using 1D convolutional substitution.
3.2. Object Detection Network
YOLOv3 has made some enhancements over its predecessors, YOLOv1 and YOLOv2. This algorithm is generally used to enhance category prediction, multi-scale prediction, and bounding box prediction, as well as multi-label classification, among other tasks. To achieve CSPDarknet53, YOLOv4 enhances the backbone network, which contains 29 convolutions, 725 × 725 receptive fields, and 27.6 million parameters.
As shown in
Figure 4, the YOLOv4 architecture consists of the following components: CSPDarknet53 + SPP + PANet + YOLO Head. After resizing the original picture to 416 × 416 resolution as input, the algorithm employs up-sampling and feature fusion operations to divide the original image into S × S grids, where S can be 13, 26, or 52, depending on the scale of the feature map, to forecast on the feature image of several scales. With the use of three anchor boxes, each size grid can estimate the object border. Finally, YOLO Head will display the bounding box position and size (x, y, h, w), as well as the object’s category, with confidence.
Advanced network structures can achieve good accuracy with a smaller number of parameters. This is because having too many parameters in a network structure can eventually lead to slow training. The convergence speed can be greatly accelerated by reducing the number of parameters. It has been a challenge to make the network less computationally intensive while ensuring accuracy.
Reference [
32] found that most neural networks are over-parameterized, and after their study found that excessive training weights have little to no effect on overall accuracy, and in many networks, it is even possible to remove 80–90% of the network weights with little loss in accuracy. So, after we choose the object detection model architecture as YOLOv4, we do some work to make the network lighter and with fewer parameters, which makes the network more efficient.
First, we will use MobileNet to replace YOLOv4’s backbone network. MobileNet is a lightweight network designed for mobile terminals or embedded devices, which has been developed into v1 [
28], v2 [
33], and v3 [
34] versions. The MobileNet model is built around depthwise separable convolutions, which are a type of factorized convolution that factorizes a standard convolution into a depthwise convolution and a 1 × 1 convolution known as a pointwise convolution. A standard convolution filters and combines inputs into a new set of outputs in a single step, as illustrated in
Figure 5a.
Figure 5b depicts the depthwise separable convolution that divides this into two layers, one for filtering and one for combining. This factorization has the effect of reducing computation and model size significantly.
The iterative concatenation of 3 × 3 depthwise separable convolutions can be used to form the backbone feature extraction network of MobileNetv1 as shown in
Figure 6 to replace the backbone network of the original YOLOv4. We take out the effective feature layers of the last three shapes of MobileNetv1 for the subsequently enhanced feature extraction.
As shown in
Figure 7a, MobileNetv2 is an upgraded version of MobileNetv1, which has a very important feature of Inverted Resblock, the whole Mobilenetv2 is composed of Inverted Resblock. The left side is the backbone part, which first uses 1 × 1 convolution for upscaling, then 3 × 3 depth separable convolution for feature extraction, and then 1 × 1 convolution for downscaling. The right side is the residual edge part, where the input and output are directly connected. As can be seen in
Figure 7b, MobileNetv3 is a combination of ideas from the following three models: MobileNetv1’s depthwise separable convolutions, MobileNetv2’s the inverted residual with linear bottleneck, and the attention mechanism b-neck structure is introduced on top of it, which works by adjusting the weights of each channel. The HardSwish activation function is also introduced to reduce the number of operations and improve performance. The backbone network of YOLOv4 is replaced by MobileNetv2 and MobileNetv3 in the same way as in
Figure 6, so that we have three object detection networks, MobileNetv1-YOLOv4, MobileNetv2-YOLOv4, and MobileNetv3-YOLOv4. The total number of parameters of each network is calculated and compared with the total parameters of the original YOLOv4 network to obtain the results shown in
Figure 8.
After replacing the backbone network, the number of parameters in each network is significantly reduced compared to the original YOLOv4, but still has a huge number of about 40 million. Therefore, this work replaces the standard convolutions in the SPP and PANet with the depthwise separable convolutions as shown in
Figure 9.
This replacement can significantly reduce the number of weights of the network.
As shown in
Figure 10, the number of parameters of the fully improved network is only one-sixth of the original YOLOV4 network. Finally, to further improve the detection accuracy, as in
Figure 11 after the feature fusion network layer, our proposed attention mechanism is added before outputting the results.
3.3. Lane Detection Network
The semantic segmentation network is a neural network and a model based on deep convolutional networks. Its most basic task is to classify different types of pixel points in an image and to aggregate the same type of pixel points to distinguish different target objects in the image [
35,
36,
37]. One of the U-Net model structures can increase the local perceptual field and collect multi-scale information without reducing the dimensionality. Deep learning networks usually require many datasets for training. This training method allows more efficient use of datasets and enables the network to perform more accurate segmentation even with a small number of training images, so the U-Net model is often used in medical image processing. Continuing this idea of U-Net, this work proposes three lightweight road semantic segmentation models using the MobileNet series as the backbone network, as shown in
Figure 12, and the models are divided into four parts.
The first part is MobileNetv1 as the backbone network is the encode process, i.e., the feature extraction process. It consists of many depthwise separable convolutions iteratively strung together. According to its characteristics mentioned before, we can understand that this can significantly reduce the number of network parameters. In this way, we can use the backbone to obtain five feature layers, whose shapes are (208, 208, 64), (104, 104, 128), (52, 52, 256), (26, 26, 512), and (13, 13, 1024). Since we only need to single extract the safe lane that the driver is driving in, with fewer classification items, the first feature layer (208, 208, 64) and the last layer (13, 13, 1024) are discarded from use. This way we will use the middle three effective feature layers for feature fusion.
The second part is feature layer fusion, which enhances the diversity of feature extraction. The backbone network outputs feature layers (26, 26, 512), and after the ZCB module (ZeroPadding+Conv2D+BN) the output is combined with the attention mechanism module proposed in this study for upsampling and then superimposed with the feature layer (52, 52, 512) of the backbone network output, and this process is repeated to finally obtain a valid feature layer that fuses all features.
The third part is the attention mechanism. Adding the LRCA-Net module to the overlay process of the feature fusion layer enables the network to focus its attention on the effective features and ensures the normal convergence during training.
The fourth part is the prediction part, where we use the final effective feature layer obtained in the second part to classify each feature point by softmax, which is equivalent to classifying each pixel point. The semantic segmentation model thus designed has the advantage of both a lightweight backbone model with a small number of parameters and combines the features of encode-decode structure and skip connection.
5. Conclusions
In this paper, we propose a novel lightweight detection system that has two different detection routes combined with an improved attention mechanism based on the same backbone network divided into object detection and lane detection applied in a safe driving assistance system. Firstly, to improve the detection accuracy, an attention mechanism module is used to capture the information of cross interactions efficiently while greatly reducing the number of parameters. Secondly, the YOLOv4 backbone network is replaced by the lightweight network MobileNet, and the ordinary convolution in the feature fusion network is modified to a depthwise separable convolutional layer, which is combined with the network attention mechanism to make the network more efficient. Third, using the feature layer extracted by the backbone network and U-Net’s extended path method as a decoder, the local perceptual field can be increased, and multi-scale information can be collected without reducing the dimensionality. Additionally, the features can be further enhanced by inserting an attention mechanism in the feature layer fusion process. Such an approach can effectively utilize the dataset and improve the segmentation accuracy of the network.
The proposed algorithm in this paper was evaluated on the object detection dataset PASCAL VOC and highway driving dataset. mAP and mIoU reached 93.2% and 93.3%, respectively, achieving high performance compared to other methods.