A Novel Network Framework on Simultaneous Road Segmentation and Vehicle Detection for UAV Aerial Traffic Images

Xiao, Min; Min, Wei; Yang, Congmao; Song, Yongchao

doi:10.3390/s24113606

Open AccessArticle

A Novel Network Framework on Simultaneous Road Segmentation and Vehicle Detection for UAV Aerial Traffic Images

¹

Project Construction Management Company of Jiangxi Transportation Investment Group Co., Ltd., Nanchang 330108, China

²

School of Computer and Control Engineering, Yantai University, Yantai 264005, China

³

School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Sensors 2024, 24(11), 3606; https://doi.org/10.3390/s24113606

Submission received: 22 April 2024 / Revised: 27 May 2024 / Accepted: 29 May 2024 / Published: 3 June 2024

(This article belongs to the Section Vehicular Sensing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Unmanned Aerial Vehicle (UAV) aerial sensors are an important means of collecting ground image data. Through the road segmentation and vehicle detection of drivable areas in UAV aerial images, they can be applied to monitoring roads, traffic flow detection, traffic management, etc. As well, they can be integrated with intelligent transportation systems to support the related work of transportation departments. Existing algorithms only realize a single task, while intelligent transportation requires the simultaneous processing of multiple tasks, which cannot meet complex practical needs. However, UAV aerial images have the characteristics of variable road scenes, a large number of small targets, and dense vehicles, which make it difficult to complete the tasks. In response to these issues, we propose to implement road segmentation and on-road vehicle detection tasks in the same framework for UAV aerial images, and we conduct experiments on a self-constructed dataset based on the DroneVehicle dataset. For road segmentation, we propose a new algorithm C-DeepLabV3+. The new algorithm introduces the coordinate attention (CA) module, which can obtain more accurate segmentation target location information and make the segmentation target edges more continuous. Also, the improved algorithm introduces the cascade feature fusion module to prevent the loss of detail information in road segmentation and to obtain better segmentation performance. For vehicle detection, we propose an improved algorithm S-YOLOv5 by adding a parameter-free lightweight attention module SimAM. Finally, the proposed road segmentation–vehicle detection framework is utilized to unite the C-DeepLabV3+ and S-YOLOv5 algorithms for the implementation of the serial tasks. The experimental results show that on the constructed ViDroneVehicle dataset, the C-DeepLabV3+ algorithm has an mPA value of 98.75% and an mIoU value of 97.53%, which can better segment the road area and solve the problem of occlusion. The mAP value of the S-YOLOv5 algorithm has an mAP value of 97.40%, which is more than YOLOv5’s 96.95%, which effectively reduces the vehicle omission and false detection rates. By comparison, the results of both algorithms are superior to multiple state-of-the-art methods. The overall framework proposed in this paper has superior performance and is capable of realizing high-quality and high-precision road segmentation and vehicle detection from UAV aerial images.

Keywords:

UAV aerial traffic images; road segmentation; vehicle detection; attention mechanism; cascade feature fusion

1. Introduction

In recent years, UAVs have developed at a rapid pace and have a wide range of applications. With the characteristics of strong environmental adaptability, flexible scheduling, and lower cost of use, UAVs have a wide range of applications in intelligent transport systems, fire rescue, and environmental monitoring [1]. At the same time, there is an increasingly urgent demand for intelligent analysis and the processing of aerial images captured by UAVs. Using UAVs to extract road information and perform vehicle detection tasks has become a hot research topic [2].

In terms of road segmentation, road information includes road type, shape, location and sign lines, etc. Real-time and efficient road information plays an important role in intelligent transport, urban planning, and emergency response. The drone can fly quickly and efficiently, thus realizing the coverage of large road areas and adapting to complex inspection environments. It provides high-frequency data updates to realize the real-time detection of road scenes. UAV aerial imagery is an important source of information for military reconnaissance, emergency rescue, geological exploration, and other fields [3]. Although road segmentation from aerial images has received considerable attention, the task remains challenging due to factors such as complex shooting angles and irregular road segments. Traditional methods for extracting roads from UAV aerial images are costly, time-consuming, and have low segmentation accuracy. Mnih et al. [4] were the first to apply deep learning to aerial road segmentation tasks. Since then, researchers have gradually proposed various deep learning based methods to extract road regions from aerial images. These methods show higher efficiency compared to traditional algorithms. However, in the road segmentation task, there are still problems such as unclear road boundary delineation, inaccurate delineation results, and insensitivity to occlusion phenomena.

Deep learning-based target detection methods are widely used in the fields of traffic target detection, fire detection [5], and plant disease identification [6]. In terms of vehicle detection, detecting vehicle targets on the road based on drone aerial images can promote the construction of urban intelligent traffic [7,8]. Intelligent transportation is the current hot topic in the field of urban transport, and it is also the future direction of development in the field of transport. Vehicle detection based on UAV aerial images can provide basic data for the construction of intelligent transport, such as traffic flow and congestion, so as to provide more detailed and comprehensive basic data support for the realization of intelligent transport. Despite the rapid development of target detection methods based on deep learning, UAV aerial images are limited by the shooting angle and shooting distance. These make vehicle detection face difficulties such as a large number of small targets, complex background conditions, inconspicuous features, and occlusion. These unfavourable factors lead to unsatisfactory results of representative algorithms such as Faster R-CNN [9] and YOLOv5 for vehicle detection in aerial images.

In order to solve the above challenges, this paper proposes the C-DeepLabV3+ algorithm for road segmentation based on the DeepLabV3+ and S-YOLOv5 algorithms for vehicle detection based on the YOLOv5 algorithm, and the two improved algorithms are united to form a road segmentation–vehicle detection framework. The two new algorithms achieve better results through a series of improvements and are more applicable to UAV aerial traffic images.

The contributions of this paper are as follows:

A new road segmentation algorithm, C-DeepLabV3+, is proposed, which introduces a Coordinate Attention module to obtain more accurate segmented target location information and employs a cascaded feature fusion module in the decoder. The algorithm significantly improves the performance of road segmentation in UAV aerial traffic images.
A new target detection algorithm, S-YOLOv5, is proposed, which is based on the YOLOv5 algorithm with the addition of a parameter-free attention module known as SimAM. This algorithm enhances the ability to capture and detect the contextual features of vehicle targets, which effectively improves the performance of vehicle detection.
We propose a road segmentation–vehicle detection algorithm under the same framework for accomplishing the serial tasks of road segmentation and vehicle detection on the road under UAV aerial images. Specifically, the C-DeepLabV3+ algorithm is utilized to accurately extract the road region in the aerial images, and then the vehicle targets on the road regions are accurately detected based on the extracted images using S-YOLOv5 so that the two tasks are serially combined together with high quality.

The rest of this paper is organized as follows. Section 2 introduces related works about road segmentation, road segmentation from UAV aerial images, object detection, and object detection from UAV aerial images. Section 3 describes the proposed method in detail. The experimental results are presented in Section 4. Finally, conclusions are given in Section 5.

2. Related Work

2.1. Road Segmentation Methods

Road segmentation is achieved through image segmentation techniques and is an important research direction in the field of computer vision. Currently, there are two methods for road segmentation in images: traditional methods and deep learning-based methods. The difference between the two lies in the different levels of road image features used. The former makes use of the medium and shallow features of the image, while the latter makes use of the medium and deep features of the image in order to obtain higher accuracy.

The traditional methods can be divided into three methods: feature-level [10,11,12,13,14,15], object-level [16,17,18], and knowledge-level [19,20,21,22,23,24] methods. Feature-level methods mainly use the geometric features of road images and radial features. Unsalan et al. [13] extracted the initial edges of roads and then used the Binary Ballon algorithm and graph theory to extract roads. Object-level methods take the whole road image as a processing unit, obtain the segmented region using the image segmentation technique, and then classify the road and nonroad regions. Knowledge-level methods identify and extract roads by building road-related knowledge models. For example, Liu et al. [24] used the information of the selected sample road sections to construct a geometric knowledge base of rural roads, and based on the knowledge base, they extracted complete rural roads from remote sensing images.

Deep learning has greatly advanced the field of computer vision [25], and many methods have also been used in the study of road segmentation. The Convolution Neural Network (CNN) can provide better road segmentation results than traditional methods by embedding many high-level and multiscale information. For example, Li et al. [26] proposed combining deep learning and linear integral convolution for road network segmentation from high-resolution images. Tao et al. [27] proposed Seg-Road, a connected structure road segmentation network based on the transformer and CNN.

In recent years, there has been a gradual increase in the use of end-to-end [28,29,30,31] neural network models to extract road regions. This method can extract more semantic information by doing feature fusion and feature refinement of semantic segmentation models, thus improving the accuracy of the model. For example, Zhou et al. [31] constructed a semantic segmentation neural network D-LinkNet based on LinkNet [32] by increasing the sensory field to improve the road segmentation accuracy. In addition, the adversarial generative network GAN [33,34,35,36] was also introduced into the research of road segmentation. Zhang et al. [34] proposed a remote sensing image road segmentation method based on an improved generative adversarial neural network, which was a conditional GAN with L2 loss that requires only a small number of samples for training and a small amount of computation.

2.2. Road Segmentation Methods Based on UAV Aerial Images

The changing angles and complex background of UAV aerial images make it more challenging to perform road segmentation on UAV aerial images. Starting with the work of Mnih, deep learning based methods have been used to detect road regions in aerial images. Saito et al. [37] used the CNN to directly segment roads and buildings in aerial images. To improve the performance of the CNN, Alshehhi et al. [38] replaced the fully connected layer in the CNN using global pooling and combined the shallow features of the roads in the neighboring regions with the deeper features of the convolutional neural network. Panboonyuen et al. [39] proposed an improved DCNN with Landscape Metrics to further reduce the number of misclassified road objects and finally optimized the road segmentation results using the Conditional Random Field (CRF). In order to improve the recall of the segmentation, Im et al. [40] proposed to improve the performance of CNN-based road segmentation algorithms using road label thickening and thinning.

Compared to the CNN, Fully Convolutional Networks (FCNs) are better for road segmentation because their extracted features contain more high-level abstract semantic information. Zhong et al. [41] applied the FCN to the task of extracting buildings and road regions from high spatial resolution images. To obtain better road classification results, Fu et al. [42] designed a multiscale network based on the FCN and used the CRF to refine the output maps. Based on U-Net [43], Zhang et al. [28] proposed ResU-Net for road recognition extraction, thus replacing the basic unit of U-Net with the residual unit to facilitate training and solve the degradation problem. Chen et al. [44] used multiple lightweight U-Nets, with the output of each U-Net network serving as the input of the next U-Net network. Meanwhile, they designed a multiobjective optimization structure to jointly train all the U-Nets, which solved the problem of the poor training effect of multiple independent models.

2.3. Object Detection Methods

Object detection is a hot issue in the current field of computer vision. The main task is to find the target position in the image and classify it. Object detection algorithms include two categories: traditional algorithms and deep learning-based algorithms.

Traditional object detection algorithms require hand-designed features, which can be divided into three steps: selecting the region, extracting the features, and classifying them. Dala et al. [45] proposed for the first time the HOG as a feature descriptor, which greatly improved the scale-invariant [46] and shape context-based [47] methods. The Deformable Part Model (DPM) [48] is the pinnacle of manually extracting features to achieve target detection based on manual labor. The development of deep learning has achieved a successful shift from manually designed features to feature extraction using convolutional neural networks. The CNN with different depth hierarchies allows for richer multilayered feature representations, thus resulting in significant improvements in object detection performance. Currently commonly used deep learning-based detection algorithms include two-stage and single-stage algorithms. Among them, the two-stage algorithms generate candidate regions first and then perform classification and regression for each candidate region. The single-stage algorithms perform both classification and regression. Girshirk et al. [49] combined region suggestion with the CNN and proposed a representative algorithm for two-stage object detection, R-CNN, which performed much better than traditional object detection algorithms. To address the shortcomings of the selection search algorithm, the Faster R-CNN algorithm utilizes the Region Proposal Network (RPN) to extract candidate frames, which improves the inference speed of the network. The accuracy values of the two-stage algorithms are higher than the single-stage algorithms, but the speeds are slower than the single-stage algorithms. The YOLO series algorithms [50,51,52,53,54] use regression to apply a single CNN to the whole image, thus dividing the image into grids of the same size to directly predict the category and bounding box of each grid. The SSD [55] algorithm combines the regression concept of YOLO with the anchor box mechanism of Faster R-CNN, which effectively handles objects of different sizes.

2.4. Object Detection Methods Based on UAV Aerial Images

Deep learning-based methods for object detection using UAVs have been widely used. For the problem of detecting congested regions in aerial images, Yang et al. [56] proposed the ClusDet detection network. It consists of four subnetworks: congested region extraction, iterative merging, scale estimation, and detection. The experimental results of this network on the VisDrone 2019 dataset [57] showed that ClusDet has higher detection accuracy compared to the baseline detection network, especially for small targets. In order to better run the target detection algorithm on UAV embedded devices, Zhang et al. [58] proposed the Slim YOLOv3 algorithm. This algorithm greatly reduced the number of parameters of the model by pruning the channels of the convolutional layer of YOLOv3, and it improved the detection accuracy and speed to some extent. Liu et al. [59] proposed an unanchored detector, Edge YOLO, for real-time detection on edge computing platforms. The algorithm used data enhancement methods to suppress overfitting during training, and it used a hybrid stochastic loss function to improve the detection accuracy of small targets. Zhu et al. [60] proposed TPH-YOLOv5 based on YOLOv5 by adding a prediction head to detect objects at different scales and replaced the original prediction head with Transformer Prediction Heads (TPHs) to explore the potential of prediction with a self-attention mechanism. They also integrated a CBAM [61] to find the attention region in scenes with dense targets, and they achieved good performance in target detection in UAV scenes.

3. Methods

3.1. Overview of the Road Segmentation–Vehicle Detection Framework

The two parts of the road segmentation and vehicle detection tasks are serial work and sequential nature, which together constitute the proposed road segmentation–vehicle detection framework. Therefore, it is crucial to improve the performance of the road segmentation and vehicle detection algorithms in order to ensure the performance of the segmentation–detection framework.

The implementation diagram of the road segmentation–vehicle detection framework consisting of the C-DeepLabV3+ and S-YOLOv5 algorithms is shown in Figure 1. Firstly, the C-DeepLabV3+ algorithm is used to segment the road region and nonroad region of the UAV aerial image, and only the road region is extracted from the original image to generate the corresponding road postsegmentation image. Then, the S-YOLOv5 algorithm is used to perform vehicle target detection on the road-extracted image. Since the detection image contains only the road region, only the vehicles on the road are detected, the types of vehicles detected are categorized into cars and trucks, and the corresponding detection image is generated. Eventually, the visualized image is obtained after the completion of the two tasks of road segmentation and on-road vehicle detection.

3.2. Road Segmentation Based on UAV Aerial Images

Road segmentation in UAV aerial images leads to inaccurate segmentation results due to problems such as variable road regions and occlusion. The existing methods make it difficult to achieve the ideal segmentation accuracy, especially in the segmentation of complex scenes. There are problems that the road edge is not smooth, and the nonroad and road regions cannot be accurately distinguished. In order to improve the precision and accuracy of road segmentation, a new C-DeepLabV3+ algorithm has been proposed. The C-DeepLabV3+ structure is shown in Figure 2. This road segmentation algorithm can handle ordinary roads and intersections.

On the one hand, this algorithm introduces the Coordinate Attention (CA) module, which pays more attention to the key characteristics of the road. In this way, more accurate segmentation target position information can be obtained so that the segmentation target edge is more continuous, and the accuracy of segmentation is effectively improved. On the other hand, a cascade feature fusion module is added to the decoder to fuse multiscale features at different levels. It can improve the network segmentation semantic information representation ability, reduce the loss of road information, and make up for the lack of traditional algorithms to extract roads on aerial images. The loss function used by the C-DeepLabV3+ algorithm is shown in Equation (1).

D i c e L o s s = 1 - \frac{2 \sum_{i = 1}^{N} y_{i} {\hat{y}}_{i}}{\sum_{i = 1}^{N} y_{i} + \sum_{i = 1}^{N} {\hat{y}}_{i}}

(1)

where

y_{i}

and

{\hat{y}}_{i}

denote the labeled and predicted values of pixel i, respectively, and N is the total number of pixel points, which is equal to the number of pixels in a single image multiplied by the batch size.

3.2.1. Coordinate Attention

The basic idea of the attention mechanism is to extract key information from the input features based on different weights. In DeepLabV3+, the small deep semantic feature map extracted by the backbone network leads to the abstraction of the obtained features, insufficient sensitivity to the important features, and a lack of sufficient spatial information. Our proposed C-DeepLabV3+ embeds feature location information into the channel attention by introducing the CA moudle. This improved algorithm not only captures channel information, but it also senses orientation and location information. It improves the sensitivity to important semantic information, enhances the attention to the segmented target, and helps the network model to segment the target object more accurately. The implementation process of the CA moudle is shown in Figure 3.

Firstly, pooling operation is performed on the input feature maps from the width and height directions to achieve global information extraction and retain the spatial structure information of the feature maps. When a feature map X of dimension C × H × W is input to CA moudle, the module will use two sets of pooling kernels (H,1) and (1,W) to average the pooling along the horizontal and vertical directions of each channel, respectively, to obtain two feature maps. The output with height h and width w on the cth channel are respectively shown in Equations (2) and (3).

Z_{c}^{h} (h) = \frac{1}{W} \sum_{0 ⩽ i ⩽ W} X_{c} (h, i)

(2)

Z_{c}^{w} (w) = \frac{1}{H} \sum_{0 ⩽ j ⩽ H} X_{c} (j, w)

(3)

In order to fully take advantage of the positional information and interchannel relationships captured in the previous stage, a transform operation is performed on the feature maps

Z_{c}^{h}

and

Z_{c}^{w}

using a shared 1 × 1 convolution. This allows the region of interest for each channel to be accurately captured, thus yielding an intermediate feature map f of size

C / r \times 1 \times (H + W)

, as shown in Equation (4):

f = δ (F_{1} ([Z^{h}, Z^{w}]))

(4)

where

δ

is a nonlinear activation function, r is the downsampling ratio, and f is an intermediate feature map that encodes spatial information in the horizontal and vertical directions.

Then f is sliced along the spatial dimension into two tensors

F_{h}

and

F_{w}

of sizes

C / r \times H \times 1

and

C / r \times 1 \times W

, respectively.

f_{h}

and

f_{w}

are transformed to the same channel size as the input X using two 1 × 1 convolutions

F_{h}

and

F_{w}

. The transformation method is shown in Equations (5) and (6).

g^{h} = σ (F_{h} (f^{h}))

(5)

g^{w} = σ (F_{w} (f^{w}))

(6)

where

σ

is the sigmoid activation function.

Finally, the output

g^{h}

and

g^{w}

feature maps are expanded as the final weights of the CA module. The formula is as follows:

Y_{c} (i, j) = X_{c} (i, j) \times g_{c}^{h} (i) \times g_{c}^{w} (j)

(7)

From the implementation process of the CA module, it can be seen that CA module encodes the weights of the input feature map in two dimensions: height and width. Therefore, introducing the CA module into the segmentation network can effectively improve the ability of the model to convey high-dimensional semantic information.

3.2.2. Cascade Feature Fusion

The multiscale feature layer is one of the important features of the backbone feature extraction network in DeeplabV3+. Among them, shallow features contain some local features that can obtain pixel-level localization accuracy, while deep features are global features that can obtain accurate contextual semantic information. Therefore, the efficient fusion of shallow and deep features can improve the feature representation ability of the network. In the DeepLabV3+ model, the decoding layer only splices and fuses the shallow and deep features of 1/4 size of the original image size, which leads to easy loss of the detailed information of the segmented object and insufficient segmentation accuracy. In order to obtain better segmentation performance, C-DeepLabV3+ employs a cascade feature fusion module to perform feature fusion on feature maps of different scales. The structure of the cascade feature fusion is shown in Figure 4.

The feature maps

F_{1}

and

F_{2}

are firstly used as inputs to the cascade feature fusion module, where

F_{1}

and

F_{2}

are feature maps of 1/8 size and 1/16 size of the original map, respectively. Then, the bilinear interpolation of

F_{2}

is performed on 2-fold upsampling and 3 × 3 dilated convolution with a dilation rate of 2 to make the size and receptive field the same as

F_{1}

. At the same time, 1 × 1 convolution is performed on

F_{1}

to match the number of feature channels of

F_{2}

. Next, a normalization operation is performed on the processed

F_{1}

and

F_{2}

using a Batch Normalization layer. Finally, the ReLU layer is used to sum the two features to obtain the fusion feature

F_{1}^{'}

. The fusion feature

F_{1}^{'}

makes full use of the detail information of the input features and achieves the superposition of feature mapping for different scales. It avoids the problem of losing detail information in the road segmentation process and can effectively improve the precision and accuracy of road segmentation.

3.3. Vehicle Detection Based on UAV Aerial Images

UAV aerial images can have a high number of small targets and dense target density, thus making the YOLOv5 algorithm less able to cope with the challenges in these scenarios. This requires the detection algorithms to pay more attention to key regions in the image to improve the ability to extract features. Existing studies show that the attention mechanism is an effective means to improve the algorithm’s feature capturing ability by reducing the focus on other noncritical information so that the network can learn more useful information. In this paper, our proposed S-YOLOv5 algorithm adds a SimAM attention module to the neck network structure of YOLOv5, which is used to improve the detection accuracy of vehicle targets in complex environments under UAV perspective. The neck network structure diagram of S-YOLOv5 is shown in Figure 5.

Existing attention modules typically focus only on the channel domain or spatial domain, and these methods produce one-dimensional or two-dimensional weights that process neurons at each channel or spatial location equally, which may limit their ability to learn more discriminative cues. Complete three-dimensional weights are superior to traditional one- and two-dimensional attention, and features can be well refined using complete three-dimensional weights. Inspired by the attention mechanisms of the human brain, SimAM module is able to infer three-dimensional weights directly from the current neuron. It takes into account both spatial and channel dimensions in order to allow the network to learn more discriminative neurons. It is able to assign higher weights to important neurons without increasing the number of network parameters. The module schematic is shown in Figure 6.

SimAM module assigns higher weights to more important neurons by generating spatial inhibition on neighboring neurons, and it proposes the concept of an energy function to explore the importance of each neuron. After a series of simplifications, the expression for the minimum energy function is shown in Equation (8).

e_{t}^{*} = \frac{4 ({\hat{σ}}^{2} + λ)}{{(t - \hat{μ})}^{2} + 2 {\hat{σ}}^{2} + 2 λ}

(8)

where t is the target neuron,

λ

is the hyperparameter, and

\hat{μ}

and

{\hat{σ}}^{2}

denote the mean and variance on each channel, respectively, which are calculated as shown in Equations (9) and (10).

\hat{μ} = \frac{1}{M} \sum_{i = 1}^{M} x_{i}

(9)

{\hat{σ}}^{2} = \frac{1}{M} \sum_{i = 1}^{M} {(x_{i} - \hat{μ})}^{2}

(10)

where x is a neighboring neuron of t.

The lower the energy, the greater the difference between neuron t and the surrounding neurons, and the higher the importance. Therefore, the importance of a neuron can be obtained from

1 / e_{t}^{*}

. The features are then augmented by Equation (11) to obtain the augmented feature tensor

\tilde{X}

.

\tilde{X} = sigmoid (\frac{1}{E}) ⊙ X

(11)

where X is the input feature tensor, E is the sum of

e_{t}^{*}

over all channel and space dimensions, and ⊙ is the Hadamard product. The sigmoid function is used to limit excessive values of E without affecting the relative importance of each neuron.

The loss function of S-Yolov5 algorithm is shown in Equation (12).

C I O U = I O U - \frac{ρ^{2} (b, b^{g t})}{c^{2}} - α v

(12)

where

ρ^{2} (b, b^{g t})

represent the Euclidean distances of the centroids of the predicted and real boxes, respectively. c represents the diagonal distance of the smallest closure region that can contain both the prediction box and the true box.

Compared to the YOLOv5 algorithm, the S-YOLOv5 algorithm can effectively capture the information of feature mapping in both global and local dimensions by embedding SimAM module to better detect and focus on important locations and features. It strengthens the focus on target objects and enhances the detection of small targets. At the same time, it is a good solution to the problem of high leakage detection rate and false detection rate for vehicles in dense scenes.

4. Experiment Results

4.1. Dataset

In this paper, we selected DroneVehicle [65], which is a dataset of UAV aerial vehicles released by Tianjin University. This dataset is a collection of aerial images of vehicle classes acquired from a high altitude viewpoint on a complex ground background, with a resolution of 840 × 712 for each image. Since not all the images in the dataset satisfy the experimental requirements, a total of 800 images containing clear road regions and vehicle targets on the road were selected from the dataset to form the experimental dataset. The experimental dataset was divided into training and test sets in the ratio of 7:3, where the training set had 560 images, and the test set had 240 images. Some sample images from the DroneVehicle dataset are shown in Figure 7.

Since the road segmentation experiments and the vehicle detection experiments have different requirements for the dataset, two different types of labeling processes are required for the dataset. For the road segmentation experiment, both the training process of the road segmentation model and the comparison of the test results require road labelled images, so labeling is a necessary step when conducting the experiment. The semantic segmentation annotation tool was utilized to annotate the road regions on the images and set the corresponding road labels, and finally, these images were converted to the VOC format for the training and testing of road segmentation experiments. For the vehicle target detection experiments, the vehicles were labelled on the images with their categories using the target detection annotation tool. Since the main vehicle categories in the 800 images were cars and trucks, they were divided into these two target categories and finally generated into the YOLO format.

4.2. Experimental Setting

The experiment was run on a Linux server with Ubuntu 18.04 (developed by Canonical in London, UK), a 12 vCPU Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50 GHz, and RTX 3080 GPU with 10 GB of memory based on the Pytorch deep learning framework using Pytorch version 1.7.0 and CUDA version 11.0:

In the road segmentation experiment, we trained and tested the C-DeepLabV3+ algorithm directly on the dataset. During the training process, freeze training was used to speed up the training efficiency, and the learning rate was set to $5 \times 10^{- 4}$ for the freeze phase and $5 \times 10^{- 5}$ for the unfreeze phase. In addition, the batch size was set to four, and 200 epochs were iterated.
In the on-road vehicle detection experiment, the S-YOLOv5 was also trained and tested on the dataset. During training, the initial learning rate was set to $1 \times 10^{- 3}$ , the batch size was set to four, and 100 epochs were iterated.

4.3. C-DeepLabV3+ Experimental Analysis

The comparative experimental analysis semantic segmentation evaluation metrics used for road segmentation are detailed below.

PA (Pixel Accuracy): The ratio of the number of pixels predicted correctly for a category to the total number of pixels. It reflects the accuracy of the segmentation result for a particular category: the higher the metric, the better the algorithm, and the formula is Equation (13). k means the number of classes of labels,

p_{i i}

means the amount of pixels accurately discriminated as class i, and

p_{i j}

means the number of pixels belonging to class i but incorrectly predicted to be class j.

PA = \frac{\sum_{i = 0}^{k} p_{i i}}{\sum_{i = 0}^{k} \sum_{j = 0}^{k} p_{j j}}

(13)

mPA (mean Pixel Accuracy): The ratio of the number of correctly classified pixels in each category label to the number of all pixels in the category label is calculated to obtain the PA of each category, and then the average index is obtained for all categories of PA, as shown in Equation (14). It is more reflective of the model’s ability to recognize various categories.

k + 1

means a category containing a background.

mPA = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{p_{i i}}{\sum_{j = 0}^{k} p_{i j}}

(14)

mIoU (mean Intersection over Union): The ratio of the intersection and union between the prediction results of each category and the real label is calculated first, that is, the IoU, and then the IoU of all categories is summed and then averaged, as shown in Equation (15). Its value range is [0, 1], which is an important indicator to test the accuracy of segmenting objects in the dataset, and the larger the value of mIoU, the more accurate the segmentation result.

p_{j i}

means the number of pixels belonging to class j but incorrectly predicted as class i.

mIoU = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{p_{i i}}{\sum_{j = 0}^{k} p_{i j} + \sum_{j = 0}^{k} p_{j i} - p_{i i}}

(15)

The C-DeepLabV3+ algorithm was trained on the working platform using the UAV aerial photography dataset, and the model’s superiority was measured using the PA, mPA, and mIoU. The road segmentation algorithm before and after improvement is shown in Table 1. The experimental results show that the proposed C-DeepLabV3+ algorithm improved the PA from 98.01% to 98.42%, which is an improvement of 0.41%, and there was an improvement of 0.2% in the mPA and 0.33% in the mIoU. This indicates that the segmentation accuracy of the improved algorithm improved to some extent and performed better in the test set.

The comparison of road segmentation results of UAV aerial images in some typical scenes is shown in Figure 8, Figure 9 and Figure 10. Among them, Figure 8 shows a comparison of the effectiveness of the road boundary segmentation. Figure 9 shows a comparison of the accuracy of the segmentation of road regions and nonroad regions, and Figure 10 shows a comparison of the effectiveness of the segmentation of shaded regions of the road.

As can be seen from Figure 8, the DeepLabV3+ algorithm is more fuzzy in the segmentation of the road boundary, and the segmentation at the boundary has the problem of unevenness. Our proposed C-DeepLabV3+ algorithm has a smoother and clearer segmentation profile at the road boundary, and less road information is lost at the boundary. Therefore, the C-DeepLabV3+ algorithm is better in segmentation at the road boundary.

It can be seen from Figure 9 that the DeepLabV3+ algorithm had poor recognition results, with deviations in the segmentation of road and nonroad regions and low road recognition accuracy. In contrast, the C-DeepLabV3+ algorithm was able to better discriminate road features, and the recognition of road and nonroad regions is significantly accurate. Meanwhile, the segmentation fineness and continuity were improved, and better road segmentation results were obtained. Therefore, the C-DeepLabV3+ algorithm has higher road segmentation accuracy and better performance.

As can be seen in Figure 10, the other algorithms incorrectly recognized the shadow regions formed on the road by tree occlusion as nonroad regions. However, the C-DeepLabV3+ algorithm could achieve more accurate detection, which well solves the problem of dealing with the effects of occlusion and shadows on road segmentation. Also, the the C-DeepLabV3+ algorithm works best when it performs a special type of road segmentation like intersections. Therefore, the C-DeepLabV3+ algorithm can effectively solve the problem of recognizing shadows on road regions.

In summary, the road segmentation algorithm C-DeepLabV3+ proposed in this paper has been effectively improved in the face of poor road edge segmentation, the inaccurate identification of road and nonroad regions, and the wrong identification of shadow regions on the road. It is more suitable for the extraction of UAV aerial images and has achieved better results and performance.

4.4. S-YOLOv5 Experimental Analysis

The evaluation indicators of the comparison test for vehicle detection are as follows.

Precision is the ratio of the number of true positive instances predicted to the number of all instances predicted to be positive: the higher the indicator, the higher the accuracy of the algorithm, and the formula is Equation (16). TP means the true positive rate, which is the number of instances where the predicted result is positive, and the true result is also positive. FP means the false positive rate, which is the number of instances where the predicted result is positive, but the true result is negative.

Precision = \frac{TP}{TP + FP}

(16)

Recall is the ratio of predicted true positive cases to the total number of actual positive cases: the higher the indicator, the better the algorithm, as shown in Equation (17). FN means the false negative rate, which is the number of instances where the predicted result is negative, but the true result is positive.

Recall = \frac{TP}{TP + FN}

(17)

The Average Precision (AP) considers both Precision and Recall, with Recall as the horizontal axis and Precision as the vertical axis to form the PR curve. The area covered by the curve is the AP, which measures the quality of the model in a single category. The formula is as follows:

AP = \int_{0}^{1} P (R) dR

(18)

The mean Average Precision (mAP) refers to the average of the average precision across all categories and can be used to compare the performance of different algorithms side by side. Generally, the threshold is set to 0.5, that is, the prediction box with an IoU greater than 0.5 is valid and denoted by [email protected], as shown in Equation (19).

mAP = \frac{1}{N} \sum_{i = 1}^{N} {AP}_{i}

(19)

From Table 2, it can be seen that our algorithm S-YOLOv5 improved in all evaluation metrics, with 1.0% in Precision, 1.15% in Recall, and 0.45% in mAP. The experimental results show that the proposed S-YOLOv5 algorithm has better performance.

A comparison of the vehicle detection results of the UAV aerial images in a plurality of different scenarios is illustrated in Figure 11, Figure 12 and Figure 13.

The comparative results of the detection of small target vehicles under UAV aerial images are shown in Figure 11.

As can be seen from Figure 11, our method could recognize more cars with smaller sizes than others, which shows that the S-YOLOv5 algorithm is able to recognize more vehicles, and it is more effective for the detection of small targets in the UAV aerial images. The S-YOLOv5 algorithm reduced the leakage rate of the small targets, and it could solve the problem of the difficulty of detecting small targets in the UAV aerial images relatively well.

Figure 12 shows the vehicle detection results of different methods for congestion scenarios in UAV aerial images.

As can be seen in Figure 12, the SSD and YOLOv5 algorithms missed vehicle targets, while the S-YOLOv5 algorithm was able to correctly identify vehicle targets and nonvehicle targets. In addition, the Faster R-CNN and YOLOv5 algorithms incorrectly recognized road signs as vehicle targets, while the S-YOLOv5 algorithm achieved the correct categorization of detected vehicle targets. As a result, S-YOLOv5 can effectively reduce the rate of misdetection and improve the accuracy of vehicle detection.

The comparative results of vehicle detection under complex background conditions in the UAV aerial images are shown in Figure 13.

As can be seen in Figure 13, the SSD and YOLOv5 algorithms failed to detect some cars with distinctive features, and vehicle types were not accurately distinguished. The Faster R-CNN detection was even worse, thereby not only failing to detect valid vehicle targets but also misclassifying targets that are clearly not cars. In contrast, the S-YOLOv5 algorithm was able to detect these car targets correctly. In addition, the SSD and YOLOv5 did not detect the truck, and the S-YOLOv5 successfully detected the truck, which indicates that it has better detection performance for both cars and trucks and effectively reduces the miss detection rate.

In summary, our proposed S-YOLOv5 algorithm significantly improved the recognition accuracy compared to other algorithms, and it has better detection of small targets. At the same time, the leakage rate and false detection rate were also significantly reduced, and it is more suitable for vehicle detection under the UAV perspective.

4.5. Segmentation–Detection Framework Analysis

Based on the segmentation–detection framework, the road segmentation algorithm and the vehicle detection algorithm were jointly experimented, and the pipeline of the two tasks has been realized. The final results of road segmentation and vehicle detection in different viewpoints and scenes can be obtained as shown in Figure 14 and Figure 15.

As can be seen from Figure 14, the C-DeepLabV3+ algorithm was able to extract the road regions accurately, and the segmentation of the road boundaries are clear in the ordinary scene. S-YOLOv5 could detect the vehicle targets on the road with high detection accuracy.

As can be seen from Figure 15, C-DeepLabV3+ could accurately recognize road regions and nonroad regions in dense scenes. The segmentation of the road boundaries are clear. This algorithm effectively solves the occlusion problem, which provides a good foundation for subsequent traffic target detection on the road. For the problem of dense vehicle targets and many small targets, S-YOLOv5 had a high detection accuracy, which could correctly detect all the vehicle targets on the road and finally obtained accurate experimental results.

In summary, our proposed road segmentation algorithm C-DeepLabV3+ can accurately extract road regions under different viewpoints, and our proposed vehicle detection algorithm S-YOLOv5 can accurately detect vehicle targets under various scenarios. The experimental results are satisfactory, and the serial pipeline of road segmentation and vehicle detection on the road can be realized with high quality.

5. Conclusions

In this paper, we proposed a novel and efficient network framework for simultaneous road segmentation and vehicle detection to process UAV aerial traffic images, which can facilitate the construction of smart transportation. The proposed C-DeepLabv3+ road segmentation algorithm introduces the CA module to enhance the focus on road information, and it adds a CFF module to prevent the loss of road detail information. The proposed S-YOLOv5 vehicle detection algorithm introduces the parameter-free attention module SimAM for enhancing the vehicle’s ability to detect small targets and improve detection accuracy. Our proposed network framework combines the two improved algorithms with higher road segmentation accuracy and more accurate vehicle detection, and it demonstrates superior performance and results on the constructed DroneVehicle dataset. In the future, the framework model can be considered to be lightweighted for piggybacking on edge devices for real-time segmentation and detection.

Author Contributions

Conceptualization, Y.S. and M.X.; methodology, C.Y. and W.M.; software, C.Y.; investigation, M.X. and Y.S.; formal analysis, C.Y. and W.M.; writing—original draft preparation, Y.S. and C.Y.; writing—review and editing, C.Y. and M.X.; supervision, M.X. and Y.S.; funding acquisition, Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Shandong Province (ZR2022QF037).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets are available on Github at https://github.com/VisDrone/DroneVehicle; accessed on 20 August 2022.

Acknowledgments

We would like to thank the providers of the datasets and the reference authors, as well as the manuscript reviewers.

Conflicts of Interest

Authors Min Xiao and Wei Min were employed by the company Project Construction Management Company of Jiangxi Transportation Investment Group Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Goncalves, J.A.; Henriques, R. UAV photogrammetry for topographic monitoring of coastal areas. ISPRS J. Photogramm. Remote Sens. 2015, 104, 101–111. [Google Scholar] [CrossRef]
Siam, M.; ElHelw, M. Robust autonomous visual detection and tracking of moving targets in UAV imagery. In Proceedings of the 2012 IEEE 11th International Conference on Signal Processing, Beijing, China, 21–25 October 2012; Volume 2, pp. 1060–1066. [Google Scholar]
Wijesingha, J.S.J.; Kumara, R.; Kajanthan, P.; Koswatte, R.; Bandara, K. Automatic road feature extraction from high resolution satellite images using LVQ neural networks. Asian J. Geoinform. 2013, 13, 30–36. [Google Scholar]
Mnih, V.; Hinton, G.E. Learning to detect roads in high-resolution aerial images. In Proceedings of the Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, 5–11 September 2010; Proceedings, Part VI 11. Springer: Berlin, Germany, 2010; pp. 210–223. [Google Scholar]
Yar, H.; Khan, Z.A.; Hussain, T.; Baik, S.W. A modified vision transformer architecture with scratch learning capabilities for effective fire detection. Expert Syst. Appl. 2024, 252, 123935. [Google Scholar] [CrossRef]
Parez, S.; Dilshad, N.; Alanazi, T.M.; Lee, J.W. Towards Sustainable Agricultural Systems: A Lightweight Deep Learning Model for Plant Disease Detection. Comput. Syst. Sci. Eng. 2023, 47, 515–536. [Google Scholar] [CrossRef]
Yang, Y.; Liu, F.; Wang, P.; Luo, P.; Liu, X. Vehicle detection methods from an unmanned aerial vehicle platform. In Proceedings of the 2012 IEEE International Conference on Vehicular Electronics and Safety (ICVES 2012), Istanbul, Turkey, 24–27 July 2012; pp. 411–415. [Google Scholar]
Reilly, V.; Solmaz, B.; Shah, M. Shadow casting out of plane (SCOOP) candidates for human and vehicle detection in aerial imagery. Int. J. Comput. Vis. 2013, 101, 350–366. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28; Neural Information Processing Systems Foundation, Inc.: South Lake Tahoe, NV, USA, 2015. [Google Scholar]
Zhang, J.; Lin, X.; Liu, Z.; Shen, J. Semi-automatic road tracking by template matching and distance transformation in urban areas. Int. J. Remote Sens. 2011, 32, 8331–8347. [Google Scholar] [CrossRef]
Coulibaly, I.; Spiric, N.; Sghaier, M.O.; Manzo-Vargas, W.; Lepage, R.; St-Jacques, M. Road extraction from high resolution remote sensing image using multiresolution in case of major disaster. In Proceedings of the 2014 IEEE Geoscience and Remote Sensing Symposium, Quebec City, QC, Canada, 13–18 July 2014; pp. 2712–2715. [Google Scholar]
Gaetano, R.; Zerubia, J.; Scarpa, G.; Poggi, G. Morphological road segmentation in urban areas from high resolution satellite images. In Proceedings of the 2011 IEEE 17th International Conference on Digital Signal Processing (DSP), Corfu, Greece, 6–8 July 2011; pp. 1–8. [Google Scholar]
Unsalan, C.; Sirmacek, B. Road network detection using probabilistic and graph theoretical methods. IEEE Trans. Geosci. Remote Sens. 2012, 50, 4441–4453. [Google Scholar] [CrossRef]
Anil, P.; Natarajan, S. A novel approach using active contour model for semi-automatic road extraction from high resolution satellite imagery. In Proceedings of the 2010 IEEE Second International Conference on Machine Learning and Computing, Bangalore, India, 12–13 February 2010; pp. 263–266. [Google Scholar]
Chaudhuri, D.; Kushwaha, N.K.; Samal, A. Semi-automated road detection from high resolution satellite images by directional morphological enhancement and segmentation techniques. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2012, 5, 1538–1544. [Google Scholar] [CrossRef]
Shen, Z.; Luo, J.; Gao, L. Road extraction from high-resolution remotely sensed panchromatic image in different research scales. In Proceedings of the 2010 IEEE International Geoscience and Remote Sensing Symposium, Honolulu, HI, USA, 25–30 July 2010; pp. 453–456. [Google Scholar]
Huang, X.; Zhang, L. Road centreline extraction from high-resolution imagery based on multiscale structural features and support vector machines. Int. J. Remote Sens. 2009, 30, 1977–1987. [Google Scholar] [CrossRef]
Yi, W.; Chen, Y.; Tang, H.; Deng, L. Experimental research on urban road extraction from high-resolution RS images using probabilistic topic models. In Proceedings of the 2010 IEEE International Geoscience and Remote Sensing Symposium, Honolulu, HI, USA, 25–30 July 2010; pp. 445–448. [Google Scholar]
Chen, H.; Yin, L.; Ma, L. Research on road information extraction from high resolution imagery based on global precedence. In Proceedings of the 2014 IEEE Third International Workshop on Earth Observation and Remote Sensing Applications (EORSA), Changsha, China, 11–14 June 2014; pp. 151–155. [Google Scholar]
Poullis, C. Tensor-Cuts: A simultaneous multi-type feature extractor and classifier and its application to road extraction from satellite images. ISPRS J. Photogramm. Remote Sens. 2014, 95, 93–108. [Google Scholar] [CrossRef]
Sghaier, M.O.; Lepage, R. Road extraction from very high resolution remote sensing optical images based on texture analysis and beamlet transform. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 9, 1946–1958. [Google Scholar] [CrossRef]
Sun, Z.; Fang, H.; Deng, M.; Chen, A.; Yue, P.; Di, L. Regular shape similarity index: A novel index for accurate extraction of regular objects from remote sensing images. IEEE Trans. Geosci. Remote Sens. 2015, 53, 3737–3748. [Google Scholar] [CrossRef]
Hui, Z.; Hu, Y.; Jin, S.; Yevenyo, Y.Z. Road centerline extraction from airborne LiDAR point cloud based on hierarchical fusion and optimization. ISPRS J. Photogramm. Remote Sens. 2016, 118, 22–36. [Google Scholar] [CrossRef]
Liu, J.; Qin, Q.; Li, J.; Li, Y. Rural road extraction from high-resolution remote sensing images based on geometric feature inference. ISPRS Int. J. Geo-Inf. 2017, 6, 314. [Google Scholar] [CrossRef]
Liu, Z.; Yang, D.; Wang, Y.; Lu, M.; Li, R. EGNN: Graph structure learning based on evolutionary computation helps more in graph neural networks. Appl. Soft Comput. 2023, 135, 110040. [Google Scholar] [CrossRef]
Bavu, E.; Ramamonjy, A.; Pujol, H.; Garcia, A. TimeScaleNet: A multiresolution approach for raw audio recognition using learnable biquadratic IIR filters and residual networks of depthwise-separable one-dimensional atrous convolutions. IEEE J. Sel. Top. Signal Process. 2019, 13, 220–235. [Google Scholar] [CrossRef]
Tao, J.; Chen, Z.; Sun, Z.; Guo, H.; Leng, B.; Yu, Z.; Wang, Y.; He, Z.; Lei, X.; Yang, J. Seg-Road: A Segmentation Network for Road Extraction Based on Transformer and CNN with Connectivity Structures. Remote Sens. 2023, 15, 1602. [Google Scholar] [CrossRef]
Zhang, Z.; Liu, Q.; Wang, Y. Road extraction by deep residual u-net. IEEE Geosci. Remote Sens. Lett. 2018, 15, 749–753. [Google Scholar] [CrossRef]
Xu, Y.; Xie, Z.; Feng, Y.; Chen, Z. Road extraction from high-resolution remote sensing imagery using deep learning. Remote Sens. 2018, 10, 1461. [Google Scholar] [CrossRef]
Buslaev, A.; Seferbekov, S.; Iglovikov, V.; Shvets, A. Fully convolutional network for automatic road extraction from satellite imagery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–23 June 2018; pp. 207–210. [Google Scholar]
Zhou, L.; Zhang, C.; Wu, M. D-LinkNet: LinkNet with pretrained encoder and dilated convolution for high resolution satellite imagery road extraction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–23 June 2018; pp. 182–186. [Google Scholar]
Chaurasia, A.; Culurciello, E. Linknet: Exploiting encoder representations for efficient semantic segmentation. In Proceedings of the 2017 IEEE visual communications and image processing (VCIP), St. Petersburg, FL, USA, 10–13 December 2017; pp. 1–4. [Google Scholar]
Li, Y.; Peng, B.; He, L.; Fan, K.; Tong, L. Road segmentation of unmanned aerial vehicle remote sensing images using adversarial network with multiscale context aggregation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 2279–2287. [Google Scholar] [CrossRef]
Zhang, X.; Han, X.; Li, C.; Tang, X.; Zhou, H.; Jiao, L. Aerial image road extraction based on an improved generative adversarial network. Remote Sens. 2019, 11, 930. [Google Scholar] [CrossRef]
Abdollahi, A.; Pradhan, B.; Sharma, G.; Maulud, K.N.A.; Alamri, A. Improving road semantic segmentation using generative adversarial network. IEEE Access 2021, 9, 64381–64392. [Google Scholar] [CrossRef]
Chen, H.; Peng, S.; Du, C.; Li, J.; Wu, S. SW-GAN: Road extraction from remote sensing imagery using semi-weakly supervised adversarial learning. Remote Sens. 2022, 14, 4145. [Google Scholar] [CrossRef]
Saito, S.; Aoki, Y. Building and road detection from large aerial imagery. In Proceedings of the Image Processing: Machine Vision Applications VIII, San Francisco, CA, USA, 10–11 February 2015; SPIE: Bellingham, WA, USA, 2015; Volume 9405, pp. 153–164. [Google Scholar]
Alshehhi, R.; Marpu, P.R.; Woon, W.L.; Dalla Mura, M. Simultaneous extraction of roads and buildings in remote sensing imagery with convolutional neural networks. ISPRS J. Photogramm. Remote Sens. 2017, 130, 139–149. [Google Scholar] [CrossRef]
Panboonyuen, T.; Jitkajornwanich, K.; Lawawirojwong, S.; Srestasathiern, P.; Vateekul, P. Road segmentation of remotely-sensed images using deep convolutional neural networks with landscape metrics and conditional random fields. Remote Sens. 2017, 9, 680. [Google Scholar] [CrossRef]
Im, H.; Yang, H. Improvement of CNN-Based Road Extraction from Satellite Images via Morphological Image Processing. In Proceedings of the IGARSS 2020—2020 IEEE International Geoscience and Remote Sensing Symposium, Waikoloa, HI, USA, 26 September–2 October 2020; pp. 2559–2562. [Google Scholar]
Zhong, Z.; Li, J.; Cui, W.; Jiang, H. Fully convolutional networks for building and road extraction: Preliminary results. In Proceedings of the 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Beijing, China, 10–15 July 2016; pp. 1591–1594. [Google Scholar]
Fu, G.; Liu, C.; Zhou, R.; Sun, T.; Zhang, Q. Classification for high resolution remote sensing imagery using a fully convolutional network. Remote Sens. 2017, 9, 498. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Berlin, Germany, 2015; pp. 234–241. [Google Scholar]
Chen, Z.; Wang, C.; Li, J.; Fan, W.; Du, J.; Zhong, B. Adaboost-like End-to-End multiple lightweight U-nets for road extraction from optical remote sensing images. Int. J. Appl. Earth Obs. Geoinf. 2021, 100, 102341. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–26 June 2005; Volume 1, pp. 886–893. [Google Scholar]
Lowe, D.G. Object recognition from local scale-invariant features. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Corfu, Greece, 20–27 September 1999; Volume 2, pp. 1150–1157. [Google Scholar]
Belongie, S.; Malik, J.; Puzicha, J. Shape matching and object recognition using shape contexts. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 509–522. [Google Scholar] [CrossRef]
Felzenszwalb, P.; McAllester, D.; Ramanan, D. A discriminatively trained, multiscale, deformable part model. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 24–26 June 2008; pp. 1–8. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Mandi, India, 16–19 December 2014; pp. 580–587. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Berlin, Germany, 2016; pp. 21–37. [Google Scholar]
Yang, F.; Fan, H.; Chu, P.; Blasch, E.; Ling, H. Clustered object detection in aerial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 20–26 October 2019; pp. 8311–8320. [Google Scholar]
Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-DET2019: The vision meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 20–26 October 2019. [Google Scholar]
Zhang, P.; Zhong, Y.; Li, X. SlimYOLOv3: Narrower, faster and better for real-time UAV applications. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 20–26 October 2019. [Google Scholar]
Liu, S.; Zha, J.; Sun, J.; Li, Z.; Wang, G. EdgeYOLO: An Edge-Real-Time Object Detector. arXiv 2023, arXiv:2302.07483. [Google Scholar]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2778–2788. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Huang, C.; Yang, J.; Xie, H. Remote sensing image segmentation algorithm based on improved DeeplabV3+. Electron. Meas. Technol. 2022, 45, 148–155. [Google Scholar]
Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. Simam: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the International Conference on Machine Learning (PMLR), Virtual, 18–24 July 2021; pp. 11863–11874. [Google Scholar]
Sun, Y.; Cao, B.; Zhu, P.; Hu, Q. Drone-based RGB-infrared cross-modality vehicle detection via uncertainty-aware learning. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6700–6713. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]

Figure 1. Road segmentation–vehicle detection framework implementation diagram.

Figure 2. Architecture of C-DeepLabV3+.

Figure 3. Coordinate attention module schematic [62].

Figure 4. Cascade feature fusion structure diagram [63].

Figure 5. Neck network structure diagram of S-YOLOv5.

Figure 6. SimAM module schematic [64].

Figure 7. (a–d) Sample images from the DroneVehicle dataset.

Figure 8. Comparison of road boundary segmentation results on different methods. (a) Original image; (b) PSPNet; (c) U-Net; (d) DeepLabV3+; (e) C-DeepLabV3+.

Figure 9. Comparison of segmentation accuracy results for road and nonroad regions using different methods. (a,f,k) Original images; (b,g,l) PSPNet; (c,h,m) U-Net; (d,i,n) DeepLabV3+; (e,j,o) C-DeepLabV3+.

Figure 10. Comparison of segmentation results for road shadow regions on different methods. (a) Original image; (b) PSPNet; (c) U-Net; (d) DeepLabV3+; (e) C-DeepLabV3+.

Figure 11. Comparison results of detection of small target vehicles using different methods. (a) Original image; (b) Faster R-CNN; (c) SSD; (d) YOLOv5; (e) S-YOLOv5.

Figure 12. Comparison of vehicle detection results for congestion scenarios using different methods. (a,f) Original image; (b,g) Faster R-CNN; (c,h) SSD; (d,i) YOLOv5; (e,j) S-YOLOv5.

Figure 13. Comparison of vehicle detection results under complex background conditions. (a,f) Original image; (b,g) Faster R-CNN; (c,h) SSD; (d,i) YOLOv5; (e,j) S-YOLOv5.

Figure 14. Segmentation–detection results in common scenarios of UAV aerial photography in the same framework. (a,c) Original images. (b,d) Result images.

Figure 15. Segmentation–detection results in intense scenarios of UAV aerial photography in the same framework. (a,c) Original images. (b,d) Result images.

Table 1. Comparison of road segmentation metrics on different methods.

Methods	PA	mPA	mIoU
PSPNet [66]	96.20%	95.61%	90.98%
U-Net [43]	96.35%	98.54%	97.09%
DeepLabV3+	98.01%	98.55%	97.20%
C-DeepLabV3+	98.42%	98.75%	97.53%

Table 2. Comparison of vehicle detection metrics using different methods.

Methods	Precision			Recall			AP		[email protected]
Methods	Car	Truck	All	Car	Truck	All	Car	Truck	[email protected]
Faster R-CNN [9]	53.48%	61.89%	57.79%	93.14%	89.80%	91.47%	75.70%	86.44%	81.07%
SSD [55]	92.79%	89.51%	91.15%	90.15%	56.86%	73.51%	95.82%	80.61%	88.21%
YOLOv5	96.80%	88.30%	92.50%	97.30%	95.20%	96.30%	97.80%	96.10%	96.95%
S-YOLOv5	97.10%	90.00%	93.50%	97.70%	97.20%	97.45%	98.10%	96.70%	97.40%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xiao, M.; Min, W.; Yang, C.; Song, Y. A Novel Network Framework on Simultaneous Road Segmentation and Vehicle Detection for UAV Aerial Traffic Images. Sensors 2024, 24, 3606. https://doi.org/10.3390/s24113606

AMA Style

Xiao M, Min W, Yang C, Song Y. A Novel Network Framework on Simultaneous Road Segmentation and Vehicle Detection for UAV Aerial Traffic Images. Sensors. 2024; 24(11):3606. https://doi.org/10.3390/s24113606

Chicago/Turabian Style

Xiao, Min, Wei Min, Congmao Yang, and Yongchao Song. 2024. "A Novel Network Framework on Simultaneous Road Segmentation and Vehicle Detection for UAV Aerial Traffic Images" Sensors 24, no. 11: 3606. https://doi.org/10.3390/s24113606

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel Network Framework on Simultaneous Road Segmentation and Vehicle Detection for UAV Aerial Traffic Images

Abstract

1. Introduction

2. Related Work

2.1. Road Segmentation Methods

2.2. Road Segmentation Methods Based on UAV Aerial Images

2.3. Object Detection Methods

2.4. Object Detection Methods Based on UAV Aerial Images

3. Methods

3.1. Overview of the Road Segmentation–Vehicle Detection Framework

3.2. Road Segmentation Based on UAV Aerial Images

3.2.1. Coordinate Attention

3.2.2. Cascade Feature Fusion

3.3. Vehicle Detection Based on UAV Aerial Images

4. Experiment Results

4.1. Dataset

4.2. Experimental Setting

4.3. C-DeepLabV3+ Experimental Analysis

4.4. S-YOLOv5 Experimental Analysis

4.5. Segmentation–Detection Framework Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI