1. Introduction
Due to the rapid advancement of unmanned aerial vehicle (UAV) technology, researchers across scientific and industrial disciplines have paid a great deal of attention to UAVs. Capturing and analyzing aerial images are important ways for UAVs to perceive their surroundings. This is significant in multiple fields, such as forest fire detection [
1], vehicle detection [
2,
3], road construction [
4], land cover [
5,
6,
7], the oil and gas inspection industry [
8,
9], and traffic data analyses [
10,
11].
In UAV flight missions, recognizing emergency landing zones is crucial for ensuring UAV safety, especially when facing unexpected events such as GPS signal loss. To improve the accuracy of landing zone identification, researchers have recently adopted semantic segmentation methods. Semantic segmentation differs from traditional methods of UAV emergency landing zone identification, it assigns each pixel in the picture a label that makes sense, enabling precise identification of landing zones. This approach is particularly effective in densely populated urban scenes, where it can identify protected targets, such as people and vehicles, and reduce false alarms. Moreover, this method extracts semantic information of different objects in the landing zone, facilitating a better understanding of the structure and environment of emergency landing zones. Consequently, semantic segmentation has become an important tool to accurately identify landing zones.
The rapid evolution of deep learning has significantly improved the effectiveness of semantic segmentation. With the growth of numerous industrial domains, including autonomous driving [
12,
13], medical image diagnosis [
14], and remote sensing [
15], researchers no longer solely focus on accuracy improvement, they also pay more attention to the speed of segmentation. An efficient semantic segmentation model must have high segmentation accuracy, real-time capabilities, and a lightweight network framework in order to maintain the balance between accuracy and speed after the model has been deployed to an embedded device.
To meet these requirements, many researchers have started to study real-time semantic segmentation methods. Recently, some CNN-based methods have achieved low latency and high efficiency. For example, BiSeNet [
16] proposes a dual-branch network topology that includes a small spatial pathway (SP) for preserving spatial information and generating high-resolution features, as well as a context pathway (CP) with a fast downsampling approach for obtaining an adequate receptive field. On these two pathways, a novel feature fusion module (FFM) is created to successfully recombine features. STDC-Seg [
17] proposes a backbone network based on the short-term dense concatenate (STDC) module to extract deep features with scalable receptive fields and multi-scale information, building on the BiSeNet architecture. A Detail Guidance Module was also designed to encode spatial information in low-level features without an extra time-consuming approach, increasing the model’s inference speed.
Although the aforementioned models have demonstrated promising results in some circumstances, they may not perform well in real-world settings of unmanned aerial vehicle (UAV) aerial images because of high resolution, complex scenarios, and a large number of small objects [
18]. Moreover, the limited computational resources of the onboard devices carried by UAV make it challenging to achieve real-time, high-quality semantic segmentation. Therefore, improving segmentation quality while achieving real-time performance has become a pressing issue in UAV semantic segmentation tasks.
In this paper, to identify suitable landing zones for UAVs, we mainly focus on improving the segmentation quality of various target objects in UAV remote sensing scenes while ensuring real-time segmentation. We designed the STDC-CT network based on the STDC1 backbone structure of STDC-Seg [
17] due to the demonstrated efficiency of the STDC backbone network. Inspired by the multi-resolution attention mechanism [
19], we propose the small object attention extractor (SOAE) module to focus on useful features in Stage 3–Stage 5 of the backbone network, especially for small object features. Attention weights of different levels are learned through a small network. Features from different layers are fused to enrich the information of small objects during feature extraction. We retain the Detail Guidance Module of STDC-Seg, but small object features are likely to be filtered out as noise during the model training. To solve the problem of insufficient noise resistance when extracting detail edges, we use the Laplacian of Gaussian (LoG) method [
20,
21,
22] to replace the original Laplacian method. Additionally, we apply the parallel aggregation pyramid pooling module (PAPPM) [
23] to increase the ability to gather multi-scale contextual information about the network and accelerate model inference speed. Finally, we design a detail and context feature fusion module (DCFFM) to fuse small object features, contextual information, and detail features. Finally, we use the small object attention feature (SOAE) branch to guide the feature fusion of the context and detail branches, allowing the model to effectively capture the detail information and global contextual information of small objects, improving the accuracy of recognizing small objects and STDC-CT performance.
This section provides a detailed introduction to the significance and research background of UAV emergency landing zones recognition. In
Section 2, we present an overview of the related work in UAV emergency landing zones recognition, including the development history of traditional recognition methods, traditional semantic segmentation methods, and deep learning-based semantic segmentation methods.
Section 3 provides a comprehensive description of the construction process of our proposed dataset, UAC-City. In
Section 4, we explain the details of our proposed semantic segmentation method, STDC-CT. The effectiveness of our method is evaluated in
Section 5 through extensive experiments. Results show that our proposed method has improved the performance of UAV emergency landing zones recognition compared with the state-of-the-art works.
The following are the key contributions of this paper:
We create a new dataset for UAV aerial semantic segmentation and achieve recognition of emergency landing zones, protected targets, and buildings during high-altitude UAV flight missions.
A lightweight semantic segmentation network named STDC-CT i is proposed for UAV emergency landing zones recognition. The proposed model consists of STDC backbone, SOAE module, PAPPM, DCFFM and Detail Guidance Module. The network achieves a balance of segmentation speed and accuracy and improves the segmentation accuracy on small objects.
Extensive experiments have been carried out to evaluate the efficiency of our method. The results of our experiments indicate that our model can reach cutting-edge performance on the UAV-City, Cityscapes, and UAVid datasets. In addition, we deploy the trained model onto a UAV equipped with a Jetson TX2 embedded device. It shows that the model works well for real-world UAV applications.
3. UAV-City Dataset
In our study, we collected aerial images captured by UAVs in Hangzhou, China, to construct our dataset. In the design of the entire process, we considered the practicality and effectiveness of research on the semantic segmentation of UAV aerial images. A total of 600 images with a resolution of were densely annotated for semantic segmentation tasks.
To enable the UAV to autonomously make timely landing decisions when receiving emergency landing instructions or encountering urgent situations, the UAV needs to recognize potential landing zones on the ground, as well as forced landing zones, protected targets, buildings, and other objects during high-altitude flights. Potential landing zones are defined as horizontal rooftops, horizontal grounds, and horizontal grasslands, while forced landing zones are defined as forests and rivers (lakes). Protected targets are defined as humans and vehicles. If no suitable zones nearby are available, then forced landing zones can be recognized for landing. During the landing process, UAVs need to continuously recognize pedestrians and vehicles on the ground to ensure the safety of lives and properties to the greatest extent possible.
3.1. Image Acquisition Strategy
During the operation of UAVs, strict compliance with safety regulations for drone flights is ensured.
The maximum flight altitude of a drone is set at 140 m, with lateral flight maintaining stability at around 120 m.
During image acquisition, the onboard camera captures continuous images of the ground with a time interval of 0.1 s, providing a top-down view. The camera angle is set vertically.
Multiple flights are conducted to capture images from different flight paths, introducing variance into the dataset to mitigate the risk of overfitting during model training.
Data collection is conducted under favorable weather conditions with sufficient daylight.
3.2. Image Processing and Annotation
3.2.1. Image Filtering
As shown in
Figure 1, the collected images consist of consecutive frames. To prevent overfitting and poor generalization during subsequent model training, multiple flights are conducted, and the images collected from each flight path are carefully selected. The selected images are annotated for semantic segmentation tasks, with a total of 600 images being annotated.
3.2.2. Image Annotation
Our dataset is specifically designed for semantic segmentation tasks, but fully annotating all objects in urban aerial images is highly challenging. To achieve the recognition of potential landing zones, forced landing zones, protected targets, buildings, and other objects on the ground, we annotate the dataset with 11 categories, namely: horizontal roof, horizontal ground, horizontal lawn, river, plant, tree, car, human, building, road, and obstacle. The definitions of each category are as follows:
Horizontal roof: the rooftop area of the buildings is flat.
Horizontal ground: flat ground areas other than roadways used for vehicular traffic.
Horizontal lawn: flat lawns.
River: identifiable water bodies, including rivers and lakes.
Plant: low vegetation, such as grass and shrubs.
Tree: tall trees with canopies and trunks.
Car: all vehicles on roads and parking lots, including cars, buses, trucks, tractors, etc.
Human: all visible pedestrians on the ground.
Building: residential buildings, garages, security booths, office buildings, and other structures under construction.
Road: roads and bridges where vehicles are legally allowed to travel.
Obstacle: steel frames, transmission line poles, and roads under construction.
The selected images are annotated with pixel-wise labeling by using the LabelMe annotation tool.
Figure 2 shows an example of the annotation results. The labeled images of the original image were obtained by the generated JSON file, as shown in
Figure 3.
3.3. Statistical Analysis
Our dataset contains 600 images of different scenes, which were annotated pixel by pixel using the LabelMe tool. Each image has a resolution of
. The dataset is divided into training, validation, and test sets according to an 8:1:1 ratio.
Figure 4 shows the pixel counts of each class (including the background) in the UAV-City dataset. It clearly shows the distribution of imbalanced pixel counts among different classes. Most of the pixels come from the tree, road, and building classes. Classes such as car, horizontal roof, horizontal ground and horizontal lawn, river, obstacle, and plant contribute fewer pixels, accounting for less than 5% of the total pixels. For the human class, it occupies only 0.103% of the pixels, which is due to the relatively low number of pixels per instance in the UAV-City dataset. Among them, the car and human classes are small objects, which have fewer total pixel points and smaller sizes. Therefore, segmenting small objects poses a great challenge in semantic segmentation tasks.
4. Proposed Method
STDC-Seg [
17] proposes a new STDC backbone network based on the BiSeNet [
16] model, as shown in
Figure 5a,b, which illustrates the layout of the STDC module, and
Figure 5c presents the general STDC network architecture. Additionally, a new Detail Guidance Module is designed to replace the original Spatial Path branch of BiSeNet, which retains low-level detail features while reducing network computational complexity, resulting in an excellent real-time performance for semantic segmentation. However, this network still has some shortcomings in small object segmentation. Specifically, due to the FFM module gradually reducing the dimensionality of the feature maps, it may lose some detail information that is useful in small object segmentation tasks. Furthermore, the Detail Guidance Module in the STDC-Seg network mainly focuses on low-level feature layers, which may result in some detail information in high-level feature layers being ignored, affecting the accuracy of small object segmentation.
Therefore, in this section, we propose the STDC-CT network, as shown in
Figure 6, which aims to improve small object segmentation accuracy while maintaining excellent segmentation speed.
4.1. Small Object Attention Extractor for STDC
We designed a small object attention extractor (SOAE) module for the STDC backbone network to extract more valuable features for small object recognition. We utilized convolutional layers at different resolutions (from Stage 3 to Stage 5) to extract diverse features from the input data. Each convolutional layer operates at a different resolution to capture features at different scales. Then, we used an attention mechanism to automatically select the most informative features and combine them for a more comprehensive representation.
In particular, we set the feature maps
generated by Stages 3 to 5 as the templates, denoted as
, as illustrated in
Figure 7. Then, the template layer
was passed through a
convolutional layer with the dimension reduced to 1. The maximum value of the feature map was then generated through global max pooling, i.e., the most noteworthy feature pixels were captured; they were mapped to vectors for cosine similarity computation through a fully connected layer:
where
denotes the cosine similarity between the template layer
and
.
and
denote the components of the vectors corresponding to
and
, respectively. For example, when
serves as the template, the cosine similarity values
between
and
, and
between
and
, indicate the attention levels of
and
toward
, respectively, while
. Similarly, when
serves as the template, the cosine similarity values
between
and
, and
between
and
, indicate the attention levels of
and
toward
, respectively, while
. To evaluate the attention weights
, we input
into the softmax layer to obtain normalized weight values:
After obtaining the attention weights, the feature map sizes
and
must be constant with that of
; thus, we unsampled
and
by 2× and 4×, respectively, and the number of channels of
,
, and
were 256, 512 and 1024 respectively. Then we added the weighted feature maps of the three levels, element by element, and adjusted the number of channels by a 1 × 1 convolution to obtain the attention feature:
where
denotes the attention feature obtained through the network with
as the template layer.
4.2. Laplacian of Gaussian for Detail Guidance Module
The Laplacian convolution method was employed in the STDC-Seg network for extracting fine details of image edges. The Laplacian operator, being an excellent edge detection operator, has been widely utilized in edge detection tasks. The Laplacian operator is commonly used in edge detection to detect significant changes in pixel intensity. However, with aerial images captured by drones, which often contain numerous small objects, such as vehicles and pedestrians, the edges of these small objects typically exhibit smaller intensity variations. As a result, when applying the Laplacian operator, the edges of small objects may be erroneously enhanced as noise. Additionally, since noise can be present in various locations within an image, the Laplacian operator tends to respond strongly to noise. When the operator is applied to pixels surrounding small objects, the noise can cause a strong response, leading to the edges of small objects being overshadowed by the noise signal. Consequently, in edge detection of aerial images captured by drones, the edges of small objects may be mistakenly considered as noise and consequently removed.
As shown in
Figure 8, to address these issues and preserve the excellent edge extraction capability of the Laplacian operator, we apply the Laplacian of Gaussian (LoG) convolution method [
21] in the Detail Guidance Module. Unlike the Laplacian convolution method, the LoG convolution method first applies Gaussian filtering to the image, which blurs the high-frequency details and reduces the intensity of noise. By separating the edges of small objects from the noise, the influence of noise is reduced. Additionally, Gaussian filtering reduces the gradient variations along the image edges. The variations of edge gradients are highly responsive to the Laplacian operator; noise typically has a high-frequency component and responds strongly to the Laplace operator. Therefore, by reducing the variations in edge gradients, Gaussian filtering can decrease the interference of noise in edge detection, enabling easier detection of the edges of small objects. Images are processed by a Gaussian filter and the Laplacian convolution, which preserves the outstanding edge extraction capability of the Detail Guidance Module while suppressing the impact of noise. This method can improve the accuracy of edge detail extraction. To reduce computational overhead, we leverage the associativity property of convolution operation, where the Gaussian function is combined with the Laplacian operator to form a single convolution kernel. As a result, only one convolution operation is needed on the image. The process of combining Gaussian and Laplacian methods is outlined below:
The Gaussian equation is:
Performing the Laplacian transformation on the two-dimensional Gaussian function:
where
x and
y denote the image pixel coordinates,
denotes the LoG operator.
4.3. PAPPM for Capturing Contextual Information
To enhance the neural network’s ability to capture multi-scale contextual information while maintaining high efficiency, researchers proposed a technique named parallel aggregation pyramid pooling module (PAPPM) [
23]. The PAPPM module consists of four parallel pyramid pooling branches, each using a different pooling kernel size to extract pooling features from feature maps at different scales. These features are then concatenated together to effectively capture multi-scale contextual information. Meanwhile, to accelerate the inference speed, the module adopts parallel computation, reducing the computational overhead while ensuring accuracy. The PAPPM module is integrated into the STDC network, as shown in
Figure 9.
4.4. The Detail and Context Feature Fusion Module
In the STDC-CT network, the context branch contains rich semantic information that can provide more accurate semantic representations. However, it loses a significant amount of spatial and geometric information due to the continuous downsampling process. On the other hand, in order to balance the detail information and contextual information in the STDC-CT network, we utilize the small object attention features extracted by the previously designed SOAE module to guide the fusion of edge and detail features with contextual information. This is because one of the original intentions of designing the STDC-CT network is to improve the accuracy of small object segmentation in drone imagery. Therefore, we have high confidence in the small object attention branch and use it as a guiding variable to balance the fusion of the detail branch and contextual information. Specifically, in the proposed approach, the attention feature is first passed through a sigmoid layer for normalization to obtain the weight value
, where
denotes the weight parameter for the detail features and
denotes the weight parameter for the contextual information. Next, the detail features and contextual information are multiplied with their corresponding weight values and then added together to perform feature fusion. Finally, the features are processed through a CONV-BN-ReLU to obtain
. The detailed structure of the method is illustrated in
Figure 10.
The small object attention branch, the context branch, and the detail aggregation module branch, respectively, are denoted as
a,
c, and
d. These branch results are shown as vectors of corresponding pixels with the symbols
,
, and
, respectively. The representations of
and
can be written as follows:
where
consists of a layer for convolution, a layer for batch normalization, and a layer for ReLU activation. When
, the model’s training places higher confidence in the detail features, whereas when
, the model’s training places higher confidence in the contextual information.