1. Introduction
UAVs have been widely used in diverse missions. Because UAVs have the characteristics of high flexibility and good concealment, they play an important role in the military field. The use of UAVs at sea to detect and attack naval vessels poses a threat to the safety of navigation. Therefore, effective detection and tracking of unmanned aerial vehicle targets are of great significance. In foggy weather, the characteristics of UAVs in the video are weakened, which reducing the tracking accuracy.
Presently, video-based multiobject tracking mainly uses the following methods: labeled multi-Bernoulli multiobject tracking algorithm based on detection optimization [
1], multiobject tracking algorithm for position prediction by combining deep neural network SiamCNN with contextual information [
2], and pedestrian tracking algorithm with a combination of YOLO detection algorithm and Kalman filter [
3]. These methods focus on problems such as tracking loss caused by occlusion, target ID matching error in the tracking process, and missing tracking targets. In the process of detection and tracking, there is a negative correlation between tracking accuracy and algorithm processing time. The realization of high-precision tracking often leads to an increase in algorithm time, which makes it difficult to meet real-time performance.
Recently, scholars have carried out research on image defogging. Salazar-Colores et al., (2020) proposed a novel methodology based on depth approximations through DCP, local Shannon entropy, and fast guided filter for reducing artifacts and improving image recovery in sky regions with low computation time [
4]. Liu et al., (2019) combined dark channel defogging with a KCF tracking algorithm for multitarget tracking under fog conditions [
5]. However, this method needs to manually label multiple target frames in the first image. Once the tracking target is lost, it will not be able to complete the follow-up tracking task. Defogging is mainly achieved by image enhancement [
6,
7,
8,
9] and using a physical model [
10,
11,
12,
13,
14]. Using image enhancement for image defogging does not consider the influence of fog on the image but achieves the defogging effect by adjusting the image contrast. Meanwhile, the physical model for image defogging takes into account the foggy image generation process and other factors, with dark channel defogging used as a typical algorithm.
In order to achieve real-time tracking for multiple UAV targets, we selected the YOLOv5 algorithm, which has excellent speed performance in the current target detection field, to carry out the UAV target detection task. In the process of matching and tracking, considering the UAV motion has the characteristics of direction and speed mutation, conventional tracking methods such as Kalman filtering and SORT algorithm are prone to cause tracking and matching errors. Therefore, we chose the Deepsort algorithm to perform track correlation for the detected UAV target. The Deepsort algorithm can combine the motion state and appearance characteristics of the target to perform matching and correlation in the tracking process. It has good tracking ability for moving direction and speed mutation targets.
We implemented and tested a “detection-based tracking” algorithm for multiple UAVs in foggy weather by combining an improved dark channel algorithm with improved YOLOv5 and Deepsort [
15,
16,
17]. Through these improvements, we reduced the complexity of defogging algorithm from
to
and simplified the complexity of YOLOv5, thus reducing the time of the defogging and detection algorithm. Compared to target detection and tracking without fog interference, the introduction of defogging processing makes the algorithm spend more time on single-frame image processing. For this reason, images are compressed without distortion to further ensure real-time tracking. The specific process is given in
Figure 1.
2. Image Defogging Algorithm Based on Improved Dark Channel
The dark channel defogging algorithm uses the following model for image defogging [
18]:
where
x stands for the pixel spatial coordinate of the image,
I(
x) represent the captured foggy images,
J(
x) represents the restored fogless images;
A is the global atmospheric light value, and
t(
x) is the transmissivity that can be estimated by Equation (1) when
tends to zero according to the dark channel prior theory [
19]. Moreover, the value of
A is the maximum grayscale of the pixels in the original images with the top 0.1% luminance in the dark channel [
20]. The process flow of the dark channel defogging algorithm is presented in
Figure 2.
2.1. Determination of Transmissivity by Mean Filtering
In the dark channel defogging algorithm, the minimum filter is often used. However, after minimum filter processing, the restored defogging image has obvious white edges at the edge of the UAV target. This phenomenon affects the edge features of the UAV target itself and is not computer friendly for UAV target recognition. To solve this problem, we used mean filtering to process the dark channel image to estimate the transmittance. Additionally, the defogging coefficient is correlated with foggy images in dark channel defogging to achieve adaptive adjustment of defogging. The detailed calculation process is as follows.
In Equation (1),
I(
x),
J(
x),
t(
x), and
A are all above 0 and
tends to zero:
We assumed a constant atmospheric light value,
A, calculated from the minimum intensity in the R, G, and B channels:
Here, mean filtering is carried out on the right side of the equation, and the transmittance
t(
x) is calculated. The result after mean filtering can reflect the general trend of
t(
x), but there is a certain absolute value difference between it and the real value. Therefore, an offset value
is made up for the filtering result. Moreover, to simplify the representation,
is substituted with
, where the average represents the mean filter processing, and sa represents the window size of the filter. The approximate evaluation of transmissivity is obtained as follows:
Let
, the above equation is expressed as follows:
δ can regulate the darkness of images restored after defogging. The larger the value of δ, the lower the transmissivity
t(
x) and the darker the restored image. To enable δ to dynamically adjust the brightness after defogging according to the fog image, δ can be set to be associated with the pixels of the original image, and the formula is as follows:
where
is the mean value of all elements in M(
x), that is, the mean value of the minimum pixel at each pixel coordinate x of RGB channels in an original foggy image. Moreover,
. If the value of δ is too low, it may lead to lower transmissivity and result in dark images after defogging. For this reason, the maximum threshold of δ is set to 0.9. Thus, we have
Equations (3), (6) and (8) are combined to obtain the following:
2.2. Estimation of Global Atmospheric Light Value
During dark channel defogging, the positions of the top 0.1% of pixels in the dark channel are determined. This operation requires comparing the pixel information of all pixels with other pixels to arrange them in order. We assumed that the number of pixels in the image is n. Then, the algorithm complexity of this operation reaches and its operation amount will increase significantly with the increase in the image size.
In this study, we directly used the combination of the maximum pixels of the filtered image dark channel and the maximum pixels of the RGB channel of the original image to estimate the atmospheric light value. In the algorithm process, only the relationship between the current pixel value and the maximum pixel value needs to be compared, thus greatly reducing the computational complexity of the algorithm. The luminance of the restored image is slightly lower on the whole, but the complexity of the defogging algorithm can be reduced from to , thus shortening the processing time of the algorithm.
Filtering is performed for inequality (4) to obtain
From Equation (10), we obtain
and
hence,
may be expressed as
where
.
The implementation steps of the algorithm are given in
Table 1.
3. Tracking of Multiple UAVs with Improved YOLOv5 and Deepsort
After defogging the video frame of UAVs under fog conditions, an algorithm combining improved YOLOv5 and Deepsort was employed to track multiple UAVs.
The network structure of YOLOv5 is divided into input, backbone, neck, and prediction [
21] as given in
Figure 3.
3.1. Optimization and Improvement of YOLOv5 Network
With highly precise detection of interesting objects, the YOLOv5 network structure can shorten the time of detection. Before target detection and tracking, the defogging algorithm is introduced to defog the fog image, which increases the processing time. Although the defogging time is reduced by improving dark channel defogging, the overall algorithm still cannot meet the real-time performance requirements. The YOLOv5 network structure was optimized and improved to further shorten detection.
3.1.1. Removal of Focus Layer
The focus module is used in YOLOv5 to slice an image and expand the three RGB channels of the original image into 12 channels. Further convolution generates an initial down-sampled feature image that retains the valid information of the original image while improving the processing speed because of less calculation and lower parameters. However, frequent slice operation in the focus module increases the amount of calculation and parameters. In the process of sampling 640 × 640 × 3 images to obtain 320 × 320 × 3 feature maps, the amount of calculation and parameters of the focus operation becomes four times that of the ordinary convolution operation. As revealed in experiments, convolution can replace the focus module and perform satisfactorily without side effects. Hence, the focus layer was removed to further improve the speed.
3.1.2. Backbone Optimization Based on ShuffleNet V2
The backbone of YOLOv5 adopts the C3 module to extract object features, which is easier and faster than BottleneckCSP [
22] in the previous versions. However, the C3 module utilizes multiple separable convolutions, so it occupies a large portion of memory if there are many channels and it is frequently implemented. In this case, the speed of the equipment is reduced to some extent. As a lightweight network model, ShuffleNet V2 [
23] contains two block structures. As shown in
Figure 4, Structure a has a channel split, so the input feature image with c input channels is split into two branches: c1 and c2. The branch c2 is concatenated with c1 after three-layer convolution. Through the control module, input and output channels are kept consistent to speed up the reasoning speed of the model. Subsequently, channel shuffle is performed to reduce the branches of the network structure, improve the parallelism of the network, and shorten the duration of processing. Therefore, Structure a is mainly used to deepen the layers of the network. Meanwhile, Structure b has the same right branch as Structure a. Its left branch is concatenated, and the channel is shuffled with the right branch after convolution of input features. However, this structure cancels the channel split so as to allow the expansion of module output channels. Therefore, it was mainly employed to downsample and compress the feature layer.
In the neck layer, the improved network maintained the same structure of feature pyramid network (FPN) and pyramid attention network as in YOLOv5. However, the PAN structure contained as many output channels as input channels. Moreover, the “cat” operation was adjusted to “add”, which further optimized memory access and utilization, as shown in
Figure 5. The channels of the original YOLOv5 were pruned to redesign the network structure.
The optimization and improvement of YOLOv5 were achieved by deleting the focus module, replacing the backbone with the ShuttleNet module, and optimizing the network in the neck layer. The improved detection network became less complex and processed faster. Its structure is presented in
Figure 6.
3.2. Tracking of Multiple UAVs Based on Deepsort
The Deepsort algorithm takes the output of the YOLOv5 detector as its input to select the boxes for object detection and calculate the object association for matching. The tracking process flow [
24] is presented in
Figure 7.
Multiobject tracking by Deepsort contains two operations, i.e., estimation of object state and matching and association of objects. In the first operation, Deepsort utilizes eight-dimensional state space () and to record the motion state and information of objects. On this basis, the Kalman filter is employed to predict the motion state of objects. The motion information of objects predicted with the current frame is matched and associated with the object detection result output by the next frame. In the second operation, Deepsort associates the motion features of objects by virtue of Mahalanobis distance. In order to overcome the matching error caused by the abrupt change of object speed and jitter of shooting equipment because of single Mahalanobis distance, the appearance feature matching of objects is introduced for compensation. These two measures are combined through linear weighting into the final measure. Therefore, it can offer reliable association and matching with short-term prediction and compensate for the ID switch during object occlusion and camera shake.
4. Image Compression
While tracking multiple UAVs in a foggy video, the defogging algorithm is introduced, but it extends the time of processing frames in the video and lowers the frame rate. Hence, it is difficult to achieve real-time processing. In order to improve the processing speed of a single frame without reducing the accuracy [
25,
26], we used the method based on bilinear interpolation [
27] to compress the image before defogging. The details are as follows.
Assuming that the size of the source image is a × b and the size of the compressed image is m × n, the horizontal and vertical compression ratios of the image are a/m and b/n. The pixel value of the compressed image can then be calculated based on the pixel ratio and the pixel points in the source image. Assuming that the coordinates of the pixel points in the compressed image are (I, J), the pixel points corresponding to the original image should be (i × a/m, j × b/m). Because i × a/m and j × b/m are not integers, the four adjacent pixels of the point are interpolated to obtain the pixel value of the point. The schematic diagram of such an operation is presented in
Figure 8.
In
Figure 8,
P represents the pixel point of the source image corresponding to the pixel point of the compressed image;
are four pixel points adjacent to point
P in the original image.
and
points can be calculated in the horizontal direction to obtain the pixel value of
point, and
and
points can be calculated in the horizontal direction to obtain the pixel value of
point. The specific calculation is as follows:
and
points can be interpolated in the vertical direction to obtain the pixel value of the
point: