1. Introduction
An Unmanned Aerial Vehicle (UAV) is a low-to-medium altitude, light weight, small, low cost, and slow aircraft that can be controlled either remotely by a human operator or autonomously via an onboard computer. UAVs have simple mechanical structure as well as they can fly outdoor or indoor at various speeds, hover over a point of interest, perform maneuver in close distance to obstacles, and pursue difficult tasks without putting operator’s health and safety in danger [
1]. Due to the aforementioned advantages, UAVs have been employed in a wide range of military and civil applications such as surveillance [
2], aerial photography [
3], infrastructure inspection [
4], express delivery [
5], pesticides spraying [
6], disaster monitoring and rescue [
7], and many others [
8]. Generally, many of these applications have employed individual UAVs, but in recent years, there has been tremendous escalation in development of applications using multiple UAVs and UAV swarms. Therefore, currently, the main research effort in this context is directed toward developing unmanned aerial systems for UAVs cooperation, multi-UAV autonomous navigation, and UAVs pursuit-evasion problems. This paper focuses on multi-UAV pursuit problem.
In UAVs pursuit problem, the vital task is to detect and track a target or leader UAV using a tracker or follower UAV. Therefore, the strategies in this context can be either competitive or collaborative between UAVs. For detection and tracking purposes, UAVs must sense their environment. To achieve this, optical sensors are the most suitable ones due to: (1) they are cheap and light; (2) they provide rich information about surrounding scene; and (3) they have low power consumption. To develop vision-based UAVs for pursuit purpose, a real-time detection algorithm must be developed to localize and recognize one or multiple UAVs appearing in video streams captured by tracker/follower UAV. However, this is a very challenging problem because of several issues: (1) large variations in target or leader UAV’s visual appearance caused by many factors such as illumination changes, weather conditions, distance between UAVs, UAVs’ orientation, occlusion, and specular light (see
Figure 1a–f); (2) existence of shadows (see
Figure 1g); (3) low contrast between UAV and background (see
Figure 1h); (4) large variations in scene background; and (5) limited computing power and memory of onboard embedded devices of UAV platforms. In this paper, to address these challenges, a new shallow deep learning-based method for multi-UAV detection and tracking is proposed to achieve high detection and tracking accuracy while having low computational complexity.
In the field of computer vision, many object-detection algorithms have been developed in the last few decades [
9]. Conventional object-detection methods (e.g., [
10,
11]) have been used in many applications and they are mainly based on sliding window search or regional proposal and handcrafted features. Generally, these methods provide low detection accuracy in complex scene scenarios as handcrafted features cannot express the characteristics of objects precisely and they are computationally expensive. Consequently, these approaches are not applicable to real-time computer vision applications. To cope with the limitation of the conventional approaches, recently many object-detection algorithms based on deep Convolutional Neural Networks (CNNs) have been developed and employed [
12,
13]. Generally, deep learning-based object-detection algorithms can be divided into two classes: two-stage and one-stage object detectors. In the former approach, first, a region proposal network is used to estimate candidate object bounding boxes. Then in the second stage, the network extracts features from each candidate box and performs classification and bounding-box regression. In this manner, several methods such as R-CNN [
14], SPP-Net [
15], Fast R-CNN [
16], Faster R-CNN [
17], R-FCN [
18], and FPN [
19] have been proposed. Even though these methods have demonstrated high accuracy, they usually suffer from high computational cost. The latter object detector approach uses a single deep neural network with regression strategy to directly classify and detect objects. Please note that in this approach, the process of region proposal is avoided. Some of the one-step detectors are OverFeat [
20], AttentioNet [
21], G-CNN [
22], Single Shot Detector (SSD) [
23], You Only Look Once (YOLO) [
24], YOLOv2 [
25], YOLOv3 [
26], DSSD [
27], DSOD [
28], RetinaNet [
29], and RefineDet [
30]. These methods provide both time efficiency and reasonable accuracy which are the ultimate goals of the real-time object-detection task. Among others, YOLO series algorithms have shown a good track record in solving UAV pursuit problem and achieved a higher mean average precision (mAP) than other real-time deep learning-based systems. For instance, A. Bonnet et al. [
31], S. Arola et al. [
32], and M. A. Akhloufi et al. [
33] propose a deep learning framework to detect and track a follower or leader UAV using an optical camera mounted on another UAV. To achieve this, a search area proposal based on particle filters [
34] is combined with YOLOv2 CNN model [
25] to improve the performance of YOLO algorithm for tracking a UAV located far from the tracker or follower UAV. In these works, the estimated bounding boxes using YOLOv2 is given to particle filters algorithm to predict the target positions in next frames. H. Saribas et al. [
35] proposes a real-time hybrid detection and tracking method to pursuit an UAV in video frames. In this manner, Kernelized Correlation Filter (KCF) [
36] is integrated with YOLOv3 [
26] and YOLOv3-tiny detection models. More, specifically, this framework uses shallow version of YOLOv3 (YOLOv3-tiny) to detect a UAV in the first frame or in frames where the tracker fails, and it employs KCF to maintain the tracking in intermediate frames. Although the state-of-the-art methods show the applicability of YOLO series CNN models in real-time UAV-UAV detection and tracking, the methods are tested on simple scenarios as (1) YOLO and YOLOv2 CNN backbones are designed for large objects and (2) YOLOv3-tiny CNN network is a relatively shallow and small network which cannot extract UAV image features with high precision. Therefore, to meet the needs of high UAV detection accuracy, it is vital to develop a new framework to make a trade-off between speed and accuracy for multi-UAV pursuit application. The differences between the proposed method and the existing UAV detection and tracking frameworks are shown in
Table 1.
This paper presents a novel real-time deep learning-based framework on embedded computer systems for multi-UAV detection and tracking in airborne videos captured in complex and dynamic outdoor environments. The proposed UAV detection framework is inspired by YOLOv3-tiny algorithm and modifies its CNN model to enhance its feature extraction capability while preserves its fast detection speed. Generally, in the context of UAV pursuit, YOLOv3-tiny [
26] detection method provides low detection accuracy, and the main reason is three-fold. The first reason is because it uses shallow and simple Darknet19 network architecture as backbone which causes insufficient feature extraction, especially at the deep convolutional layers where the small UAVs may not have enough information to learn from the network. The second reason is that it makes deep CNN model by simply stacking convolutional and max pooling layers, which simply can result in overfitting and degradation of neural network. The third reason is that YOLOv3-tiny’s internal CNN layers use fixed filter sizes (
) to detect UAVs. However, UAVs and the objects in their surrounding environments appear with large variations in sizes and aspect ratios so that it is important to design a new network architecture with different filter sizes to extract features with high precision. To solve all these issues, we develop a new CNN architecture, called FastUAV-NET, with the following characteristics:
It is an embedded-based CNN network architecture as it needs low compute and memory demand.
It is a sparsely connected CNN architecture, inspired by Inception model [
37], to maintain the computational cost, while increasing the depth and width of the CNN model.
It is a tiny wide network architecture which uses five Inception blocks, where each block is a combination of , , convolutional layers and a max pooling layer, to extract features from scene objects with various sizes and aspect ratios as well as multiple orientations.
Furthermore, the proposed CNN architecture adopts a feature pyramid network to detect UAVs in two different scales. Even using the proposed tiny wide CNN network architecture, the detection frame rate is still low on computer embedded systems mounted onto UAVs. Therefore, the proposed FastUAV-NET is only used to initialize bounding boxes for UAVs in the scene and then sKCF tracking algorithm [
38] which can run at a high frame rate is used to track the bounding boxes. The proposed UAV detection algorithm is applied to every 6th frame and then the detected UAVs are tracked in intermediate frames via sKCF [
38]. The experiments are performed on a variety of test videos and results indicate that the proposed method is robust to issues caused by fast moving UAVs, changes in scale and aspect ratio of UAVs, illumination variation, camera viewpoint change, specular light, and shadow. Moreover, results show that the proposed framework has the least error detection rate compared to the state-of-the-art methods.
Contributions:
The main contributions of the paper are as follows:
proposing a new framework composed of a detector based on a novel deep learning architecture, named FastUAV-NET, and a tracker based on scalable kernel correlation filter;
fast and accurate localization o multiple UAVs in airborne video frames;
developing a framework that could run online on a GPU embedded platform mounted onto an UAV; and
generating the largest UAV-UAV dataset which consists of 25,000 video frames with large variations in both UAVs’ backgrounds and foregrounds.
This paper is organized as follows.
Section 2 briefly describes the YOLOv3-tiny algorithm.
Section 3 describes the proposed framework for multi-UAVs pursuit.
Section 4 provides experimental results of the proposed method and compared with state-of-the-art methods. The paper is concluded in
Section 5.
2. Brief Review of YOLOv3-Tiny
Currently, one-step object-detection algorithms (e.g., [
20,
21,
22,
23,
24,
25,
26,
27,
28,
29,
30]) have been widely used in many object-detection problems. Among others, YOLOv3 [
26] is one of the most popular object detectors since a single convolutional neural network simultaneously predicts multiple bounding boxes and class probabilities at high accuracy and a fast inference speed. To extract features and detect objects, YOLOv3 [
26] stacks two fully convolutional underlying architectures with
and
convolution kernels. More specifically, Darknet53 [
26]–it is inspired by Resnet [
39]–with 53 convolutional layers is used as a backbone to extract features and then 53 more layers are stacked onto it to achieve object detection. However, using 106 convolutional layers causes heavy computational cost and large run-time memory footprint so that the model is not able to achieve real-time speed in embedded systems. In other words, Darknet53 network deals with a problem of diminishing feature reuse, which makes this network slow. To take advantage of this widely used object-detection method in real-time applications, a lightweight version, called YOLOv3-tiny, has been developed. The YOLOv3-tiny uses Darknet19 network architecture which includes 13-layer for feature extraction and 11 more layers for object detection. The structure of YOLOv3-tiny is shown in
Table 2. Generally, this method achieves up to 220 Frame Per Second (FPS), whereas YOLOv3 achieves up to 35 FPS on a computer with a Titan X GPU. In YOLOv3-tiny, to achieve this high computational performance, first, the Darknet53 architecture is simplified by reducing the size of the backbone model and then a two-scale prediction strategy is used to detect objects on two-scale feature maps.
2.1. Detection Procedure in YOLOv3-Tiny
YOLOv3-tiny is an end-to-end object-detection method. In this method, the input image splits into
grid cells (i.e.,
and
) which are used in a CNN architecture to predict 3 bounding boxes and
c class probability for each grid cell. In this manner, the network predicts bounding-box center coordinates
relative to the bounds of the grid cell as well as width and height
relative to the whole image. Moreover, a confidence score (
C) for each bounding box is estimated based on product of the probability that the bounding box contains the object of interest (
) and Intersection over Union (
) of predicted box and ground-truth box. This can be formulated as follows:
Please note that
represents a value between 0 and 1. To calculate
, first the overlapping area between the predicted bounding box and ground-truth must be calculated to manipulate intersection and then the intersection must be divided with union which is the total area between both predicted and ground-truth. Ideally, when the
value closes to 1, it means the predicted bounding box is close to the ground-truth. Moreover, in YOLOv3-tiny parallel with the bounding-box prediction, each grid cell also predicts
c conditional class probability.
where the class-specific confidence score, which reflects how possible the object belonging to the class exists in individual box confidence, is estimated via the product of the individual box confidence and conditional class probability.
In contrast to the other YOLO versions, YOLOv3 and YOLOv3-tiny predict the objectness score (
) for each bounding box using a multilabel classification approach based on logistic regression rather than SoftMax to better model the data. The model in this framework returns a tensor of shape
number of anchors × (
bounding-box offsets+
objectness prediction), where number of anchors, bounding-box offsets, and objectness prediction are set to 3, 4, and 1, respectively. In the final step of YOLOv3-tiny, the fine tune bounding box(es) is generated using class-specific confidence score thresholding and non-maximum suppression. Finally, to train YOLOv3-tiny, sum of the squared error loss is used for bounding box and binary cross-entropy loss is used for the objectness score and class probabilities. More specifically, YOLOv3-tiny loss function is broken into four main parts: (1) error in bounding-box centers; (2) error in bounding-box dimensions; (3) loss related to confidence score; and (4) object classification loss. In this manner, YOLOv3-tiny employs the following loss function (
L):
where
is a weight used to increase emphasis on boxes with objects whereas
is used to lower the emphasis on boxes with no objects. In Equation (
3),
is the number of cells,
B is the number of bounding boxes predicted by each grid,
c is the number of classes,
C refers to the confidence score, and
refers to the class prediction. Moreover,
describes whether the object is in grid
i, and
denotes that the
bounding-box predictor in grid
i is responsible for that prediction.
2.2. YOLOv3-Tiny Network Architecture
The Darknet19 structure of the YOLOv3-tiny network consists of 24 layers where layer 0 to 12 are convolutional and pooling layers for extracting features of the target objects from the input images. In the backbone architecture, each convolutional layer with filter size 3 and stride 1 is followed by a batch normalization, a Leaky ReLU activation function, and a pooling layer with kernel size 2. In this algorithm, max pooling strategy is used to achieve dimensionality reduction. After feature extraction, YOLOv3-tiny tends to detect objects at two different scales using feature pyramid network. To achieve this, the first detection is made by the 16th layer. For the first 15 layers, the input image is down sampled to generate feature map of size . Then, the detection is made using the detection kernel, resulting a feature map of 3 × . Then, the feature map from layer 13 is subjected to a convolutional layer before up-sampling by factor of 2 to dimensions of . To improve the performance of feature pyramid, the up-sampled feature map is concatenated with the feature map from layer 8, which is the corresponding resolution feature map generated by the Darknet19. Then, the combined feature maps are passed through a convolutional layer and followed by convolutional layer to fuse the features from the earlier layer. Next, the second detection is made by the 23rd layer, yielding a detection feature map of 3 × . Then to detect objects, the position of the objects is estimated using the semantic information, and a bounding box is dedicated to all possible targets based on 6 anchors at two scales. Finally, the non-maximum value suppression is used to remove redundant bounding boxes to determine the optimal final bounding box.
2.3. Limitations of YOLOv3-Tiny in UAV Detection
YOLOv3-tiny can achieve real-time detection on GPU. However, as shown in
Table 2, the Darknet-19 has very limited number of convolutional layers (depth) of CNN, resulting in limited feature extraction capability. This simply leads to low detection accuracy when the scene complexity is high. Consequently, in the UAV pursuit application where the complexity in scene and demand for detection precision are high, YOLOv3-tiny detection algorithm cannot be used without improving its CNN Darknet-19 network architecture.
4. Results and Discussion
In this section, the performance of the proposed FastUAV-NET multi-UAV detection architecture with and without sKCF tracking [
38] is quantitatively and qualitatively evaluated on outdoor videos captured under different range of variability in real scenes. The proposed method is compared with state-of-the-art detection methods.
4.1. Dataset
The dataset is generated from 11 different videos which are used for training and testing. The video frames contain one or two UAVs and are captured in RGB color space under: (1) large background combinations, (2) various weather conditions, (3) different illumination changes, (4) different viewpoints, scales, and aspect ratios (see
Figure 3), and (5) existence of shadows. In this dataset, size of UAVs varies from
to
pixels. The height and width of the video frames are
. To generate training dataset, 6 different videos are used and the remaining 5 videos are used to construct testing dataset. In this manner, the training and testing datasets consist of
and
video frames, respectively. Moreover, they are manually annotated by bounding boxes. The basic characteristics of the constructed dataset are summarized in
Table 3.
4.2. Quantitative Measurement Metrics
To validate the detection results, three different quantitative error measures such as recall, precision, and Average Precision (AP) are used. Recall refers to the ratio of true positives (
) to the number of true positives plus the number of false negatives (
). Recall score is written as
. The precision refers to ratio of true positives (
) to the number of true positives plus the number of false positives (
). This metric can be formulated as
. To determine whether a predicted bounding box by a model is true positive, false positive, or false negative, the overlap between UAV detection result (
) and ground-truth bounding box (
) is considered.
When the score is greater than , it is considered a true positive, else it is considered a false positive. Moreover, when the model could not detect an UAV is considered a false negative. The third metric is average precision which is the area under precision-recall curve.
4.3. UAV Detection Methods
To understand and analyze the performance of the detection methods, the proposed model is compared with YOLOv3 [
26] and YOLOv3-tiny models. Moreover, to improve the computational cost, all the UAV detection methods are applied to every 6th frame and then the detected UAVs are tracked in intermediate frames via the sKCF algorithm [
38]. In this manner, the tracking performance of UAVs is also examined using different UAV detection models.
In the proposed and the compared UAV detection methods, anchor boxes are used to predict bounding boxes. Please note that YOLOv3 algorithm uses nine anchors whereas the proposed and YOLOv3-tiny methods employ six anchors. In YOLOv3, nine anchors are selected using k-means clustering and they are set to , , , , , , , , and . In the proposed method and YOLOv3-tiny methods, the anchors are set to , , , , , and . In all the methods, the parameters such as learning rate, batch size, and subdivision are selected as , 64, and 16, respectively. Moreover, threshold score for all the methods is set to . In the proposed and YOLOv3-tiny methods, epoch is set to 20,000 whereas in YOLOv3, epoch is 40,000. The convolutional neural network frameworks are implemented in CUDA C++ on the Windows platform. They train and test on a computer with a single NVIDIA GTX 1050TI GPU, an Intel i7-7700HQ CPU, and 16 GB RAM. Moreover, the algorithms also are tested on a Jetson TX2 module embedded on a UAV.
4.4. Performance Comparison of UAV Detection Methods
In this section, the performance of the proposed FastUAV-NET multi-UAV detection model is compared with the state-of-art detection methods which are YOLOv3 and YOLOv3-tiny. The proposed FastUAV-NET and compared detection methods are trained with the 15,000 collected training dataset and are tested on 10,000 video frames from testing dataset. To evaluate the accuracy of obtained bounding boxes using proposed FastUAV-NET detection method as well as YOLOv3 and YOLOv3-tiny, different quantitative measurements such as precision, recall, IoU, and AP quantitative measurements are used. According to the recall and precision rates of the test results obtained using different methods, the precision-recall curves of UAV detection are plotted and shown in
Figure 4.
Figure 4 shows that FastUAV-NET performs favorably against YOLOv3 and YOLOv3-tiny. Moreover, the IoU and AP of these 3 models on the generated test dataset are shown in rows 2 to 4 of
Table 4. The proposed FastUAV-NET provides 0.7576 AP and 0.6175 average IoU, YOLOv3 has 0.4579 AP and 0.5225 average IoU, and YOLOv3-tiny gives 0.4210 AP and 0.4683 average IoU. The results in
Figure 4 and
Table 4 show that the proposed multi-UAV detection method provides the highest AP and IoU scores. The IOU results simply show that the proposed FastUAV-NET achieves a higher overlap between the predicted bounding box and the ground-truth. On the other hand, the YOLOv3-tiny detection method provides the lowest accuracy. The main reasons that the proposed method is more efficient than the other two methods are two-fold. First, the proposed model is based on different kernel sizes for convolution operation which causes multi-level feature extraction. Please note that extracting multiple features using multiple filters with different kernel sizes simply improves the performance of the network as the network can learn UAVs from small details, middle sized features or almost whole images. Second, the proposed network is specifically designed for UAV detection in wild airborne scenes; however, this is not the case in YOLOv3 and YOLOv3-tiny where their network architectures are designed to detect various objects in natural scenes. Consequently, the FastUAV-NET is more robust to large variations in both UAVs’ backgrounds and foregrounds. Moreover, the results show that the widening of CNN network is a much more effective way of improving performance of UAV detection compared to increasing depth CNN network.
The task of UAV pursuit is considered to be a very challenging problem due to the existence of shadows, image instability, platform motion, instant change in size of the captured UAVs, existence of many interfering objects in the environment, and finally variation of camera orientation and illumination. To visualize this, 3 detection methods are applied to 12 different frames of 5 test videos under different scenarios and the results are depicted in
Figure 5. The results shows that the proposed method is robust to existence of shadows and sudden changes in illumination, weather condition, background, scale, aspect ratio, and viewpoints.
Figure 5 illustrates that the proposed detection method effectively detects all UAVs. Furthermore, it can be seen that the proposed detection method can drastically improve accuracy of the predicted bounding boxes in terms of scale and aspect ratio. However, in multi-UAV scenario (see Video 2), the results show that the proposed method can fail to detect multiple UAVs when UAVs lie very close to each other. Moreover,
Figure 5 also demonstrates that the YOLOv3 and YOLOv3-tiny networks have high miss detection rates under most of the scenarios. Consequently, the overall experiment results show that the widened network architecture can even outperforms thin 106-layer deep YOLOv3 network.
The experimental results that are obtained on the computational time are tabulated in
Table 4. The computational time on the desktop computer shows that all methods can be used for real-time applications. On the other hand, the computational results on the embedded Nvidia Jetson TX2 show that only the FastUAV-NET and YOLOv3-tiny can be considered for embedded applications as YOLOv3 can run at 7 FPS, which is very slow.
Table 4 illustrates that the proposed model is approximately two times quicker than YOLOv3. After training, the proposed network runs at about 13 FPS on Nvidia Jetson TX2. However, the YOLOv3-tiny network is quicker than the proposed network as it runs at 25 FPS. The main reason that YOLOv3-tiny is quicker than the proposed network architecture is simply because the YOLOv3-tiny has only 13 convolutional layers which decreases computational time, but with the cost of decreasing accuracy and quality of solutions. Even though the proposed CNN architecture is slower than YOLOv3-tiny, it is significantly provides higher detection accuracy.
4.5. Performance Comparison of UAV Detection Methods with Tracking
Here, the intention is to test the performance of detection algorithms with the sKCF tracking method [
38] on the 5 test videos. To this end, the detection algorithms are applied to every 6th frame and then the predicted bounding box(es) is tracked in intermediate frames using the sKCF algorithm. To conduct experiments, the estimated bounding boxes using sKCF algorithm are used to calculate IoU and AP scores and the results are tabulated in the last 3 rows of
Table 4. The results show that the highest and the lowest accuracies belongs to the sKCF tracking method initialized with the FastUAV-NET and YOLOv3-tiny, respectively.
Table 4 shows that the proposed method with the sKCF tracker provides 0.7916 AP and 0.6583 average IoU, YOLOv3 with the sKCF tracker gives 0.4788 AP and 0.5470 average IoU, and YOLOv3-tiny with the sKCF tracking method provides 0.4568 AP and 0.5046 average IoU. More specifically, the results indicate that the proposed detection method with the sKCF tracker outperforms approximately by 65% and 73%. Furthermore,
Table 4 shows that integration of UAV detection methods with the sKCF tracking algorithm reduces the computational cost, and thus to speed up the multi-UAV detection and tracking methods. From the results it is obvious that both proposed framework and YOLOv3-tiny can only be considered to run in real time on the embedded Nvidia Jetson TX2. Finally, we present a visual comparison of the proposed method along with YOLOv3-tiny on the test videos dataset and the qualitative detection results are shown in
Figure 6. The tracking results in video 1 illustrate that initializing the sKCF tracker with the proposed network successfully tracks the UAV in all the inter-frames, but initializing the sKCF with YOLOv3-tiny fails to detect the UAV at inter-frame 2750. Video 2 demonstrates multi-UAV scenario, and the results show that the proposed framework tracks UAVs in most of the frames, except where the UAVs are very close to each other. On the other hand, the sKCF tracker with YOLOv3-tiny provides very low tracking performance. For instance, the proposed method can detect both UAVs at video frames 640 and 850, but the sKCF tracker with YOLOv3-tiny could only track one of the UAVs. Video 4 belongs to complex background scenario and the results depict that our method can track the UAV in all the video frames; however, the compared framework fails to track the UAV at video frames 380 and 410. Consequently, the qualitative results verify that the proposed method can initialize the sKCF tracker with higher accuracy than the YOLOv3-tiny.
4.6. Discussion
The ultimate goals of the multi-UAV detection and tracking are two-fold. First, UAVs must be localized with high precision. Second, the processing speed should also be high so that the localization system could be run online on a GPU embedded system mounted onto an UAV. To achieve these goals, the current state-of-the-art methods tries to integrate the most popular real-time object-detection methods such as YOLOv3-tiny and YOLOv2 with one of the fast-tracking algorithms including KCF and particle filters. The YOLOv3-tiny and YOLOv2 methods use the Darknet-19 backbone structure to extract features. However, in this paper, it is shown that the detection methods based on the Darknet-19 architecture are fast, but cannot perform well with the constructed UAV dataset, which comprises RGB video frames with large variations in both UAVs’ backgrounds and foregrounds. This is because the Darknet-19 is a shallow and simple CNN network and cannot extract features and approximate the UAV model with high accuracy. Moreover, the experiments show that when focusing on UAV’s pursuit problem, a customized CNN architecture must be used to improve the detection performance. In this manner, we propose to widen the backbone structure of the YOLOv3-tiny detection method to balance the localization performance by learning more robust features and the speed without massively increasing the depth of convolutional layers. The results clarify that the proposed method can detect UAVs with high precision even at harsh environmental conditions. However, the experimental results indicate the proposed method can fail to detect multiple UAVs when UAVs fly very close to each other. To further improve the computational time of the FastUAV-NET detection algorithm, scalable Kernel Correlation Filter (sKCF) tracker which is an efficient and fast-tracking algorithm is used. More specifically, the FastUAV-NET detection algorithm is applied to every 6th frame and then the detected UAV(s) is tracked in intermediate frames using the sKCF tracker. The experimental results show that FastUAV-NET combined with the sKCF tracker outperforms sKCF tracker initialized with YOLOv3-tiny approximately by 73%.