1. Introduction
UAVs (Unmanned Aircraft Vehicles), or drones, are currently being widely utilized in civilian and military applications, such as surveillance, rescue, surveying and delivery, for their features of “LSS” (low altitude, slow speed, and small size). However, as a result of such characteristics, UAVs are also hard to detect, and therefore may cause serious threat to military and social security, especially for airplanes when landing or taking off. For example, Frankfurt Airport temporarily closed in March 2019 due to two hovering drones nearby, and caused approximately 60 flight cancellations [
1]. Hence, an accurate, long and large range UAVs detection method is urgently required for now and the future.
Recent approaches for detecting UAVs in images are always based on computer vision (CV) methods [
2,
3,
4,
5], which can be roughly classified into three categories: those based on appearance, on motion information across frames, and the hybrid. Appearance-based methods rely on specially designed neural network (NN) frameworks, such as Faster R-CNN [
6], You-Only-Look-Once (YOLO)v3 [
7], Single Shot MultiBox Detector (SSD) [
8] and Cascade R-CNN [
5]. They have been proven to be powerful under complex lighting or backgrounds for some tasks. However, their limitation is that the targets are required to be relatively large and clear in vision [
9,
10], which is often not the case in real-world scenes for drone detection. Motion-based methods mainly rely on optical flow [
11,
12,
13,
14,
15,
16] or motion modeling of foreground [
17,
18,
19]. These methods are more robust when the target objects are tiny or blurry in images, but they are more often employed for region proposal or distinguishing moving objects from static backgrounds, rather than detecting. Hybrid methods, combining both appearance and motion information, may add extra structures or restraints, such as motion restrictions [
8,
20], or inter-frame coherence [
21,
22] to basic detection architectures (typically appearance-based neural network backbones). In short, previous approaches have exploited different characteristics of drones and significantly improved detection performance. However, the common shortage of those methods is they only consider the problem of drone detection under a relatively clean background with one or two targets from few distractors. Moreover, rare work has been done on exploiting motion features for object recognition, let alone for drones.
Compared with other conventional object detection problems, the drone detection problem poses the following unique challenges: First, the object to be detected may appear in any area of the frame and move towards any directions. Second, the background of an object is always complex and changes fast in city scenes. Third, disturbances of UAV-like targets, such as birds, kites or pedestrians are commonly seen. Forth, target has usually less than 200 pixels in the captured image, and its appearance is various too, which results in severe performance degeneration of CNN-based methods.
To address the problem, the present work introduces a novel motion-based method named the multi-scale space kinematic detection method (MUSAK). It relies on recovery of object motion behavior, which is inspired by the observation that different objects have different motion behaviors. The MUSAK method detects drones from uncontrolled backgrounds by exploiting multiscale kinematic parameters extracted from input videos. The kinematic parameters here consist of the translation parameters (translation velocity and acceleration) and rotation parameters (angular velocity and acceleration).
The structure of the MUSAK method is shown in
Figure 1. The pipeline starts from the extraction and tracking of ROIs (Regions of Interest). Then, based on the number and quality of the keypoints in each ROI, the process goes into one of the three compatible scale spaces to further extract the kinematic parameters. Afterwards, the extracted time-series motion parameters are fed to the corresponding GRU classifiers (they are separately trained on a customized UAV database comprised of several public datasets and the homemade multiscale UAV dataset (MUD)), to output the motion recognition results. Different from the previous detection methods, MUSAK extracts the motion patterns by exploiting the kinematic parameters at different scales, which enables it to enlarge the interclass differences to a great extent.
The experiments suggest that MUSAK achieves the state-of-the-art detection accuracy for UAVs compared with the existing methods. We also carry out the adaptivity and significance analyses for MUSAK.
The main contributions can be summarized as follows:
For the first time, this work proposes a motion-based method that uses rigid body kinematic parameter combinations for detecting objects.
The proposed MUSAK introduces three scale spaces to describe the different state of the kinematic parameters of objects. Particularly, it uses a pixel amount of the object as a relative depth to construct the pseudo 3D space.
A new drone datasets MUD is established. It is comprised of several public databases and newly added data in real-world scenes with motion-related information labels.
The remainder of this paper is organized as follows. We give an overview on related work in
Section 2. In
Section 3, we describe each part of the proposed MUSAK method. The introduce of datasets and experimental results are shown in
Section 4. Further discussion is stated in
Section 5 and
Section 6 concludes our paper in short.
3. Multi-Scale Space Kinematic Method for Drone Detection
The proposed MUSAK method will be described in this section.
Figure 1 gives the detection framework. The key idea of our method is to employ three groups of kinematic parameters from 2D space, pseudo 3D space and 3D space, respectively, to deal with scenarios with different image qualities for detection.
Specifically, for each time-series input of ROIs, the first step of the MUSAK method is to employ the ground plane alignment method to calibrate the orientation. Then, according to the number and quality of keypoints in each ROI, MUSAK introduces three individual scale spaces to describe the target: the detail scale, the edge scale, and the block scale, exactly corresponding to the mentioned 3D space, the pseudo 3D space and the 2D space.
At the detail scale, the number of robust keypoints is enough to produce a high-precision 3D structure of the target, where the depth map and all the 3D kinematic parameters can be obtained. While at the edge scale, the robust keypoints are not enough to get a reliable depth structure, so we try to get a relative depth estimation from two neighboring frames, which is described by a set of relative 3D kinematic parameters, namely the pseudo 3D parameters, following the intuition “big when near”. At the block scale, there are so few robust key points that can be found that the object depth cannot be calculated. Hence, the object motion is treated as movement on the pixel plane. Under such a condition, only the 2D kinematic parameters can be acquired. Loosely speaking, the method of three-scale division reflects the effect of different distance for the target under same optical observing system.
The following subsection will discuss more about each step of our method.
Table 1 shows the notations used in this paper.
3.1. ROI Extraction and Tracking
The ROI extraction is the first step. For its high robustness and adaptivity in tackling different backgrounds, Visual Background Extractor (ViBe+) [
35] is employed for retrieving ROIs. The main steps are briefly as follows:
For each pixel, a sample set is created, which consists of the pixel and its 20 neighbor pixels. Through the first frame, each pixel sample set is randomly selected from the adjacent 24 pixels.
- 2.
Background/foreground detection.
In ViBe+ method, each pixel compares to its pixel sample set to determine whether it belongs to background, if the cardinality of the set intersection of a given sphere of radius 30 and the sample set is above the given threshold of 2.
- 3.
Background model updating.
The updating process follows conservative policy, lifespan policy and spatial consistency. Each background point will be transferred to a foreground point with a probability of 1/rate, in which rate denotes the updating factor. As the background varies rapidly in our scenes, the rate here is set to 8 (a background pixel has one chance in 0.125 of being selected to update its model).
During implementation, the following modifications suggested in work [
35] are also adopted: a different distance function and threshold criterion, a separation between updating mask, and the detection of blinking pixels. The result of a typical motion ROI is illustrated in
Figure 2.
As
Figure 2 shows,
Movi indicates the
i-th extracted motion ROI, and Rec (
t,
Movi) represents bounding box containing
Movi at time
t. For each newly extracted motion ROI, Spatially Regularized Correlation Filter (SRDCF) [
36] is adopted for inter-frame tracking due to its ability to address the tracking task efficiently and robustly.
In addition, this method aligns the direction of the ground plane by using the method [
37] to calibrate the direction of the following kinematic parameters at the start of acquisition.
3.2. The Keypoints Criterion
The keypoints criterion physically divides the whole detection scales into three scale spaces, the detail scale, the edge scale and the block scale, based on the image quality or definition.
The examples of each scale space are shown in
Figure 3. When the object is near, its detail is clear, and therefore plenty of robust feature points can be utilized. When it goes further, the details of the object vanish first, only the edges and the large part remain visible. When the distance becomes far enough, all robust feature points aggregate to a blurred pixel block. This physical transition thus yields three corresponding scales: detail, edge, and block scale. In the detail scale, the object possesses four or more robust feature points for interframe alignment. At the edge scale, the keypoints lose robustness and transfer to edge or corner points, which can partially retain structural completeness. At the block scale, as
Figure 3c shows, the object becomes to a moving or still pixel block on the background.
As stated above, keypoints quality measuring is the first and also the important step in our method. The quality of keypoints, denoted by
Qkey, is modeled as follows:
Qkey ranges from 0 to 1, and lower value means higher keypoints quality. The measure describes the number and quality of keypoints.
Nrob denotes the number of robust feature points from the invariant feature descriptor (SURF),
Nedge is the number of corner and edge points and
α is the relation coefficient describing how many robust feature points increase as the corner and edge points increase for an object. The value of
α relies on the type of object and the detection condition. For the related objects (drones, cars, pedestrians and birds) in our detection scenarios,
α is set to 25. Then, the keypoints criterion is written as
where
Qedge and
Qblock are threshold values for the edge and block scales. Based on our detection task, the values are set to
Qedge = 1/500 and
Qblock = 1/50.
3.3. Extraction of 3D Kinematic Parameters
This section describes the extraction process of 3D kinematic parameters at the detail scale.
3.3.1. Depth Estimation
Depth value is the key to recovering the 3D structure of the ROI. Current methods for obtaining depth maps can be mainly classified into three kinds: laser measuring, stereo vision and image based estimation. Due to low cost and high accuracy, depth estimating from image is popular, and thus adopted in the present research. Specifically, we choose the ViP-DeepLab [
38] method for its state-of-the-art performance among the current monocular depth estimation methods.
3.3.2. Parameter Extraction
Here, we will identify the 3D translation and rotation kinematic parameters, denoted as kinematic vector for . is composed of translation velocity , acceleration , angular velocity and angular acceleration .
The reference frame here is a right-handed 3D camera coordinate system, which is also the world coordinate system, shown in
Figure 4a, its origin is located at the optical center of the camera, the Z-axis points away along the optical axis in correspondence with the depth direction, the X-axis is consistent with the u-axis of the pixel coordinate system in
Figure 4b, and the Y-axis is consistent with the v-axis of the pixel plane coordinate system.
Given a pair of pixel coordinates
of the
i-th feature point (SURF descriptor [
39]) extracted in the area
at current time
t and the corresponding depth value
, the aligned points in neighboring frame
and
are denoted by
and
with depth of
and
. Moreover, the focal length is denoted by
, the frame rate by
and the total feature number by
. Then, according to the camera geometry, the coordinate of the
-th feature point in 3D Euclidean space is
.
According to the EPNP (efficient perspective-n-point) method [
40], with more than four
aligned
pairs in
, we can determine the translation vector
and rotation matrix
by solving the camera projection function
where
are the intrinsic parameters of the camera and
are homogeneous barycentric coordinates (Refer to [
40] for more details).
Then, with the translation vector
and rotation matrix
, the translation velocity
in terms of components is written as
By using the central difference, the acceleration
can be calculated as
For the rotation matrix
, the rotation angle
between two neighboring frames can be extracted by using Rodrigues transformation
where
refers to the trace of
. The anti-symmetric matrix
of rotation axis
(unit vector) is given by
where
refers to the transposition of
. Afterwards, the angular velocity
is obtained by
Additionally, by employing the central difference to angular velocity
, we obtain angular acceleration
Thus far, all the 3D kinematic parameters at the detail scale have been obtained.
3.4. Extraction of 2D and Pseudo 3D Kinematic Parameters
As mentioned above, when there is a sufficient number of pixels in target ROIs, all of the 3D kinematic parameters can be retrieved. However, there are also many cases where the depth information cannot be obtained, or the robust aligned keypoints pairs are not enough due to lacking pixels. So, inferring motion structure for this kind of objects degenerates to a 2D case.
This subsection extracts kinematic parameters at the edge and block scales. The reference frame is consistent with the pixel coordinate system in
Figure 4b.
When the 3D motion structure degenerates to in-plane motion the feature point pairs transfer to unaligned corner points or pixel blocks.
The 2D parameters at the block scale can thus be written as two derivative operators, acting on the in-plane translation vector
,
We also employ the central difference for calculating the parameters during implementation as the 3D process.
At the edge scale, the exact depth value cannot be calculated from Equation (3). However, following the intuition that “big when near”, information about the target distance is still accessible. To this end, the present work introduces a new parameter named relative depth, instead of the physical depth along the direction of optical axis, to describe the inference value of the target distance. The relative depth is defined by the total pixel number in the ROI:
where
is also a measure of scale,
is the total number of pixels in
and
is the compatibility parameter for connecting the kinematic parameters of other spaces.
models the object scale and is an approximation of the 3D depth. We refer to the newly reconstructed 3D structure as pseudo 3D space. Then, the pseudo 3D space is denoted as
, where
denotes the Cartesian product,
is the reconstructed scale dimension and
is the stretch factor introduced for dimensional balance, given by
.
Then, the translation vector is modeled as
Similar to Equations (4) and (5), we can calculate the translation parameters for the edge scale. The “pseudo” here means an approximation caused by scale transition.
At the edge scale, rotation parameters are obtained from an estimation based on edge alignment. The robust feature point alignment degenerates to edge alignment due to scale transition, as shown in
Figure 3. According to [
41,
42],
and
(aligned edge pairs from neighboring frames, as
Figure 5 shows) can match each other by the epipolar constraint and region-growing strategy. A high-accuracy matchup will induce decent aligned edge point pairs, which results in a convergent optimization for Equation (3).
At last, the rotation parameters under the block scale cannot be calculated because almost all key points or edges contract into a blurred pixel block. Under such conditions, the rotational patterns vanish, and only the translation parameters are involved for recognition.
3.5. The Compatibility for Three Scale Spaces
The above sections have achieved kinematic parameters extraction for the introduced three scale spaces. This subsection will demonstrate the compatibility and relationship between the parameters.
The corresponding three extraction processes are uniformly illustrated in
Figure 6.
are three neighborhoods from the motion of a certain drone under the detail, edge and block scales, respectively. The kinematic parameters extracted in
are based on the local 3D, pseudo 3D and 2D coordinate systems.
is the map from the input motion neighborhoods to the local space.
is the time-parameterized motion process.
From the perspective of geometry, we can find three pairs of homeomorphic maps
and the corresponding coordinate patch
, namely, a chart
, for the three scales, which gives birth to three coordinate patches, the local 3D Euclidean space, the local pseudo 3D space and the local 2D Euclidean space. Then, the map
is
When a drone belongs to the intersection of two or three coordinate patches, which means that the motion is on the boundary of related scale ranges, the extraction results under corresponding scales must meet the compatibility requirements. Then, the kinematic parameters at intersection neighborhoods should satisfy the following two conditions:
Substituting the translation parameters for the map, that is
Equation (16) can be written in componentwise as
Following the same fashion as the first condition, we can draw from the second one
So far, for any motion belonging to the intersection , equals the value of translation at and for the points in the intersection , equals the focal length (under the homogeneous coordinates). The two requirements ensure the compatibility of the kinematic parameters extracted from three scales. The requirements also indicate that from the detail stage to the block stage, the structural degeneration process occurs, which causes the disappearance of feature points.
3.6. Drone Detection by GRU Network
Following the above process, the target has been abstracted as time-series kinematic parameter sets (
). After that, GRU networks [
43] are employed to solve the following drone recognition problem.
More specifically, as shown in
Figure 1, three GRU networks are employed for three scales. Each network is trained separately and is comprised of the update gate and reset gate. The three input vectors are in the same form of
but differ in terms of components. The output is encoded by the one-hot vector, and the object classes include “drones”, ”birds”, ”pedestrians”, ”cars” and “others”, which are the common moving objects presented in UAV scenes. In addition, the loss function is the cross entropy
where
is the ground truth value and
is the prediction.
The GRU unit we adopt has the same structure as the one in the literature [
43], containing the reset and update gates. The current activation is controlled by delivering the previous activation based on the calculation of reset and update gate.
3.7. The Hybrid MUSAK
Besides the motion feature, appearance-based DNN methods are with no doubt the most commonly used in object detection tasks, if the target is clearly visible. By decision fusion, the proposed MUSAK method is able to cooperate with appearance-based methods, such as Faster R-CNN, to enhance its performance. A simple but effective way to implement this idea is to output the final detection probability by averaging the confidence scores from MUSAK and Faster R-CNN.
By the fashion of average fusion, this paper employs the appearance-based method Faster R-CNN [
23] to form the hybrid MUSAK method. We choose the VGG16 architecture as the backbone and fine-tune the network on our MUD dataset. To improve recall performance, we set 15 types of anchors with five scales (2,4,8,16,32) and four ratios (0.5,1,2). More performance analysis about the hybrid MUSAK will be discussed in
Section 4.
4. Experiments
In this section, the performance of MUSAK method is evaluated and compared with other existing methods. Further analysis on MUSAK is also conducted.
4.1. Setup and Datasets
The existing datasets for drone detection pay more attention to the diversity of appearance and illumination, and to some degree, they are too clean, which means other important issues, such as multiple drone-like distractors, scale transitions and motion complexity, have not been considered.
To address the issue, we first construct a new dataset, named Multiscale UAV Dataset (MUD). It is comprised of the popular dataset Drone-vs-Bird [
4,
5], MAV-VID [
22], Anti-UAV [
44], part of the UAV-Aircraft Dataset (UAD) [
20], and several self-made video clips having complex backgrounds or multiple drone-like objects as interference. MUD is designed to perform better training/evaluation for drone detection methods working in the real environment.
Specifically, MUD adds drone flight videos of indoor scenes, urban scenes and wild scenes. The newly added data not only contain basic annotations such as target categories and rectangular boxes but also mark the depth information, flying height, camera angle, and motion parameters for motion-based method research. (The dataset will be available soon on [
45]). Example pictures in MUD are shown in
Figure 7. The main acquisition equipment includes cameras (FE 24–240 mm), GPS (with precision of
) and laser range finder (0.2–50 m with
).
In addition, we introduce simulation data from AirSim software for training MUSAK, which has also been used in training appearance-based methods [
25].
4.2. Extraction Results of Motion ROIs
We compare different motion extraction methods based on 10 randomly selected groups of video clips with the ground truth of 250 ROIs. Using the number of bounding boxes (BB), recall and precision rate as indicators, the result is shown in
Table 2.
ViBe+ achieves the best performance in both recall and precision metrics. The implementation parameters are as follows: five pixels for minimum size of the foreground hole, 10 for sample size per pixel, eight for updating factor, while other parameters remain default as them in the previous work [
35]. The exemplar results of motion ROI extraction via ViBe+ are illustrated in
Figure 8.
Figure 8 visualizes the extraction results of ROIs in an urban and indoor scene. The red windows in the top rows are the ROI bounding boxes, and the mask areas in the bottom rows are the ROIs.
Other motion extraction methods can also be employed as an alternative based on the target attributes.
4.3. The Extracted Kinematic Parameters
As described in
Section 3.3, the kinematic parameters
in 3D space can be extracted. Compared with the ground-truth value given by the UAV’s real-time kinematic differential system (RTK) and motion capture system (MCS), the extraction biases are shown in
Figure 9. The onboard GPS ensures a vertical positioning accuracy of
and a horizontal positioning accuracy of
. Moreover, if the visual module on board also works, the vertical positioning accuracy is
, while the horizontal positioning accuracy is
.
Figure 9 shows the estimation biases for the kinematic parameters in terms of components. The average biases for
are 4.5%, 7.7%, 9.1% and 15.4%. Overall, the estimation of velocity
has the highest accuracy, while the angular acceleration
has the lowest accuracy. The translation biases are lower than the rotation biases, and the biases of the first-order parameters are lower than those of the second-order parameters.
The estimation of rotation parameters is more sensitive to feature alignment. A minor feature point shift on the pixel plane may cause large biases in the extracted rotation parameters. Therefore, the rotation parameters are relatively inaccurate compared with translation parameters. In addition, the bias of the second-order parameters is larger than the first order ones, since they are derived from the first-order parameters and the bias is added up to the estimation.
4.4. Drone Detection Results with MUSAK Method
In this subsection, we evaluate our MUSAK method and compare the proposed methods with other existing methods.
4.4.1. Detection Results by GRU
During the initialization, we set the input dim to
, where 4 refers to the four parameters
and
refers to the dimension of the space. The batch size was 128, and the hidden layer number was 64. In the training process, we re-estimate the accuracy indicator for each epoch and fine-tune the hyperparameters, including the time step and bias. A low loss value should be ensured for each training batch. To eliminate the cumulative error of the system, the whole process will be initialized every 60 s.
Figure 10 shows the examples of the detection results in three scales and a scene with tiny objects. The outputs include bounding boxes and scores.
Figure 10a–c shows the detection results in three different scenes, the indoor, fully air, and urban scene, which are under detail scale, block scale and edge scale, respectively. Our method is devoted to tackling multiclass classification, which includes five commonly presented classes of drones, birds, pedestrians, cars and others. Different classes of objects are marked in different colors with the class scores. The first row of images is feature-rich, while the last two rows have relatively poor appearance features. Though lacking appearance features, MUSAK can still recognize moving drones during flight.
Figure 10d presents the detection result for tiny objects (area < 100 px) with three kites as distractors, which is a Gordian knot to the previous methods. When a typical motion pattern appears, the confidence score of being a certain class will increase significantly, and the score for other classes will gradually decrease. The related box is considered to contain the object if the score exceeds 0.5.
Table 3 presents the confusion matrix of the corresponding classes.
The confusion matrix of the five classes (drones, birds, pedestrians, cars, and others) is shown in
Table 3. The numbers in the table represent the rate of the number of true positive samples to the total number. It can be found from the table that the detection accuracy of drones, pedestrians and cars is high, while the accuracy of birds is the lowest with a high confusion rate. Compared with birds, drones have a higher detection accuracy. Accuracy of pedestrians and cars is the highest. This is mainly because of their low motion complexity and high motion stability. In general, all the true positive rates (the diagonal elements) in the table can reach more than 58%.
4.4.2. Comparison with the Existing Methods
To compare the performance of different methods, referring to the literature [
46], we take the AP values (AP, AP
S, AP
M and AP
L) and PR (Precision-Recall) curve as metrics. Two specifically proposed metrics, the 95% point and tail gradient, are presented to describe the tendency of PR curves. The 95% point or knee point refers to the location where the PR curve falls to 95% of the highest precision value for a certain method, which indicates the precision and robustness of the method. The tail gradient is the absolute slope of the line linking the knee point and the point with precision of 0.1 on the curve, which also implies the robustness of methods.
We first present the PR curves of our methods and the major existing CV methods for drone detection, which are shown in
Figure 11. The PR curves are under IoU = 50% (Intersection-over-Union). In addition to the motion-based method MUSAK and the hybrid MUSAK described in
Section 3.7, other 7 controlled methods are involved. They are: FlowNet [
13] based on deep optical flow features, Srigrarom et al. [
18] using trajectory features, Faster R-CNN for drones [
23], Craye&Ardjoune [
26] using a combined framework with U-net and ResnetV2, Rodriguez-Ramoset al. [
22] based on an inattentional ConvLSTM, Kim et al. [
47] based on Yolov3 with attention mechanism and Rozantsev et al. [
20] exploiting an improved Faster R-CNN with camera compensation. Generally, along with recall value increases, the PR curve for each method starts with a steady initial stage, in which the precision stays at a high level. When passing its knee point, it begins a fast falling process and ends at a low precision. Further comparison is listed in
Table 4.
Each curve in
Figure 11 can be divided into two parts: head and tail, divided by the 95% point (knee point). The head part lies on the left of the curve, comprised of the points with precision of over 95% of the highest precision of each curve. The appearance-based ([
23,
26,
47]) and hybrid ([
20,
22] and Hybrid MUSAK) methods have a higher precision at the head part compared with motion-based methods ([
13,
18] and MUSAK). This is because the target in the head part is always feature-rich and easily detected by appearance. Since the Faster R-CNN based methods perform well on head part objects, such as Rozantsev [
20] and Faster R-CNN [
23], our hybrid MUSAK, which combines the MUSAK and Faster R-CNN, achieves the highest head precision with 0.92, a 2.2% increase compared with the previous SOTA method. For the motion-based methods, MUSAK (0.73) surpasses FlowNet [
13] (0.69) and Srigrarom [
18] (0.71) in its refined kinematics-based framework.
As for the tail part, we calculate the tail gradient (TAIL GRAD for short) to evaluate the degradation speed of the method. A large tail gradient indicates severe degeneration of performance. The results are listed in
Table 4. It can be found that the appearance-based methods have larger tail gradients than motion-based methods, which reveals that motion features are more robust. The tail gradient for MUSAK (5.54) is relatively lower than that of appearance-based methods but higher than that of Srigrarom [
18] (2.78). This is because Srigrarom [
18] focuses on trajectory features, which belong to long-term motion characteristics, while our MUSAK method extracts both short-term and long-term motion features. The short-term features for 3D space depend on precise keypoints alignment, which is often hard to meet. Furthermore, the highest score on precision of Srigrarom [
18] is lower.
The location of the 95% point is also an overall metric to expose the precision and robustness of a method. The top right location of the 95% point demonstrates a relatively high precision and strong robustness. The 95% points of previous motion-based methods are at recall < 0.40. The 95% points of previous appearance-based methods range from recall = 0.45 to recall = 0.63. All these methods are on the left of our MUSAK and hybrid MUSAK. MUSAK promotes the 95% point of motion-based methods from recall = 0.38 to recall = 0.78, which is a more than 2x enhancement. Hybrid MUSAK promotes the value by 15.4% from recall = 0.65 to recall = 0.75. The enhancements result from the well-directed motion-based detection scheme.
In addition to the metrics mentioned above,
Table 4 also introduces more metrics named AP
S, AP
M and AP
L, which refer to the AP for small, medium and large objects. AP
L should be larger than AP while AP
S is smaller than AP and larger AP values represent higher accuracy. In general, motion-based methods have higher AP
S values than appearance-based methods, while for AP
M and AP
L, the values of appearance-based methods are higher. Craye & Ardjoune [
26] achieved the highest AP among appearance-based methods for its refined preprocess before classification. The proposed MUSAK presents the highest AP value of 0.656 for motion methods, which improves the previous methods by more than 84.3% due to its well-directed modeling of the motion process. The hybrid MUSAK reports state-of-the-art performance with 0.785 AP value, an at least 14.1% increase compared to the previous SOTA method (Rodriguez-Ramos [
22]). Compared with other hybrid methods, promotion is mainly achieved from the improvement on detection of small and medium objects.
5. Further Analysis for MUSAK
To further reveal the adaptivity and significance of our methods, the keypoints quality, temporal-spatial resolution and significance of the kinematic parameters are taken into consideration in this section.
5.1. The Impact of Keypoints Quality
Keypoints quality, which is described by the
Qkey introduced in
Section 3.2, demonstrates the definition of objects. The low value indicates a high keypoints quality as well as the object definition. Then, the impact of keypoints quality on the detection methods is presented as the correlation between AP and
Qkey plotted in
Figure 12.
In
Figure 12, four curves of AP with respect to
Qkey for the appearance-based method (Faster R-CNN) and MUSAK (three detection branches and the synthesized) illustrate how the object definition impacts the performance of the methods.
In general, all curves are in a falling process due to the degeneration of keypoints. The 3D motion, pseudo 3D motion and appearance curves are of a S-shaped falling process, a slight decline in the head part followed by a slump tail. The 2D motion curve is a wave-like falling process without any dramatic drops, which indicates that the 2D motion detection branch is more robust to keypoints quality. More specifically, the appearance curve has the highest AP value of 0.95 in the head part but the maximum downward gradient in the tail part, and ends at the lowest value (0.11). The 3D/pseudo 3D branch curve possesses a relatively high AP value, approximately 20% less than the head value of the appearance curve, but a relatively slow falling subprocess and a 90% increase for the ending value. The 2D motion curve starts at a lower AP value of 0.53, then with a middle stage AP of approximately 0.44 and ends at 0.315, a 3-fold increase in the appearance method value because the 2D motion detection branch can extract the appearance-invariant motion features. In summary, all methods are impacted by keypoints quality to some extent. The appearance method is more sensitive to keypoints quality, while MUSAK is more robust, which results in better performance in low-definition scenarios.
5.2. The Impact of Temporal-Spatial Resolution
For further analyzing the performance of MUSAK, we consider the temporal-spatial characteristic and focus on each detection branch. The impact of the temporal-spatial resolution on different detection methods and branches is shown in
Figure 13, in which the time resolution is represented by the frame rate, while the spatial resolution uses the positioning accuracy (absolute relative error) as a measurement. Videos with different frame rates and positioning accuracy are collected for calculating the AP value.
Figure 13a presents the relationship between AP and frame rate (the motion blur is eliminated). In general, the curves of motion-based methods are composed of a rapid incline head process followed by a stable ending process with a slight decline tail due to the forgetting gates in GRU for handling long sequence, while the appearance curve is relatively stationary. The 2D motion curve has the maximum upward gradient and first reaches its peak at 0.47 at 15 fps. The pseudo 3D and 3D motion methods reach peak AP values of 0.59 and 0.67 at 20 and 23 fps, respectively. It can be concluded that the optimal frame rate is close to 25 fps.
Figure 13b presents the relationship between AP and positioning accuracy. The curve for the 3D branch starts to decline significantly when it goes over the point of error = 25%, while for pseudo 3D, this critical point is approximately error = 45%. The 3D detection branch is most sensitive to spatial resolution, while the 2D detection branch is more robust. The 2D and pseudo 3D detection branches compensate the loss caused by positioning error for MUSAK.
Based on the above results and analyses, it can be concluded that MUSAK deals with moving objects under different backgrounds and enhances the performance for dim drones. MUSAK requires object tracks and coordinates transformation from camera coordinate frame to world coordinate frame to retrieve the relative motion process for objects. For moving cameras which will enlarge the vision field, the motion compensation should be introduced.
5.3. Significance of the Kinematic Parameters
In this part, we conduct an ablation experiment to analyze the significance of the kinematic parameters. Without using all the parameters mentioned in
Section 3.3, here, different combinations of kinematic parameters from single parameter to multi-combinations are adopted by MUSAK for detection. The results are presented in
Figure 13.
Figure 14a shows the drone detection results with a single kinematic parameter, The X, Y and Z axes are the same as the definition in
Section 3.3.2. It can be seen from
Figure 14a that the AP values of the second-order (acceleration and angular acceleration) parameters are higher than those of the first-order (velocity and angular velocity) parameters, and it is also obvious that the AP of the Y components for translation parameters (velocity and acceleration) are the highest compared with the X and Z components, while the Z components are highest for rotation-related parameters (angular velocity and angular acceleration). Detection by the Y component of acceleration achieves the top AP followed by the Z component of angular acceleration. Referring to the coordinate declaration in
Section 3.3.2, two important conclusions can be obtained. Firstly, the translation parameters in the gravity direction and rotation parameters in the direction of the optic axis are of great significance. Secondly, the second-order kinematic parameters carry more motion characteristics for drones.
Figure 14b compares different combinations of kinematic parameters. We refer to four two-parameter combination bases, the first-order, second-order, translation, rotation parameters and then integrate with other parameters for detection. The AP of the second-order parameter basis (0.48) is significantly higher than that of the first-order parameter basis (0.31). Among the three-parameter combinations, the combination of angular velocity, acceleration and angular acceleration has the highest AP of 0.60. The above results also show significance of the second-order parameters.