*Article* **Long Distance Ground Target Tracking with Aerial Image-to-Position Conversion and Improved Track Association**

**Seokwon Yeom**

School of AI, Daegu University, Gyeongsan 38453, Korea; yeom@daegu.ac.kr

**Abstract:** A small drone is capable of capturing distant objects at a low cost. In this paper, long distance (up to 1 km) ground target tracking with a small drone is addressed for oblique aerial images, and two novel approaches are developed. First, the coordinates of the image are converted to real-world based on the angular field of view, tilt angle, and altitude of the camera. Through the image-to-position conversion, the threshold of the actual object size and the center position of the detected object in real-world coordinates are obtained. Second, the track-to-track association is improved by adopting the nearest neighbor association rule to select the fittest track among multiple tracks in a dense track environment. Moving object detection consists of frame-to-frame subtraction and thresholding, morphological operation, and false alarm removal based on object size and shape properties. Tracks are initialized by differencing between the two nearest points in consecutive frames. The measurement statistically nearest to the state prediction updates the target's state. With the improved track-to-track association, the fittest track is selected in the track validation region, and the direction of the displacement vector and velocity vectors of the two tracks are tested with an angular threshold. In the experiment, a drone hovered at an altitude of 400 m capturing video for about 10 s. The camera was tilted 30◦ downward from the horizontal. Total track life (TTL) and mean track life (MTL) were obtained for 86 targets within approximately 1 km of the drone. The interacting multiple mode (IMM)-CV and IMM-CA schemes were adopted with varying angular thresholds. The average TTL and MTL were obtained as 84.9–91.0% and 65.6–78.2%, respectively. The number of missing targets was 3–5; the average TTL and MTL were 89.2–94.3% and 69.7–81.0% excluding the missing targets.

**Keywords:** small drone; long-distance surveillance; ground target tracking; track-to-track association; image-to-position conversion

#### **1. Introduction**

Small unmanned aerial vehicles (UAVs), or drones, are useful for security and surveillance [1,2]. One important task is to track moving vehicles with aerial video. A small drone captures video from a distance at a low cost [3]. No highly trained personnel are required to generate the video.

Ground targets can be tracked with a small drone with visual, nonvisual, or combined methods. Various deep learning methods with camera motion models were studied in [4]. Tracking performance can be degraded by small objects, large numbers of targets, and camera motion [5]. Deep learning-based object detection was combined with multi-object tracking and 3D localization in [6]. In [7], YOLO and Kalman filter were used to detect and track high-resolution objects. The object detector and tracker with deep learning may require heavy computation with massive training data [8,9].

The background subtraction and the adaptive mean-shift and optical flow tracking were developed for the video sequences captured by a drone in [10]. The mean-shift tracker based on particle filtering was utilized to track a small and fast-moving object in [11]. The SIFT feature-based tracker was developed for fast processing in [12]. The kernelized correlation filter-based target tracking was studied in [13]. In [14], object tracking was

**Citation:** Yeom, S. Long Distance Ground Target Tracking with Aerial Image-to-Position Conversion and Improved Track Association. *Drones* **2022**, *6*, 55. https://doi.org/10.3390/ drones6030055

Academic Editors: Diego González-Aguilera and Pablo Rodríguez-Gonzálvez

Received: 7 February 2022 Accepted: 22 February 2022 Published: 23 February 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

performed by handing over the camera from one drone to another. Ground targets were tracked with road geometry recovery in [15]. Usually, trackers based on video sequences transmit high-resolution video streams to the ground, or a high computational burden is imposed on the drone. Aerial video was processed with high-precision GPS data from vehicles in [16]. Bayesian fusion of vision and radio frequency sensors were studied for ground target tracking in [17]. In [18], computer vision-based airborne target tracking with GPS signals was studied. Object tracking from drones with nonvisible cameras can be found in literature. A boat was captured and tracked with the Kalman filter by a fixed-wing drone in [19]. A small vessel was tracked by adopting a colored-noise measurement model in [20]. In the nonvisual approach, high-cost sensors add more payload to the drone, or infrastructure is required on the ground or in a vehicle.

In this paper, moving vehicle tracking with a small drone at long distances (up to 1 km) is addressed. In the previous work [21–24], the drone's camera was pointed directly at the ground, or the altitude of the drone was very low. When the camera is tilted, the field of view (FOV) can be extended, but constant scaling from pixel coordinates to actual position can no longer be applied. Therefore, an image-to-position conversion is developed to change the integer coordinates of a pixel to its actual positions, assuming that the angular field of view (AFOV), tilt angle, and camera (drone) altitude are known.

Moving object detection consists of frame subtraction and thresholding, morphological operation, and false alarm removal; falsely detected objects are removed using the object's actual size and two shape properties: squareness and rectangularity [23,24]. The minimum size of the extracted object is set constant, but the converted value in pixels changes depending on the distance from the drone.

Target tracking consists of three stages: track initialization, track maintenance, and track termination. Tracks are initialized with the difference between the two nearest measurements in consecutive frames following speed gating. Tracks are maintained by state estimation, measurement-to-track association (abbreviated as measurement association), and track-to-track association (abbreviated as track association). The nearest neighbor (NN) measurement association updates the state of a target by combining the measurements that are statistically closest to the prediction. The interacting multiple model (IMM) filters with a constant velocity (CV) or constant acceleration (CA) motion models are adopted to handle the various maneuvers of the target [25,26]. Track association fuses multiple tracks into a single track [27]. In this paper, the track association scheme is improved to fuse multiple tracks by sequentially searching for the nearest track in a dense track environment. Figure 1 shows a schematic block diagram of object detection and target tracking by a small drone at a long distance. First there is the image-to-position conversion, and finally there is the reverse process, position-to-image conversion.

**Figure 1.** Block diagram of moving object detection and multiple target tracking.

In the experiment, the drone hovered from a fixed position at a height of 400 m. The title angle was 60◦ producing video for 10 s. Figure 2 shows three sample scenes extracted from a sample frame. The extracted area is 100 × 60 pixels and shows 2–3 vehicles with different resolutions. Overall, the frame has low resolution, and the targets are sometimes occluded by trees, structures, and other cars. Road lanes, traffic signs, and shadows can be included in backgrounds. A total of 86 targets within approximately 1 km of the drone were investigated with total track life (TTL) and mean track life (MTL). The average TTL and MTL were obtained as 84.9–91.0% and 65.6–78.2%, respectively, by various angular thresholds for directional track association. The number of missing targets was 3–5; the average TTL and MTL were 89.2–94.3% and 69.7–81.0%, respectively, if the missing targets are excluded.

**Figure 2.** Three sample scenes showing targets: (**a**) sample frame; (**b**) sample scene 1; (**c**) sample scene 2; (**d**) sample scene 3.

The rest of the paper is organized as follows: the image-to-position conversion is described in Section 2. Section 3 explains multiple target tracking. Section 4 details the improved track association. Section 5 presents the experimental results. Discussion and conclusions follow in Sections 6 and 7, respectively.

#### **2. Image-Position Conversion**

Imaging is the projection of a three-dimensional real-world onto a two-dimensional plan. Thus, when the camera is not pointed directly at the ground, the discrepancy between the relative pixel positions in the image and the real world becomes irregular. Since the coordinates of the image are not indexed in proportion to the actual distance, constant scaling generates more discrepancy between the coordinates in the image and the coordinates in the real world. Therefore, integer coordinates in *x* and *y* directions of the image are converted to real-world positions based on the AFOV, tilt angle, and altitude of the camera. As shown in Figure 3, the drone camera is positioned at (0,0,*h*) with a tilt angle *θT*. The image size is *W* × *H* pixels. The actual position vector *xij* of the (*i*, *j*) pixel is approximated as

$$\begin{aligned} \mathbf{x}\_{\parallel} - \begin{pmatrix} \mathbf{x}\_{\perp} \ y\_{\parallel} \end{pmatrix} & \approx \begin{pmatrix} d \underset{\mathbf{W}}{\mathbf{W}} \cdot \tan\left[ \left( l - \frac{\mathbf{W}}{2} + 1 \right) \boldsymbol{\upPhi} \right], & h \cdot \tan\left[ \boldsymbol{\upPhi}\_{T} + \left( \frac{\mathbf{W}}{2} - j \right) \frac{a\_{\mathbf{W}}}{H} \right] \\\ l = 0, \ldots, \mathcal{W} - 1, \ \boldsymbol{\upmu} - 0, \ldots, H - 1, \end{pmatrix} \end{aligned} \tag{1}$$

where *ax* and *ay* are the view angles of the camera in *x* and *y* directions, respectively; *dH*/2 is the distance from the camera to (*xW*/2−1, *yH*/2,0), which is *y*2 *<sup>H</sup>*/2 + *h*2. It is noted that land is assumed to be flat. Thus, the position in *z* direction is zero. The actual pixel size is calculated as

$$\Delta(i\_\prime j) = \left| (x\_i - x\_{i+1}) \cdot (y\_i - y\_{j+1}) \right|. \tag{2}$$

**Figure 3.** Coordinate conversion from image to real-world: (**a**) x direction; (**b**) y direction.

In the experiments, the altitude *h* is set to 400 m; *W* and *H* are 3840 and 2160 pixels; *ax* and *ay* are set to AFOV of the camera, 70◦ and 40◦, respectively; *θ<sup>T</sup>* is set to 60◦; thus *yH*/2 is calculated at 692.8 m by Equation (1), and *dH*/2 is 800 m accordingly.

Figure 4 provides the visualization of the coordinate conversion and actual pixel size according to Equations (1) and (2), respectively. In Figure 4a, every 50th pixel is shown for better visualization. The maximum, median, and minimum pixel sizes in Figure 4b are 1.623 m2, 0.1506 m2, and 0.0561 m2, respectively. This approximate conversion considers *x* and *y* directions separately. Thus, a simple reverse process is possible from position to image, and it will be shown that the detection result is significantly improved in the experiments although the coordinate conversion can be accompanied by various inevitable errors [28,29].

**Figure 4.** Approximated conversion from image to real world, (**a**) visualization of the coordinate conversion; (**b**) actual pixel size.

#### **3. Multiple Target Tracking**

A block diagram of multiple target tracking is shown in Figure 5.

**Figure 5.** Block diagram of multiple target tracking.

Tracks are initialized with two-point differencing between the nearest neighbor measurements following maximum speed gating. For the measurement association, the speed gating process is first performed, followed by the measurement gating based on a chisquare hypothesis test. Then, the NN measurement selection follows. The NN rule is very effective in computing and was successfully applied to multiple ground target tracking by a drone [21–24]. The IMM filter is adopted to estimate the kinematic state of the target. It can efficiently handle various maneuvers on multiple targets. The IMM with a combined CV and CA scheme was contrived to track a single target in [24]. The IMM-CV scheme was applied to track various maneuvering 120 aerial targets for the aerial early warning system in [30]. The motion models are analyzed in detail in [31].

In a multisensor environment, a track fusion method was developed assuming the target is undergoing a common process noise [27]. A practical approach for track association was developed for the Kalman filter in [22]. This approach has been extended with the IMM filter and the directional track association was proposed to consider the moving direction of the target in [24]; directional gating tests the maximum deviation in the directions of the tracks and the direction of the displacement vector between the tracks. In this paper, the NN track selection scheme is proposed and will be described in the next section.

There are three criteria for track termination in this paper. One is associated but not selected during the track association. The other cases are the maximum frame number without measurements and the minimum target speed. The criterion of the minimum target speed is very effective when high clutter occurs on nonmoving false targets [23]. After track termination, its validity is tested with the track life length. If the track life length is shorter than the minimum track life length, the track is removed as a false track. More detailed processes of target tracking are described in [22–24].

#### **4. Improved Track Association**

In this paper, the track association procedure is developed to select the fittest track in a dense track environment. For track *s*, the fittest track is selected as follows:

$$\mathfrak{k}(s) = \underset{t = 1, \ldots, N\_{\Gamma}(k), \, s \neq t}{\arg\min} \left[ \mathfrak{k}^{\mathfrak{s}}(k|k) - \mathfrak{k}^{t}(k|k) \right]^{T} \left[ T^{\mathfrak{s}t}(k) \right]^{-1} \left[ \mathfrak{k}^{\mathfrak{s}}(k|k) - \mathfrak{k}^{t}(k|k) \right], \tag{3}$$
 
$$\mathbf{s} = 1, \ldots, N\_{\Gamma}(k),$$

$$T^{st}(k) = P^s(k|k) + P^t(k|k) - P^{st}(k|k) - P^{ts}(k|k) \tag{4}$$

$$P^{st}(k|k) = \left[I - b^s(k)\mathcal{W}^s(k)H\right] \left[FP^t(k-1|k-1)F^T + Q\right] \left[I - b^t(k)\mathcal{W}^t(k)H\right],\tag{5}$$

where *<sup>x</sup>*ˆ*s*(*k*|*k*) and *<sup>x</sup>*ˆ*<sup>t</sup>* (*k* \* \**k*) are the state vector of tracks *<sup>s</sup>* and *<sup>t</sup>*, respectively, at frame *<sup>k</sup>*, *NT*(*k*) is the number of tracks at frame *<sup>k</sup>*, *<sup>P</sup>s*(*k*|*k*) and *<sup>P</sup><sup>t</sup>* (*k* \* \**k*) are the covariance matrix of tracks *s* and *t*, respectively, at frame *k, b<sup>s</sup>* (*k*) and *b<sup>t</sup>* (*k*) are binary numbers that become one when track *s* or *t* is associated with a measurement, otherwise they are zero [27]. *F*, *H*, and *Q* are the transition matrix, measurement matrix and covariance of the process noise, respectively. It is noted that *Tst*(*k*) is meaningless if its determinant is not positive. The fused covariance in Equation (5) is a linear recursion and its initial condition is set at *<sup>P</sup>st*(0|0)=[0]*Nx*×*Nx* , *Nx* is the dimension of the state vector, which is 4 and 6 for the CV and CA motion models, respectively, and *W<sup>t</sup>* (*k*|*k*) is obtained as the combined filter gain of track *t* as [24]:

$$\mathcal{W}^t(k|k) = \sum\_{j=1}^{M} \mathcal{W}^t\_j(k|k)\mu^t\_j(k). \tag{6}$$

where *M* is the number of modes of the IMM filter; *W<sup>t</sup> <sup>j</sup>*(*k*|*k*) is the filter gain of the *j*-th mode matched filter at frame *k*; and *μ<sup>t</sup> j* (*k*) is the mode probability of the *j*-th mode- matched filter at frame *k*. The following chi-square hypothesis test should be satisfied between tracks *s* and *t* since multiple tracks of the same target have error dependencies on each other [27]:

$$\left[\mathfrak{X}^{s}(k|k) - \mathfrak{X}^{t}(k|k)\right]^{T} \left[T^{st}(k)\right]^{-1} \left[\mathfrak{X}^{s}(k|k) - \mathfrak{X}^{t}(k|k)\right] \le \gamma\_{\mathbb{S}'} \tag{7}$$

where *γ<sup>g</sup>* is a gate threshold for the track validation region. The directional gating process tests the maximum deviation in the direction of the displacement vector and the directions of the track velocity as [24]:

$$\max \left( \cos^{-1} \frac{\left| \left< \mathring{d}^{st}(k|k), \mathfrak{d}^s(k|k) \right> \right|}{\left| \left. \mathring{d}^{st}(k|k) \right> \right| \left| \left< \mathring{d}^s(k|k) \right> \right|}, \cos^{-1} \frac{\left| \left< \mathring{d}^{st}(k|k), \mathfrak{d}^t(k|k) \right> \right|}{\left| \left. \mathring{d}^{st}(k|k) \right> \right| \left| \left< \mathring{d}^t(k|k) \right> \right|} \right) \le \theta\_{\mathcal{S}'} \tag{8}$$

$$d^{\mathbb{A}t}(k|k) = \left[ \begin{array}{c} \mathfrak{k}^t(k|k) - \mathfrak{k}^s(k|k) \\ \mathfrak{k}^t(k|k) - \mathfrak{k}^s(k|k) \end{array} \right], \ \mathfrak{k}^s(k|k) = \left[ \begin{array}{c} \mathfrak{k}^{\text{ax}}(k|k) \\ \mathfrak{k}^{\text{y}y}(k|k) \end{array} \right], \ \mathfrak{k}^t(k|k) = \left[ \begin{array}{c} \mathfrak{k}^{\text{fx}}(k|k) \\ \mathfrak{k}^{\text{y}y}(k|k) \end{array} \right], \tag{9}$$

where · denotes the inner product operation, *<sup>θ</sup><sup>g</sup>* is an angular threshold; and *<sup>x</sup>*ˆ*<sup>t</sup>* (*k*|*k*) and *y*ˆ*t* (*k*|*k*) are the position components of *<sup>x</sup>*ˆ*<sup>t</sup>* (*k* \* \**k*) in *<sup>x</sup>* and *<sup>y</sup>* directions, respectively; <sup>ˆ</sup>*vtx*(*k*|*k*) and *<sup>v</sup>*ˆ*ty*(*k*|*k*) are the velocity components of *<sup>x</sup>*ˆ*<sup>t</sup>* (*k* \* \**k*) in *<sup>x</sup>* and *<sup>y</sup>* directions, respectively.

After the fittest track is selected, the current state of track *s* is replaced with a fused estimate and covariance if <sup>|</sup>*P<sup>s</sup>* (*k*|*k*)| ≤ \* \* \* *<sup>P</sup>c*ˆ(*s*)(*k*|*k*) \* \* \* as

$$\begin{array}{l} \mathfrak{k}^{s}(k|k) = \mathfrak{k}^{s}(k|k) \\ \qquad + \left[P^{s}(k|k) \\ \quad -P^{st}(k|k)\right] \left[P^{s}(k|k) + P^{t}(k|k) - P^{st}(k|k) \\ \quad -P^{ts}(k|k)\right]^{-1} \left[\hat{\mathfrak{k}}^{t}(k|k) - \hat{\mathfrak{x}}^{s}(k|k)\right]\_{\prime} \end{array} \tag{10}$$

$$\begin{array}{l} P^{\mathfrak{s}}(k|k) = P^{\mathfrak{s}}(k|k) \\ \quad - \begin{bmatrix} P^{\mathfrak{s}}(k|k) \\ -P^{\mathfrak{s}t}(k|k) \end{bmatrix} \begin{bmatrix} P^{\mathfrak{s}}(k|k) + P^{t}(k|k) - P^{\mathfrak{s}t}(k|k) \\ -P^{\mathfrak{s}t}(k|k) \end{bmatrix} \\ \quad - P^{\mathfrak{s}t}(k|k) \end{array} \tag{11}$$

The track selection process proposed in this paper is as follows: after track *s* becomes a fused track, track *c*ˆ(*s*) becomes a potentially terminated track. That is, fusion only occurs if the determinant of the covariance matrix of track *s* is less than the determinant of the selected track. It is noted that a more accurate track has less error (covariance). In the previous directional track association, track *c*ˆ(*s*) was instantly terminated, but in the procedure proposed in this paper, it is still eligible to be associated with other tracks that have not yet been fused. The detailed procedure of the track association is illustrated in Figure 6.

**Figure 6.** Illustration of track association at a frame: fused tracks in blue and potentially terminated tracks in yellow.

In Figure 6, there are initially three tracks, *s*, *t*, and *u,* at a certain frame. In step 1, Track *s* searches for the fittest track for itself. Once track *t* is satisfied with Equations (3), (7), (8), and <sup>|</sup>*P<sup>s</sup>* (*k*|*k*)| ≤ \* \**Pt* (*k*|*k*) \* \*, tracks *<sup>s</sup>* and *<sup>t</sup>* are fused, track *<sup>s</sup>* becomes the fused track, and track *t* becomes the potentially terminated track. Otherwise, no change occurs, and we go to the next step. In Step 2, track *t* searches for the fittest track, except for any already fused track, here track *s*. If track *t* becomes the fused track after fusion with track *u*, then track *u* becomes the potentially terminated track and is terminated at the final stage because no tracks remain to be considered for track *u*. Otherwise, in Step 3, track *u* searches and can be fused with track *t*. Finally, all potentially terminated tracks are terminated. In the above row, tracks *s* and *t* and tracks *t* and *u* are fused, and track *u* is terminated. In the bottom row, tracks *s* and *t* and tracks *u* and *t* are fused, and track *t* is terminated. In the next frame, the remaining tracks *s* and *t* or tracks *s* and *u* can be fused if they originate from a single target. This track association procedure allows multiple fusions of one track, and fusion occurs at most once for a track in a frame. It will be shown that it can reduce the number of tracks significantly in the experiments.

#### **5. Results**

In this section, experimental results will be detailed through video description, parameter setting, and moving vehicle tracking along with the proposed strategy.

#### *5.1. Video Description and Moving Object Detection*

A video was captured by a Mavic Air 2 hovering from a fixed position at a frame rate of 30 fps. The frame size is 3840 × 2160 pixels. The scenes in the video include a highway interchange, a toll gate, road bridges, buildings, and trees. The drone was at an altitude of 400 m and the tilt angle was set to 60◦. Every second, frames were processed for efficient processing, thus the actual frame rate was 15 fps. A total of 152 frames were considered for about 10 s. A total of 86 moving vehicles within a range of approximately 1 km of the drone appear in the entire video. The number of frames subtracted was 151, and the life length of the target over the entire period was 150 frames due to the two-point differencing initialization. However, the life lengths of Targets 5, 8, 29–32, 54, 62, 83, and 84 in the video are 92, 124, 42, 68,104, 100, 146, 146, 136, and 100 frames, respectively because they started late or stopped early. Some targets are occasionally occluded by a bridge, a toll gate, tress, and other vehicles. Some of them happen to be invisible because of shadows. The minimum target speed was set to 1 m/s. Thus, very slow targets were not considered as targets of interest. Figure 7a shows Targets 1–53, and Figure 7b shows Targets 54–86 in the first frame. Figure 7c shows the 1 km range with an outer circle and the view angles with blue and red lines and arcs according to distance. It is noted that Figure 7c was obtained manually to show the approximate coverage on a commercially available aerial map [32].

**Figure 7.** (**a**) Targets 1–53 at the first frame; (**b**) Targets 54–86 at the first frame; (**c**) 1 km range of the drone with approximated view angles.

For the object detection, the thresholding after the frame subtraction was set to 30. The structure element for the morphological operation (closing) was set at [1]2×2. The minimum size of a basic rectangle for false alarm removal was set to 6 m2, and the minimum squareness and rectangularity were set to 0.2 and 0.3, respectively. Figure 8a is a thresholded binary image after frame subtraction between Figure 7a and the next frame.

Figure 8b is the result of the morphological operation of Figure 8a. Figure 8c is basic rectangles of Figure 8b after removing false alarms. Figure 8d shows the center of the basic rectangles, indicated by the blue dot. The number of detections in Figure 8d is 127 including false alarms.

Figure 9a shows the detection results for all frames with the image-to-position conversion. Most blue dots of Figure 9a are along roads that coincided with the trajectories of the vehicles. Figure 9b–d show the detection results for all frames that have not undergone the image-position conversion process; the pixel sizes were chosen to be constant as the maximum, median, and minimum sizes in Figure 9b–d, respectively. The maximum, median, and minimum sizes of the pixels in Figure 4b are 1.623 m2, 0.1506 m2, and 0.0561 m2, respectively. Thus, the pixel sizes of a basic rectangle of 6 m2 are 4 pixels, 40 pixels, and 107 pixels in Figure 9b–d, respectively. If the threshold is too low, as shown in Figure 9b, more false alarms are detected; if the threshold is too high, as shown in Figure 9d, more missing detections are obtained. With a median threshold as shown in Figure 9c, some long-distance vehicles fail to be detected.

**Figure 8.** Object detection: (**a**) frame subtraction and thresholding; (**b**) morphological operation (closing) of (**a**); (**c**) basic rectangles after false alarm removal of (**b**); (**d**) 127 centers of basic rectangles of (**c**), indicated as blue dots.

**Figure 9.** Object detection results of 151 frames with: (**a**) image-to-position conversion; (**b**) maximum pixel size; (**c**) median pixel size; (**d**) minimum pixel size.

#### *5.2. Multiple Target Tracking*

The positions of Figure 9a become inputs to the target-tracking stage. The sampling time is 1/15 s since every second frame is processed. The IMM-CV and IMM-CA are adopted with the image-to-position conversion and the proposed directional track association procedure. Table 1 shows the parameters for target tracking; it is noted that the angular threshold *θ<sup>g</sup>* of 90◦ is equivalent to the track association without the directional gating.

**Table 1.** Parameters for target tracking.


For IMM-CV and IMM-CA without track association, a total of 340 and 314 valid tracks are generated, as shown in Figure 10a,b. The number of tracks of IMM-CV is reduced to 173, 185, and 192 by the directional track association when *θ<sup>g</sup>* is 90◦, 30◦, and 20◦, respectively, as shown in Figure 11a–c. For IMM-CA, 196, 208, and 209 tracks are generated for the directional track associations when *θ<sup>g</sup>* is 90◦, 30◦, and 20◦, respectively, as shown in Figure 12a–c.

**Figure 10.** All tracks without track association: (**a**) IMM-CV; (**b**) IMM-CA.

**Figure 11.** All tracks with track association, IMM-CV: (**a**) *θ<sup>g</sup>* = 90◦; (**b**) *θ<sup>g</sup>* = 30◦; (**c**) *θ<sup>g</sup>* = 20◦.

**Figure 12.** All tracks with track association, IMM-CA: (**a**) *θ<sup>g</sup>* = 90◦; (**b**) *θ<sup>g</sup>* = 30◦; (**c**) *θ<sup>g</sup>* = 20◦.

Two metrics, TTL and MTL are employed to evaluate the tracking performance. The TTL and MTL are defined, respectively, as [30]:

$$\text{TTL} = \frac{\text{Sum of lengths of tracks which have the same target ID}}{\text{Target life length} - 1},\tag{12}$$

$$\text{MTL} = \begin{array}{c} \text{TTL} \\ \hline \text{Number of tracks associated in TTL} \end{array} \tag{13}$$

A track's target ID is defined as the target with the most measurements on the track. The MTL becomes less than the TTL in case of track breakage or overlap. The TTL and MTL are the same if only one track is generated for one target. The TTL and MTL are 0 if no track is generated for a target when the target is missing. Figures 13 and 14 show the TTL and MTL of Figures 11 and 12, respectively. Three targets (11, 17, 27) are missing in all cases. In addition, no tracks are set on Target 55 in Figure 13a and on Targets 55 and 84 in Figure 14a–c.

**Figure 13.** TTL and MLT of IMM-CV: (**a**) *θ<sup>g</sup>* = 90◦; (**b**) *θ<sup>g</sup>* = 30◦; (**c**) *θ<sup>g</sup>* = 20◦.

**Figure 14.** TTL and MLT of IMM-CA: (**a**) *θ<sup>g</sup>* = 90◦; (**b**) *θ<sup>g</sup>* = 30◦; (**c**) *θ<sup>g</sup>* = 20◦.

Tables 2 and 3 show the overall tracking performance of IMM-CV and IMM-CA, respectively. They show the number of tracks, the number of tracks associated with the targets of interest, average TTL and MTL, average TTL and MTL excluding missing targets, and number of missing targets. The average TTL and MTL are, respectively, 84.9–91.0% and 65.6–78.2%. Excluding missing targets, the average TTL and MTL accounted for 89.2–94.3% and 69.7–81.0, respectively.


**Table 2.** Tracking performance of IMM-CV.

**Table 3.** Tracking performance of IMM-CA.


Eight supplementary multimedia files (MP4 format) for Figures 10–12 is available online. The first is the IMM-CV without the track association for Figure 10a (Supplementary Material Video S1), and the second is the IMM-CA without the track association for Figure 10b (Supplementary Material Video S2). The third is the IMM-CV with *θ<sup>g</sup>* = 90◦ for Figure 11a (Supplementary Material Video S3). The fourth is the IMM-CV with *θ<sup>g</sup>* = 30◦ for Figure 11b (Supplementary Material Video S4). The fifth is the IMM-CV with *θ<sup>g</sup>* = 20◦ for Figure 11c (Supplementary Material Video S5). The sixth is the IMM-CA with *θ<sup>g</sup>* = 90◦ for Figure 12a (Supplementary Material Video S6). The seventh is the IMM-CA with *θ<sup>g</sup>* = 30◦ for Figure 12b (Supplementary Material Video S7). The eighth is the IMM-CA with *θ<sup>g</sup>* = 20◦ for Figure 12c (Supplementary Material Video S8). The black squares and numbers in the MP4 files are position estimates and track numbers, respectively, in the order they were initialized. For better visualization, odd numbers are shown in white, and even numbers in yellow. The blue dots are the detection positions including false alarms.

#### **6. Discussion**

The image-to-position conversion is an approximation showing a significant improvement in object detection. The reverse process is also possible, and the track is displayed in the frame after the reverse process.

The stability of the drone (camera) is important especially for oblique images where the position of the target can be concentrated or easily occluded; it can prevent false detections that result in false tracks and false track associations.

The proposed track selection can reduce the number of tracks from 340 to 173–192 for IMM-CV and from 314 to 196–209 for IMM-CA. Smaller angular threshold yields higher TTL and MTL, producing more tracks. The highest TTL and MTL excluding missing targets are, respectively, 94.3% and 81.0% for IMM-CA with *θ<sup>g</sup>* = 20◦. Some targets are still detected and tracked outside the range of interest as shown in the videos.

The average number of missing targets is 3.33 and 5 for IMM-CV and IMM-CA, respectively. Targets 11 and 27 move too slowly. Target 17 is occluded by trees and shadows. In the experiments, the surveillance area was around 0.53 km2. It was more than twice as large as the area when the camera was pointed directly at the ground.

#### **7. Conclusions**

In this paper, two strategies were developed for multitarget tracking by a small drone. One is the image-to-position conversion based on the AFOV, tilt angle, and altitude of the camera. The other is the improved track association for densely distributed track environments. Both the IMM-CV and IMM-CA schemes achieve robust results in TTL and MTL.

The overall process is computationally efficient as it does not require high-resolution video streaming or storage and training on large-scale data. This system is suitable for security and surveillance for civil and military applications such as threat detection, vehicle counting and chasing, and traffic control. This method can be also applied to tracking other objects such as people or animals over long distances. Target tracking using moving drones from various perspectives remains a subject of future study.

**Supplementary Materials:** The following are available online at https://zenodo.org/record/5932718, Video S1: IMM-CV, Video S2: IMM-CA, Video S3: IMM-CV-90, Video S4: IMM-CV-30, Video S5: IMM-CV-20, Video S6: IMM-CA-90, Video S7: IMM-CA-30, Video S8: IMM-CA-20.

**Funding:** This research was supported by Daegu University Research Grant 2019.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** Not applicable.

**Conflicts of Interest:** The author declared no conflict of interest.

#### **References**

