*Article* **Real-Time Detection and Recognition of Multiple Moving Objects for Aerial Surveillance**

**Wahyu Rahmaniar 1,2, Wen-June Wang 1,\* and Hsiang-Chieh Chen <sup>3</sup>**


Received: 14 October 2019; Accepted: 17 November 2019; Published: 20 November 2019

**Abstract:** Detection of moving objects by unmanned aerial vehicles (UAVs) is an important application in the aerial transportation system. However, there are many problems to be handled such as high-frequency jitter from UAVs, small size objects, low-quality images, computation time reduction, and detection correctness. This paper considers the problem of the detection and recognition of moving objects in a sequence of images captured from a UAV. A new and efficient technique is proposed to achieve the above objective in real time and in real environment. First, the feature points between two successive frames are found for estimating the camera movement to stabilize sequence of images. Then, region of interest (ROI) of the objects are detected as the moving object candidate (foreground). Furthermore, static and dynamic objects are classified based on the most motion vectors that occur in the foreground and background. Based on the experiment results, the proposed method achieves a precision rate of 94% and the computation time of 47.08 frames per second (fps). In comparison to other methods, the performance of the proposed method surpasses those of existing methods.

**Keywords:** moving object; image stabilization; object detection; optical flow; surveillance; UAVs

### **1. Introduction**

There has been increased worldwide interest in unmanned aerial vehicles (UAVs) used for surveillance in recent years due to their high mobility and flexibility. In general, the UAV with a camera attached for surveillance flying over the mission area can be controlled manually by an operator or automatically by using computer vision. One of the most important tasks of aerial surveillance is the detection of moving objects that can be used to convey essential information in images, such as pedestrian detection and tracking [1–3], vehicle detection and tracking [4,5], object counting [6], estimation and recognition of object activity [7–9], human and vehicle interactions [10], intelligent transportation systems [11,12], traffic management [13,14], and autonomous robot navigation [15,16].

Several studies have proposed some methods to detect moving objects using stationary cameras, such as Gaussian Mixture Model (GMM) [17], Bayesian background model [18], Markov Random Field (MRF) [19,20], and frame differences [21,22]. These methods extract and identify moving objects by seeking the changes in pixels in each frame. However, these techniques rely on static pixels in the images and are not suitable for processing images from moving cameras that have dynamic pixels. Therefore, stationary cameras limit the application of image processing on videos from moving cameras, e.g., aerial vehicles, mobile robots, and handheld cameras. Thus, the problem for detecting moving objects using a moving camera attracted the attention of researchers in recent years [23].

Detecting moving objects using UAVs has many difficulties to implement in real time and in real environments. These difficulties include camera movements, dynamic background, abrupt motion of the objects or camera, rapid illumination changes, camouflage of stationary objects as moving objects, moving object appearance changes, noise from low-quality images, and so on. Several approaches have been proposed to detect moving objects by moving cameras using object segmentation techniques. Saif et al. [24] presented a dynamic motion model using moment invariant and segmentation which extracts one frame in one second but it is not fast enough for the real-time detection. Their result has some false detection, such as a parked car recognized as a moving object. Maier et al. [25] used the deviations between all pixels of the anticipated geometry of two or more consecutive frames to distinguish moving and static objects, but the result depended on the accuracy of the optical flow calculation and the amount of radial distortion. Kalantar et al. [26] proposed a moving object detection framework without explicitly overlaying frame pairs, where each frame is segmented into regions and subsequently represented as a regional adjacency graph (RAG).

In our propose method, we do not only want to achieve the accuracy of the moving objects detection using a moving camera but also to do it in real-time processing. Some previous studies used an optical flow schemes approach to define the movement path of pixels that are tracked on two consecutive frames. Wu et al. [27] used a coarse-to-fine threshold scheme on particle trajectories in the sequence of images to detect moving objects. The background movement is subtracted using the adaptive threshold method to get fine foreground segmentation. Then, mean-shift segmentation is used to refine the detected foreground. Cai et al. [28] combined procedures of the brightness constancy relaxation and intensity normalization within the optical flow to extract the moving objects in the background based on the growing region of the velocity field. In this case, the images are obtained from the robot competition arena which has a homogeneous background. Minaeian et al. [29] used the foreground estimation to segment moving targets through the integration of spatiotemporal differences and local motion history. However, these previous methods did not adequately prove reliability in real-time processing.

This paper proposes a method for detecting multiple moving objects from the sequence of images taken by a UAV which can be applied for real-time applications. The detection and recognition are performed through this method for different objects, such as people and cars. In addition, the image sequences to be tested by this method may contain a complex background. This paper proposes a reliable method for object detection in images where the processing time to get the foreground is shorter than that of segmentation method employed in previous studies [24–26]. Aerial image stabilization is proposed to reduce the mixing of camera and object movements, where the background moves due to the camera movement and the foreground moves due to camera and object movement. Furthermore, the unwanted camera movements make the motion vectors field estimation between two consecutive frames incompatible with the actual situation. This situation differentiates the direction of the motion vectors of static objects from the background, even though the objects are a part of the background. Thus, the static objects tend to be recognized as moving objects. To solve such problems, the proposed method provides the motion vectors classification to distinguish static and dynamic (moving) objects.

The remainder of the paper is organized as follows. Section 2 introduces materials and the main algorithm. Section 3 illustrates performance results using multiple videos taken from a UAV. Finally, conclusions are drawn in Section 4.

### **2. Materials and Method**

#### *2.1. Materials*

The experiment was executed using Visual Studio C++ in the 3.40 GHz CPU with 8 GB RAM. The performance of the proposed method is evaluated using three types of aerial image sequences (action1, action2, and action3) obtained from the UCF (http://crcv.ucf.edu/data/UCF\_Aerial\_Action.php) aerial action dataset with the resolution of 960 × 540. These image sequences were recorded at different flying altitudes ranging from 400–450 feet. Action1.mpg and action2.mpg were taken by the UAV at

similar altitudes, where people and cars are the main objects in the image. Action3.mpg was taken at a higher altitude than other videos, so the objects look smaller when compared to other videos.

### *2.2. The Proposed Method*

The challenge to the moving object detection by a moving camera is obvious. The application of the proposed framework can be used to distinguish the foreground from a dynamic background into a simpler formulation. The systematic approach starts with image stabilization to reduce unwanted movement in the sequence of images. The unwanted movements are the motion of the camera as well as any vibration of the UAV. Inaccuracies in motion compensation can cause failure on the estimation of the background and foreground pixels [30]. However, despite using image stabilization, motion vectors in the static objects (background) and moving objects (foreground) are still difficult to distinguish.

Additionally, in order to detect several moving objects with different sizes and speeds we require the correct calculation of motion vector fields. Furthermore, static and dynamic objects are distinguished based on their movement direction (MD). There are two kinds of MD to be estimated: The direction of the object's movement (foreground) and the direction of the background's movement. It should be noted that the background motion is affected by the camera's movement. Figure 1 shows an illustration of the movement of a UAV affecting camera movement. The background movement corresponding to the motion of a moving camera is affected by UAV movements on the yaw, pitch, and roll axis. So, efficient affine transformation is needed.

**Figure 1.** Unmanned aerial vehicles (UAV) movement modeling.

Figure 2 shows an overview of the structure of the system. The algorithm consists of three steps to accomplish the main task: Step 1 is the aerial image stabilization, step 2 is the object detection and recognition, and step 3 is the classification of the motion vectors. The proposed algorithm handles each frame for the moving objects detection and recognition so that it can be used in real-time applications with online image processing.

Step 1: Image stabilization is performed to handle unstable UAV platforms. This step aligns each frame with the adjacent frame in a sequence of aerial images to eliminate the effect of camera movement. This stabilization method consists of motion estimation and compensation. We used the methods of speeded-up robust features (SURF) [31–33] and affine transformation [34] to estimate the camera movement based on the position of features which are similar between the previous (*t* − 1) and current (*t*) frames. Then use the Kalman filter [35,36] to overcome the changes in frame position due to UAV movement such that the camera movement is compensated for each frame. This image transformation is applied to the frame *t*, so it affect the results of MD in the background and foreground.

Step 2: People and cars are detected in the images as the moving objects candidates or foreground. In this step, Haar-like features [37] and cascade classifiers [38,39] are used to detect and recognize the objects in the images and determine the region of interest (ROI) for the objects. This is followed by labeling the background and foreground.

**Figure 2.** System overview of the real-time moving object detection and recognition using UAV.

Step 3: Calculate the motion vectors from two consecutive images based on the dense optical flow [40]. Background modeling is sometimes incompatible with actual camera movements due to UAV movements and camera transitions. It is noted that Step 1 makes the MD between static and dynamic objects clearer to be distinguished. MD is specified as the value of a highly repetitive motion vector in frame *t*, which is calculated in the background and each foreground. If the foreground has the same MD as the background, then the object is omitted from the foreground. Thus, the final result is the ROI in the image showing the moving objects.

The details of each step are explained as follows.

### *2.3. Step 1: Aerial Image Stabilization*

This step uses an affine motion model to handle rotation, scaling, and translation. The affine model can be used to estimate movement between frames under certain conditions in the scene [41,42]. For every two successive frames, the previous frame is defined as *<sup>f</sup>*(*<sup>t</sup>* <sup>−</sup> <sup>1</sup>) and the current frame is defined as *f*(*t*). In order to reduce the computation time, let the image size be reduced to 75% of the original size and the color is changed into a gray-scale, where ˆ *f*(*t*) denotes the new image with the above size and color of *f*(*t*). The local features on each frame are found using SURF [31] as the feature detector and descriptor. SURF uses an integral image [43] to compute different box filters to detect feature points in the image. If *<sup>f</sup>*(*t*) is an input image and *<sup>f</sup>*(*x*,*y*)(*t*) is the pixel value of the location (*x*,*y*) at *f*(*t*), the value *P*(*i*, *j*, *t*) is defined as

$$P(i,j,t) = \sum\_{x=0}^{x \le i} \sum\_{y=0}^{y \le j} f\_{(x,y)}(t). \tag{1}$$

Haar wavelet [30] corresponding to *dx* and *dy* are calculated in the *x*-direction and *y*-direction, respectively, around each feature point to form a descriptor vector presented as

$$v = (\sum d\_{x\prime} \sum d\_{y\prime} \sum |d\_x| \sum |d\_y|). \tag{2}$$

Then, a 4 × 4 array with each vector having four orientations is constructed and centered on the feature point. Therefore, there will be a total of 64-length vectors for each feature point.

Fast Library for Approximate Nearest Neighbor (FLANN) [44] is used to select a set of feature point pairs between ˆ *<sup>f</sup>*(*<sup>t</sup>* <sup>−</sup> <sup>1</sup>) and <sup>ˆ</sup> *f*(*t*). Then, the minimum distance for all pairs of feature points is calculated using the Euclidean distance. The matching pair is determined as a feature point pair with a distance less than 0.6. If the total number of matching pairs are more than three, then the selected feature points are used for the next step. Otherwise, the previous trajectory is used as an estimate of the current movement.

In homogenous coordinates, the relationship between a pair of feature points in the ˆ *<sup>f</sup>*(*<sup>t</sup>* <sup>−</sup> <sup>1</sup>) and ˆ *f*(*t*) is given by

$$H\begin{bmatrix}x(t)\\y(t)\end{bmatrix} = H\begin{bmatrix}x(t-1)\\y(t-1)\\1\end{bmatrix} \tag{3}$$

where *H* is the homogeneous affine matrix given by

$$H = \begin{bmatrix} 1 + a\_{11} & a\_{12} & T\_x \\ a\_{21} & 1 + a\_{22} & T\_y \end{bmatrix} \tag{4}$$

where *aij* is the parameter from the rotation angular θ, *Tx*, and *Ty* are parameters of the translation *T* on the x-axis and y-axis, respectively. An affine matrix can be represented as a least squares problem by

$$\begin{aligned} L &= mh, \\ L &= \begin{bmatrix} \mathbf{x}(1)' & \mathbf{y}(1)' & \dots & \mathbf{x}(\overline{\mathbf{q}})' & \mathbf{y}(\overline{\mathbf{q}})' \end{bmatrix}^T, \\ m &= \begin{bmatrix} M\_0(1) & M\_1(1) & \dots & M\_0(\overline{\mathbf{q}}) & M\_1(\overline{\mathbf{q}}) \end{bmatrix}^T, \\ h &= \begin{bmatrix} 1 + a\_{11} & a\_{12} & T\_x & 1 + a\_{21} & a\_{22} & T\_y \end{bmatrix}^T, \end{aligned} \tag{5}$$

where *<sup>q</sup>* <sup>=</sup> 1, ... , *<sup>q</sup>* is the number order of features, *<sup>M</sup>*0(*q*) = *<sup>x</sup>*(*q*) *<sup>y</sup>*(*q*) 10 00 , and *<sup>M</sup>*1(*q*) = 0 0 <sup>0</sup> *<sup>x</sup>*(*q*) *<sup>y</sup>*(*q*) <sup>1</sup> .

The optimal estimation *h* in Equation (5) can be found by using Gaussian elimination to minimize Root Mean Squared Errors (RMSE) calculated by

$$RMSE = \frac{1}{Q} \|L - mh'\| = \sqrt{\frac{\sum\_{q=1}^{Q} \left(L\_{q} - \left(mh'r\right)\_{q}\right)^{2}}{Q^{2}}}.\tag{6}$$

Because the affine transform cannot represent the three-dimensional motion which occurs in the image, the outliers are generated in motion estimation. To solve this problem, Random Sample Consensus (RANSAC) [45] is used to filter outliers during the estimation.

Next, the translation and rotation trajectories are compensated to generate a new set of transformations for each frame using the Kalman filter. The Kalman filter consists of two essential parts, prediction and measurement correction. The prediction step estimates the state of the trajectory *<sup>z</sup>*ˆ(*t*) = *Tx*(*t*), *Ty*(*t*), θˆ(*t*) at ˆ *f*(*t*) as

$$
\hat{z}(t) = z(t-1),
\tag{7}
$$

where the initial state is defined by *z*(0) = [0, 0, 0] and the error covariance can be estimated by

$$\mathcal{e}(t) = \mathcal{e}(t-1) + \Omega\_{\mathcal{V}} \tag{8}$$

where the initial error covariance is defined by *e*(0) = [1, 1, 1] and Ω*<sup>p</sup>* is the noise covariance of the process. Optimum Kalman gain can be computed as follows

$$K(t) = \frac{\mathfrak{E}(t)}{\mathfrak{E}(t) + \Omega\_m},\tag{9}$$

where Ω*<sup>m</sup>* is the noise covariance of the measurement. The error covariance can be compensated by

$$
\varepsilon(t) = (1 - K(t))\mathcal{E}(t). \tag{10}
$$

Then, the measurement correction step compensates the trajectory state at ˆ *f*(*t*), which can be computed as

$$z(t) = z(t) + K(t)(\Gamma(t) - z(t)),\tag{11}$$

where the new state contains the compensated trajectory defined by *<sup>z</sup>*(*t*) = *T*- *<sup>x</sup>*(*t*), *T*- *<sup>y</sup>*(*t*), θ- (*t*) and Γ(*t*) is the accumulation of the trajectory measurement that can be calculated as follows

$$\Gamma(t) = \sum\_{\tau=1}^{t-1} \left[ \left( \overline{T}\_{\mathbf{x}}(\tau) + T\_{\mathbf{x}}(t) \right) \left( \overline{T}\_{\mathbf{x}}(\tau) + T\_{\mathbf{x}}(t) \right) \left( \overline{\theta}(\tau) + \theta(t) \right) \right] = \left[ \Gamma\_{\mathbf{x}}(t), \Gamma\_{\mathbf{y}}(t), \Gamma\_{\theta}(t) \right]. \tag{12}$$

Therefore, a new trajectory can be obtained by

$$\left[\overline{T}\_x(t), \overline{T}\_y(t), \overline{\theta}(t)\right] = \left[T\_x(t), T\_y(t), \theta(t)\right] + \left[\sigma\_x(t), \sigma\_y(t), \sigma\_\theta(t)\right].\tag{13}$$

where σ*x*(*t*) = *T*- *<sup>x</sup>*(*t*) <sup>−</sup> <sup>Γ</sup>*x*(*t*), <sup>σ</sup>*y*(*t*) <sup>=</sup> *<sup>T</sup>*- *<sup>y</sup>*(*t*) <sup>−</sup> <sup>Γ</sup>(*t*), and σθ(*t*) <sup>=</sup> <sup>θ</sup>- (*t*) <sup>−</sup> <sup>Γ</sup>θ(*t*).

Then, warp *f*(*t*) is in the new image plane and let us apply the new trajectory in Equation (13) to get the transformation *f*(*t*) in the current frame

$$\overline{f}(t) = f(t) \begin{bmatrix} \Phi(t) \cos \overline{\theta}(t) & -\Phi(t) \sin \overline{\theta}(t) \\ \Phi(t) \sin \overline{\theta}(t) & \Phi(t) \cos \overline{\theta}(t) \end{bmatrix} + \begin{bmatrix} \overline{T}\_x(t) \\ \overline{T}\_y(t) \end{bmatrix} \tag{14}$$

where Φ(*t*) is a scale factor computed by

$$\Phi(t) = \frac{\cos \overline{\theta}(t)}{\cos \left( \tan^{-1} \left( \frac{\sin \overline{\theta}(t)}{\cos \overline{\theta}(t)} \right) \right)(t)}. \tag{15}$$

### *2.4. Step 2: Object Detection and Recognition*

In this step, the background and foreground are determined in each frame that has been transformed in Step 1. The foreground is made up of the moving object candidates, which are people and cars, in the image. The foreground is detected and recognized using Haar-like features and a boosted cascade of classifiers with training and detection stages. The basic idea behind Haar-like features is to detect objects of various sizes in the images. Figure 3 shows the template of the Haar-like features where each feature consists of two or three adjacent rectangular groups and can be scaled up or down. The pixel intensity values in the white and black groups are accumulated separately. So, the distinction between adjacent groups gives light and dark regions. Therefore, Haar-like features are suitable for defining information in images to find objects on different scales in which some simple patterns are used to identify the existence of objects.

**Figure 3.** Haar-like features: (**a**) Edge, (**b**) line, (**c**) center-surround.

The Haar-like feature value is calculated as the weighted sum of the pixel gray level values which are summed over the black rectangle and the entire feature area. Then, an integral image [41] is used to minimize the number of array references in the sum of the pixels in a rectangular area of an image. Figure 4a,b show the example of the main objects to be selected. Figure 4c shows the example of the additional objects to be selected which are non-moving objects, i.e., road signs, fences, boxes, road patterns, grass patterns, power lines, roadblocks, and so on. This additional object has the

purpose to reduce false detection where the type of object often tends to be recognized as foreground. Negative images are the images of landscapes and roads taken by a UAV without containing cars or people. In this study, the minimum and maximum sizes of positive images to be trained are 16 × 35 and 136 × 106, respectively.

**Figure 4.** Examples of positive images: (**a**) Person, (**b**) car, (**c**) non-moving object.

The AdaBoost algorithm [46] is used to combine features of the selected classifier. A classifier is chosen as the threshold to determine the best classification function for each feature. A training sample is set as (αs, βs), *s* = 1, 2, ... , *N*, where β*s*= 0, or 1 for negative or positive labels, respectively, and it is the class label for the sample α*s*. Each sample is converted to a gray-scale then scaled down to the base resolution of the detector. The AdaBoost algorithm creates a weight vector which is distributed over all training samples in the iteration. The initial weight vector for all samples (α1, <sup>β</sup>1), ... ,(α*N*, <sup>β</sup>*N*) is set as <sup>ω</sup>1(*s*) = 1/*N*. The error associated with the selected classifier is evaluated as

$$\varepsilon\_{i} = \sum\_{s=1}^{N} \omega\_{i}(s), \text{ if } \left| \lambda\_{i}(\alpha\_{s}) \neq \beta\_{s} \right|. \tag{16}$$

The <sup>λ</sup>*i*(α*s*) = 0, or 1 is a selected classifier for negative or positive labels, respectively, and *<sup>i</sup>* = 1, 2, ... , *<sup>I</sup>* is the iteration number. The selected classifier is used to update the weight vector as

$$\begin{aligned} \boldsymbol{\omega}\_{i+1}(\boldsymbol{s}) &= \boldsymbol{\omega}\_{i}(\boldsymbol{s}) \boldsymbol{\delta}\_{i}^{1-r\_{\boldsymbol{s}}} \\ \text{where } \boldsymbol{r}\_{\boldsymbol{s}} &= \begin{cases} 0, \text{ if } \boldsymbol{\alpha}\_{\boldsymbol{s}} \text{ classified correctly,} \\ 1, \text{ otherwise} \end{cases} \end{aligned} \tag{17}$$

and δ*<sup>i</sup>* is the weighting parameter set by

$$
\delta\_{\bar{i}} = \frac{\varepsilon\_{\bar{i}}}{1 - \varepsilon\_{\bar{i}}}.\tag{18}
$$

The final classifier stage *W*(α) is the labeled result of each region represented as

$$\mathcal{W}(a) = \begin{cases} 1, & \text{if } \sum\_{i=1}^{l} \left[ \log \left( \frac{1}{\delta\_i} \right) \times \lambda\_i(a) \right] \ge \frac{1}{2} \sum\_{i=1}^{l} \log \left( \frac{1}{\delta\_i} \right). \\\ 0, & \text{otherwise} \end{cases} \tag{19}$$

Figure 5 shows a sub-window that slides over the image to identify the region containing the object. The region is labeled at each classifier stage either as positive (1) or negative (0). The classifier passes to the next stage if the region is labeled as positive, which means that the region is recognized as an object. Otherwise, the region is labeled as negative and is rejected. The final stage shows the region of the moving object candidates. The region of the non-moving object is not to be displayed in the image and is used to evaluate the detected object. If the region of a moving object candidate is the same as the non-moving object, then the region is eliminated as a foreground. Let the *n*-th foreground region be represented as

$$Obj[n] = [(x\_{\min}(n), y\_{\min}(n)), (x\_{\max}(n), y\_{\max}(n))],\tag{20}$$

where (*x*min(*n*), *<sup>y</sup>*min(*n*)) and (*x*max(*n*), *<sup>y</sup>*max(*n*)) are the minimum and maximum positions of the rectangular foreground pixel locations, respectively.

**Figure 5.** Cascade classifier for object detection and recognition.

False detection of the moving object candidates is eliminated immediately using a comparison of the region with non-moving objects. This will speed up the computation time in the next step.

### *2.5. Step 3: Motion Vector Classification*

The Farneback optical flow [40] is adopted to obtain motion vectors of two consecutive images. The Farneback optical flow uses a polynomial expansion to provide high speed and accuracy for field estimation. Suppose there is a 10 × 10 window *G*(*j*) and the pixel *j* is chosen inside the window. By using polynomial expansion, each pixel in *G*(*j*) can be approximated by a polynomial so called "local coordinate system" at *<sup>f</sup>*(*<sup>t</sup>* <sup>−</sup> <sup>1</sup>) which can be computed as follows

$$f\_{lcs}^{p}(t-1) = p^T A(t-1)p + b^T(t-1)p + c(t-1),\tag{21}$$

where *<sup>p</sup>* is a vector, *<sup>A</sup>*(*<sup>t</sup>* <sup>−</sup> <sup>1</sup>) is a symmetric matrix, *<sup>b</sup>*(*<sup>t</sup>* <sup>−</sup> <sup>1</sup>) is a vector, and *<sup>c</sup>*(*<sup>t</sup>* <sup>−</sup> <sup>1</sup>) is a scalar. The local coordinate system at *f*(*t*) can be defined by

$$f\_{\rm lss}^p(t) = p^T A(t)p + b^T(t)p + c(t). \tag{22}$$

Then, a new signal is constructed at *f*(*t*) by a global displacement Δ(*t*) as *f p lcs*(*t*) = *<sup>f</sup> <sup>p</sup>*−Δ(*t*) *lcs* (*<sup>t</sup>* <sup>−</sup> <sup>1</sup>). The relation between the local coordinate systems of two input images will be

$$\begin{array}{lcl} f\_{\rm los}^{p}(t) &= (p - \Delta(t))^{T} A(t - 1)(p - \Delta(t)) + b^{T}(t - 1)(p - \Delta(t)) + c(t - 1) \\ &= p^{T} A(t - 1)p^{T} + (b(t - 1) - 2A(t - 1)\Delta(t))^{T} p + \Delta^{T}(t)A(t - 1)\Delta(t) - b^{T}(t - 1)\Delta(t) + c(t - 1) \end{array} \tag{23}$$

The coefficients can be equated in Equations (22) and (23) as

$$A(t) = A(t-1),\tag{24}$$

$$b(t) = b(t-1) - 2A(t-1)\Delta(t),\tag{25}$$

and

$$
\omega(t) = \Delta(t) \Big( \Delta^T(t) A(t-1) - b^T(t-1) \Big) + c(t-1). \tag{26}
$$

Therefore, the total displacement with the extraction in the ROI can be solved by

$$
\Delta(t) = -\frac{1}{2}A^{-1}(t-1)(b(t) - b(t-1)).\tag{27}
$$

The displacement in Equation (27) is a translation from each corresponding ROI consisting of the x-axis (Δ*x*(*t*)) and y-axis Δ*y*(*t*) , so the angular value of the motion vector can be calculated by

$$\Delta\_{\partial}(t) = \tan^{-1} \left( \frac{\Delta\_{(x+1,y)}(t) - \Delta\_{(x-1,y)}(t)}{\Delta\_{(x,y+1)}(t) - \Delta\_{(x,y-1)}(t)} \right) \times \frac{180}{\pi}. \tag{28}$$

Since the motion vector is calculated for each 10 × 10 pixels neighborhood, the total displacement is the matrix of size (*image*\_*width*/10) <sup>×</sup> (*image*\_*height*/10). Thus, the new *<sup>n</sup>*-th foreground region is determined by

$$[Obj2[n] = \left] \frac{(\chi\_{\min}(n) \, \, y\_{\min}(n))}{10} \,, \frac{(\chi\_{\max}(n) \, \, y\_{\max}(n))}{10} \right] \,. \tag{29}$$

Figure 6a shows regions marked with red and blue ROI, representing the moving objects candidate (foreground), identified as a person and a car, respectively. Figure 6b shows an example of the estimate motion vector distribution. In images taken by a static camera, the motion vectors in the background are zero, signifying MD value is zero. This means that there is no movement (represented by the direction of the arrows) between two consecutive frames. In our case (images were taken by a moving camera), motion vectors in the background have several different directions as shown in Figure 6b. The red ROI is a parked car classified as a non-moving object, where the motion vectors are similar to most motion vectors in the background. The blue ROI shows a person walking, classified as a moving object, where the motion vectors are different from most motion vectors in the background. Thus, MD on each moving object candidate is obtained as the most occurrence of motion vectors in each ROI. In the background, MD can be obtained as the most occurrence motion vectors in images other than the foreground.

**Figure 6.** Optical flow estimation: (**a**) Original image, (**b**) motion vectors.

Figure 7 shows a flowchart of the classification of motion vectors and the selection of moving objects, which are implemented in Algorithm 1. In each ROI, motion vectors with angular values that are equal to or greater than zero are grouped into the same class. If the motion vector is in the background, it is classified as <sup>Δ</sup>*B*[*B*] where *<sup>B</sup>* is the number order of classes in the background and the total number of members of each class is denoted by *NB*[*B*]. If the motion vector is in the foreground, it is classified as <sup>Δ</sup>*F*[*n*, *<sup>F</sup>*[*n*]] where *<sup>F</sup>*[*n*] is the number order of classes in the foreground and the total number of members of each class is *NF*[*n*, *<sup>F</sup>*[*n*]]. Then, MD of the background <sup>Δ</sup>*<sup>B</sup>* and *<sup>n</sup>*-th foreground <sup>Δ</sup>*F*[*n*] are determined as the biggest <sup>Δ</sup>*<sup>B</sup>* and <sup>Δ</sup>*F*[*n*], respectively. If <sup>Δ</sup>*F*[*n*] has a value on the threshold of Δ*B*, then the object is identified as a non-moving object and is not considered as a moving object candidate. Otherwise, the object is identified as a moving object. Finally, the image will only show the ROI of the selected object. The minimum and maximum MD threshold values in the background are −5 and +5, respectively. We choose these values because the MD between background and static objects may have little difference which is not out of the threshold range [−5, +5].

**Algorithm 1.** The proposed classification for selecting moving objects.

	- Motion vector : Δ<sup>θ</sup>
	- Number of foregrounds : *n*
	- Foreground region : *obj*2[*n*]

**Figure 7.** Flowcharts to classify motion vectors and select moving objects.

#### **3. Results and Discussion**

### *3.1. Result of Motion Vectors*

The tested images were unstable due to the movement of the UAV. This caused their motion vectors with regards to static (non-moving) and dynamic (moving) objects to be unsuitable to distinguish. Figures 8 and 9 show the results of the motion vectors without and with image stabilization, respectively. Figures 8a and 9a show the motion vectors in the background. Figures 8b and 9b show the motion vectors in the ROI as a static object car. Figures 8c and 9c show the motion vectors in the ROI as dynamic objects people. Figure 8 shows that the motion vectors of the dynamic and static objects are almost the same with slight difference from the motion vectors in the background. Thus, the result of the motion vectors without image stabilization was incorrect.

**Figure 8.** Result of motion vectors without image stabilization: (**a**) Background, (**b**) car, (**c**) people.

Figure 9b shows that the motion vectors in the car (static object) are almost the same as the background. Figure 9c shows that the motion vectors in the people (dynamic objects) are very different from the background. Thus, the results of the motion vectors with image stabilization were very suitable to distinguish between static and dynamic objects.

**Figure 9.** Result of motion vectors with image stabilization: (**a**) Background, (**b**) car, (**c**) people.

### *3.2. Result of Moving Objects Detection*

Figures 10–12 show the results of detection and recognition of moving objects. In some cases, there were false detections on moving objects candidates because motion vectors classified these objects as undesirable and so omitted them. Figures 10 and 11 show the sequence of images obtained from Action1 and Action2, respectively. Sometimes the algorithm did not detect a small object in the image. For example, a small car in Figure 11a was not detected as the foreground. Although the classification result of the motion vector showed the car as a moving object, the final result eliminated the car because the object region was not recognized as the foreground.

Figure 12 shows result of the sequence of images obtained from Action3 which contains five people playing together and making little movements every once in a while. The detection result showed that if there were only slight displacements on an object, it was difficult to distinguish the motion vector. So, the object tended to be detected as a non-moving object.

**Figure 10.** The result of moving object detection in Action1: (**a**) Frame 25, (**b**) frame 100, (**c**) frame 210, (**d**) frame 405.

**Figure 11.** The result of moving object detection in Action2: (**a**) Frame 25, (**b**) frame 100, (**c**) frame 170, (**d**) frame 440.

**Figure 12.** The result of moving object detection in Action3: (**a**) Frame 5, (**b**) frame 60, (**c**) frame 120, (**d**) frame 300.

The results of computation performance are summarized in Table 1 in respect of frames per second (fps). The average time cost is about 47.08 fps which is faster than previous methods in [23–28]. Table 2 shows the performance accuracy in terms of True Positive (TP), False Positive (FP), False Negative (FN), Precision Rate (PR), recall, and f-measure. TP is the detected region that corresponds to the moving object. FP is the detected region that is not related to the moving object. FN is the region associated with the moving object that is not detected. The performance accuracy can be computed as

$$PR = \frac{TP}{TP + FP'} \tag{30}$$

$$\text{Recall} = \frac{TP}{TP + FN} \tag{31}$$

$$\text{F-measure} = 2 \times \frac{PR \times \text{Recall}}{PR + \text{Recall}}.\tag{32}$$

Although many articles have tried to solve the same problem (moving object detection using a moving camera), the proposed method has performed well for real-time computation time in a real environment with a complex background. The detection results also showed that the proposed method detected moving objects with high accuracy, although the UAV had some unwanted motion and vibration. The comparison of computation time and accuracy between the results of the proposed method with those methods in [23–26,28] are reported in Table 3. The proposed method achieved an average precision rate of 0.94 and a recall of 0.91. Action1 had the highest PR and recall compared to other videos because there were only a few objects and their sizes were quite large. Action2 had the lowest PR because there were a lot of objects that were similar to the person and car such as trees, fences, road signs, houses, and bushes. Action3 had the lowest recall due to some small objects in the video with displacement.

**Table 1.** Computation time performance in frames per second (fps).



**Table 2.** Detection results performance.



The method in [27] did not discuss the accuracy of the detected moving object as well as the computation time performance. It focused on the optical flow to describe the direction of pixel movement. However, this method, i.e., [27], is suitable for application on an image with a homogeneous background. In our case, a moving camera produced several objects in the background that had no correlation with moving objects but had pixel movements. This condition occurs in image sequences that have complex backgrounds such as our datasets. Thus, the method in [27] is not suitable to be applied to our datasets. In addition, we used a simple dense optical flow which is sufficient to calculate the motion vector fields between two consecutive frames and has a fast computation time. Then, we used the classification, which is feasible to distinguish the motion vectors between static and dynamic objects, to determine MD in the background and foreground.

The proposed method can be used for various moving objects, not only for people and cars. In this work, we used people and car objects to test the performance of the method, because these objects are often investigated as moving objects using moving cameras [23–29]. High frequency jitter, small size objects, and low-quality images make detection of moving objects using UAVs a difficult task. But, using the framework that we propose, we can resolve the problem. Furthermore, a machine learning approach is used to detect and recognize the foreground because it can be applied to almost all processors without GPU. This method is proposed for use on a PC or on-board system. In other words, if the image capture by UAVs can be transmitted to a ground station such as a PC using a wireless camera or transmitted to an additional board such as Raspberry Pi on a UAV, then the image can be processed online and in real time.

Based on information from datasets and previous studies [23–29], we can conclude that the proposed algorithm will be applicable under the conditions: UAV altitude is less than 500 feet and speed is less than 15 m/s. In addition, based on our experiment results, our algorithm had the best results at a video frame rate of less than 50 fps.

### **4. Conclusions**

A novel method for multiple moving objects detection using UAVs is presented in this paper. The main contribution of the proposed method is to detect and recognize moving objects by using a UAV with moving camera with excellent accuracy and can be used in real-time applications. An image stabilization method was used to handle unwanted motion in aerial images so that a significant difference in motion vectors can be obtained to distinguish between static and dynamic objects. The object detection that was used to determine the region of the moving object candidate had a fast computation time and good accuracy on complex backgrounds. Some false detections can be handled using a motion vector classification, in which the object that has a movement direction similar to the background will be removed as a moving object candidate. Comparing the results on various sequences of aerial images, the proposed method can be a potential real-time application in the real environment.

**Author Contributions:** W.R. contributed to the conception of the study and wrote the manuscript, performed the experiment and data analyses, and contributed significantly to algorithm design and manuscript preparation. W.-J.W. and H.-C.C. helped perform the analysis with constructive discussions, writing, review, and editing.

**Funding:** We would like to thank the Ministry of Science and Technology of Taiwan for supporting this work by the grant 108-2634-F-008-001.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
