*2.2. Convolutional Neural Networks*

CNNs applied to image analysis usually constrain their input to remain as a three-dimensional element with the following parameters: Width, height and colour channels (depth). The network is conformed by three types of layers: Convolutional, activation and pooling. The task of each layer is:


The concatenation of these structures allows the system to learn different level features and generate a more precise prediction than traditional methods, based on low-level features such as colour. Therefore, the application of convolutional neural networks has supposed a revolution in image processing. A more in-depth explanation of these neural networks can be found in [10].

## *2.3. CNN-Based Change Detection*

As mentioned in the introductory section, researchers have developed methods to detect changes in images based on CNN architectures. Two methods have been studied and analysed to obtain insights about the architectures and accuracy levels that form the state-of-the-art nowadays. In [3], both background and foreground images are sliced in patches with reduced dimensions. The two patches (back and foreground) are concatenated in the depth axis, obtaining a unique structure to feed the CNN as a result. Convolution applied to images stacked depth-wise produces faster and more appropriate results than in any other dimension as detailed in [11]. As the result of introducing the concatenated structure to the CNN-based architecture, this model returns a black and white binary patch, with the detected foreground zones in white and background zones in black. Nevertheless, the restriction of using a moving camera is unsolved as the system is indicated for static acquisition devices. In [12], a variation of the Visual Geometry Group Net (VGGNet) architecture is connected to a deconvolutional neural network. VGGnet is a state-of-the-art implementation developed by VGG (Visual Geometry Group) from the University of Oxford [13]. The resultant architecture is applied to obtain an image of the same dimensions as the input picture. To do so, the number of deconvolution layers is equivalent to the CNN layers in the VGGNet. As a result, the model returns a binary image with identical size to the input image. The stated method does not employ a background image. It employs only the foreground and, for training purposes, the ground-truth images. As a result, we can retrieve two conclusions: First, without a background reference, the system can be employed for moving cameras if the dataset is properly conformed; next, the algorithm has to learn a different background for each scene. Therefore, the system will have to learn on each scene, which slows processing time in real-world applications. From the results of both methods, it may be concluded that CNNs have improved considerably the accuracy of traditional change detection algorithms.

### *2.4. Image Matching Techniques*

On stationary camera situations, acquiring a precise reference image represents an effortless task, while on moving camera scenarios it remains a challenging problem. One of the recent progresses of UAV technology consists on the automation of flights [14]. This system provides human–UAV interaction by allowing the programming of UAV's routes, specifying the coordinates at each point

and other parameters, namely speed, height, mode of flight or behaviour, in case of any event possible. However, the mentioned innovation is subject to the real-world conditions that affect the behaviour of UAVs. As explained in Section 1, the image acquisition is assumed to vary from two flights on the same route. It can be caused by multiple reasons, in particular, modifications in UAV's height, position because of GPS precision variations, wind conditions or movement of the camera. Consequently, the reference pictures obtained on one of those flights are not circumscribed to exactly the same area. In this paper we have selected an approach based on ORB (oriented FAST and rotated BRIEF) algorithm [15] to resolve the alignment problem. ORB descriptor is based on BRIEF (binary robust independent elementary features) [16] and FAST (features from accelerated segment test) [17] algorithms. This descriptor extracts the key points of an image by implementing FAST method. It employs a Harris corner measure [18] to discover the optimal points. To obtain the orientation Equation (3), ORB uses a rotation matrix based on the computation of intensity centroid [19], Equation (2) from the moments Equation (1). Moments are obtained using:

$$\mathfrak{m}\_{pq} = \sum\_{\mathfrak{x}, \mathfrak{y}} \mathfrak{x}^p \mathfrak{y}^q I(\mathfrak{x}, \mathfrak{y})\_\prime \tag{1}$$

where *I*(*x*, *y*) represents pixel intensities, *x* and *y* denote the co-ordinates of the pixel and finally, *q* and *p* indicate the order of moments. The centroid is obtained from the calculated moments as:

$$\mathcal{C} = (\frac{m\_{10}}{m\_{00}} \frac{m\_{01}}{m\_{00}}) \tag{2}$$

Finally, a vector from corner's center to the centroid is constructed. The orientation is:

$$\theta \, = \operatorname{atan2}(m\_{01}, m\_{10}) \,\tag{3}$$

where atan2 is the arctangent's quadrant-aware version given by Equation (4):

$$\text{arctan2}(\mathbf{x}, \mathbf{y}) = \begin{cases} \arctan(\mathbf{y} \cdot \mathbf{x}^{-1}) & \mathbf{x} > \mathbf{0}, \\ \arctan(\mathbf{y} \cdot \mathbf{x}^{-1} + \pi) & \mathbf{y} \ge \mathbf{0}, \mathbf{x} < \mathbf{0}, \\ \arctan(\mathbf{y} \cdot \mathbf{x}^{-1} - \pi) & \mathbf{y} < \mathbf{0}, \mathbf{x} > \mathbf{0}, \\ \pi/2 & \mathbf{y} > \mathbf{0}, \mathbf{x} = \mathbf{0}, \\ -\pi/2 & \mathbf{y} > \mathbf{0}, \mathbf{x} = \mathbf{0}, \\ \text{undefined} & \mathbf{x} = \mathbf{y} = \mathbf{0}. \end{cases} \tag{4}$$

Other descriptors considered during the development have been SIFT (scale-invariant feature transform) [20] and SURF (Speeded-up robust features) [21] algorithms. SIFT algorithm transforms a picture into feature vectors. Each vector is invariant to image translation, scaling, and rotation. After that, it compares each vector from the new image to the ones obtained from the reference picture. Next, it provides a candidate by matching features based on Euclidean distance. SURF algorithm obtains the feature vectors based on the sum of the Haar wavelet response around the points of interest. These points have been detected previously by using an integer approximation of the determinant of the Hessian matrix. However, the ORB descriptor is able to perform two orders of magnitude faster than SIFT and several times faster than SURF. Because of this performance difference, we have selected ORB to process our videos and reduce the computational cost of the system.

#### **3. Methodology**

Image alignment techniques employed on the approach are explained in Section 3.1. After that, the architecture of the deep neural network model is explained in detail in Section 3.3. The dataset generation for the process has been described in Section 3.4. Subsequently, the training process is

detailed in Section 3.5. Lastly, the post-processing part is described in Section 3.6 to introduce how we have obtained our desired image from the model output. In Figure 1, a block diagram of a system is shown to detail our pipeline.

**Figure 1.** Block diagram of the system. Both reference and foreground images are introduced into our image alignment system. After that, sliding window algorithm is applied. The two resultant patches are concatenated along the depth axis. Ultimately, our CNN model predicts a grayscale image, which is post processed to obtain the final binary patch depicted.
