*3.1. Image Alignment*

As stated on previous sections, our system is applied to moving cameras. Our objective is to compare two video sequences (background and foreground) to detect the changed regions between them. Because of implementation purposes, a reference video from the UAV's route is required, which will be considered as background. More recent videos from the same route are considered as our foreground scenarios. In real-world situations, the new images will not be completely aligned to the reference video. This can be caused by multiple reasons such as lack of GPS precision or weather variations. To solve the problem, an image alignment system has been developed using ORB [15]. The idea is similar to the feature alignment performed in [22]. Both the reference and the new route's images are compared. Feature extraction is performed with ORB algorithm to obtain the most significant zones of each picture. After that, a descriptor matcher algorithm from [23] is created. The descriptor performs an analysis of the obtained values and outputs the relation between them by distance difference. This output is filtered and sorted to obtain the most adjacent descriptive points of both images. Lastly, a geometric transformation is performed to generate a modified version of the acquired image, aligned to the reference. An example of the results obtained by this system is depicted in Figure 2. As can be observed, the resultant image contains a black zone that represents the pixels from the foreground image not included in the reference. To prevent the appearance of false positives because of these black zones, only the mid section of the image is selected automatically to perform the change detection. The alignment system entails an innovation to other implementations such as [3,12], which consider a scenario with a static camera.

**Figure 2.** Example of our image alignment module's output with a foreground image with significant difference from the reference.

#### *3.2. Sliding Window*

The image provided to our system can vary in size depending primarily on the UAV's camera device or time processing requirements. For instance, the processing requirements of a real-time detection system varies from those of a post-flight analysis. Moreover, deep learning models usually struggle to work with very high-resolution images. Our approach to resolve these problems is described in this subsection. In the first place, the maximum input size is defined as a parameter of the system. If the input images overcome the defined dimensions, they are resized to the specified size. After that, we have to divide the pictures into small regions. This is a consequence our network's requirements to process images in reasonable time. Therefore, we can obtain pixel-level precision results.

To do so, we have employed an algorithm named sliding window. This algorithm iterates over the image's dimensions, retrieving a matrix of a specific size (window) which contains a region of the initial image. The dimensions of the window are identical in width and height to prevent any further complication of the segmentation process. Another parameter of the algorithm is defined as the step. The step of a sliding window algorithm represents the distance, in each dimension, from the starting point of a window in iteration i to the starting point of the region in iteration i + 1. Adaptive to each dimension, the step remains a crucial factor in terms of computational cost, just as the dimension of the window. With these two parameters we can control the overlapping of zones among adjacent windows. If overlapping occurs, the system will benefit from an increment of precision on the analysis. In this case, four predictions are generated for the most part of the image, except the limits of each dimension. As a conclusion, the variation of both parameters provides an extremely efficient tool that can modify the algorithm's performance to adapt it for multiple purposes: real-time segmentation, maximum precision prediction or sample generation. The application of this algorithm in both the reference image and background image is depicted in Figure 3.

**Figure 3.** Representation of the sliding window and depth concatenation process on a reference image and a foreground image.

### *3.3. Deep Neural Network Architecture*

The model is based on the concatenation of two input images: The reference background scene and the updated scene image which may contain some changes. Both images are merged in depth dimensions, as it is performed in other state-of-the-art methods such as [3,4,12]. This input form allows the model to learn hierarchical features. As a result, CNNs conform an effective tool to obtain relevant information from images. An exhaustive description of the CNN architecture is described in detail in Tables 1 and 2. Figure 4 illustrates the complete architecture. The deep neural network is composed by four CNNs with increasing number of filters and a kernel size of 3 × 3 pixels. As our inputs consist of images with reduced dimensions, we employ this kernel size to extract the features as detailed as possible to obtain a precise detection. For the activation layers of the CNN, we have used Rectified Linear Unit (ReLU) activation [24]. ReLU activation applies Equation (5) to its inputs. This function is widely implemented in CNNs as mentioned in Section 2.2 because of its reduced computational cost and the acceleration of the optimization process. Following this layer, we have employed a max-pooling layer with 2 × 2 kernel size.

$$R(z) = \max(0, z) \tag{5}$$

After the last CNN structure, we have vectorized its output neurons. The first one-dimensional layer contains a dropout layer [25] to avoid overfitting [26]. Continuing the structure, we have employed a batch normalization [27], which subtracts the mean of its inputs and divides them by the standard deviation as in Equation (6):

$$
\hat{\mathfrak{X}}\_{l} = \frac{\mathfrak{x}\_{l} - \mu\_{B}}{\sqrt{\sigma\_{B}^{2} + \epsilon}} \tag{6}
$$

denoting *µ<sup>B</sup>* as the average value of the batch, *σ<sup>B</sup>* as the standard deviation of the batch, and *e* a constant added for numerical stability. The dropout layer deactivates part of the neurons from our densely connected layer during the training process. As a result, the model improves its generalization, as it forces the layer to predict the same output using different neurons. The aim of batch normalization is to increase the stability of the model by normalizing the output of the previous layer. As a consequence, the model will be adaptable to new scenarios, which is one of the essential features of the proposed system. As our output consists of pixels from changed regions or pixels from unchanged regions, we are in a binary classification problem. Therefore, the final output layer uses a sigmoid activation function as depicted in Equation (7):

$$
\sigma(z) = \frac{1}{1 + e^{-z}} \tag{7}
$$

The resulting normalized vector is the initial prediction of our system. It conforms the input of the post-processing methods described in Section 3.6. From this vector we construct a one-channel image that represents the changes between the reference image and the updated image.

**Figure 4.** Diagram of the deep neural network architecture used in our system, containing layer types and dimensions.

**Table 1.** Layer description of the convolutional neural networks (CNNs) which conform the first part our model.



**Table 2.** Layer description of the fully connected layers which define the final part of our model.

#### *3.4. Dataset*

To train our model, we have selected images from CD2014 dataset [1]. This dataset contains images categorized in: "Baseline", "Dynamic Background", "Camera Jitter", "Intermittent Object Motion", "Shadows", "Thermal", "Bad Weather", "Low Framerate", "PTZ", "Night" and "Air Turbulence". We have picked "Bad Weather", "Dynamic Background" and "Intermittent Object Motion" for our training process. This election has been based on the similarity of these images with real conditions where our model is to be applied. As stated before, the target of this system is to perform change detection on images obtained with variable weather and with dynamic scenarios. Therefore, the mentioned categories represent a stable base for our model to learn.

To build our dataset, we have selected one background image from each category to conform the background for each foreground picture. This has been done to provide a unique reference for the complete dataset and prepare the model for slight changes as mentioned in previous sections. To do so, each background image has been replicated to obtain a reference for each foreground picture. Because of computational limitations, the input size for our model represents a restrictive value. Two complete images could not be established as an input. Therefore, a division of each image must be performed to obtain slices from them with a reduced size. Image preprocessing has been applied to the CD2014 dataset using OpenCV [23]. This library allows us to resize the input images to dimensions multiple of our desired input size: 64 × 64. As a result, we generate blocks of 64 × 64 pixels from the three images: Reference, target and ground truth. This is achieved using the sliding window algorithm explained in Section 3.2. The resultant patches obtained from the previous process are automatically labelled using the original image name, category and region position to conform a unique identifier. With this implementation, the image order is preserved and training can be performed without any trouble.

#### *3.5. Training*

Our system has been implemented in Keras [28] using Tensorflow as backend [29]. As mentioned in previous sections, our model input is the result of the concatenation of the reference and the foreground image along the depth axis. The model output is conformed by a vector with a length of 4096 elements. To obtain this desired structure, we have to process our original ground truth images. These images are provided by the CD2014 Dataset as a grayscale image. Black pixels represent unmodified regions. White pixels depict altered zones and grey indicate the border between an altered and an unaltered zone. The image processing has been performed using the ImageDataGenerator class from Keras and Numpy [30]. ImageDataGenerator allows Keras to select batches of data for training the model, instead of loading the complete dataset on memory and select the batches from memory. As a result, we obtain an iterative component called generator. These generator structures are widely implemented in Keras, and training methods are no exception. With this methodology, extensive datasets are easy to handle. These generators can be easily customized. In our case, we have performed two significant customizations. First, we have concatenated both input images into a unique structure with size of 64 × 64 × 6 as our model input. Then, output images are transformed into vectors using Numpy API. Final dataset is detailed on Table 3. In summary, the training dataset is conformed by 130,476 patches, 43,492 for background, foreground and ground-truth respectively, with a proportion of 80% for training and 20% for evaluation purposes. 6750 additional patches from the IOM category have been included for metrics evaluation. Training loss is illustrated in Figure 5. The X-axis represents the steps (in thousands) of training. The Y-axis depict the loss value at a given step. The graph has been obtained using Tensorboard from TensorFlow. Tensorboard has been employed to analyse the training process. The model has been trained on an NVIDIA 1080TI GPU for 12 epochs because of the implementation of early stopping [31].

**Table 3.** Description of the complete dataset used to train our model. Columns represent the category of the images, the number of complete images used, the size of each image and the resultant number of 64 × 64 patches used to train the algorithm.


**Figure 5.** Training loss curve represented loss value per thousand steps.

#### *3.6. Post Processing*

After the deep learning model is applied to both images, the resultant grayscale pictures are processed. As stated before, the system outputs a set of 64 × 64 grayscale images. The length of the mentioned set is proportional to the image dimensions. It is a result of the sliding window algorithm applied to the inputs, as detailed in Section 3.2. If overlapping has been selected in the sliding window algorithm, the post-processing method divides the obtained values at the composition process by the overlapping factor. With this approach, the overlapping effect allows us to have a trade-off between precision and computational cost, to obtain a more flexible implementation. Therefore, the image is obtained by inverting the sliding window algorithm. That is to say, the 64 × 64 patches are placed on the equivalent position of the input patch at the original image in a new blank image. After all the patches have been included in the new image, we obtain a grayscale image with the dimensions of the inputs of the system. This image will be introduced into the filtering component of this post-processing module. Figure 6 represents an example output previous to the filtering component.

Pixel's intensity threshold is applied to filter possible noise effects such as blocking or insignificant changes detection in the image. Subsequently, a morphological dilation is performed on the resultant image. The objective of the dilation is to complete the possible gaps produced by noise in changed regions. Dilation has been selected as some information is typically removed by the filter to secure the complete elimination of disturbances. To compensate this, the dilation process expands the borders of the most relevant regions to complete the missed information after the previous threshold process. The dilation operator follows the formula:

$$(f \oplus b)(\mathbf{x}) = \sup\_{y \in E} [f(y) + b(\mathbf{x} - y)]\tag{8}$$

denoting by f(x) an image, and the structuring function by b(x). *E* is the Euclidean space into the set of real numbers.

The effect of this filtering on the images is depicted in Figure 7.

**Figure 6.** Image resultant of the combination of 64 × 64 grayscale patches obtained by CNN predictions.

**Figure 7.** Final output of our system after the filtering part of our post-processing module is applied to Figure 6.

### **4. Experimental Results**

Model results are described in Section 4.1 along with the metrics used to analyse the model's performance. In Section 4.2, we have compared our solution with several state-of-the-art implementations for background subtraction and change detection, trained using the CD2014 dataset.

### *4.1. Evaluation Metrics*

To perform a comparison between our system and several state-of-the-art methods, we have selected multiple metrics to characterize the performance of the models. All of them are based on four elemental concepts. We define true positives (TP) as the correctly classified changed pixel. True negatives (TN) represent the unmodified pixels which have been correctly predicted. "Let false positives" (FP) defines the incorrect classified changed pixels. Lastly, false negatives (FN) represent the incorrectly labelled background pixels. We have selected the metrics from [1], as all the methods based on this dataset compute them. These metrics, defined by the previous concepts, are:


In Figure 8 we have represented examples of the model's performance in different image categories. Similarly, Figure 9 depicts examples of our model's performance in UAV imagery. Ultimately, Table 4 shows the results of the stated metrics represented for each scenario of our dataset. It should be noted that recall and precision measures are included in F-measure. From the metrics obtained, we can observe that the system has obtained excellent scores in complex scenarios such as snow blizzard.

**Figure 8.** Results of our system applied to CD2014 images. Column (**a**) represents the image where change detection is applied, column (**b**) represents the ground truth images given by the CD2014 dataset, (**c**) the prediction images from our model and (**d**) the original images with the bounding boxes obtained by the prediction process.

**Figure 9.** Results of our system applied to images acquired from an unmanned aerial vehicle (UAV), not included in the dataset. In this case, column (**a**) represents the foreground image, column (**b**) the binary image predicted and column (**c**) the original image bounding boxes as in previous table.


**Table 4.** Metrics scores of our proposed solution on each CD2014 selected category.

#### *4.2. Comparison with Other Change Detection Systems*

As mentioned in the introductory part of this section, we have selected various implementations to compare our system. We have based our election in the use of the CD2014 dataset to have the most objective results as possible. Moreover, we have elected algorithms founded on various technologies such as traditional image processing techniques (SuBSENSE [8]), other traditional mathematical implementations (GMM [6]) and convolutional neural networks (ConvNet-IUTIS, ConvNet-GT [4]). Our purpose with this is to remark the performance of CNNs over traditional methods. As we compare a CNN based implementation, we can obtain valuable conclusions for our system performance. The mentioned traditional solutions have been selected because of their impact in background subtraction, as they are frequently compared in most papers dealing with this subject. In Table 5, the results of this comparison are represented, using the value of the F-measure as a reference for their performance. Only scenarios that have been tested with the model are compared. From the previous table, we can observe how the performance of our model denotes higher precision than other state-of-the-art implementations for static cameras. Therefore, our change detection component offers an accurate solution for the purpose. The combination of this development with the image alignment methods provides a state-of-the-art solution for remaking modified regions in images acquired by a moving camera.

**Table 5.** F-measure scores of our implementation and the other seven state-of-the-art implementations on the CD2014 dataset from categories "Bad Weather", "Dynamic Background" and "Intermittent Object Motion".


#### **5. Discussion**

The primary purpose of this paper is to describe the development of a new system for change detection with UAV images using a combination of image alignment and CNNs. The most significant difference between our proposed solution and other state-of-the-art methods is the inclusion of the mentioned image alignment system to adjust images from moving cameras. Figure 10 depicts the improvement of the precision of our system caused by the use of the image alignment component. An additional difference in our system is represented by the improvement of precision it provides on the studied scenarios, as reflected in Table 5. Based on these reasons, we can affirm that our CNN architecture and training have been implemented effectively.

As mentioned before, the system employs a reference sequence to detect the changes. This implies restraints to the method as the reference must denote the ideal status of the recorded scenario. That is to say, all the elements included in this sequence are considered as background. Previous to this approach, a reconstruction method similar to [2] was implemented. Because of the unsatisfactory precision obtained and the computational cost involved, the approach was discarded.

An additional point to discuss would be the multiple variations that could appear between two UAV flights. The effect of flying at different heights is one of these possible variations that could occur on a real scenario. In this case, our system compensates this using the image alignment component to select the most relevant elements which are included on both reference and foreground images. At that point, it automatically selects the central region of both images to be analysed.

As can be observed in Table 5, implementations from [4] have an unknown F-measure score at the "Intermittent Object Motion" category. These methods do not consider this particular type of scenario. However, the results they obtain on the other two categories position them as a relevant state-of-the-art implementation to compare our system.

**Figure 10.** Effect of the image alignment component on the system's output. (**a**) The resultant image with the image alignment applied, while (**b**) represent the results obtained without the use of the component. As can be observed, the aligned image detects changed elements (in this case, the two red boxes) more precisely.

#### **6. Conclusions**

In this paper we have presented a change detection system for static and moving cameras using image alignment based on ORB algorithm and convolutional neural networks. Because of the use of UAV imagery acquired by moving cameras, the problem of dynamic backgrounds has been addressed. As we have detailed along the paper, our mayor improvement from other state-of-the-art implementations consists on the use of an image alignment process. The objective of this element is to compensate the possible variations during UAVs flights described in previous sections. In addition to that, the inclusion of the sliding window algorithm reduces the computational cost of the CNN model by reducing the dimensions of the input images. Moreover, this method adds versatility to the system. The sliding window algorithm can be adjusted to provide overlapping sections to improve accuracy with an increment in computational cost. As far as we know, a moving camera scenario has not been taken into account in any of the state-of-the-art methods for change detection compared along the paper. Only dynamic backgrounds on static cameras have been studied on mentioned implementations. Our system is capable of adapting to these conditions using image alignment techniques and the idea of a reference video or image. Datasets with dynamic backgrounds have been selected to train the network to achieve meaningful outcomes for real-world applications. Results from experiments indicate a precise detection in scenarios with adverse weather such as a snowfall or a blizzard. The comparison with other state-of-the-art methods reflects that our system is the most accurate on the studied scenarios.

#### *Future Work*

The precision of the reference's acquisition is crucial for the system's performance. As a solution to this, we are working on GPS data processing for improving the alignment system as in [22]. In addition to that, we consider the option to include deep learning in the alignment process as another of our futures lines of work. Our objective with that is to compare the performance of deep learning against our current image alignment system based on ORB.

**Author Contributions:** Data curation, V.G.R.; Formal analysis, J.A.R.F. and J.M.M.G.; Funding acquisition, N.S.A. and J.M.L.M.; Investigation, V.G.R.; Methodology, V.G.R., J.A.R.F., J.M.M.G. and N.S.A.; Project administration, N.S.A. and J.M.L.M.; Resources, N.S.A. and J.M.L.M.; Software, V.G.R.; Supervision, J.A.R.F., J.M.M.G. and F.Á.; Validation, V.G.R.; Visualization, V.G.R.; Writing–original draft, V.G.R.; Writing—review & editing, J.A.R.F., J.M.M.G., N.S.A., J.M.L.M. and F.Á.

**Funding:** This research was funded by the 5G Public Private Partnership (5G-PPP) framework and the EC funded project H2020 ICT-762013 NRG-5 (http://www.nrg5.eu) grant number 762013.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
