**SlimDeblurGAN-Based Motion Deblurring and Marker Detection for Autonomous Drone Landing**

### **Noi Quang Truong, Young Won Lee, Muhammad Owais , Dat Tien Nguyen, Ganbayar Batchuluun, Tuyen Danh Pham \* and Kang Ryoung Park**

Division of Electronics and Electrical Engineering, Dongguk University, 30 Pildong-ro 1-gil, Jung-gu, Seoul 04620, Korea; noitq.hust@gmail.com (N.Q.T.); lyw941021@dongguk.edu (Y.W.L.); malikowais266@gmail.com (M.O.); nguyentiendat@dongguk.edu (D.T.N.); ganabata87@gmail.com (G.B.); parkgr@dongguk.edu (K.R.P.)

**\*** Correspondence: phamdanhtuyen@gmail.com; Tel.: +82-10-9264-4449; Fax: +82-2-2277-8735

Received: 11 June 2020; Accepted: 12 July 2020; Published: 14 July 2020

**Abstract:** Deep learning-based marker detection for autonomous drone landing is widely studied, due to its superior detection performance. However, no study was reported to address non-uniform motion-blurred input images, and most of the previous handcrafted and deep learning-based methods failed to operate with these challenging inputs. To solve this problem, we propose a deep learning-based marker detection method for autonomous drone landing, by (1) introducing a two-phase framework of deblurring and object detection, by adopting a slimmed version of deblur generative adversarial network (DeblurGAN) model and a You only look once version 2 (YOLOv2) detector, respectively, and (2) considering the balance between the processing time and accuracy of the system. To this end, we propose a channel-pruning framework for slimming the DeblurGAN model called SlimDeblurGAN, without significant accuracy degradation. The experimental results on the two datasets showed that our proposed method exhibited higher performance and greater robustness than the previous methods, in both deburring and marker detection.

**Keywords:** unmanned aerial vehicle; autonomous landing; deep-learning-based motion deblurring and marker detection; network slimming; pruning model

#### **1. Introduction**

Unmanned aerial vehicles (UAVs) or drones are successfully used in several industries. They have a wide range of applications such as surveillance, aerial photography, infrastructural inspection, and rescue operations. These applications require that the onboard system can sense the environment, parse, and react according to the parsing results. Scene parsing is a function that enables the system to understand the visual environment, such as recognizing the type of objects, place of objects, and regions of object instances in a scene. These problems are the main topics in computer vision—classification, object detection, and object segmentation. Object detection is a common topic and has attracted the most interest in recent studies. In object detection, traditional handcrafted feature-based methods showed limited performance [1–12]. A competitive approach is to apply deep-learning-based methods, which have gained popularity in recent years [13–16]. However, deploying deep learning models to a UAV onboard system raises new challenges—(1) difficulties of scene parsing in cases of low-resolution or motion-blurred input, (2) difficulties of deploying the model to an embedded system with limited memory and computation power, and (3) balancing between model accuracy and execution time. Autonomous landing is a core function of an autonomous drone, and it has become an urgent problem to be solved in autonomous drone applications. Recently, deploying deep learning models to UAV systems has become more feasible, because of both the growth in computing power and the extensive

studies of deep neural networks, which have achieved significant results in scene parsing tasks, such as object detection tasks (e.g., faster region-based convolutional neural network (R-CNN) [17] and single-shot multibox detector (SSD) [18]. Therefore, the topic of autonomous drone landing has attracted much research interest, and the trend is toward autonomous landing by using deep-learning-based methods for tracking a guiding marker. Several state-of-the-art (SOTA) object detectors, based on convolutional neural networks (CNNs) have been proposed and deployed successfully for marker detection in marker tracking tasks. You only look once (YOLO) models might be the most popular deep object detectors in practical applications, because the detection accuracy and execution time are well balanced. Nevertheless, those systems have low robustness and are prone to failure when dealing with low-resolution [16] or motion-blurred images [19]. Such inputs need to be preprocessed before being fed to the detector. Thus, using a combination of a few networks as a pipeline is a promising approach to achieve this goal. In addition, drone landing causes motion of the attached camera. Even if a drone has an antivibration damper gimbal, the recorded frames are affected by motion blurring, especially in the case of high-speed landing [20]. For this reason, marker detection with motion-blurred input is a critical problem that necessarily needs to be addressed.

Therefore, we propose to learn efficient motion deblurring marker detection for autonomous drone landing, through a combination of motion deblurring and object detection, and apply a slimming deblurring model to balance the system speed with accuracy, on embedded edge devices. To this end, we trained the DeblurGAN network on our synthesized dataset and then pruned the model to obtain the slimmed version, SlimDeblurGAN. Moreover, we trained a variant of the YOLO detector on our synthesized dataset. Finally, we stacked SlimDeblurGAN and the detector, and then evaluated the system on a Desktop PC and a Jetson board TX2 environment. This research was novel compared to previous studies, in the following four ways:


#### **2. Related Works**

There are numerous studies on autonomous drone landing, which can be classified into two types—those not considering motion blurring and those considering motion blurring.

Not considering motion blurring: At the initial stages, researchers considered objects on the runway with a lamp to guide the UAV, to determine a proper landing area. Gui et al. [1] proposed a vision-based navigation method for UAV landing, by setting up a system in which a near-infrared (NIR) light camera was integrated with a digital signal processing processor and a 940-nm optical filter was used to detect NIR light-emitting diode (LED) lamps on the runway. Their method had a significant advantage, i.e., it could not only work well in the daytime but also at nighttime. However, this method required a complicated setup of four LEDs on the runway. In addition, this method could only be performed in a wide area. Therefore, it failed to operate in narrow urban landing areas. Forster et al. [2] proposed a landing method including generating a 3D terrain depth map from the images captured by a downward-facing camera, and determining a secure area for landing.

It was completely proven that this method could work well in both indoor and outdoor environments. Nevertheless, the depth estimation algorithm was only tested at a maximum range of 5 m, and this method exhibited a slow processing speed. Two limitations of markerless methods are the difficulty of spotting a proper area for landing and the requirement of complicated setups for the landing area.

To solve these problems, marker-based methods were proposed. According to the type of features used, marker-based methods could be categorized into two kinds—handcrafted feature-based and deep feature-based methods. One of the handcrafted feature-based approaches that was robust to low-light conditions adopted a thermal camera-based method. These methods have high performance, even in nighttime scenarios, by using the emission of infrared light from a target on the ground. However, such methods require the drone to carry an additional thermal camera, as thermal cameras are not available in conventional drone systems. Other handcrafted marker-based approaches are based on visible-light cameras. Lin et al. [4] proposed a method to track the relative position of the landing area, using a single visible-light camera-based method. They used an international H-pattern marker to guide a drone landing in a cluttered shipboard environment. The characteristic of this method was that it could restore the marker from partial occlusion, and correctly detect the marker from complicated backgrounds. Moreover, they adopted the Kalman filter to fuse the vision measurement with the inertial measurement unit (IMU) sensor outputs, to obtain a more accurate estimate. Following that approach, Lange et al. [5] introduced a method to control the landing position of autonomous multirotor UAVs. They also proposed a new hexagonal pattern of landing pads, including concentric white rings on a black background and an algorithm to detect the contour rings from the landing pads. In addition, they used auxiliary sensors such as the SRF10 sonar sensor (Robot Electronics, Norfolk, UK), which accurately measured the current altitude above the ground, and the Avago ADNS-3080 optical flow sensor (Broadcom Inc., San Jose, CA, USA), which output the UAV's current velocity. These methods have the same disadvantage as the previous one, mandatorily carrying additional hardware, such as the IMU sensor, sonar sensor, and optical flow sensor. Some previous studies investigated UAV landing on a moving platform [6,20]. These studies take account of the six-degrees of freedom (6-DOF) pose of the marker, by using special landing pads, like fiducial markers. They also investigated a landing scenario in which the markers were positioned on the deck of a ship or placed on a moving platform. Other than landing on a fixed area, this method not only solved the marker-tracking problem but also tackled the more challenging obstacle. However, it requires more calculations and the estimation of relative position between the UAV and the moving target. Hence, they used SOTA computer vision methods, including multisensor fusion, tracking, and motion prediction of landing target on the moving platform. Consequently, the limitation of such methods is the short working range, due to the limited working range of the hardware employed. In particular, a previous study adopted the fiducial AprilTag [21] marker as the landing pad, owing to its robustness in difficult situations, such as severe rotation, heavy occlusion, light variation, and low image resolution. Although this study successfully tracked the marker in daytime conditions, the maximum distance between the landing target and the UAV was only approximately 7 m.

Araar et al. [7] proposed a new solution for multirotor UAV landing, using a new landing pad and relative-pose-estimation algorithm. In addition, they adopted two filters (an extended Kalman filter and an extended H\_∞) to fuse the estimated pose and the inertial measurement. Although their method was highly accurate, it required information on the inertial measurements. Additionally, only indoor environment experiments were conducted, and the maximum working range was limited, owing to the drawback of the employed AprilTag marker. A novel idea was adopted in another study, taking advantage of cloud computing to overcome the limitation of the onboard hardware [11]. Specifically, the heavy computation tasks of computer vision were transferred to a cloud-based system, and the onboard system of the UAV only handled the returned results. Barták et al. [8] introduced an adequate handcrafted marker-based method for drone landing. Handcrafted feature-based techniques, such as blob pattern recognition, were adopted to identify and recognize the landing target. Control algorithms were also employed to navigate the drone to the appropriate target area. In this way, this method worked well in real-world environments. Nevertheless, their experiments were conducted only during daytime, and the maximum detection range was limited to 2 m. In an attempt to address autonomous UAV landing on a marine vehicle, Venugopalan et al. [9] proposed a method that adopted handcrafted feature-based techniques, like color detection, shape detection, pattern recognition, and image recognition, to track the landing target. Additionally, a searching and landing algorithm and a state machine-based method, were proposed. Their method worked well, with a success rate of over 75%, even in some difficult environmental conditions like oscillatory motion associated with the landing target or wind disturbance. However, the testing distance between the landing target and the UAV in their experiments was close. Wubben et al. [10] proposed the method for accurate landing of unmanned aerial vehicles, based on ground pattern recognition. In their method, a UAV equipped with a low-cost camera could detect ArUco markers sized 56 × 56 cm, from an altitude of up to 30 m. When the marker was detected, the UAV changed its flight behavior in order to land on the accurate position where the marker was located. Through experiments, they confirmed an average offset of only 11 cm from the target position, which vastly enhanced the landing accuracy, compared to the conventional global positioning system (GPS)-based landing, which typically deviated from the intended target by 1 to 3 m. Some researchers studied the autonomous landing of micro aerial vehicles (MAVs), using two visible-light cameras [12]. They performed a contour-based ellipse detection algorithm to track a circular landing pad marker in the images obtained from the forward-facing camera. When the MAV was close to the target position, the downward-facing camera was used because the fixed forward-facing camera view was limited. By using two cameras to extend the field of view of the MAV, the system could search for the landing pad even when it was not directly below the MAV. However, this method was only tested in an indoor scenario, which limited the working range.

In order to overcome the performance limitations of the handcrafted feature-based methods, deep feature-based methods were introduced, which exhibited high accuracy and increased detection range. Nguyen et al. [13] proposed a marker tracking method for autonomous drone landing, based on a visible-light camera on a drone. They proposed a variant of YOLOv2 named lightDenseYOLO to predict the marker location, including its center and direction. In addition, they introduced Profile Checker V2 to improve accuracy. As a result, their method could operate with a maximum range of 50 m. Similarly, Yu et al. [14] introduced a deep-learning-based method for MAV autonomous landing systems, and they adopted a variant of the YOLO detector to detect landmarks. The system achieved high accuracy of marker detection and exhibited robustness to various conditions, such as variations in landmarks under different lighting conditions and backgrounds. Despite achieving high performance in terms of detection range and accuracy, these methods did not consider input images under conditions like low-resolution and motion-blurred images. In another study, Polvara et al. [15] proposed a method based on deep reinforcement learning to solve the autonomous landing problem. Specifically, they adopted a hierarchy of double-deep Q-networks that were used as high-level control policies to reach the landing target. Their experiments, however, were only conducted in indoor environments.

Recently, Truong et al. [16] proposed a super-resolution reconstruction (SR) marker detection method for autonomous drone landing, by using a combination of SR and marker-detection deep CNNs, to track the marker location. Their proposed method successfully handled the obstacle of low-resolution input. Moreover, they introduced a cost-effective solution for autonomous drone landing, as their system required only a low-cost, low-resolution camera sensor, instead of expensive, high-resolution cameras. Furthermore, their proposed system could operate on an embedded system and acquired a real-time speed. However, they did not consider the case of motion blurring in the captured image. A low-resolution image was acquired by a low-resolution camera, including the small number of pixels from the camera sensor, but the motion blurring was caused by the f-number of the camera lens and the camera exposure time. A small f-number and a large exposure time caused a large amount of motion blurring in the captured image. It is often the case that motion blurring frequently occurred in the captured image by drone camera, because the image was captured while the drone was moving or landing. Therefore, we propose a new method of motion deblurring and marker detection for drone landing, which is completely different from the previous work [16], which considers only SR of the low-resolution image by drone camera, without motion deblurring. In addition, we propose a new network of SlimDeblurGAN for motion deblurring (that is different from previous work [16]), which used deep CNN with a residual net skip connection and network-in-network (DCSCN) for SR. Considering the motion blur method: All previous methods exhibited promising solutions for autonomous landing. They conducted experiments based on various scenarios like indoor, outdoor, daytime, and nighttime, as well as difficult conditions like low light and low resolution of the input. However, the input images under the motion blur effect, which frequently occur due to the movement of the drone, were not considered in their studies. Therefore, we propose a deep-learning-based motion deblurring and marker detection method for drone landing. These research studies [13,14] were about marker detection by a drone camera and did not consider the motion blurring in the captured image, which was different from our research considering motion deblurring. The research in [20] dealt with the motion blurring in the captured image by UAV, but they did not measure the accuracy of marker detection and the processing speed on the actual embedded system for the drone. Different from this research, we measured the accuracies of marker detection by our method and compared them with the state-of-the-art methods. In addition, we measured the processing speed of marker detection by our method on the actual embedded system for the processing on the drone and compared them with the state-of-the-art methods. The research [19] studied the detection of motion-blurred vehicle logo. However, its target was only for logo detection, which was different from our research of marker detection by a drone camera. Although the method in [13,14,21] achieved a 99% accuracy for landmark or marker, based on field experiments, they assumed only the slow movement or landing of drone, which did not generate the motion blurring. However, in the actual case of drone movement or landing at normal speed, motion blurring occurred frequently, as mentioned in [20]. Table 1 presents a comparison of the proposed and previous methods.


**Table 1.** Summary of theoretical comparisons between the proposed method and previous studies on vision-based drone landing. 


**Table 1.** *Font.*

## **3. Proposed Method**

#### *3.1. A. Proposed Two-Phase Framework of Motion Deblurring and Marker Detection for Autonomous Drone Landing*

Our goal was to propose a model M to accurately detect a marker object in a motion-blurred image *x blur* . The factor blur indicated that the image x was affected by motion blur. For that, a framework was considered, which combined two models, including a motion deblurring model that acted as a preprocessing model (*P*) to predict the sharp image *y* ˆ *sharp* = *P x blur* , θ*<sup>P</sup>* , and a follow-up marker detection model (*S*) that predicted the marker object based on the predicted sharp image *y* ˆ = *S y* ˆ *sharp* , θ*<sup>S</sup>* . Here, θ*<sup>P</sup>* is the set of trainable parameters of the preprocessing model (*P*), and θ*<sup>S</sup>* is the set of trainable parameters of the marker detection model (*S*). This framework was promising because the motion deblurring model helped to recover the blurred input image to the sharp image, on which the detector could easily act. In addition, it had several advantages like separate independent training, guaranteed model convergence in the framework, and leveraging the SOTA models. Therefore, we proposed a two-phase framework for motion deblurring and marker detection, as shown in Figure 1. Phase I is a motion deblurring preprocessor *P* that uses our proposed SlimDeblurGAN model, and Phase II is the marker detector *S* that uses a YOLOv2 detector, which intakes the motion deblurred output from Phase I and outputs the predicted bounding box of the marker. The remainder of this section on the proposed method is organized according to the two phases.

#### *3.2. Phase I: Blind Motion Deblurring by SlimDeblurGAN*

— Motion deblurring is a method of sharpening the blurring of an image caused by the motion of object or camera during the exposure time. Such methods are categorized into two kinds—blind and nonblind deblurring. Nonblind deblurring methods assume that the blur source is known, whereas blind deblurring methods suppose that blur source is unknown, and they estimate both a latent sharp image and blur kernels. Kupyn et al. proposed DeblurGAN [22], which is a blind motion deblurring method that achieved the SOTA performance, while being faster than its closest competitor, DeepDeblur [23], by a factor of five. In this study, we did not directly use DeblurGAN in our framework; instead, we used a slim version that was pruned from the base model DeblurGAN. The pruning process is described in Section 3.2.2. Section 3.2.1 briefly explains the original DeblurGAN.

–

—

#### 3.2.1. Blind Motion Deblurring by DeblurGAN

The family of conditional generative adversarial network (cGAN) [24] was successfully applied in some image translation applications such as super-resolution [25], style transfer [26], and motion deblurring. DeblurGAN was designed as a cGAN using the Wasserstein GAN gradient penalty (WGAN-GP) [27] as the critic function. Training GAN models required the procedure of finding a Nash equilibrium of a noncooperative two-player game [28]. Sometimes the gradient descent does this and at others, it does not, and no good equilibrium-finding algorithm was reported yet. These difficulties led to a novel idea, WGAN, which used an alternative objective function—using the Wasserstein distance instead of the traditional Jensen–Shannon distance, because it helped to increase the training stability [29]. Gulrajani et al. [27] then proposed WGAN-GP, which was an updated version, robust to the choice of generator architecture. For this crucial reason, DeblurGAN adopted WGAN-GP as a critic function, which allowed DeblurGAN to use a lightweight CNN architecture as a generator. The DeblurGAN architecture included a generator network and a critic network, as shown in Figure 2.

**Figure 2.** DeblurGAN architecture. The generator recovers the latent sharp image from the blurred image. The critic outputs the distance between the restored and sharp images. The total loss consists of perceptual loss and WGAN loss from the critic. After training and channel pruning, only the generator is used in Phase I.

The generator was the same as that proposed by Johnson et al. [30] for style transfer tasks. It contained two convolution blocks, nine residual blocks [31] (ResBlocks), and two transposed convolution blocks. Each ResBlock had a convolution layer, instance normalization layer [32], and rectified linear unit (ReLU) [33] activation. In contrast to the original one proposed by Johnson et al. [30], the DeblurGAN generator had an additional global skip connection, which was referred to as ResOut. The detailed information of the generator architecture is shown in Table 2. The critic network architecture was identical to that of PatchGAN [34].

Input layer (Res) - - - 256 × 256 × 3 1st CL 64 7 × 7 1 × 1 256 × 256 × 64 2nd CL 128 3 × 3 2 × 2 256 × 256 × 128 3rd CL 256 3 × 3 2 × 2 256 × 256 × 256

1st transpose CL 128 3 × 3 1 × 1 256 × 256 × 128 2nd transpose CL 64 3 × 3 1 × 1 256 × 256 × 64 Last CL 3 7 × 7 1 × 1 256 × 256 × 3

Output [Res + Last CL] - - - 256 × 256 × 3

**The number of Strides**

256 3 × 3 1 × 1 256 × 256 × 256

**Size of Output (Height × Width × Channel)**

**Layer Type**

ResBlock

] × 9

3 × 3 , 1 3 × 3 , 1

[

**The number of Filters**


**Table 2.** DeblurGAN generator architecture (CL means convolutional layer).

DeblurGAN loss included content loss and adversarial loss, as shown in Equation (1):

$$
\mathcal{L} = \mathcal{L}\_{\text{GAN}} + \lambda \cdot \mathcal{L}\_{\text{X}} \tag{1}
$$

where the total loss L is the sum of the adversarial loss L*GAN* and content loss L*X*; the coefficient λ denotes the balance between the two types of losses and it was set to 100 in all experiments.

Adversarial loss was described as:

$$\mathcal{L}\_{\text{GAN}} = \sum\_{n=1}^{N} -D\_{\theta\_D} \left( \mathbf{G}\_{\theta\_G} \left( \mathbf{X}^{blur} \right) \right) . \tag{2}$$

*D*θ*<sup>D</sup>* and *G*θ*<sup>G</sup>* are the discriminator and generator, respectively. θ*<sup>D</sup>* and θ*G*. are the trainable parameters of the discriminator and generator, respectively.

Content loss was the perceptual loss [30], which was defined as:

$$\mathcal{L}\_X = \frac{1}{\mathcal{W}\_{i,j}\mathcal{H}\_{i,j}} \sum\_{x=1}^{\mathcal{W}\_{i,j}} \sum\_{y=1}^{\mathcal{H}\_{i,j}} \left( \phi\_{i,j} \Big(\mathbf{X}^{\text{sharp}}\Big)\_{\mathbf{x},y} - \phi\_{i,j} \Big(\mathbf{G}\_{\theta\_G} \Big(\mathbf{X}^{\text{shur}}\Big)\Big)\_{\mathbf{x},y} \right)^2,\tag{3}$$

where φ*i*,*<sup>j</sup>* is the feature map obtained by the *i th* convolution within the VGG19 network, pretrained on ImageNet [35], and *Wi*,*<sup>j</sup>* and *Hi*,*<sup>j</sup>* are the width and height of the feature maps, respectively. θ*<sup>G</sup>* is the set of trainable parameters of the generator (*G*θ*<sup>G</sup>* ).

The authors proved experimentally that without this perceptual loss or without replacing the perceptual loss with a simple mean square error (MSE), the network did not converge to a meaningful state [22].

#### 3.2.2. Proposed SlimDeblurGAN

As we adopted a two-phase process, the execution time of the proposed framework *toverall* was the sum of the time for each model element in two phases, including the processing time of motion deblurring *tP*, and that of marker detection *tD*, as illustrated in Equation (4).

$$t\_{\text{overall}} = t\_{\text{Phase I}} + t\_{\text{Phase II}} = \; t\_P + t\_D. \tag{4}$$

In Equation (4), the processing time of detection was much shorter than that of the motion deblurring model. Informatively, the processing time of the YOLOv2 detector was shorter than that of DeblurGAN, by almost 17 times. Therefore, a slimmed deblurring model P was crucial to reduce the execution time, and thus increased the execution speed of the overall system.

Considering the recently proposed methods for network lightening, such as using MobileNet [36] as a backbone, manually reducing the number of layers, network slimming [37], knowledge distillation [38], and dynamic computation [39], we settled on the network slimming proposed by Liu et al. [37]. This was a novel learning scheme for learning efficient convolutional networks, which reduced the model size, decreased the run-time memory footprint, and lowered the number of computing operations, without compromising accuracy. Essentially, the network slimming method is a technique to learn more compact CNNs. It directly imposed sparsity-induced regularization on the scaling factors in batch normalization layers, and the unimportant channels could thus be automatically identified during training, which could then be pruned. It is conceptually easy to understand; however, proposing a framework that can prune well for every network is challenging, as each network has its different components and irregular network architecture. Liu et al. applied a network slimming method to prune image classifier CNNs [37]. Zhang et al. [40] then extended the scheme to a coarse-grained method and successfully applied it to a slim YOLOv3 network. Inspired by the works of Liu and Zhang et al. [37,40], we proposed a model pruning procedure for pruning the DeblurGAN model to obtain SlimDeblurGAN, as shown in Figure 3 and Table 3.

**Figure 3.** Iterative procedure of pruning of the DeblurGAN model through sparsity training and channel pruning for SlimDeblurGAN.

Adapting DeblurGAN for model pruning. Our goal was to reduce the processing time of the proposed system by reducing the execution time of Phase I. This phase was a motion deblurring task that could be performed by the generator of DeblurGAN. In addition, only the generator was kept and employed in the testing time. Therefore, we conducted the process of training and pruning DeblurGAN to obtain SlimDeblurGAN. To this end, we proposed a slimming framework for pruning only the generator, while keeping the remaining part of the DeblurGAN. We adapted DeblurGAN for the pruning process, by modifying the generator architecture. In more detail, the original DeblurGAN generator used the instance normalization layer; however, we replaced all instance normalization layers with batch normalization (BN) layers and imposed L1 regularization on the BN layers.

level sparsity training was to adopt a scaling factor γ for each channel, where |γ| denoted Channel-level sparsity training of DeblurGAN. Sparsity could be implemented at different levels, such as the weight level, kernel level, layer level, or channel level. Among these levels, the channel level provided the best tradeoff between flexibility and ease of implementation. The idea of channel-level sparsity training was to adopt a scaling factor γ for each channel, where |γ| denoted the channel importance, and then to jointly train the network weights and the scaling factors. As there are some identical properties between desired architecture and the BN architecture, the implementation of channel-level sparsity could leverage the BN layer.

Specifically, the BN layer was formulated as shown in the following equations:

**The number of Filters**

**Layer Type**

ResBlock 1

ResBlock 2

ResBlock 3

ResBlock 4

ResBlock 5

ResBlock 6

ResBlock 7

$$\hat{z} = \frac{z\_{\text{in}} - \mu\_{\text{B}}}{\sqrt{\sigma\_{\text{B}}^2 + \epsilon}} \,\tag{5}$$

$$z\_{\rm out} = \gamma \hat{z} + \beta\_{\prime} \tag{6}$$

4th CL 107 3 × 3 1 × 1 256 × 256 × 107 5th CL 251 3 × 3 1 × 1 256 × 256 × 251

6th CL 59 3 × 3 1 × 1 256 × 256 × 59 7th CL 251 3 × 3 1 × 1 256 × 256 × 251

8th CL 24 3 × 3 1 × 1 256 × 256 × 24 9th CL 251 3 × 3 1 × 1 256 × 256 × 251

10th CL 9 3 × 3 1 × 1 256 × 256 × 9 11th CL 251 3 × 3 1 × 1 256 × 256 × 251

12th CL 3 3 × 3 1 × 1 256 × 256 × 3 13th CL 251 3 × 3 1 × 1 256 × 256 × 251

14th CL 3 3 × 3 1 × 1 256 × 256 × 3 15th CL 251 3 × 3 1 × 1 256 × 256 × 251

16th CL 3 3 × 3 1 × 1 256 × 256 × 3 17th CL 251 3 × 3 1 × 1 256 × 256 × 251

ResBlock 8 18th CL 13 3 × 3 1 × 1 256 × 256 × 13

Input layer (Res) - - - 256 × 256 × 3

where *zin*, µB, and σ 2 B are respectively the input features, mean, and variance of input features in a minibatch, and γ and β denote the trainable scale factor and bias (scale and shift), respectively. The trainable scale factor γ in the BN layer could be adopted as an indicator of channel importance. To impose sparsity regularization, a sparsity-induced penalty was added to the training objective (*lossnetwork*), which was given as

$$L = \text{loss}\_{\text{network}} + \lambda \sum\_{\mathcal{Y}} f(\mathcal{Y})\_{\text{\textquotedblleft}} \tag{7}$$

where *f*(γ) = γ  and <sup>λ</sup> denotes the penalty factor. <sup>γ</sup> is the trainable scale factor.

**Table 3.** SlimDeblurGAN: DeblurGAN generator architecture after the model pruning process (CL—convolutional layer).


Channel pruning. To achieve this goal, we adopted an expected pruning ratio *r*, which was an expected ratio of the number of expected pruned channels to the overall feature channels. Based on *r* and the sorted list of all γ , a global threshold <sup>γ</sup><sup>ˆ</sup> was experimentally obtained, which determined whether a channel of the feature map was to be pruned. Feature channels, whose scaling factors were smaller than the threshold γˆ were pruned.

Fine-tuning SlimDeblurGAN. After channel pruning, model fine-tuning was necessary to compensate for temporary accuracy loss, which showed an even higher accuracy than the model without fine-tuning.

Iterative pruning. The repetition of the pruning procedure (as shown in Figure 3) helped to avoid over-pruning, which caused the pruned model degradation and could not be recovered by fine-tuning or performing more pruning steps.

3.2.3. Summarized Differences between the Original DeblurGAN and Proposed SlimDeblurGAN

We summarized the differences between the original DeblurGAN and the proposed SlimDeblurGAN, as follows.


#### *3.3. Phase II: Marker Detection by YOLOv2 Detector*

Deep object detectors have attracted much interest in recent years. Several SOTA deep object detectors were proposed, including Fast R-CNN [41], Faster R-CNN [17], R-FCN [42], RetinaNet [43], SSD [18], and the YOLO series (YOLO [44], YOLOv2 [45], and YOLOv3 [46]). According to the adoption of extra region proposal modules, these could be categorized into two classes—two-stage and one-stage detectors. In particular, the YOLO series, which were one-stage detectors, were widely adopted in practical applications [44–46], because the accuracy and speed were well-balanced. Therefore, we adopted YOLOv2 using Darknet19 of Figure 4 as the main part of the feature extraction, as a marker detector in Phase II. — –

**Figure 4.** YOLOv2 backbone convolutional neural networks (CNN) architecture. The backbone network is a combination of the first 18 layers of Darknet19 and the YOLOv2 header. The header includes four convolutional layers and a Reorg layer. The Reorg layer refers to reorganization, which manipulates the output feature map of the 13th convolution of Darknet19 to obtain a reorganized feature map. The Concat layer concatenates the reorganized feature map and the output feature map of the 20th convolution. The final output is the feature map of shape S × S × B × (5 + C).

In YOLOv2 [45], the Reorg layer was used to reshape the feature map, so that the width and height of the input feature map matched the other output feature map, and these two feature maps could be concatenated together. However, our marker datasets were quite different, compared to other object detection datasets like COCO [47], because the number of classes to be detected was only one (the marker), and its size varied from small to large, according to the height of the drone above the landing area. Therefore, it was necessary to adapt YOLOv2 to remove the redundant computations. First, we considered the anchor boxes. As the marker was a circle-based shape, the ground truth bounding

C) [45]. Here, "5" meant cen

boxes were theoretically square. However, the image size of our dataset was 1280 × 720 pixels, and in training and testing, the input was resized to 320 × 320 pixels by bilinear interpolation [48]. Hence, the height-to-width ratio of the marker was changed to a certain ratio. Therefore, instead of choosing the anchor boxes by hand or using anchor boxes obtained from other datasets, we normalized all ground truth bounding boxes of the training dataset and clustered them through K-means clustering with a distance metric, to obtain the proper anchor boxes for our dataset. The number of anchor boxes and their sizes could be determined by the elbow curve method on the graph of the number clustering and the IoU threshold. Second, we set the input size for the backbone network of 320 × 320 pixels as close to the output of Phase I, 256 × 256 pixels.

According to the YOLOv2 network [45], the output image was represented as S × S grids, and S was defined as 10. Therefore, the output feature map of the S × S grids was 10 × 10. B was the number of anchor boxes (in our experiment, B was defined as 3). In detail, each grid could have three anchor boxes for representing the detected object [45]. The output shape of the feature map of the YOLOv2 network was S × S × B × (5 + C) [45]. Here, "5" meant center x, center y, width, height, and confidence of one anchor box. In addition, C was the number of class probability (in our case, C was one because there was only one marker class) of one anchor box. Consequently, the output shape of the feature map of the YOLOv2 network (S × S × B × (5 + C)) became 10 × 10 × 3 × 6 in our case.

#### **4. Experimental Results**

#### *4.1. Experimental Environment and Datasets*

As there was no open dataset of blurred images acquired from landing drones, we synthesized images from the Dongguk drone camera database version 2 (DDroneC-DB2) [13] as the synthesized motion blur drone database 1 (SMBD-DB1), and we also obtained the real motion blur drone database 1 (RMBD-DB1), which contained real motion blur in the wild.

Training and testing were performed based on a two-fold cross-validation method on these two databases. All subsets were equally distributed to training and testing per fold. In the 1st fold validation, half of the images in the SMBD-DB1 were used for training, and the other half was for testing. This procedure was repeated by exchanging the training and testing data with each other in the 2nd fold validation. The average accuracy from the two-fold validations was determined as the final accuracy. The same rule was also applied to RMBD-DB1. The details of these two datasets are presented in the following section.

SMBD-DB1: This dataset was generated by following the idea proposed by Kupyn et al. [22], which was based on random trajectory generation. More specifically, motion-blurred images were generated by applying the motion-blurring kernels to the original images. The motion-blurring kernels were created by applying subpixel interpolation to the trajectory vector. Each trajectory vector, which was a complex-valued vector, corresponded to the discrete positions of an object undergoing 2D random motion in a continuous domain. Figure 5 illustrates four samples of SMBD-DB1, including the motion-blurring kernels, original images, and obtained blurred images. SMBD-DB1 contains 10,642 images generated from three sub-datasets (acquired in the morning, afternoon, and evening) of DDroneC-DB2, as shown in Table 4.

RMBD-DB1: For validating our method in a real-world environment, we used an additional dataset that contained real motion blur in the wild. We obtained six drone-landing videos and obtained 2991 images. The details of RMBD-DB1 are shown in Table 5. Therefore, the RMBD-DB1 included real motion of captured images by a real drone camera, and our method was expected to be performed on a real drone. The third case of Figure 6 shows the real motion blurring when the drone was fast landing.

**Figure 5.** Examples of SMBD-DB1. (**a**) Synthetically generated kernels, (**b**) ground truth (original images), and (**c**) motion-blurred images.

**Figure 6.** Examples of RMBD-DB1. (**a**) Ground truth (original images), and (**b**) motion-blurred images.

**Sub-Dataset The number of Images**

Core™ i7

Evening 3848 Total 10,642

Morning 502 Afternoon 1557 Evening 932 Total 2991

**Sub-Dataset The number of Images**


**Table 4.** Descriptions of SMBD-DB1.


#### *4.2. Training the Proposed Method*

The training of our method included two parts to train SlimDeblurGAN and YOLOv2, as explained in the following sections. All experiments, including both training and testing, were performed on a desktop computer with an Intel®Core™ i7-3770K CPU 3.5 GHz (4 cores) (Intel Corp., Santa Clara, CA, USA), 8 GB of main memory, and an NVIDIA GeForce 1070 graphics processing unit (GPU) card (1920 compute unified device architecture (CUDA) cores, and 8 GB of graphics memory) (NVIDIA Corp., Santa Clara, CA, USA).

#### 4.2.1. Training SlimDeblurGAN

We conducted a model pruning process, as mentioned in the previous section, to obtain SlimDeblurGAN. First, we created the base model and performed sparse training. Second, we repeated the iterative process of pruning and fine-tuning, until the resulting model showed a critical degradation of accuracy. We chose a model from among all pruned models throughout the pruning process, which had the best balance between accuracy and model size. Aiming to generate the base model, we replaced all instance normalizations by batch normalizations in the generator network for model pruning adaptation. As a result, we could increase the batch size to accelerate the training process. The training batch size of the DeblurGAN base model was chosen as 8, as it was the maximum batch size for which the model could be loaded in our training environment. We retained this batch size in the fine-tuning of the iterative pruning process. The number of parameters and accuracies of the resulting models obtained from the iterative channel pruning process are presented in Table 6. In this table, the peak signal-to-noise ratio (PSNR) [48] was widely used for mathematical measurements of image quality, based on the mean square error (MSE) between the pixels of ground truth image (Im(*i*,*j*)) and motion-deblurred image (Res(*i*,*j*)), as shown in Equations (8) and (9). The structural similarity measure (SSIM) index [49] could also predict the perceived quality of images. In detail, SSIM was the index showing the similarity between two images based on means, standard deviations, and the covariance of the two images.

$$\text{MSE} = \frac{\left(\sqrt{\sum\_{j=1}^{M} \sum\_{i=1}^{N} \left(\text{Im}(i,j) - \text{Res}(i,j)\right)^2}\right)^2}{\text{MN}},\tag{8}$$

$$\text{PSNR} = 10 \log\_{10} \left( \frac{255^2}{\text{MSE}} \right) \tag{9}$$

where *M* and *N* represent the width and height of image, respectively.


**Table 6.** DeblurGAN channel pruning iterations.

The base model had 11.39 million parameters in the generator, and we performed sparse training from scratch with this model, which showed successful convergence with a PSNR of 20.59 and an SSIM of 0.49 on SMBD-DB1. We further performed channel pruning and fine-tuning with this trained base model. As a result, we obtained the first pruned model with the number of parameters in the generator reduced to 2.47 million. The fine-tuning yielded a PSNR of 21.46 and an SSIM of 0.39. Likewise, we performed the next channel pruning iterations, in which the base model was replaced with the resulting model from the previous iteration. The number of parameters in the generator after the second and third iterations decreased to 1.64 million and 1.14 million, respectively. Meanwhile, the accuracies of the resulting models degraded to 20.92 (PSNR) and 0.34 (SSIM) after the second iteration, followed by 18.16 (PSNR) and 0.26 (SSIM) after the third iteration. As shown in Table 6, the number of remaining parameters of the generator was dramatically reduced by 4.6 times after the first iteration, 6.9 times after the second iteration, and almost 10 times after the third iteration, compared to that of the base model. Although the accuracy increased after the first iteration, it decreased slightly after the second iteration, and critically degraded after the third iteration. We stopped the pruning process after the third iteration, as the observed degradation indicated over-pruning. We considered that the second pruned model had a good balance between the number of parameters and accuracy. Hence, we applied this slimmed model to the motion-deblurring phase of our system and referred to it as SlimDeblurGAN. The successful PSNR training with fine-tuning of the SlimDeblurGAN for 70 epochs appears in Figure 7.

**Figure 7.** Training PSNR of SlimDeblurGAN.

**Normalized \***

**Height Width** Box 1 0.072 0.040 Small Box 2 0.152 0.086 Medium Box 3 0.371 0.212 Large

**Size**

**Anchor Box**

#### 4.2.2. Training Marker Detection CNN

In an attempt to facilitate the training process of the marker detection CNN, we considered the distribution of object bounding boxes in the training dataset, in order to generate a set of anchor boxes used by YOLOv2.

Generating anchor boxes: We performed K-means clustering on the bounding boxes of SMBD-DB1, based on the mean IoU distance with K from 2 to 9. As shown in Figure 8, we could determine the number of clusters by the elbow curve method. As a result, 3 was the best candidate for the number of clusters. In detail, the case of K from 2 to 3 showed the largest increment in the mean IoU and we chose 3 for the number of clusters in YOLOv2.

**Figure 8.** Mean IoU with respect to K for the optimal K determination.

The details of the three selected anchor boxes are shown in Table 7, and these anchor boxes are visualized in Figure 9. Notably, the actual size of the anchor boxes used in YOLOv2 depended on the grid size of the output feature map of the YOLOv2 backbone. In our experiments, we designed the backbone network to generate an output feature map grid of size 10 × 10. Therefore, the size of the anchor boxes was 10 times larger than the normalized size. The normalized bounding boxes are detailed in Table 7.


**Table 7.** Size of the three selected anchor boxes (\* The ranges of normalized height and width are from 0 to 1, respectively).

**Figure 9.** Anchor boxes.

#### *4.3. Testing the Proposed Method*

4.3.1. Accuracy of Motion Deblurring of the Proposed SlimDeblurGAN

We conducted training and channel pruning processes to obtain SlimDeblurGAN on one fold and tested on the other fold of SMBD-DB1. In addition to measuring the testing accuracy, we measured the number of floating-point operations (FLOPs) to show the effectiveness of our proposed SlimDeblurGAN in terms of computation, compared to DeepDeblur [23], DeblurGAN [22], and DeblurGAN, using MobileNet as the backbone nets [50]. All parameters for others [22,23,50] were optimally selected by us with training data. The average results of the measurements from the two folds are presented in Table 8. The comparison of SlimDeblurGAN and the SOTA methods is also illustrated in Figure 10. As shown in this figure and table, DeblurGAN [22] showed the highest PSNR of 21.6; however, it also had a very high number of operations, at 99.3 Giga FLOPs. DeepDeblur [23] showed the lowest PSNR and highest FLOPs. Both the SlimDeblurGAN and the DeblurGAN model using MobileNet as a backbone had a small number of FLOPS, nearly one-sixth as much as that of DeblurGAN. However, DeblurGAN using MobileNet failed to maintain accuracy with a PSNR of 19.5, whereas SlimDeblurGAN had a slightly decreased accuracy, with a PSNR of 20.9. Therefore, we confirmed that our channel pruning process successfully generated a compact version of DeblurGAN with fewer FLOPs, yet high accuracy.

**Figure 10.** Comparison of DeepDeblur, DeblurGAN, DeblurGAN (MobileNet), and our proposed model of SlimDeblurGAN, in terms of motion deblurring accuracy and the number of FLOPs.


**Table 8.** Comparison of the number of FLOPs and accuracies of Deep Deblur [23], DeblurGAN [22], DeblurGAN using MobileNet [50], and our proposed model, SlimDeblurGAN.

Figure 11 presents examples of motion deblurring by four methods, i.e., DeepDeblur [23], DeblurGAN [22], DeblurGAN (MobileNet) [50], and SlimDeblurGAN. As shown in the figure, the results by DeblurGAN (MobileNet) and by DeepDeblur [23] were worse than those of the other methods, because the marker was still blurred and had the ghost effect as in the motion-deblurred image. However, the results obtained by DeblurGAN and SlimDeblurGAN showed sharp, non-ghost effects and recognizable markers, even by the human eye. Although the accuracy of DeblurGAN was slightly higher than that of SlimDeblurGAN, as presented in Table 8, the results obtained from both methods were almost the same in terms of perceptual comparisons.

**Figure 11.** Examples of motion deblurring on SMBD-DB1. (**a**) Ground truth original image, (**b**) motion-blurred image, motion deblurring by (**c**) DeepDeblur, (**d**) DeblurGAN, (**e**) DeblurGAN (MobileNet), and (**f**) SlimDeblurGAN (ours).

#### 4.3.2. Accuracy of Marker Detection

As the most common method of evaluating the performance of object detection system is to analyze the precision, recall, and F1 score at different IoU thresholds, we also evaluated our system in this way. These metrics were based on true positive (TP), false positive (FP), true negative (TN), and false negative (FN). In our study, TP, FP, TN, and FN could be determined by the following case studies of detection results:

Case 1: The system could not detect the marker on the image. We considered this case to be FN, as presented in Figure 12a.

**Figure 12.** Examples of the detected results. The green boxes are ground truth labels; the red boxes are detected results. (**a**) The model cannot detect the maker on the image, and thus this case is considered as FN. (**b**) The detected result is not the marker but rather a marker-like object, and therefore this case is considered as FP. (**c**) The model can correctly detect the marker on the image; this case is considered as TP if the IoU is greater than or equal to the predefined threshold; otherwise, it is considered as FN and FP.

Case 2: The detected object was not the marker but rather a marker-like object (i.e., wrong detection). We considered this case to be FP, as shown in Figure 12b.

Case 3: The marker was detected by the system, as illustrated in Figure 12c; we considered the IoU between the detected bounding box and the ground truth bounding box. We compared the IoU score with the predefined threshold. If the IoU was greater than or equal to the predefined threshold, this case could be considered as TP; otherwise, it could be considered to be FP and FN.

Following these definitions, we counted the number of true positives (Num. of TP), the number of false positives (Num. of FP), and the number of false negatives (Num. of FN) on the testing dataset. As a result, the accuracies could be calculated based on Equations (10)–(12).

$$\text{Precision} = \frac{\text{f}}{\text{Num. of FP}} \text{ of TP}$$

$$\text{Num. of FP} + \text{Num. of FP}'$$

$$\text{Recall} = \frac{\text{f}}{\text{Num. of TP}} \times \text{fTP}$$

$$\text{Num. of TP} + \text{Num. of FN}' \tag{11}$$

$$\text{F1 score } = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision } + \text{ Recall}}.\tag{12}$$

Precision reflected the proportion of correct detections out of the total number of detections, whereas recall indicated the proportion of correct detections to the total number of ground truth marker boxes in the testing dataset. There was a trade-off between precision and recall. If the model learned to predict with high precision, it tended to overfit, which caused a reduction in the recall. In contrast, if the model learned to be able to predict all markers in the dataset for a high recall, it could be more general in marker detection, which caused underfitting and degraded precision. The higher the precision, the lower recall, and vice versa. The F1 score turned out to be a better metric, which was based on precision and recall, as shown in Equation (12). In our experiments, we evaluated the detection performance

on five methods—(1) YOLOv2, (2) a combination of DeepDeblur and YOLOv2, (3) a combination of DeblurGAN and YOLOv2, (4) a combination of a modified version of DeblurGAN that used MobileNet as the backbone, and YOLOv2, and (5) our proposed method—a combination of SlimDeblurGAN and YOLOv2. The measurements were conducted at different IoU thresholds on both synthesized motion-blur datasets. All parameters for other methods [21–23,45,50] were optimally selected by us with training data.

Testing accuracies on SMBD-DB1: The marker detection results on SMBD-DB1 are shown in Tables 9–11 and Figure 13. The precision, recall, and F1 scores of the five methods were very high at the low IoU thresholds and decreased with increasing IoU. At first glance, we could see that without any motion deblurring preprocessing, YOLOv2 exhibited low accuracies of marker detection on the motion-blurred input, as shown by the blue curves in Figure 13. The reason for this was that the motion blur strongly distorted the feature, pattern, and shape of markers in the images, making them difficult to detect. Obviously, the ghost effects in motion-blurred images were obstacles to accurately detecting the markers. Therefore, YOLOv2 could only detect the marker in images that were less affected by motion blur; otherwise, it failed. In an attempt to overcome this problem, YOLOv2 was trained directly on motion-blurred images, which could help YOLOv2 to increase its recall by learning to generalize its detection. This approach, however, decreased the precision, owing to the aforementioned tradeoff between them. Hence, the system was less accurate in distinguishing markers between marker-like objects, which caused a reduction in precision. However, with the help of highly accurate motion deblurring models such as DeblurGAN or SlimDeblurGAN, YOLOv2 obtained high precision, recall, and F1 score. Apart from DeblurGAN and SlimDeblurGAN, using DeblurGAN (MobileNet) made YOLOv2 yield lower detection accuracy than the other methods, including YOLOv2, without any deblurring preprocessing method. The failure of the DeblurGAN (MobileNet) indicated that adopting the MobileNet architecture in the backbone of the SOTA model to reduce the computation cost was not always effective. Our proposed SlimDeblurGAN showed a higher detection accuracy than the other methods, confirming that we successfully generated a compact and highly accurate motion-deblurring model of SlimDeblurGAN. In detail, the detection result by SlimDeblurGAN + YOLOv2 was slightly higher than that by the DeblurGAN + YOLOv2, as shown in Tables 9–11 and Figure 13. However, the number of parameters and FLOPs of SlimDeblurGAN were less by factors of 10 and 6, respectively, than those of DeblurGAN, as shown in Tables 6 and 8.



**Table 10.** Comparisons of recall on SMBD-DB1.


**Table 11.**Comparisons of the F1 Score on SMBD-DB1.


**Figure 13.** *Cont*.

**Figure 13.** Marker detection results on SMBD-DB1 for the five methods at different IoU thresholds: (**a**) precision, (**b**) recall, and (**c**) F1 score.

– — Figure 14 presents examples of the detection result on SMBD-DB1; the green boxes show the ground truths, and the red boxes show the detected boxes. As shown in this figure, the marker in the motion-blurred image could not be detected by YOLOv2 (without importing the additional motion deblurring stage) and could not be recognized even by the human eye. The images in Figure 14b–e were detection results on the resulting images of different deblurring methods—DeepDeblur, DeblurGAN, DeblurGAN (MobileNet), and SlimDeblurGAN, respectively. – —

**Figure 14.** Examples of detection results on SMBD-DB1. The green boxes represent the ground truth bounding boxes and the red boxes represent the boxes detected by (**a**) YOLOv2, (**b**) DeepDeblur and YOLOv2, (**c**) a combination of DeblurGAN and YOLOv2, (**d**) a combination of DeblurGAN (MobileNet) and YOLOv2, and (**e**) a combination of SlimDeblurGAN and YOLOv2 (ours).

Although the detection results of Figure 14b–e showed that the marker in the image could be successfully detected, the detected bounding boxes by different methods were different. Specifically, the detected boxes by DeblurGAN (c) and by SlimDeblurGAN (e) were closer to the ground truth than those by DeepDeblur (b) and by DeblurGAN (MobileNet) (d). From these results, we could confirm that the motion deblurring method can overcome the challenge of motion-blurred input to an object-detection system. The more accurate the motion-deblurring method, the more accurate the detection result becomes.

Testing accuracies on RMBD-DB1: Table 12, Table 13, Table 14 present the detection accuracies of precision, recall, and F1 score, respectively, on the RMBD-DB1 dataset. In addition, the comparative graphs of the experimental results are presented in Figure 15. As shown in these tables and figure, the methods combining motion deblurring and marker detection showed better detection results than the method without motion deblurring, and our proposed method yielded a better result than the SOTA methods. By testing on the RMBD-DB1 realistic motion blur dataset, we could confirm that our proposed system can work well in the real-world environment.

**Figure 15.** *Cont*.

Pascal™

**Figure 15.** Marker detection results on RMBD-DB1 for five methods at different IoU thresholds: (**a**) precision, (**b**) recall, and (**c**) F1 score.



**Table 13.**Comparisons of recall on RMBD-DB1.


**Table 14.**Comparisons of F1 Score on RMBD-DB1.


Figure 16 presents examples of detection results on RMBD-DB1 performed by five methods, i.e., YOLOv2, a combination of DeblurGAN and YOLOv2, a combination of DeblurGAN (MobileNet) and YOLOv2, a combination of DeepDeblur and YOLOv2, and our proposed framework combining SlimDeblurGAN and YOLOv2. The motion blur was due to the downward movement of the drone, which caused the failure of object detection by YOLOv2 (without importing the additional motion deblurring stage), as shown in Figure 16a. However, the combinations of motion deblurring models and the YOLOv2 detector successfully detected the marker, as shown in Figure 16b–d, and our method showed more accurate results of marker detection than the other methods, as shown in Figure 16e.

**Figure 16.** Examples of detection results on RMBD-DB1. The green boxes represent the ground truth bounding boxes and the red boxes represent the boxes detected by (**a**) YOLOv2, (**b**) DeepDeblur and YOLOv2, (**c**) a combination of DeblurGAN and YOLOv2, (**d**) a combination of DeblurGAN (MobileNet) and YOLOv2, and (**e**) a combination of SlimDeblurGAN and YOLOv2 (ours).

#### 4.3.3. Comparisons on Processing Speed and Discussion

Pascal™ We measured the processing speed of our method on both an embedded system and a desktop computer. The specifications of the desktop computer are explained in Section 4.2, and a Jetson TX2 system was used as the embedded system, as shown in Figure 17. A Jetson TX2 embedded system is a fast, power-efficient device optimized for artificial intelligence (AI). It includes an NVIDIA Pascal™-family GPU (256 CUDA cores) with 8 GB of memory and features various standard hardware interfaces that facilitate integration into a wide range of products like UAVs and autonomous vehicle [51]. As the board was pre-flashed with a Linux development environment, we installed the Ubuntu 16.04 operating system, which was a convenient environment for training and testing deep learning models, as recommended by NVIDIA. Our proposed SlimDeblurGAN and YOLOv2, including the comparative algorithms were implemented in desktop computer by TensorFlow 1.14 [52], CUDA® toolkit (ver. 10.0) [53], and NVIDIA CUDA® deep neural network library (CUDNN) (ver. 7.6.2) [54]. These were also implemented in the Jetson TX2 system by TensorFlow 1.12 [52], CUDA® toolkit (ver. 9.0) [53], and NVIDIA CUDNN (ver. 7.3) [54]. The full specifications of the Jetson TX2 embedded system are presented in Table 15.

–

**Figure 17.** Jetson TX2 embedded system.

**Table 15.** Specifications of the Jetson TX2 embedded system.


— We measured the processing time of each phase, separately. Table 16 presents the processing time per image and the FPS of four motion deblurring methods—DeepDeblur, DeblurGAN, DeblurGAN (MobileNet), and SlimDeblurGAN. The processing speed of SlimDeblurGAN on the desktop computer was extremely fast, at about 98 FPS, and it was approximately 54.6 FPS on the Jetson TX2 system. SlimDeblurGAN had the highest processing speed in both the desktop and Jetson TX2 environments.

**Table 16.** Processing time (ms) per image with FPS of motion deblurring networks on two platforms.


In addition, we measured the total processing time per image of our method, including the YOLOv2 detector, as shown in Table 17. In the desktop environment, YOLOv2 archived the fast speed at 50 FPS, and it was still fast on the Jetson TX2 board with approximately 32.3 FPS. As a result, the total processing speed of our proposed method could be 33.1 FPS on the desktop computer and 20.3 FPS on the Jetson TX2 embedded system. In addition, our method was faster than the previous method, as shown in Tables 17 and 18. YOLOv2 [45] and our previous method [21] were already applied to autonomous drone landing, and our method outperformed these algorithms, as shown in Tables 9–14, Tables 16 and 18 and Figures 13 and 15, which confirmed the necessity of motion blur restoration

by the proposed method, for accurate marker detection. In [21], DSCN + YOLOv2 was compared with DSCN + lightDenseYOLO, which confirmed that DSCN + lightDenseYOLO proposed in [21] outperformed DSCN + YOLOv2. Therefore, we compared our method (SlimDeblurGAN + YOLOv2) with DSCN + lightDenseYOLO proposed in [21]. In Figure 14, the green boxes represent the ground truth bounding boxes, and the red boxes represent the boxes detected. There was no red box in the lower image of Figure 14a, which meant that YOLOv2 could not detect the marker in the motion blurred image. The same result could also be observed in Figure 16a. As shown in Tables 9–14, and Figures 13 and 15, YOLOv2 without the motion deblurring method showed lower accuracies of marker detection than our proposed method.


**Table 17.** Total processing time (ms) per image by our method on two platforms.

**Table 18.** Total processing time (ms) per image by the previous method [21] on two platforms.


Although the improvement in precision by the proposed method with SMBD-DB1 was 0.5%, compared to previous work [21] (shown in Table 9), those of recall and the F1 Score by the proposed method were, respectively, 2% and 1.3% compared to previous work [21] as shown in Tables 10 and 11. In addition, with RMBD-DB1, the improvement of precision, recall, and F1 Score of the proposed method were respectively, 11%, 12%, and 11.7%, compared to previous work [21], as shown in Tables 12–14.

#### **5. Conclusions**

We introduced a deep-learning-based marker detection method for autonomous drone landing, which considered motion deblurring, by proposing a two-phase framework system. To the best of our knowledge, this study was the first to consider the performance of a combination of motion deblurring and marker detection for autonomous drone landing. In addition, we considered the balance between accuracy and execution time by adopting our proposed motion-deblurring network and the real-time object detector of YOLOv2. To this end, we proposed a SlimDeblurGAN by channel pruning, to lighten the pretrained DeblurGAN model without significant degradation of accuracy, which was significantly faster than the original version. We adopted such models to our system by training from scratch and testing on our two synthesized motion-blurred datasets acquired from landing drones. We confirmed experimentally that our system could be operated well on non-uniform motion-blurred input, and it could be applied to an embedded system with low processing power. For our future work, we plan to combine the two networks of motion deblurring and marker detection into one model, including shallower layers and fewer parameters, which could reduce the processing time. In addition, we would apply our network to other applications of pedestrian detection at a distance, for intelligent surveillance camera environments, object detection in satellite images, small object detection, and moving object detection, etc.

**Author Contributions:** N.Q.T. and T.D.P. implemented the overall system of motion deblurring and marker detection, and wrote this paper. Y.W.L., M.O., D.T.N., G.B., and K.R.P. helped with the experiments and analyzed the results. All authors have read and agreed to the published version of the manuscript.

**Acknowledgments:** This work was supported in part by the National Research Foundation of Korea (NRF) funded by theMinistry of Education, through the Basic Science Research Program under Grant NRF-2018R1D1A1B07041921; in part by the NRF funded by the Ministry of Science and ICT (MSIT), through the Basic Science Research Program under Grant NRF-2019R1A2C1083813; in part by the NRF funded by the MSIT, through the Bio and Medical Technology Development Program under Grant NRF-2016M3A9E1915855.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **A Hough-Space-Based Automatic Online Calibration Method for a Side-Rear-View Monitoring System**

#### **Jung Hyun Lee and Dong-Wook Lee \***

Department of Electronics and Electrical Engineering, Dongguk University, Seoul 04620, Korea; jhlee36@dongguk.edu

**\*** Correspondence: dlee@dongguk.edu; Tel.: +82-2-2260-3350

Received: 18 May 2020; Accepted: 11 June 2020; Published: 16 June 2020

**Abstract:** We propose an automatic camera calibration method for a side-rear-view monitoring system in natural driving environments. The proposed method assumes that the camera is always located near the surface of the vehicle so that it always shoots a part of the vehicle. This method utilizes photographed vehicle information because the captured vehicle always appears stationary in the image, regardless of the surrounding environment. The proposed algorithm detects the vehicle from the image and computes the similarity score between the detected vehicle and the previously stored vehicle model. Conventional online calibration methods use additional equipment or operate only in specific driving environments. On the contrary, the proposed method is advantageous because it can automatically calibrate camera-based monitoring systems in any driving environment without using additional equipment. The calibration range of the automatic calibration method was verified through simulations and evaluated both quantitatively and qualitatively through actual driving experiments.

**Keywords:** side-rear-view monitoring system; automatic online calibration; Hough-space

#### **1. Introduction**

In recent years, vision-based Advanced Driver Assistance Systems (ADAS) based on cameras have been developed continuously to provide safety and convenience to motorists. Vision-based ADAS employ the intrinsic and extrinsic parameters of cameras to provide a specific Field Of View (FOV). A Surround View Monitoring System, one of the vision-based ADAS, uses camera parameters to generate a bird's eye view image with a FOV that contains all of the information around the vehicle [1,2]. A panoramic rear-view system also uses camera parameters to stitch side-view image and rear-view image to generate a panoramic image [3,4]. These vision-based ADAS transforms the captured image using camera parameters to generate the desired FOV image.

A side-rear-view monitoring system should especially provide a reliable FOV so that the driver can glean adequate information, and most countries legally specify the reliability of FOV. However, even though the same model vehicles are equipped with the same side-rear-view monitoring system devices, each monitoring system provides a different FOV due to manufacturing tolerances. FOV also changes when the same monitoring system device is mounted on different vehicles. Therefore, a side-rear-view monitoring system has to calibrate camera to provide uniform FOV even when various factors change. To provide consistent FOV according to the laws and circumstances of each country, control over the intrinsic and extrinsic parameters of cameras is required.

Camera calibration is one of the most useful methods for estimating the intrinsic and extrinsic parameters of a camera [5–11]. Camera calibration can improve camera performance by overcoming manufacturing tolerance limitations. Looser manufacturing tolerances can allow for lower cost and higher yield. Additionally, estimated parameters by calibration can be used to transform pixel-based metrics to physically based ones. This geometrical transformation enables FOV control.

Camera calibration for ADAS can be categorized into offline, self, and online calibration. Offline calibration methods use photographed targets, such as checker-patterns on a floor or wall [12–18]. This method is inconvenient because the size and position of the targets should be regulated depending on the location, orientation, and FOV of the camera. To this end, automobile manufacturers need to secure specialized facilities. Online calibration does not use specific targets and requires a moving camera. Traditional online calibration methods employ additional devices, such as encoders, Light Detection and Ranging (LiDAR) systems, odometry devices, and Inertial Measurement Units (IMU) [19–22] to overcome absence of specific targets. Other online calibration methods called self calibration [23–25] do not use additional devices but requires specific information about the road surface, such as lane markers [26–29]. However, it is not always possible to obtain specific information in the natural driving environment. In addition, self calibration methods have a constraint that some camera parameters must be known. Therefore, offline calibration must precede because the vehicle will not be operating on roads with lanes before it is sold. Unquestionably, the primary purpose of traditional online calibration is recalibration.

We propose a side-rear-view camera calibration method that is possible even if we do not know any camera parameters. Therefore, it does not require an offline calibration preprocessing step. In this method, a vision-based ADAS camera mounted near the side surface of a vehicle constantly photographs the vehicle. We call the part of the captured vehicle "Reflected-Vehicle Area (RVA)", and we can extract the RVA not influenced by the driving environment. Therefore, the RVA is an essential prerequisite for our method.

A segmentation method with artificial intelligence such as deep learning is one of good solution to detect the RVA [30]. However, deep learning requires a huge amount of data based on the type of vehicle, camera parameters, and various driving environments. Collecting these data is very inconvenient and difficult. In order to overcome this inconvenience, we utilize widely known and uncomplicated image processing techniques to detect the RVA.

The proposed method detects the boundary of a reflected vehicle and computes the interior of the boundary as the RVA. The boundary of the reflected vehicle can be represented by any curve in the captured image. Random Sample Consensus (RANSAC) is a useful curve-fitting method. However, RANSAC is not always able to identify the optimal curve from moderately contaminated data [31]. Therefore, we eliminate contaminated data to the extent possible before utilizing RANSAC.

After the RVA is detected, the proposed method computes a similarity score between the detected RVA and a stored vehicle model. The similarity score can be calculated by the reprojection error minimization. To minimize reprojection error, we have to extract and match the features from the captured image and the stored image using Scale-Invariant Feature Transform (SIFT) or Speeded Up Robust Features (SURF) [32]. However, it is difficult to extract adequate numbers of feature points from two-dimensional vehicle images, which are required to apply image matching. The challenge is mainly due to the fact that the feature point is a corner where two straight lines with different slopes meet whereas RVA consists mostly of smooth curves. Image-template-matching methods can compute the similarity score without requiring feature point extraction. The template size should be small to facilitate the utilization of ring projection in conventional template-matching methods [33,34], but the RVA is too large from the viewpoint of applying ring projection. To solve this problem, Yang et al. studied large-scale rotation-invariant template matching [35]. This method uses color information, but the color of the RVA changes continuously because it reflects the surrounding environment. We propose a large-scale rotation-invariant template-matching method that computes the similarity score by using edge information instead of color information. The proposed algorithm utilizes the normalized 2D cross-correlation and the Hough space expressed in the Hesse normal form.

The rest of this paper is organized as follows: Section 2 reviews related works, and Section 3 describes the essential procedures of the proposed method. Section 4 presents simulation and experimental results. Finally, we conclude with a summary of the work in Section 5.

#### **2. Related Works**

This literature review focuses only on how to calibrate the parameters of a vehicle camera since camera calibration has been extensively researched for a long time in a wide range of fields. The previous methods can be classified as offline and online calibration according to which features are used. Additionally, online calibration can be categorized according to whether additional devices are used.

#### *2.1. O*ffl*ine Calibration*

Offline calibration methods estimate camera parameters using special patterns consisting of edges, circles, or lines. Since these methods use precisely drawn patterns aligned with a camera-mounted vehicle, it is possible to accurately estimate the camera parameters. However, accurate calibration cannot be performed without special facilities aligning a vehicle and patterns in the precise location.

The A&G company [13] provides calibration facilities that can align the vehicle and the calibration patterns for highly accurate camera calibration. Xia et al. [14] used multiple patterns and cameras by minimizing the reprojection error. Mazzei et al. [15] also minimized the reprojection error of checkboard patterns' corner locations to calibrate extrinsic parameters of the front view camera. Hold et al. [16] used a similar method using circle patterns on the ground. This method minimized the reprojection error of the centers of the circles. Tan et al. [17] drew an H-shaped pattern that consisted of two parallel lines and one perpendicular line to the vehicle. Li et al. [18] also used an H-shaped pattern to calibrate a rear-view camera.

#### *2.2. Online Calibration with Additional Devices*

Online calibration can estimate camera parameters using various sensors to utilize natural features instead of artificial patterns while driving. However, in terms of side-rear-view monitoring system calibration, this method has several drawbacks. Since the side-rear-view camera is looking at the horizon behind a vehicle rather than the road surface and part of the captured image is obscured by the driver's vehicle, it is difficult to detect enough natural features. Therefore, feature-based algorithms are inappropriate for calibration of a side-rear-view monitoring system.

Wang et al. [19] proposed a camera-encoder system to estimate extrinsic parameters. They obtained the distance that camera traveled through the encoder and calculated the Euclidean distance between matched image feature points using feature extracting and matching algorithms. This method can estimate extrinsic parameters by comparing the Euclidean distance of matched image feature points with the camera movement distance. Schneider et al. [20] also utilized odometry, camera, and matched feature points for estimating intrinsic parameters. Chien et al. [21] used visual odometry and LiDAR for online calibration. Visual odometry determines equivalent odometry information using feature extracting and matching algorithms. Li et al. [22] utilized IMU to calibrate the camera. The data measured by the IMU is fed into a processor, which calculates the position.

#### *2.3. Online Calibration without Additional Devices*

Online calibration without additional devices extracts and matches natural feature points in image sequences. Since there is no other assistant equipment, these methods highly depend on feature extracting and matching algorithms. All of the introduced papers in this section utilized the road lanes as the feature points.

Xu et al. [26] and de Paula et al. [27] utilized the detection of two symmetrical lanes to calibrate cameras of the lane departure warning systems and augmented reality systems, respectively. Wang et al. [28] detected two symmetrical dotted lanes for online calibration. However, a side-rear-view camera captures few or no symmetrical lanes. Choi et al. [29] proposed the recalibration method for around view monitoring systems. This method can calibrate only when the road lanes around the vehicle are detected. However, road lanes near the vehicle are not captured by the side-rear-view camera.

#### **3. Automatic Online Calibration**

Automatic online calibration is a method that automatically calibrates the camera's orientation and location in natural driving environments. However, the calibration method cannot change the orientation of a fixed camera in monitoring systems, which leads to deformed images. Therefore, we have to convert camera parameters into image deforming parameters.

The camera parameters can be classified into intrinsic and extrinsic parameters. Intrinsic parameters describe the optical properties of the camera, and extrinsic parameters describe the orientation and location of the camera. Since the optical properties of the manufactured camera such as the image sensor size, image sensor resolution, and distance between image sensor and lens hardly change, we assume that intrinsic parameters are constant. Therefore, we assume that intrinsic parameters are constant. Online calibration focuses only on the camera orientation because the orientation has considerably more influence on the image than the camera position [29,36]. Therefore, we also exclude camera location parameters, which is one of extrinsic parameters, from variables as well.

The camera orientation can be expressed in terms of its roll, pitch, and yaw angles, as shown in Figure 1. When the camera rotates in the roll direction, the subject rotates in the image. When the camera rotates in the yaw and pitch directions, the subject moves in the horizontal and vertical directions, respectively, in the image. Therefore, the roll direction corresponds to image rotation, while the pitch and yaw directions correspond to image translation. By using this relationship, we can express camera orientation as image deforming parameters: image rotation and image translation parameters.

**Figure 1.** Parameterization of camera orientation.

We compare and analyze RVA of a pre-uploaded 3D vehicle model and RVA of a captured image in order to estimate the parameters. An RVA detection step has to be preceded for comparative analysis. RVA data in the image space is converted into the Hough space to estimate the image rotation parameters. We can estimate the image translation parameters using two RVA data with no image rotation difference. Figure 2 shows the procedure of the proposed automatic online calibration.

#### *3.1. Reflected-Vehicle Area Detection*

RVA is a part of the driver's vehicle photographed in the image. The algorithm for detecting RVA consists of two steps. The first step involves preprocessing to improve the accuracy of the second step and to eliminate, to the extent possible, the data that are not related to the reflected vehicle. In the

second step, we utilized RANSAC to find the reflected-vehicle boundary and determine the inside of this boundary as the RVA. RANSAC is an iterative curve-fitting method for estimating the parameters of a mathematical model and classifying data into inliers and outliers. Inliers are the data whose distribution can be explained by some set of model parameters, and outliers are the data that do not fit the model. Therefore, outliers do not influence the estimated parameters. For this reason, RANSAC is used for outlier detection as well [37].

**Figure 2.** Block diagram of automatic online calibration.

In the first step, we eliminate outliers to improve the accuracy of RANSAC, which is inversely proportional to the number of outliers. We assume that the edge points of the reflected vehicle are inliers and all other points are outliers. The edge points of RVA always appear stationary because the moving speed and direction of the camera installed on the vehicle are the same as those of the vehicle. Therefore, we detected edges that do not change over time, and we call this process "static edge detection". To detect static edge points, we capture multiple images over a certain time period and detect the edges of each captured image using the Sobel filter. Thereafter, we could detect static edges by applying the logical and operation to each pixel coordinate. Figure 3 shows an example of the detection of static edges.

Edge detection using the Sobel filter does not guarantee robust results because it uses static parameters of filter to detect edges of dynamic images. However, the proposed static edge detection can overcome this problem by collecting lots of edge information from multiple images. Therefore, we must capture an adequate number of images to eliminate static edge points outside RVA to the extent possible. The static edge image in Figure 4a confirms that the static edge points inside RVA also form a curve as distinct as the reflected-vehicle boundary. Therefore, if there are several static edge points in each row of the image, only the leftmost static edge point is set as the candidate of the reflected-vehicle boundary. This process not only eliminates the static edge points inside the RVA but also represents candidates as a bijection function. Figure 4b shows the candidate points of the reflected-vehicle boundary. In this figure, most of the static edge points inside RVA are not candidates. After determining the candidate group, we utilize RANSAC to detect RVA in the second step.

**Figure 3.** Concept of static edge detection. (**a**) First captured image; (**b**) second captured image; (**c**) edge image of (**a**); (**d**) edge image of (**b**); and (**e**) static edge image after logical and processing.

In the first step, we eliminated most of the static edge points, except for the points of the reflected-vehicle boundary. We utilized RANSAC to categorize the candidates into the reflected-vehicle boundary (inliers) and the others (outliers) in the second step. RANSAC is an iterative method involving two phases: hypothesis generation and hypothesis evaluation, as shown in Figure 5.

RANSAC generates the hypothesis for line fitting by randomly sampling data and estimating parameters using the sampled data. The highest score of hypothesis evaluation is given when all randomly picked-up data are inliers. Therefore, hypothesis generation and evaluation must be iterated so that all randomly picked-up data are inliers.

inside RVA.

201 202

*Sensors* **2020**, *20*, x FOR PEER REVIEW 7 of 27

**Figure 4.** Example of reflected-vehicle area detection using Random Sample Consensus (RANSAC). (**a**) Static edge image; (**b**) determined candidates of the reflected-vehicle boundary; (**c**) estimated reflected-vehicle boundary using RANSAC; (**d**) identified reflected-vehicle boundary points from (**b**); (**e**) eliminated static edge points outside Reflected-Vehicle Area (RVA); and (**f**) static edge points inside RVA. In the first step, we eliminated most of the static edge points, except for the points of the reflected-vehicle boundary. We utilized RANSAC to categorize the candidates into the reflectedvehicle boundary (inliers) and the others (outliers) in the second step. RANSAC is an iterative method involving two phases: hypothesis generation and hypothesis evaluation, as shown in Figure 5.

(1)

where γ is the number of inliers divided by the number of points in the data, *s* is the number of samples selected each time, and *N* is the number of iterations. Equation (2) can be used to determine the number of iterations.

$$N = \frac{\log(1 - p)}{\log(1 - \gamma^s)}\tag{2}$$

We can determine the variables γ and *s* experimentally, but the probability *p* can only be determined empirically. After hypothesis generation, RANSAC calculates the error that the distance datum has to the estimated line and counts the number of inliers within a predefined threshold to evaluate the hypothesis. To predefine the threshold, we assumed that the error follows a normal distribution. In statistics, the empirical rule is expressed as follows: *X* is an observation from a normally distributed random variable, µ is the mean of the distribution, and σ is its standard deviation.

$$\begin{array}{l} \Pr(\mu - 1\sigma \le X \le \mu + 1\sigma) \approx 0.6827\\ \Pr(\mu - 2\sigma \le X \le \mu + 2\sigma) \approx 0.9545\\ \Pr(\mu - 3\sigma \le X \le \mu + 3\sigma) \approx 0.9973 \end{array} \tag{3}$$

We obtained the standard deviation of the inliers σ and then predefine the threshold between 2σ and 3σ so that RANSAC can select inliers with a probability of 95% or higher. Finally, we could detect the reflected-vehicle boundary by using RANSAC.

Boundary of a reflected-vehicle can be represented by a smooth curve. However, the boundary changes depend on the vehicle type and camera parameters. Therefore, we used a third-order equation as a model of RANSAC as shown below.

$$f(v) = a\_0 + a\_1 v + a\_2 v^2 + a\_3 v^3 \tag{4}$$

where *v* is a horizontal direction coordinate of RVA point. Additionally, the order of the equation may increase as needed.

Figure 4c shows a reflected-vehicle boundary curve estimated using RANSAC. The blue points in Figure 4b are the candidates of a reflected-vehicle boundary, and these points are used as the input data for RANSAC. Figure 4c shows the curve with the most inliers, as estimated using RANSAC, and the blue points in Figure 4d are the candidates identified as inliers by RANSAC.

We assume that the interior of an estimated reflected-vehicle boundary is RVA. Figure 4e,f show the static edge points outside RVA and the static edge points inside RVA, respectively.

In the next section, we presented an automatic calibration method that employs these static edge points.

#### *3.2. RVA Comparative Analysis to Estimate Parameters*

We estimated the image rotation and translation parameters that represent the camera orientation by comparing RVA with the stored vehicle model. The stored vehicle model must be converted into an edge image to compare it with the RVA consisting of static edge points. Vehicle manufacturers may provide three-dimensional (3D) vehicle model data; if that is not the case, we can construct the data by using a 3D scanner. Then, we can regulate the camera position, orientation, and FOV and shoot a 3D vehicle in 3D virtual space. The Unity program is useful for regulating the virtual camera and for clicking pictures with it in 3D virtual space [38]. We applied edge detection to images captured using the virtual camera to obtain a reflected-vehicle edge image of the stored vehicle model. Figure 6 shows the process of converting a 3D vehicle model into an edge image by using the Unity program.

255

256 257

261 262

263

**Figure 6.** Process of converting a 3D vehicle model into an edge image by using the Unity program. (**a**) 3D virtual space of the Unity program; (**b**) photograph captured using a virtual camera; (**c**) reflected-vehicle area of image captured using a virtual camera; and (**d**) edge image of a 3D vehicle model in the reflected-vehicle area. **Figure 6.** Process of converting a 3D vehicle model into an edge image by using the Unity program. (**a**) 3D virtual space of the Unity program; (**b**) photograph captured using a virtual camera; (**c**) reflected-vehicle area of image captured using a virtual camera; and (**d**) edge image of a 3D vehicle model in the reflected-vehicle area.

After converting the edge image from the 3D vehicle model, we utilized the Hough space to compare the converted edge image of the 3D vehicle model and the result of reflected-vehicle area detection in Section 2. The Hough space is a set of values that transform the edge points of the RVA into the Hesse normal form [39]. Equation (5) represents the Hesse normal form. After converting the edge image from the 3D vehicle model, we utilized the Hough space to compare the converted edge image of the 3D vehicle model and the result of reflected-vehicle area detection in Section 2. The Hough space is a set of values that transform the edge points of the RVA into the Hesse normal form [39]. Equation (5) represents the Hesse normal form.

$$
\sigma = \mathfrak{x}\mathsf{cess}\theta\mathsf{c}\mathsf{s}\mathsf{s}\mathfrak{y}\mathsf{s}\mathsf{i}\mathfrak{y}\mathsf{s}\mathsf{n}\theta
\tag{5}
$$

252 ሺ, ሻ ሺ, ሻ 253 ሺ, ሻ 254 =ℎሺሻ The coordinate (*x*, *y*) can be expressed as (*r*, θ) by using Equation (5), and we can visualize (*r*, θ) as a curve. Figure 7 shows a visualized curve corresponding to an image space point in the Hough space. We assumed that this curve can be expressed as *r* = *h*(θ). The coordinate (, ) can be expressed as (, ) by using Equation (5), and we can visualize (, ) as a curve. Figure 7 shows a visualized curve corresponding to an image space point in the Hough space. We assumed that this curve can be expressed as = ℎ().

**Figure 7.** Parameterization of image space and Hough space. (**a**) A point in the image space, and (**b**) a curve corresponding to an image space point in the Hough space. **Figure 7.** Parameterization of image space and Hough space. (**a**) A point in the image space, and (**b**) a curve corresponding to an image space point in the Hough space.

258 ሺ, ሻ Δ ሺ′, ′ሻ Δ 259 ሺΔ, Δሻ 260 ሺΔ, Δሻ If the coordinate (, ) is rotated by Δ and moved to (′, ′), a degree-shift of Δ occurs in the Hough space, and if (Δ, Δ) image translation occurs, -shifting occurs in the Hough space, as shown in Figure 8. This phenomenon indicates that the parameter and (Δ, Δ) image translation are independent of each other. Therefore, the Hesse normal form can be re-expressed by considering that image rotation and translation occur simultaneously: If the coordinate (*x*, *y*) is rotated by ∆θ and moved to (*x* ′ , *y* ′ ), a degree-shift of ∆θ occurs in the Hough space, and if (∆*x*, ∆*y*) image translation occurs, *r*-shifting occurs in the Hough space, as shown in Figure 8. This phenomenon indicates that the parameter θ and (∆*x*, ∆*y*) image translation are independent of each other. Therefore, the Hesse normal form can be re-expressed by considering that image rotation and translation occur simultaneously:

$$r + \mathbb{A}r \models \Delta \mathbb{A}^{\nu} \sqcap \mathsf{s}\Diamond (\mathsf{g}\otimes \mathsf{s}\Diamond \theta \urcorner \theta \urcorner \mathsf{s}\Diamond \mathsf{s}\Diamond \mathsf{s}\Diamond \theta \urcorner \theta \urcorner \mathsf{s}\Diamond \theta \urcorner). \tag{6}$$

264 265 266

277

279 280

271

**Figure 8.** Parameterization of image space and Hough space by means of similarity transformation. (**a**) Rotation transformation in image space; (**b**) rotation transformation in Hough space; (**c**) translation transformation in image space; and (**d**) translation transformation in Hough space.

267 =ℎሺሻ + ∆ = 268 ℎሺ + Δሻ ℎሺሻ 269 ℎሺ + Δሻ =ℎሺሻ 270 ℎሺሻ By using Equation (6), the Hough space curve *r* = *h*(θ) can be re-expressed as *r* + ∆*r* = *h*(θ + ∆θ). We can estimate rotational similarity by comparing the difference between *h*(θ) and *h*(θ + ∆θ). *r* = *h*(θ) denotes a curve in the Hough space corresponding to one point in the image space. Many points exist in the image space, so we calculate the variance of *h*(θ) corresponding to each θ to solve this problem.

$$v(\theta) = \frac{1}{N-1} \sum\_{i=1}^{N} \left| h\_i(\theta) - \mu\_h \right|^2 \tag{7}$$

ୀଵ ሺሻ ℎሺሻ <sup>ℎ</sup> 272 ሺሻ ℎሺሻ = ଵ ∑ ℎ ሺሻ ே 273 ୀଵ 274 ሺ, ሻ ሺሻ 275 ሺሻ 276 ሺሻ where *v*(θ) is the variance of *h*(θ) corresponding to θ, *N* is the number of edge points, *hi*(θ) is *h*(θ) corresponding to the *i*th edge point, and µ*<sup>h</sup>* = <sup>1</sup> N P*N i*=1 *hi*(θ). Figure 9 shows an example of how (*r*, θ) of the Hough space and variance *v*(θ) are changed by image transformation. Figure 9g–i show that the variance *v*(θ) is shifted in the vertical direction owing to image rotation, and the amplitude of *v*(θ) is stretched in the horizontal direction owing to image scaling. Moreover, the effect of image translation is rarely seen in the Hough space. Therefore, we can estimate the rotational similarity between Figure 9a,b by computing the degree-shifting between variances *v*(θ), as shown in Figure 9g,h.

278 ሺሻ We utilized normalized cross-correlation to calculate the degree-shifting between *vm*(θ) and *vc*(θ), where *vm*(θ) is a curve corresponding to the 3D vehicle model, and *vc*(θ) is a curve corresponding to a static edge image of RVA. Normalization is applied to calculate the degree-shifting when the amplitude

difference between two signals is large, as shown in Figure 9g,i. The normalized cross-correlation is one of the proper solutions for estimating the relationship between two signals, and it is expressed as follows:

, (8)

where σ*v<sup>m</sup>* is the variance of *vm*(θ), σ*v<sup>c</sup>* is the variance of *vc*(θ), ∗ is the complex conjugate, and *K* is the length of valid signals. Then, we can obtain the rotational similarity ∆θ by using Equation (9).

> ∆θ = argmax φ

(*vm*(θ))<sup>∗</sup>

*vc*(θ + φ)

σ*v<sup>m</sup>* σ*v<sup>c</sup>*

*R*(φ) =

1 *K* X *k*

$$\mathbf{Q}\_{\parallel}|\_{\mathbf{Q}\_{\parallel}} \tag{9}$$

 **Figure 9.** Example of Hough space data changed by image transformation. (**a**) A test image; (**b**) result of rotating and translating image (**a**); (**c**) result of scaling and translating image (**b**); (**d**) result of converting data (**a**) to the Hough space, where row is θ and column is *r*; (**e**) result of converting data (**b**); (**f**) result of converting data (**c**); (**g**) variance of (**d**) corresponding to θ, where row is θ, and column is variance *v*(θ); (**h**) variance of (**e**); and (**i**) variance of (**f**).

 If the camera image is calibrated using the estimated rotational similarity score, only translational similarity remains to be determined. We can obtain translational similarity from the normalized 2D cross-correlation, which is widely used in computer vision [40].

$$\gamma(\boldsymbol{\mu}, \boldsymbol{v}) = \frac{\sum\_{\mathbf{x}, \boldsymbol{y}} [I\_{\boldsymbol{m}}(\mathbf{x}, \boldsymbol{y}) - \boldsymbol{\mu}\_{\rm l}] \Big[\mathbb{I}\_{\rm c}(\mathbf{x} - \boldsymbol{u}, \boldsymbol{x} - \boldsymbol{v}) - \boldsymbol{\mu}\_{\rm l}\Big]}{\left\{\sum\_{\mathbf{x}, \boldsymbol{y}} [I\_{\boldsymbol{m}}(\mathbf{x}, \boldsymbol{y}) - \boldsymbol{\mu}\_{\rm l}]^{2} \sum\_{\mathbf{x}, \boldsymbol{y}} \Big[\mathbb{I}\_{\rm c}(\mathbf{x} - \boldsymbol{u}, \boldsymbol{x} - \boldsymbol{v}) - \boldsymbol{\mu}\_{\rm l}\Big]^{2}\right\}^{0.5}} \tag{10}$$

 � ∗ Δ where γ(*u*, *v*) denotes the normalized 2D cross-correlation data at (*u*, *v*), *I<sup>m</sup>* the edge image of the 3D vehicle model, <sup>ˆ</sup>*I<sup>c</sup>* the rotation-corrected camera image, and <sup>µ</sup>*I<sup>m</sup>* and µˆ*I<sup>c</sup>* the averages of *I<sup>m</sup>* and ˆ*Ic*, respectively. Then, we can compute the translational similarity by using Equation (11).

$$(\Delta x \Delta \mathfrak{g} y) = \underset{\mathfrak{g} \mid u, v}{\text{argm}} (\mathfrak{R}(\mathfrak{g}(\mathfrak{h}\_{\mathcal{I}} v)) \tag{11}$$

Finally, we can construct a similarity matrix from ∆*x*, ∆*y*, and ∆θ, and calibrate the image captured by the camera as follows:

$$\begin{array}{ll} \text{v } u \text{ v } \begin{array}{l} \text{H}\_{\text{S}} = \begin{bmatrix} \sum \text{g}\_{\text{y}} [I\_{\text{H}} \mid \mathbf{x} \ \mathbf{y} \end{bmatrix} = \begin{bmatrix} \cos \Delta \theta & \mathbf{\hat{i}} \begin{bmatrix} \cos \Delta \theta & \mathbf{\hat{x}} \ \mathbf{\hat{u}} \ \mathbf{x} - \mathbf{\hat{v}} \\ \sin \Delta \theta & \cos \Delta \theta & \mathbf{\hat{x}} \ \mathbf{y} \end{bmatrix} \end{array} \end{array} \end{array} \tag{12}$$

319 320

where H<sup>S</sup> is the similarity matrix, **R** is the image rotation matrix, and **t** is the image translation matrix instead of 3D camera orientation.

#### **4. Simulation and Experimental Results**

We performed several simulations and experiments to illustrate the performance of the proposed method. The purpose of the first experiment was to determine the number of captured images for static edge detection. The second experiment was performed and repeated to validate the effect of the driving environment on the automatic calibration. We compared our method with previous methods in the third experiment. The final experiment indicated the constraints of the proposed method. For these experiments, we installed High-Definition Low-Voltage Differential Signaling (HD LVDS) cameras with 60-degree FOV and a resolution of 1280 px × 720 px on the vehicle's left- and right-side mirror. These values were chosen due to their similarities to humans' angle of view. The camera was equipped with a three-axis goniometer to change its orientation, as shown in Figure 10a. We also produced and installed grabber equipment to acquire Controller Area Network (CAN) data and LVDS camera images, as shown in Figure 10b. The proposed algorithm was implemented in C++ on a portable PC. We analyzed the CAN data via the car' On-Board Diagnostic II (OBDII) port to detect vehicle speed and captured images only when the vehicle was in motion.

**Figure 10.** (**a**) Camera and goniometer used in the experiment and (**b**) grabber equipment for synchronizing and acquiring Controller Area Network (CAN) data and Low-Voltage Differential Signaling (LVDS) camera data.

#### *4.1. Experiments for Determining An Appropriate Number of Captured Images*

We compared the results of reflected-vehicle edge detection by changing the number of captured images to determine the appropriate number of captured images required for the purpose. Figure 11 shows the results of reflected-vehicle edge detection as a function of the number of captured images. The static edge points outside RVA were eliminated as the number of captured images increased. However, as the number of captured images increased, the time and memory costs increased as well. Due to this tradeoff relationship, we repeated this experiment in different driving environments and generalized the relationship between the number of captured images and the number of static edge points outside RVA.

Figure 12 shows the relationship between the number of static edge points outside RVA and the number of captured images. If more than 15 captured images were used, the ratio of the number of static edge points outside RVA to the number in the number of static edge points outside RVA converged to zero, as shown in Figure 12. Therefore, at least 15 captured images should be used so that the proportion of the number of static edge points outside RVA is less than 50%. Furthermore, we could eliminate a greater number of static edge points outside RVA by capturing more than 15 images, depending on the operating time and the computing power of the equipment.

331 332

335

333 334 **Figure 11.** Left side: *n*th captured image, where *n* is the number of images captured; right side: static edge points of reflected-vehicle edge detection according to *n*.

**Figure 12.** (**a**) Number of static edge points outside RVA according to the number of captured images, where the black dash line indicates that the ratio of the number of static edge points outside RVA to the number of static edge points is 0.5 and (**b**) change in the number of static edge points outside RVA according to the number of captured images.

#### *4.2. Field Experiments for Quantitative and Qualitative Evaluation*

We conducted experiments to verify each of the algorithms applied to the proposed method in natural driving environments. We used a goniometer and artificially regulated the camera orientation by 5◦ per axis. Figure 13 shows the process of the proposed method. Figure 13c shows the results of

static edge detection and RVA detection. The static edge points outside RVA have been appropriately eliminated. Figure 13a shows one of the captured images, Figure 13b shows the result of automatic calibration of the captured images, and the green curves indicate the boundary of the 3D vehicle model. As shown in Figure 13b, the green curve almost matches the boundary of the reflected vehicle in the calibrated image. This result indicates that the proposed automatic calibration is apt for a side-rear-view monitoring system and that the proposed automatic calibration is accurate even when the camera orientation changes.

**Figure 13.** Field experiment result with speed limits of 80 km/h. (**a**) Captured image, where the green line denotes the boundary of the 3D vehicle model; (**b**) calibrated image, where the green line denotes the boundary of the 3D vehicle model; (**c**) RVA detection, where red points are static edge points inside RVA and gray points are static edge points outside RVA; and (**d**) edge image of 3D vehicle model, where the green line denotes the boundary of the 3D vehicle model.

We repeated the driving test in various environments without changing the camera orientation and the edge image of the 3D vehicle model to verify whether the proposed method can provide consistent results. We drove through a school campus with a speed limit of 20 km/h, city road with a speed limit of 50 km/h, speedway with a speed limit of 80 km/h, and an indoor parking lot during the day and night. Figure 14 shows the results of the experiment in each environment. The static edge points outside RVA appeared near the horizon when driving at the speed of 50 km/h or more. These static edges mostly indicate a horizontal vanishing line whose position hardly changed in the image. In the school campus or the underground parking lot, static edge points outside RVA were randomly scattered. Therefore, reflected-vehicle edge detection was essential when driving on a natural road with horizon views.

The static edge points inside RVA were more evident in the daytime than in the nighttime. Naturally, the edge of the car was more clearly visible in a bright environment than in a dark environment. Moreover, RVA detection results were evident in the dark when high-speed roads and parking lots were well lit. On the contrary, fewer static edge points inside RVA were detected in the campus during the night-time because of poorer lighting conditions compared to those in the other environments. Nevertheless, the proposed method took advantage of the Hough space (see Section 3) to ensure that the calibration can be performed even when there are only a few static edge points inside RVA.


*Sensors* **2020**, *20*, x FOR PEER REVIEW 16 of 27

**Figure 14.** Results obtained using the proposed method in different experimental environments where the red points are the static edge points inside RVA, and the green line denotes the boundary of the 3D vehicle model. **Figure 14.** Results obtained using the proposed method in different experimental environments where the red points are the static edge points inside RVA, and the green line denotes the boundary of the 3D vehicle model.

We must know all camera parameters except rotation parameters to implement existing methods, whereas our method operates with unknown camera parameters. Moreover, the ground truths of

the camera parameters installed in the vehicle were not available [19,41–43]. Therefore, we repeated this experiment 100 times and used precision, recall, and Root Mean Squared Error (RMSE) as the quantitative evaluation indexes. Furthermore, we experimented with a 150-degree FOV camera with lens distortion, a 115-degree FOV camera with lens distortion, and a 150-degree FOV camera without lens distortion to confirm the applicability of the algorithm to other ADAS cameras.

#### 4.2.1. Precision, Recall, and RMSE

Precision and recall can be obtained by calculating true positive (TP), false positive (FP), false negative (FN), and true negative (TN) [44].

$$\begin{array}{l}\text{precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}\\\text{recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}\end{array} \tag{13}$$

TP, FP, FN, and TN are defined in Table 1, where *I<sup>m</sup>* denotes the edge image of the 3D vehicle model, ˆ*I<sup>c</sup>* is a calibrated image, *S<sup>m</sup>* is RVA of *Im*, (*Sm*) *c* is area outside RVA of *Im*, *S<sup>c</sup>* is RVA of ˆ*Ic*, and (*Sc*) *c* is area outside RVA of ˆ*Ic*.

**Table 1.** Definitions of true positive (TP), false positive (FP), false negative (FN), and true negative (TN) for computing precision and recall.


Figure 15 shows the TP, FP, FN, and TN areas visually. TP is the number of intersection points between *S<sup>m</sup>* and *Sc*. The red region FN denotes the number of intersection points between *S<sup>m</sup>* and (*Sc*) *c* . We could calculate FP and TN as well in the same manner. The RMSE is defined as follows.

$$\begin{array}{l} \text{RMSE}\_{\text{rot}} = \sqrt{\frac{1}{N} \sum\_{n} \left| \Delta \theta(n) - \mu\_{\Delta \theta} \right|^{2}}, \text{ where } \left| \Delta \theta(n) \right| < 0.5\pi\\ \text{RMSE}\_{\text{trans}} = \sqrt{\frac{1}{N} \sum\_{n} \left| \Delta x(n) - \mu\_{\Delta \mathbf{x}} \right|^{2}}\\ \text{RMSE}\_{\text{trans}} = \sqrt{\frac{1}{N} \sum\_{n} \left| \Delta y(n) - \mu\_{\Delta y} \right|^{2}} \end{array} \tag{14}$$

where *N* denotes the number of experiments; ∆θ, ∆*x*, and ∆*y* are parameters of a similarity matrix estimated using the proposed method, and µ∆θ, µ∆*x*, and µ∆*<sup>y</sup>* denote the average of ∆θ, ∆*x*, and ∆*y*, respectively. The range of ∆θ(*n*) is limited because the side-rear-view camera cannot capture the rear-view if <sup>∆</sup>θ(*n*)  exceeds 0.5π. Table <sup>2</sup> shows the average and the RMSE of each parameter calculated from 100 repeated experiments in different environments, and Figure 16 shows the parameter, precision, and recall values calculated over 100 experiments. If the precision and recall are 1, the two images are identical. We can see that the averages of precision and recall were 0.9758 and 0.9239, and both values were close to 1. This means that the edge image of the 3D vehicle model and the calibrated image were almost identical. The RMSE of rotational similarity was less than 1◦ , and the RMSE values of x- and y-axes translational similarity were 4.9041 and 13.4763 px, respectively. An RMSE value close to zero indicates that the experimental results are not affected by changes in the driving environment. Since we experimented in various environments without changing the camera orientation and the edge image of the 3D vehicle model, these quantitative results verified that the proposed method could perform online-calibration in most environments with RVA.

መ

መ 398

ሺ ሻ 

393 394 395

399

400

401

402

406

407 408

=

=

መ 397

 ሺ ሻ 

396

ሺሻ

 

 +

 +

 ሺሻ

403 <sup>=</sup> ∩ = ሺሻ ∩ ሺ ሻ = ∩ ሺ ሻ = ሺሻ 404 ∩ 405 **Figure 15.** Visualization and parameterization of TP, FP, FN, and TN. (**a**) Captured image where the blue line is the boundary of RVA of captured image and *S<sup>c</sup>* is RVA of calibrated image; (**b**) TP = *S<sup>m</sup>* ∩ *Sc*, TN = (*Sm*) *<sup>c</sup>* <sup>∩</sup> (*Sc*) *c* , FN = *S<sup>m</sup>* ∩ (*Sc*) *c* , and FP = (*Sm*) *<sup>c</sup>* <sup>∩</sup> *<sup>S</sup>c*; and (**c**) edge image of the 3D vehicle model where the green line is the boundary of 3D vehicle model and *S<sup>m</sup>* is the RVA of the edge image of the 3D vehicle model.

**Table 2.** Average and Root Mean Squared Error (RMSE) of the quantitative results of 100 repeated experiments in different environments.


#### 4.2.2. Experiments with Various Cameras

FOV of cameras used in ADAS depends on its purpose. For example, forward collision warning systems and parking assistance systems commonly use a narrow-angle camera and a wide-angle camera, respectively. In some cases, lens distortion may occur. In order to verify that the proposed algorithm can work in these various conditions, we experimented with three types of cameras: a 150-degree FOV camera with lens distortion, a 115-degree FOV camera with lens distortion, and a 115-degree FOV camera without lens distortion. Camera orientation was manually regulated by 5-degree per axis.

As shown in Figures 17 and 18, changes in FOV did not significantly affect the experimental results. Additionally, Figure 19 shows that the proposed method could calibrate both the lens distorted-image and lens distortion corrected-image. These qualitative results indicate that the proposed method could perform online calibration even if cameras' FOV and lens distortion were changed. Therefore, our method has the potential to be applied to various ADAS cameras.

#### *4.3. Comparison with Previous Methods*

As aforementioned in Section 2, camera calibration can be categorized according to which features and devices were used: offline calibration, online calibration with additional devices, and online calibration without additional devices. Table 3 shows the comparison of the related works and the proposed method from the viewpoint of the side-rear-view monitoring system calibration. Offline calibration is an inconvenient and restrictive method because the driver has to visit a large service

center equipped with special facilities, and it cannot be conducted in natural driving environments. In addition, these facilities increase the price of offline calibration-based products. Likewise, additional devices of online calibration also increase the price. *Sensors* **2020**, *20*, x FOR PEER REVIEW 19 of 27

**Figure 16.** The result of 100 repeated experiments. (**a**) Estimated rotational similarity; (**b**) *x*-axis translational similarity; (**c**) *y*-axis translational similarity; (**d**) precision values; and (**e**) recall values. **Figure 16.** The result of 100 repeated experiments. (**a**) Estimated rotational similarity; (**b**) *x*-axis translational similarity; (**c**) *y*-axis translational similarity; (**d**) precision values; and (**e**) recall values.

4.2.2. Experiments with Various Cameras FOV of cameras used in ADAS depends on its purpose. For example, forward collision warning systems and parking assistance systems commonly use a narrow-angle camera and a wide-angle camera, respectively. In some cases, lens distortion may occur. In order to verify that the proposed algorithm can work in these various conditions, we experimented with three types of cameras: a 150-degree FOV camera with lens distortion, a 115-degree FOV camera with lens distortion, and a 115-degree FOV camera without Online calibration is convenient because it can automatically calibrate cameras in a natural driving situation. However, traditional online calibration can hardly calibrate side-rear-view cameras due to constraints. They must extract features such as lane from captured images, but the side-rear-view camera looking at the horizon behind the vehicle does not capture traffic lanes around the vehicle. In contrast to those methods, the proposed method can calibrate the side-rear-view camera using RVA that is being photographed at all times in natural driving environments.

As shown in Figures 17 and 18, changes in FOV did not significantly affect the experimental results. Additionally, Figure 19 shows that the proposed method could calibrate both the lens

lens distortion. Camera orientation was manually regulated by 5-degree per axis.


**Figure 17.** Experimental results with a 150-degree Field Of View (FOV) camera with lens distortion in different orientation conditions, where the green line denotes the boundary of the 3D vehicle model.


**Figure 18.** Experimental results with a 115 degree FOV camera with lens distortion in different orientation conditions, where the green line denotes the boundary of the 3D vehicle model.


**Figure 19.** Experimental results with 115-degree FOV camera and lens distortion correction in different orientation conditions, where the green line denotes the boundary of the 3D vehicle model.


**Table 3.** Comparison of the related works and the proposed method.

Unfortunately, since there is no previous method that can calibrate side-rear-view monitoring system in natural driving environments, it is impossible to conduct quantitative performance comparison of the previous and proposed methods with the same dataset. However, the comparison summarized in Table 3 clearly explains that the proposed method is superior to the other previous methods in terms of side-rear-view camera calibration. Moreover, we can utilize the RVA information instead of the calibration patterns to implement offline calibration for calculating the similarity score and aligning images.

The similarity matrix consisting of image rotation and translation parameters can be estimated by minimizing an algebraic distance, called reprojection error, between matched feature points.

$$\hat{\mathbf{H}}\_{\rm S} = \operatorname\*{argmin}\_{\mathbf{H}\_{\rm S}} \sum\_{i} \|\mathbf{m}\_{i} - \mathbf{H}\_{\rm S} \mathbf{m}\_{i}\|^{2} \tag{15}$$

where H<sup>S</sup> is the similarity matrix, m*<sup>i</sup>* is *i*-th feature points inside the RVA in a captured image, m ˇ *i* is *i*-th feature points of the 3D vehicle model corresponding to m*<sup>i</sup>* , and H ˆ <sup>S</sup> is the estimated similarity matrix. We solved Equation (15) using the Levenberg–Marquardt method, one of the maximum likelihood estimation methods [45]. Experiments were performed using three types of cameras.

The proposed method can estimate similar parameter values to the RVA-based offline calibration method, as seen in Table 4. Furthermore, the experimental results with the 150-degree camera in Figure 20 show that the RVA boundary of the calibrated image by our method resembles the green line more than the previous methods. Additionally, other calibrated images in Figure 20 show that all RVA boundary locations of both methods almost fit the green line. These experimental results indicate that our method can provide similar results as the RVA-based offline calibration method even under conditions where the previous method cannot operate due to a lack of feature points. *Sensors* **2020**, *20*, x FOR PEER REVIEW 22 of 27 **Table 3.** Comparison of the related works and the proposed method.


**Table 4.** Quantitative performance comparison. **Method Driver's** 

**Product** 

**Calibration** 

#### *4.4. Limitation of Calibration*

The proposed method compared the static edge points inside RVA of a captured image and a 3D vehicle model image to calculate the similarity between the two images. Therefore, the static edge points of RVA represent an essential factor, but RVA can be altered by various factors. To investigate the effect of RVA range, we repeated the experiment by gradually decreasing the RVA range. Since the goniometer has a limitation in changing the camera orientation, we decreased the RVA of the 3D vehicle model image instead of changing the camera orientation, as shown in Figure 21. We could predict that the calibrated images corresponding to Figure 21 will display the translated images along the x- and *y*-axis directions. Therefore, if the rotation parameter is changed or if the translation parameter is different from the decreased RVA value, the calibration fails. where H is the similarity matrix, m is -th feature points inside the RVA in a captured image, m is -th feature points of the 3D vehicle model corresponding to m , and H is the estimated similarity matrix. We solved Equation (15) using the Levenberg–Marquardt method, one of the maximum likelihood estimation methods [45]. Experiments were performed using three types of cameras. The proposed method can estimate similar parameter values to the RVA-based offline calibration method, as seen in Table 4. Furthermore, the experimental results with the 150-degree camera in Figure 20 show that the RVA boundary of the calibrated image by our method resembles the green line more than the previous methods. Additionally, other calibrated images in Figure 20 show that all RVA boundary locations of both methods almost fit the green line. These experimental results indicate that our method can provide similar results as the RVA-based offline calibration method even under conditions where the previous method cannot operate due to a lack of feature points.


**Figure 20.** Qualitative performance comparison. **Figure 20.** Qualitative performance comparison.

 

**Figure 21.** Images with gradually decreasing RVA of a 3D vehicle model image to verify the minimum required RVA, where the green lines are the boundaries of a 3D vehicle model image.

Figure 22 shows the calibrated images corresponding to the images in Figure 21. We can see that the rotation parameter of the bottom-right image in Figure 22 differed from those of the other calibrated images, and the vehicle boundary in all calibrated images in Figure 22 almost matched the green curves representing the boundaries of the 3D vehicle model. Accordingly, the calibration failed only in the bottom-right case. RVA in case of the failure had no static edge points inside it, unlike the other cases. This means that the calibration can fail when it uses only RVA boundary data. Through this experiment, we confirmed that the elements that serve as static edge points inside RVA (i.e., door handle or pillar) must be photographed for automatic calibration of the side-rear-view monitoring system.

**Figure 22.** Calibrated images corresponding to Figure 21, where the green lines denote the boundaries of a 3D vehicle model image.

#### **5. Conclusions**

 

We proposed an automatic online calibration method for the monitoring system of a vehicle equipped with a side-rear-view camera. The proposed method has the following advantages. The first advantage is that it can be used to automatically calibrate the camera while driving without using additional sensors or artificial markers. Therefore, no specialized facilities are required for calibration. In addition, there is no constraint that offline calibration must be performed before automatic calibration, which is true of conventional methods. The next advantage is that it provides consistent results, even when the driving environment changes. This is possible because we eliminate irrelevant data before utilizing RANSAC to provide consistent results in various driving environments. The third advantage is that the proposed method facilitates large-scale template matching by using information about edge

points instead of color information because the method uses the Hough space. This advantage solves the problem of traditional large-scale template-matching methods that use color information, as well as the problem that the RVA color changes depending on the vehicle color and the driving environment. The last advantage is that the calibration requires only RVA information. Therefore, the proposed method can potentially be used to calibrate most cameras mounted on a vehicle.

Based on this potential, we expect the proposed automatic online calibration method to be applied not only to side-rear-view monitoring systems but also to various vision-based ADAS. These advantages indicate that the proposed method can provide convenience to motorists who require recalibration, and it can increase profits for vehicle manufacturers by reducing the usage of special facilities. As a disadvantage, the proposed method estimates the similarity instead of camera orientation. This disadvantage sometimes induces affine transformation errors. These errors can be solved by using a planar vehicle model, but it is difficult to overcome this disadvantage with the proposed method because it employs a 3D vehicle model. The results of experiments conducted in various driving environments indicate that the proposed automatic calibration method is suitable for use in real-world applications.

**Author Contributions:** J.H.L. developed the algorithm, and performed the experiments. D.-W.L. developed the system architecture and analyzed the experimental results. All authors wrote the paper together. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflicts of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## **SD-VIS: A Fast and Accurate Semi-Direct Monocular Visual-Inertial Simultaneous Localization and Mapping (SLAM)**

#### **Quanpan Liu, Zhengjie Wang \* and Huan Wang**

School of Mechatronical Engineering, Beijing Institute of Technology, Beijing 100081, China; liuquanpan@126.com (Q.L.); WH978992767@163.com (H.W.)

**\*** Correspondence: wangzhengjie@bit.edu.cn; Tel.: +86-186-1164-6723

Received: 17 February 2020; Accepted: 7 March 2020; Published: 9 March 2020

**Abstract:** In practical applications, how to achieve a perfect balance between high accuracy and computational efficiency can be the main challenge for simultaneous localization and mapping (SLAM). To solve this challenge, we propose SD-VIS, a novel fast and accurate semi-direct visual-inertial SLAM framework, which can estimate camera motion and structure of surrounding sparse scenes. In the initialization procedure, we align the pre-integrated IMU measurements and visual images and calibrate out the metric scale, initial velocity, gravity vector, and gyroscope bias by using multiple view geometry (MVG) theory based on the feature-based method. At the front-end, keyframes are tracked by feature-based method and used for back-end optimization and loop closure detection, while non-keyframes are utilized for fast-tracking by direct method. This strategy makes the system not only have the better real-time performance of direct method, but also have high accuracy and loop closing detection ability based on feature-based method. At the back-end, we propose a sliding window-based tightly-coupled optimization framework, which can get more accurate state estimation by minimizing the visual and IMU measurement errors. In order to limit the computational complexity, we adopt the marginalization strategy to fix the number of keyframes in the sliding window. Experimental evaluation on EuRoC dataset demonstrates the feasibility and superior real-time performance of SD-VIS. Compared with state-of-the-art SLAM systems, we can achieve a better balance between accuracy and speed.

**Keywords:** visual-inertial; semi-direct SLAM; multi-sensor fusion

#### **1. Introduction**

Simultaneous localization and mapping (SLAM) plays an important role in self-driving cars, virtual reality, unmanned aerial vehicles (UAV), augmented reality and artificial intelligence [1,2]. This technology can provide reliable state estimation for UAV and self-driving cars in GPS-denied environments by relying on its sensors. Various types of sensors can be utilized in SLAM, such as stereo camera, lidar, inertial measurement units (IMU), and monocular camera. However, they have significant disadvantages when used individually: the metric scale of stereo camera can be obtained directly by using fixed baseline length, but it can only be estimated accurately in a limited depth range [3]; lidar has high precision in indoor, but it will encounter the reflection problem of glass surface in outdoor [4]; cheap IMUs are extremely susceptible to bias and noise [5]; monocular camera cannot estimate the absolute metric scale [6]. This paper mainly studies the monocular vision-inertial navigation system (VINS) based on multi-sensor fusion [7], which has the advantages of small size, lightweight, observable scale, roll, and pitch angle, etc.

According to the different methods of image information processing, there are two categories of SLAM: feature-based method and direct method. The standard process of feature-based method proceeds in three steps [8–10]. Firstly, a group of sparse point or line features is extracted from each image, and feature matching between adjacent frames is performed by using invariant feature descriptors. Secondly, estimate the camera motion and the 3D position of sparse feature points by using multiple view geometry theory. Finally, optimize the camera motion and the 3D position of sparse feature points by minimizing visual re-projection errors. However, the feature-based method has the following disadvantages: feature extraction and matching are very time-consuming; fewer features extractable in low-texture environments; repeated texture environments will cause incorrect feature matching. Therefore, the accuracy and robustness of the feature-based method depend on the correct feature extraction and matching.

The direct method considers the entire image or some pixels with a large gradient and directly estimates the camera motion and scene structure by minimizing the photometric error [11–14]. Therefore, in the low-texture environments and repeated texture environments, the performance of the direct method is better than the feature-based method. In addition, without feature extraction and matching, the calculation speed of the direct method is faster than the feature-based method. However, the photometric error function is highly non-convex, it is difficult for the direct method to converge when large baseline motion and image blur occurs. Due to the inability to perform effective loop closure detection, the cumulative drift generated during long-term operation is still an unresolved problem [15].

Academic research on SLAM is very extensive. Well known methods include PTAM, SVO, VINS-Mono, LSD-SLAM, ORB-SLAM, DSO. PTAM was one of the most representative systems in the early stage of the feature-based SLAM algorithm. This algorithm first proposed to divide tracking and mapping into two parallel threads [16]. After that, most feature-based SLAM algorithms have adopted this idea, including ORB-SLAM. ORB-SLAM is the most successful feature-based SLAM, which uses ORB features in tracking, mapping, re-location, and loop closure detection [17]. VINS- Mono is a compelling multi-sensor monocular vision-inertial SLAM based on tight coupling and non-linear optimization [18]. Pl-VIO is an extension of the line feature of VINS-Mono, which can optimize the re-projection error of point and line features in a sliding window [19]. VINS-Fusion is a multi-sensor fusion platform based on non-linear optimization, which is the continuation of VINS-Mono in the direction of multi-sensor fusion [20]. LSD-SLAM is a novel method of real-time monocular SLAM based on direct method and can create large-scale semi-dense maps in real-time on a laptop without GPU acceleration [21]. Based on LSD-SLAM, DSO adds complete photometric calibration and uniformly samples pixels with large gradients throughout the image, thereby improving tracking accuracy and robustness [11]. On the basis of DSO, LDSO adds loop closure detection, which makes up for the shortcomings of DSO incapable of loop closure detection and eliminates the cumulative error generated during long navigation [22]. Stereo DSO is an improved version of DSO, which can realize the real-time estimation of motion and 3D structure with high accuracy and robustness on the moving stereo camera [3]. Lee authors in [7] presents a new implementation method for efficient simultaneous localization and mapping using a forward-viewing monocular vision sensor. The method is developed to be applicable in real time on a low-cost embedded system for indoor service robots.

In recent years, SLAM systems that combine the feature-based method and direct method have become popular. Semi-direct visual odometry (SVO) is an effective hybrid method that combines the advantages of feature-based method and direct method [23]. However, since the algorithm was originally designed for the drone's look-down camera, it is easy to lose camera tracking in other settings. Lee [24] proposed a loose-coupled method by combining ORB-SLAM and DSO to improve positioning accuracy. However, its frontend and backend are almost independent, which cannot share estimation information to further improve the pose precision. In [25,26], different semi-direct approaches were proposed for stereo odometry. Both methods use feature-based tracking to obtain a motion prior, and then perform direct semi-dense or sparse alignment to refine the camera pose. SVL [27] can be considered a combination of ORB-SLAM and SVO. The method for ORB-SLAM is adopted in keyframes, and SVO is adopted in non-keyframes. Therefore SVL can achieve good balance

between speed and accuracy according to camera motion and environment. Our method is motivated by SVL, but we go one step further in real-time performance. Specifically, thanks to the keyframe selection strategy and sliding window-based back-end, we only need to extract new feature points on the keyframes and track them with KLT sparse optical flow algorithm, which can further reduce the calculation complexity while ensuring accuracy.

In this paper, we present SD-VIS, a novel fast and accurate semi-direct visual-inertial SLAM framework, which combines the exactness of feature-based method and quickness of direct method. The keyframes in SD-VIS are tracked by feature-based method, which is used for sliding window-based non-linear optimization and loop closure detection. This strategy solves the problem of drift in the long-term operation and ensures the robustness of the algorithm in case of large baseline motion and image blur. Non-keyframes are tracked by the direct method, and the distance between adjacent non-keyframes is minimal, which ensures the convergence of error function. Compared with the direct method, SD-VIS exhibits the function of loop closure detection and solves the problem of drift in long-term operation. Compared with the feature-based method, SD-VIS can achieve the same accuracy while maintaining a faster speed.

#### **2. System Framework Overview**

Figure 1 demonstrates the framework of the semi-direct vision-inertial SLAM system. Sensor data comes from a monocular camera and IMU. IMU measurements between two consecutive images are pre-integrated, and the pre-integration is used as the constraint of IMU between two images (Section 3.2). In the initialization procedure, we detect the feature points on each image and track them with KLT sparse optical flow algorithm [15].

**Figure 1.** The semi-direct visual-inertial SLAM system framework

In the following visual-inertial alignment, we align the pre-integrated IMU measurements and visual images and calibrate out the metric scale, initial velocity, gravity vector, and gyroscope bias by using multiple view geometry (MVG) theory based on the feature-based method. (Section 3.3). After initialization, keyframe selection will be performed based on the IMU pre-integration and previous feature matching results. The previous feature matching refers to the feature matching between the last frame and the penultimate frame in the sliding window, and the matching is completed before the current frame arrives. Non-keyframes are used for fast-tracking and localization by direct method [11], and keyframes are tracked by feature-based method [18] and used for non-linear optimization and loop closure detection (Section 4). In the following tight coupling optimization framework, we can get more accurate state estimation by minimizing visual re-projection error, IMU residual, prior information from marginalization, and re-location information from loop closure detection (Section 5).

#### **3. IMU Measurements and Visual-Inertial Alignment**

#### *3.1. Definition of Symbols*

Figure 2 shows the definition of symbols in the semi-direct visual-inertial SLAM framework. C, B, and W are the camera coordinate system, the IMU body coordinate system, and the world coordinate system, respectively. We define TWB = R W B , P W B as the motion of B relative to W. TBC = R B C , P B C represents the extrinsic parameters between C and B, which can be calibrated in advance. T*t*,*t*+<sup>1</sup> ∈ SE(3) represents the motion from time t to time t + 1 in the coordinate system C, and *Zt*,*t*+<sup>1</sup> represents a pre-integrated IMU measurement between the camera coordinate system C*<sup>t</sup>* and C*t*+1. TWB = (R<sup>B</sup> <sup>W</sup>,P<sup>B</sup> <sup>W</sup> ) TBC = (R<sup>C</sup> B ,P<sup>C</sup> B ) T,+1 ∈ SE(3) t t + 1 ,+1 C C +1

**Figure 2.** Symbol definition of algorithm

3D F<sup>1</sup> , F<sup>2</sup> ∈ R 3 C C +1 P<sup>1</sup> , P<sup>2</sup> , P<sup>3</sup> , P<sup>4</sup> ∈ R 2 F1 by the projection function π: R <sup>3</sup> → R 2 A 3D point F1, F<sup>2</sup> ∈ R 3 represents the spatial feature points observed simultaneously by the camera coordinate system C*<sup>t</sup>* and C*t*+1.P1, P2, P3, P<sup>4</sup> ∈ R 2 are the projections of feature points on the image coordinate system. We adopt the traditional pinhole camera model to map the F<sup>1</sup> in the camera coordinate system to the image coordinate system by the projection function π: R <sup>3</sup> <sup>→</sup> <sup>R</sup> 2 :

$$\mathbf{F}\_1 = \pi \mathbf{F}\_1 = \begin{bmatrix} \mathbf{f}\_{\mathbf{u}} \frac{\mathbf{x}\_{\mathbf{c}}}{Z\_{\mathbf{c}}} + \mathbf{c}\_{\mathbf{u}} \\ \mathbf{f}\_{\mathbf{v}} \frac{\mathbf{y}\_{\mathbf{c}}}{Z\_{\mathbf{c}}} + \mathbf{c}\_{\mathbf{v}} \end{bmatrix} \mathbf{F}\_1 = \begin{bmatrix} \mathbf{x}\_{\mathbf{c}} & \mathbf{y}\_{\mathbf{c}} & \mathbf{z}\_{\mathbf{c}} \end{bmatrix}^{\mathrm{T}} \tag{1}$$

[fu fv ] <sup>T</sup> [c<sup>u</sup> cv ] T where <sup>h</sup> f<sup>u</sup> f<sup>v</sup> iT and h c<sup>u</sup> c<sup>v</sup> iT is camera internal parameters.

#### *3.2. IMU Pre-Integration*

In the back-end optimization and visual-inertial alignment, the constraints of vision and IMU need to be optimized in the same frame, so the IMU measurements between two adjacent frames need to be integrated into one constraint.

IMU can output 3-axis angular velocity <sup>ω</sup><sup>e</sup> and 3-axis acceleration <sup>e</sup><sup>α</sup> including bias and Gaussian white noise:

$$
\widetilde{\boldsymbol{\omega}} = \boldsymbol{\omega}^{\mathbf{B}} + \mathbf{b}\_{\boldsymbol{\omega}}^{\mathbf{B}} + \mathbf{n}\_{\boldsymbol{\omega}}^{\mathbf{B}} \tag{2}
$$

$$\widetilde{\boldsymbol{\alpha}} = \mathbf{R}\_{\rm W}^{\rm B} \left( \boldsymbol{\alpha}^{\rm W} + \mathbf{g}^{\rm W} \right) + \mathbf{b}\_{\alpha}^{\rm B} + \mathbf{n}\_{\alpha}^{\rm B} \tag{3}$$

where n B <sup>ω</sup> ∼ N 0, σ 2 ω , n B <sup>α</sup> ∼ N 0, σ 2 α represents the Gaussian white noise. g<sup>W</sup> = [0, 0, g] T is the gravity vector. R B <sup>w</sup> represents the rotation matrix from W to B. b B <sup>ω</sup>, b B <sup>α</sup> represents the biases of gyroscope and accelerometer.

We define P W Bi , V W Bi , q<sup>W</sup> Bi as the translation, velocity, and rotation quaternions in the IMU body coordinate system B<sup>i</sup> . If we know the measurements value of IMU during t ∈ [i, j], we can calculate P W Bj , V W Bj , q<sup>W</sup> Bj in the coordinate system B<sup>j</sup> :

$$\mathbf{P}\_{\mathbf{B}\_{\parallel}}^{\rm W} = \mathbf{P}\_{\mathbf{B}\_{\rm l}}^{\rm W} + \mathbf{V}\_{\mathbf{B}\_{\rm l}}^{\rm W} \Delta \mathbf{t} + \iint\_{\rm t \in [i, j]} \left[ \mathbf{R}\_{\rm B\_{\rm l}}^{\rm W} (\widetilde{\boldsymbol{\alpha}}\_{\rm t} - \mathbf{b}\_{\alpha\_{t}}^{\rm B}) - \mathbf{g}^{\rm W} \right] \delta \mathbf{t}^{2} \tag{4}$$

$$\mathbf{V\_{B\_{j}}^{W}} = \mathbf{V\_{B\_{l}}^{W}} + \int\_{\mathbf{t} \in [\mathbf{i}, \mathbf{j}]} \left[ \mathbf{R\_{B\_{t}}^{W}} (\overleftarrow{\alpha}\_{\mathbf{t}} - \mathbf{b\_{\alpha\_{t}}^{B}}) - \mathbf{g^{W}} \right] \delta \mathbf{t} \tag{5}$$

$$\mathbf{q\_{B\_{l}}^{W}} = \mathbf{q\_{B\_{l}}^{W}} \otimes \int\_{t \in [i,j]} \frac{1}{2} \Omega \left(\widetilde{\boldsymbol{\omega}}\_{t} - \mathbf{b\_{w\_{l}}^{B}}\right) \mathbf{q\_{B\_{l}}^{W}} \delta t \tag{6}$$

where:

$$
\Omega(\omega) = \begin{bmatrix} -\omega\_{\times} & \omega \\ -\omega^T & 0 \end{bmatrix}, \ \omega\_{\times} = \begin{bmatrix} 0 & -\omega\_z & \omega\_y \\ \omega\_z & 0 & -\omega\_x \\ -\omega\_y & \omega\_x & 0 \end{bmatrix}.
$$

From formulas (4)–(6), we can know that the IMU body state P W Bj , V W Bj , q<sup>W</sup> Bj is related to the IMU body coordinate system B<sup>i</sup> . In the back-end tightly coupling optimization, we will continuously iteratively update the IMU state variables in the sliding window. When the IMU body state at time t = i is iteratively updated, we need to recalculate the state at time t = j, which is very time-consuming. We adopt IMU pre-integration technology to avoid unnecessary time consumption. formulas (4)–(6) can be written as:

$$\mathbf{R}\_{\mathbf{W}}^{\mathbf{B}\_{\mathbf{l}}}\mathbf{P}\_{\mathbf{B}\_{\mathbf{j}}}^{\mathbf{W}} = \mathbf{R}\_{\mathbf{W}}^{\mathbf{B}\_{\mathbf{l}}} \left(\mathbf{P}\_{\mathbf{B}\_{\mathbf{l}}}^{\mathbf{W}} + \mathbf{V}\_{\mathbf{B}\_{\mathbf{l}}}^{\mathbf{W}}\Delta\mathbf{t} - \frac{1}{2}\mathbf{g}^{\mathbf{W}}\Delta\mathbf{t}^{2}\right) + \boldsymbol{\Theta}\_{\mathbf{B}\_{\mathbf{j}}}^{\mathbf{B}\_{\mathbf{l}}} \tag{7}$$

$$\mathbf{R}\_{\mathbf{W}}^{\mathbf{B}\_{\mathbf{l}}}\mathbf{V}\_{\mathbf{B}\_{\mathbf{l}}}^{\mathbf{W}} = \mathbf{R}\_{\mathbf{W}}^{\mathbf{B}\_{\mathbf{l}}} \left(\mathbf{V}\_{\mathbf{B}\_{\mathbf{l}}}^{\mathbf{W}}\Delta \mathbf{t} - \mathbf{g}^{\mathbf{W}}\Delta \mathbf{t}\right) + \boldsymbol{\beta}\_{\mathbf{B}\_{\mathbf{l}}}^{\mathbf{B}\_{\mathbf{l}}} \tag{8}$$

$$\mathbf{q}\_{\mathbf{W}}^{\mathbf{B}\_{\mathbf{l}}} \otimes \mathbf{q}\_{\mathbf{B}\_{\mathbf{l}}}^{\mathbf{W}} = \boldsymbol{\gamma}\_{\mathbf{B}\_{\mathbf{l}}}^{\mathbf{B}\_{\mathbf{l}}} \tag{9}$$

where:

$$\Theta\_{\mathbf{B}\_{\mathbf{j}}}^{\mathbf{B}\_{\mathbf{i}}} = \iint \left[ \mathbf{R}\_{\mathbf{B}\_{\mathbf{t}}}^{\mathbf{B}\_{\mathbf{l}}} (\widetilde{\alpha}\_{\mathbf{t}} - \mathbf{b}\_{\alpha\_{\mathbf{t}}}^{\mathbf{B}}) \right] \delta \mathbf{t}^2 \tag{10}$$

$$\boldsymbol{\mathfrak{B}}\_{\mathbf{B}\_{\parallel}}^{\mathbf{B}\_{\text{l}}} = \int\_{\mathbf{t} \in [\mathbf{i}, \mathbf{j}]} \left[ \mathbf{R}\_{\mathbf{B}\_{\text{l}}}^{\mathbf{B}\_{\text{l}}} (\widetilde{\boldsymbol{\alpha}}\_{\mathbf{t}} - \mathbf{b}\_{\boldsymbol{\alpha}\_{\mathbf{t}}}^{\mathbf{B}}) \right] \boldsymbol{\mathfrak{b}} \,\tag{11}$$

$$\gamma\_{\mathbf{B}\_{\parallel}}^{\mathbf{B}\_{\mathbf{i}}} = \int\_{\mathbf{t} \in [\mathbf{i}, \mathbf{j}]} \frac{1}{2} \Omega (\widetilde{\boldsymbol{\omega}}\_{\mathbf{t}} - \mathbf{b}\_{\boldsymbol{\omega}\boldsymbol{\omega}}^{\mathbf{B}}) \mathbf{q}\_{\mathbf{B}\_{\mathbf{i}}}^{\mathbf{B}\_{\mathbf{i}}} \delta \mathbf{t} \tag{12}$$

*Sensors* **2020**, *20*, 1511

From formulas (10)–(12), the pre-integration measurements θ Bi Bj , β Bi Bj , γ Bi Bj , and the IMU body coordinate system B<sup>i</sup> are independent of each other. This means that when the states in the B<sup>i</sup> coordinate system are iteratively updated, there is no need to recalculate the states in the B<sup>j</sup> coordinate system. Since the IMU pre-integrated measurements θ Bi Bj , β Bi Bj , γ Bi Bj is affected by the bias, when the bias is updated iteratively, we will update the pre-integrated measurement by the first-order approximation method:

$$\boldsymbol{\Theta}\_{\mathbf{B}\_{\parallel}}^{\mathbf{B}\_{\parallel}} \approx \overline{\boldsymbol{\Theta}}\_{\mathbf{B}\_{\parallel}}^{\mathbf{B}\_{\parallel}} + \mathbf{J}\_{\mathbf{b}\_{\alpha}}^{\boldsymbol{\Theta}} \boldsymbol{\delta} \mathbf{b}\_{\alpha}^{\mathbf{B}} + \mathbf{J}\_{\mathbf{b}\_{\omega}}^{\boldsymbol{\Theta}} \boldsymbol{\delta} \mathbf{b}\_{\omega}^{\mathbf{B}} \tag{13}$$

$$\mathfrak{J}\_{\mathbf{B}\_{\parallel}}^{\mathbf{B}\_{\text{l}}} \approx \overline{\mathfrak{J}}\_{\mathbf{B}\_{\parallel}}^{\mathbf{B}\_{\text{l}}} + \mathbf{J}\_{\mathbf{b}\_{\alpha}}^{\mathbf{B}} \, \mathfrak{sl}\_{\alpha}^{\mathbf{B}} + \mathbf{J}\_{\mathbf{b}\_{\omega}}^{\mathbf{B}} \, \mathfrak{sl}\_{\omega}^{\mathbf{B}} \tag{14}$$

$$
\gamma\_{\mathbf{B}\_{\parallel}}^{\mathbf{B}\_{\text{i}}} \approx \overline{\gamma}\_{\mathbf{B}\_{\parallel}}^{\mathbf{B}\_{\text{i}}} \otimes \left[ \begin{array}{c} 0\\ \frac{1}{2} \mathbf{J}\_{\mathbf{b}\_{\text{w}}}^{\mathcal{V}} \, \delta \mathbf{b}\_{\text{w}}^{\mathbf{B}\_{\text{w}}} \end{array} \right] \tag{15}
$$

where J θ bα , J θ bω , J β bα , J β bω , J γ bω are the Jacobian matrices of pre-integrated measurements with respect to bias.

#### *3.3. Visual-Inertial Alignment*

The convergence speed and effect of nonlinear visual-inertial SLAM systems depend heavily on reliable initial values. Therefore, in the initialization procedure of SD-VIS, we align the pre-integrated IMU measurements with the visual image to complete the system initialization.

#### 3.3.1. Gyroscope Bias Correction

We regard the camera coordinate system C<sup>0</sup> as the world coordinate system. We detect the feature points on each image and track them with KLT sparse optical flow algorithm, and then the rotation R C0 Ct and R C0 Ct+<sup>1</sup> of the two adjacent frames C*<sup>t</sup>* and C*t*+<sup>1</sup> can be estimated by using visual structure from motion (SfM). Rotation increment <sup>e</sup><sup>γ</sup> Bt Bt+<sup>1</sup> between the IMU body coordinate system B*<sup>t</sup>* and B*t*+<sup>1</sup> can be estimated by IMU pre-integration. We can get the following formula:

$$\min\_{\mathbf{c}\otimes\mathbf{b}\_{\mathbf{w}\_{\mathbf{t}}}^{\mathbf{B}}} \sum\_{} \|\mathbf{R}\_{\mathbf{C}\_{\mathbf{t}+1}}^{\mathbf{C}\_{\mathbf{0}}}{}^{-1} \otimes \mathbf{R}\_{\mathbf{C}\_{\mathbf{t}}}^{\mathbf{C}\_{\mathbf{0}}} \otimes \boldsymbol{\gamma}\_{\mathbf{B}\_{\mathbf{t}+1}}^{\mathbf{B}\_{\mathbf{t}}}\|^{2} \tag{16}$$

where:

$$
\boldsymbol{\gamma}\_{\mathbf{B}\_{\mathrm{t}+1}}^{\mathbf{B}\_{\mathrm{t}}} \approx \overleftarrow{\boldsymbol{\gamma}}\_{\mathbf{B}\_{\mathrm{t}+1}}^{\mathbf{B}\_{\mathrm{t}}} \otimes \begin{bmatrix} \mathbf{0} \\ \ \ \ \ \ \mathbf{\tilde{J}}\_{\mathbf{b}\_{\mathrm{w}}}^{\mathbf{y}} \ \ \boldsymbol{\delta}\mathbf{b}\_{\mathrm{w}\mathrm{t}}^{\mathbf{B}} \end{bmatrix} \tag{17}
$$

We solve the above least squares problem to get the initial calibration of the gyroscope bias and use it to update θ Bi Bj , β Bi Bj , γ Bi Bj .

#### 3.3.2. Gravity Vector, Initial Velocity, and Metric Scale Correction

We define the variables that need to be calibrated as:

$$\mathbf{X}\_{\rm I} = \left[ \begin{array}{cccccc} \mathbf{V}\_{\rm b\_0}^{\rm b\_0} & \mathbf{V}\_{\rm b\_1}^{\rm b\_1} & \dots & \mathbf{V}\_{\rm b\_n}^{\rm b\_n} & \mathbf{g}^{\rm C\_0} & s & \\ \end{array} \right] \tag{18}$$

where V bk bk is the initial velocity in the IMU body coordinate system, g <sup>C</sup><sup>0</sup> is the gravity vector in the camera coordinate system, s represents the metric scale of semi-direct visual-inertial SLAM framework.

Suppose we have obtained pre-calibrated external parameter TBC = R B C , P<sup>B</sup> C , we can transform the pose TC0C<sup>t</sup> = R C0 Ct , PC<sup>0</sup> Ct from the camera coordinate system to the IMU body coordinate system:

$$\mathbf{q}\_{\mathbf{B}\_{\mathbf{t}}}^{\mathbf{C}\_{0}} = \mathbf{q}\_{\mathbf{C}\_{\mathbf{t}}}^{\mathbf{C}\_{0}} \otimes \mathbf{q}\_{\mathbf{C}}^{\mathbf{B}-1} \tag{19}$$

$$\rm{sP}\_{\rm B\_t}^{\rm{C\_0}} = \rm{sP}\_{\rm C\_t}^{\rm{C\_0}} - \rm{R}\_{\rm B\_t}^{\rm{C\_0}} P\_{\rm C}^{\rm{B}} \tag{20}$$

Considering two adjacent keyframes B<sup>t</sup> and Bt+1, then formulas (7) and (8) can be rewritten as:

$$\boldsymbol{\Theta}\_{\mathbf{B}\_{\mathrm{t}+1}}^{\mathbf{B}\_{\mathrm{t}}} = \mathbf{R}\_{\mathrm{C}\_{0}}^{\mathbf{B}\_{\mathrm{t}}} \left( \mathbf{s} \Big( \mathbf{P}\_{\mathrm{B}\_{\mathrm{t}+1}}^{\mathbf{C}\_{0}} - \mathbf{P}\_{\mathrm{B}\_{\mathrm{t}}}^{\mathbf{C}\_{0}} \Big) - \mathbf{R}\_{\mathrm{B}\_{\mathrm{t}}}^{\mathbf{C}\_{0}} \mathbf{V}\_{\mathrm{B}\_{\mathrm{t}}}^{\mathbf{B}\_{\mathrm{t}}} \Delta \mathbf{t} + \frac{1}{2} \mathbf{g}^{\mathbf{C}\_{0}} \Delta \mathbf{t}^{2} \right) \tag{21}$$

$$\boldsymbol{\beta}\_{\mathbf{B}\_{t+1}}^{\mathbf{B}\_{\mathbf{t}}} = \mathbf{R}\_{\mathbf{C}\_{0}}^{\mathbf{B}\_{\mathbf{t}}} \left( \mathbf{R}\_{\mathbf{B}\_{t+1}}^{\mathbf{C}\_{0}} \mathbf{V}\_{\mathbf{B}\_{t+1}}^{\mathbf{B}\_{\mathbf{t}+1}} - \mathbf{R}\_{\mathbf{B}\_{\mathbf{t}}}^{\mathbf{C}\_{0}} \mathbf{V}\_{\mathbf{B}\_{\mathbf{t}}}^{\mathbf{B}\_{\mathbf{t}}} + \mathbf{g}^{\mathbf{C}\_{0}} \Delta \mathbf{t} \right) \tag{22}$$

We combine formulas (19)–(22) to get the following formula:

$$
\overline{\mathbf{Z}}\_{\mathbf{B}\_{t+1}}^{\mathbf{B}\_{t}} = \begin{bmatrix} \boldsymbol{\Theta}\_{\mathbf{B}\_{t+1}}^{\mathbf{B}\_{t}} - \mathbf{P}\_{\mathbf{C}}^{\mathbf{B}} + \mathbf{R}\_{\mathbf{C}\_{0}}^{\mathbf{B}\_{t}} \mathbf{R}\_{\mathbf{B}\_{t+1}}^{\mathbf{C}} \mathbf{P}\_{\mathbf{C}}^{\mathbf{B}} \\ \boldsymbol{\upbeta}\_{\mathbf{B}\_{t+1}}^{\mathbf{B}\_{t}} \end{bmatrix} = \mathbf{H}\_{\mathbf{B}\_{t+1}}^{\mathbf{B}\_{t}} \boldsymbol{\upchi}\_{\mathbf{I}} + \mathbf{n}\_{\mathbf{B}\_{t+1}}^{\mathbf{B}\_{t}} \tag{23}
$$

where:

$$\mathbf{H}\_{\mathbf{B}\_{t+1}}^{\mathbf{B}\_{t}} = \begin{bmatrix} -\mathrm{I}\Delta\mathbf{t} & \mathbf{0} & \frac{1}{2}\mathbf{R}\_{\mathbf{C}}^{\mathbf{B}\_{t}}\Delta\mathbf{t}^{2} & \mathbf{R}\_{\mathbf{C}\_{0}}^{\mathbf{B}\_{t}}\left(\mathbf{P}\_{\mathbf{C}\_{t+1}}^{\mathbf{C}\_{0}} - \mathbf{P}\_{\mathbf{C}\_{t}}^{\mathbf{C}\_{0}}\right) \\\ -\mathbf{I} & \mathbf{R}\_{\mathbf{C}\_{0}}^{\mathbf{B}\_{t}}\mathbf{R}\_{\mathbf{B}\_{t+1}}^{\mathbf{C}\_{0}} & \mathbf{R}\_{\mathbf{C}\_{0}}^{\mathbf{B}\_{t}}\Delta\mathbf{t} & \mathbf{0} \end{bmatrix} \tag{24}$$

In the above formula RC<sup>0</sup> Bt , RC<sup>0</sup> Bt+<sup>1</sup> , PC<sup>0</sup> Bt , PC<sup>0</sup> Bt+<sup>1</sup> can be obtained through visual SfM:

$$\min\_{\mathbf{X}\_{\mathbf{I}}} \sum \| \overline{\mathbf{Z}}\_{\mathbf{B}\_{\mathbf{t}+1}}^{\mathbf{B}\_{\mathbf{t}}} - \mathbf{H}\_{\mathbf{B}\_{\mathbf{t}+1}}^{\mathbf{B}\_{\mathbf{t}}} \chi\_{\mathbf{I}} \|^{2} \tag{25}$$

Solving the above formula, we can calibrate the initial velocity for each keyframe, gravity vector, and absolute metric scale. After estimating the scale, we will adjust the translation vector of the vision SfM to make the system have an observable scale.

#### 3.3.3. Gravity Vector Refinement

We can know the magnitude of the gravity vector in advance, so we refer to the VINS-Mono [18] method to re-parameterize the gravity vector obtained in Section 3.3.2 with two variables in tangent space, and perform further optimization.

After obtaining the accurate gravity vector, we can rotate the coordinate system C0, which is temporarily the world coordinate system, to the real world coordinate system W. However, since the yaw angle in the visual-inertial SLAM system is unobservable, the yaw angle of the C<sup>0</sup> coordinate system remains unchanged during the rotation process. At this time, the initialization procedure of the semi-direct visual-inertial SLAM system is completed.

#### **4. Visual Measurements**

#### *4.1. Keyframe Selection*

We have three different keyframe selection strategies. Satisfying one of these three strategies makes the current frame a keyframe. The first and third strategies of keyframe selection are based on the feature matching results of the last frame and the penultimate frame in the sliding window, which has been matched before the current frame arrives. The first selection strategy is the tracking number of feature points. No new feature points will be extracted when tracking non-keyframes. The translational motion of the camera will lead to the decrease of tracking feature points. If the number of tracking points in the last frame in the sliding window is less than 70% of the minimum tracking point threshold, the current frame will be treated as a keyframe. The second selection strategy is related to IMU pre-integration. If the translation distance between the last two adjacent frames in the sliding window calculated by the IMU pre-integration exceeds a preset threshold, the current frame is also considered as a new keyframe. The third selection strategy is the average parallax of the feature points tracked on the the last frame and the penultimate frame in the sliding window. The translation

or rotation of camera will cause parallax. When the average parallax exceeds the threshold, the current frame will also be regarded as a keyframe.

#### *4.2. Keyframes Tracking*

If the current frame is treated as a keyframe, we first use the fast feature detector [28] to add new feature points in the last frame in the sliding window and then use the KLT sparse optical flow algorithm to track them in the current frame (Figure 3). At least 200 feature points will be maintained in each frame. Since there is no need to calculate the feature point descriptor, the optical flow method can save more time. In addition, we also use RANSAC [29] with the fundamental matrix model to eliminate outliers generated during tracking.

**Figure 3.** The top chart (**a**) shows how to track a keyframe. The black triangle represents the newly extracted feature points on the 10th frame (last frame) of the sliding window, while the green dotted line represents tracking them in the keyframe by KLT sparse optical flow algorithm. The bottom chart (**b**) shows how to track a non-keyframe. The red dotted line indicates that the feature points on the 10th frame of the sliding window are tracked on the non-keyframe by the direct method.

#### *4.3. Non-Keyframes Tracking*

[·] = 1 2

‖·‖<sup>2</sup>

T,+1 T,+1 T,+1 If the current frame is considered a non-keyframe, we use direct image alignment to estimate the relative pose T*t*,*t*+<sup>1</sup> between the current frame and the last frame in the sliding window. The initial value of the relative pose can be obtained directly by IMU pre-integration. The feature points observed in the last frame are projected into the current frame according to the estimated pose T*t*,*t*+1. Due to the hypothesis of photometric invariance, if the same feature point is observed by two adjacent frames, the photometric values of the projection points on the two adjacent frames are equal (Figure 4). Therefore, we can optimize the relative pose T*t*,*t*+<sup>1</sup> by minimizing the photometric error between image blocks (4 × 4 pixels):

$$T\_{t\downarrow+1} = \operatorname\*{argmin}\_{T} \iint\_{R} \rho[\delta I(T,\mu)] d\mu \tag{26}$$

∬[(, )]

,+1 = arg min

where ρ[·] = 1 2 k·k 2 , *u* is the position of the feature point of the last frame in the sliding window. R is the image region where the depth is known in the last frame, and the back-projected points are visible in the current frame domain.

T,+1 ′ **Figure 4.** Adjusting the relative pose T*t*,*t*+<sup>1</sup> between the current frame and last frame in the sliding window means moving the re-projection position *u* ′ of feature points on the current frame.

 The photometric error δ*I* is:

$$\delta I(T\_{t,t+1}, u) = I\_{t+1}\Big(\pi T\_{t,t+1} \cdot \pi^{-1}\Big(u, d\_p\Big)\Big) - I\_l(u) \quad \forall u \in \mathbb{R} \tag{27}$$

 where *d<sup>p</sup>* is the depth of the feature point in the last frame in the sliding window. *I<sup>k</sup>* represents the intensity image in the k-th frame.

T(ξ) We use the inverse compositional formulation [30] of the photometric error, which can avoid unnecessary Jacobian derivation. The update step T(ξ) for the last frame in the sliding window is:

$$\delta I(\boldsymbol{\xi}, \boldsymbol{\mu}) = I\_{t+1} \Big( \pi T\_{t, t+1} \cdot \pi^{-1} \binom{\boldsymbol{\mu}, \boldsymbol{d}\_p}{\boldsymbol{\mu}} \Big) - I\_t \Big( \pi \Big( T(\boldsymbol{\xi}) \cdot \pi^{-1} \binom{\boldsymbol{\mu}, \boldsymbol{d}\_p}{\boldsymbol{\mu}} \Big) \Big) \,\forall \boldsymbol{\mu} \in \mathbb{R} \tag{28}$$

T,+1 We solve it in an iterative Gauss Newton method and update T*t*,*t*+<sup>1</sup> in the following way:

$$T\_{t,t+1} \leftarrow T\_{t,t+1} \cdot \mathbf{T}(\boldsymbol{\xi})^{-1} \tag{29}$$

T, +1 After image alignment, we can get the optimized relative pose T*t*,*t*+<sup>1</sup> between the current frame and the last frame in the sliding window. We define all 3D points observed in all frames in the sliding window as the local map, and project the local map to the current frame to find the visible 3D points of the current frame. Due to the inaccuracy of the visible 3D point position and the camera pose, there will be errors in the projection position of the current frame. To make the projection position more accurate, the current frame needs to be aligned with the local map. The feature matching step optimizes the positions of all the projection points in the current frame by minimizing the photometric errors of the projection blocks (5 × 5 pixels) in the current frame and the reference frame (Figure 5):

$$\mu\_{i}^{\prime} = \operatorname\*{argmin}\_{\boldsymbol{u}\_{i}^{\prime}} \frac{1}{2} \left\| I\_{l+1} \left( \boldsymbol{u}\_{i}^{\prime} \right) - A\_{l} \cdot I\_{l} \left( \boldsymbol{u}\_{i} \right) \right\|^{2} \quad \forall i \tag{30}$$

 ′ = arg min ′ 2 ‖+1 ( ′) − · ( )‖<sup>2</sup> ∀ Solving the above formula in an iterative Gauss Newton method, we can get the update of the projection block position. The reference frame is usually far away from the current frame, so we apply an affine warping *A<sup>i</sup>* to the reference patch.

an affine war — Through image alignment and feature matching, we get the implicit results of direct motion estimation — feature correspondence with sub-pixel accuracy. Note that when tracking non-keyframes with the direct method, no new feature points are extracted. In the back-end optimization, we will combine IMU residual, visual re-projection error, prior information, and re-localization information to optimize the camera pose and 3D point position again.

 ′ **Figure 5.** Adjust the position of the projection block *u* ′ *i* on the current frame to minimize the photometric error of the projection block in the current frame and the reference frame in the sliding window.

#### **5. Sliding Window-based Tightly-coupled Optimization Framework**

After tracking non-keyframes and keyframes, we proceed with a sliding window-based tightly-coupled optimization framework for high accuracy and robust state estimation. In the optimization framework, we combined IMU residual, visual re-projection error, prior information, and re-localization information to optimize the camera pose and 3D point position again.

#### *5.1. Formulation*

The state variables to be estimated by SD-VIS are defined as:

–

δβ<sup>B</sup><sup>j</sup> Bi

δγ<sup>B</sup><sup>j</sup> Bi =

[ R<sup>W</sup> Bi (P<sup>B</sup><sup>j</sup> <sup>W</sup> − P<sup>B</sup><sup>i</sup>

δb<sup>α</sup> B δb<sup>ω</sup> B ] 

[ δθ<sup>B</sup><sup>j</sup> Bi

rB (Z ̃ Bj Bi , X) =

$$\mathbf{X}\_{\mathbf{k}} = \begin{bmatrix} \mathbf{X}\_{0} & \mathbf{X}\_{1} & \dots & \mathbf{X}\_{\mathbf{n}} & \mathbf{X}\_{\mathbf{C}}^{\mathbf{B}} \\ \mathbf{P}\_{\mathbf{B}\_{\mathbf{k}}}^{\mathbf{W}} & \mathbf{V}\_{\mathbf{B}\_{\mathbf{k}}}^{\mathbf{W}} & \mathbf{q}\_{\mathbf{B}\_{\mathbf{k}}}^{\mathbf{W}} & \mathbf{b}\_{\mathbf{c}}^{\mathbf{B}} & \mathbf{b}\_{\mathbf{w}}^{\mathbf{B}} \end{bmatrix} \mathbf{k} \in [0, \mathbf{n}] \tag{31}$$
 
$$\mathbf{X}\_{\mathbf{C}}^{\mathbf{B}} = \begin{bmatrix} \mathbf{P}\_{\mathbf{C}}^{\mathbf{B}} & \mathbf{q}\_{\mathbf{C}}^{\mathbf{B}} \end{bmatrix}$$

XC <sup>B</sup> = [P<sup>C</sup> <sup>B</sup> q<sup>C</sup> B] X<sup>k</sup> k th n where X<sup>k</sup> includes the translation, velocity, and rotation quaternions of the k th IMU body coordinate system concerning the world coordinate system, as well as the bias of gyroscope and accelerometer. n represents the size of the sliding window. By minimizing the sum of IMU residuals, visual re-projection errors, prior information, and relocation information in the sliding window, we can obtain a robust and accurate semi-direct visual-inertial SLAM system:

$$\min\_{\mathbf{X}} \left\{ \sum\_{} \left\| \mathbf{r}\_{\mathbf{B}} \left( \overleftarrow{\mathbf{Z}}\_{\mathbf{B}\_{k+1}}^{\mathbf{B}\_{k}} \mathbf{X} \right) \right\|\_{\mathbf{P}\_{\mathbf{B}\_{k+1}}^{\mathbf{B}\_{k}}}^{2} + \sum\_{} \left\| \mathbf{r}\_{\mathbf{C}} \left( \overleftarrow{\mathbf{Z}}\_{\mathbf{F}}^{\mathbf{C}\_{j}} \mathbf{X} \right) \right\|\_{\mathbf{P}\_{\mathbf{F}}^{\mathbf{C}\_{j}}}^{2} + \left\| \mathbf{r}\_{\mathbf{P}} - \mathbf{H}\_{\mathbf{P}} \mathbf{X} \right\|^{2} + \sum\_{} \left\| \mathbf{r}\_{\mathbf{C}} \left( \overleftarrow{\mathbf{Z}}\_{\mathbf{F}}^{\mathbf{C}\_{\mathbf{L}}} \mathbf{X} \right) \right\|\_{\mathbf{P}\_{\mathbf{F}}^{\mathbf{C}\_{\mathbf{L}}}}^{2} \right\} \tag{32}$$

<sup>P</sup>Bk+1 PF rB(Z̃ Bk+1 Bk ,X) r<sup>C</sup> (Z̃ F Cj ,X) {r<sup>p</sup> , H<sup>p</sup> } r<sup>C</sup> (Z̃ F CL , X) where r<sup>B</sup> e Z Bk Bk+<sup>1</sup> , X , r<sup>C</sup> e Z Cj F , X , n rp, H<sup>p</sup> o and r<sup>C</sup> e Z C<sup>L</sup> F , X are IMU residuals, visual re-projection errors, prior information and re-localization information respectively.

R<sup>W</sup> Bi (V<sup>B</sup><sup>j</sup> <sup>W</sup> − V<sup>B</sup><sup>i</sup>

<sup>W</sup> − V<sup>B</sup><sup>i</sup>

bα <sup>B</sup><sup>i</sup> − b<sup>α</sup> Bj

bω <sup>B</sup><sup>i</sup> − b<sup>ω</sup> Bj

<sup>W</sup>Δt + 1 2 g <sup>W</sup>Δt 2 ) − θ̃ Bj Bi

<sup>W</sup> ⊗ (γ̃<sup>B</sup><sup>j</sup> Bi ) −1 ] xyz

<sup>W</sup>Δt ) − β̃

Bj Bi ] 

15∗1

<sup>W</sup> + g

#### *5.2. IMU Residuals*

According to the formulas (4)–(6) in Section 3.2, we can get the IMU measurement residual:

rB eZ Bi Bj , X = δθB<sup>i</sup> Bj δβB<sup>i</sup> Bj δγB<sup>i</sup> Bj δb B α δb B ω = R Bi W P W Bj − P W Bi − V W Bi ∆t + <sup>1</sup> 2 gW∆t 2 −eθ Bi Bj R Bi W V W Bj − V W Bi + gW∆t − eβ Bi Bj 2 " q<sup>W</sup> Bi <sup>−</sup><sup>1</sup> <sup>⊗</sup> <sup>q</sup><sup>W</sup> Bj ⊗ eγ Bi Bj −<sup>1</sup> # xyz b Bi <sup>α</sup> − b Bj α b Bi <sup>ω</sup> − b Bj ω 15∗1 (33)

where [·] xyz represents the real part of the quaternion. eθ Bi Bj , eβ Bi Bj ,eγ Bi Bj T are the IMU pre-integration between two adjacent keyframes B<sup>i</sup> and B<sup>j</sup> .

#### *5.3. Visual Re-Projection Errors*

When the feature point F<sup>1</sup> is first observed in the i th image, the visual re-projection error in the j th image can be defined by the pinhole camera model as:

$$\operatorname{tr}\_{\mathbf{C}}\left(\widetilde{\mathbf{Z}}\_{\mathbf{F}\_{1}}^{\mathbf{C}\_{\operatorname{j}}}\boldsymbol{\mathcal{X}}\right) = \pi^{-1} \begin{bmatrix} \left[\widetilde{\mathbf{u}}\_{\mathbf{F}\_{1}}^{\mathbf{C}\_{\operatorname{i}}}\right] \\ \widetilde{\mathbf{v}}\_{\mathbf{F}\_{1}}^{\mathbf{C}\_{\operatorname{i}}} \end{bmatrix} - \mathbf{R}\_{\mathbf{B}}^{\mathbf{C}} (\mathbf{R}\_{\operatorname{W}}^{\operatorname{\mathbf{B}}}) \mathbf{R}\_{\mathbf{B}}^{\operatorname{\mathbf{W}}} (\mathbf{R}\_{\operatorname{\mathbf{C}}}^{\operatorname{\mathbf{B}}} \pi^{-1} \begin{bmatrix} \widetilde{\mathbf{u}}\_{\mathbf{F}\_{1}}^{\mathbf{C}\_{\operatorname{j}}} \\ \widetilde{\mathbf{v}}\_{\mathbf{F}\_{1}}^{\mathbf{C}\_{\operatorname{j}}} \end{bmatrix} + \mathbf{P}\_{\mathbf{C}}^{\operatorname{\mathbf{B}}}) + \mathbf{P}\_{\mathbf{B}\_{\operatorname{i}}}^{\operatorname{\mathbf{W}}} - \mathbf{P}\_{\mathbf{B}\_{\operatorname{j}}}^{\operatorname{\mathbf{W}}}) - \mathbf{P}\_{\mathbf{C}}^{\operatorname{\mathbf{B}}} \end{bmatrix} \tag{34}$$

where eu Ci F1 ,ev Ci F1 and eu Cj F1 ,ev Cj F1 represent the coordinates of the pixels projected from the feature point F<sup>1</sup> to the i th and j th frame image, respectively. π −1 is the back-projection function of the pinhole camera model.

#### *5.4. Marginalization Strategy*

In order to limit the computational complexity of SD-VIS, the back-end adopts a sliding window-based tightly-coupled optimization framework, so we use the marginalization strategy [31] to make the correct operation of sliding windows. As shown in Figure 6, if the current frame is determined as a keyframe, the frame will remain in the sliding window, and the oldest frame is marginalized out When the oldest frame is marginalized, the feature points that can only be observed by the oldest frame will be discarded directly, and other visual and inertial measurements associated with the frame will be removed from the sliding window by Schur complement. The new prior information constructed by Schur complement will be added to the existing prior information. If the current frame is not a keyframe, the last frame in the sliding window will be marginalized, and all visual measurements related to that frame will be removed directly from the sliding window.

#### *5.5. Re-Localization*

Due to the global 3D position and yaw angle are unobservable, there will be inevitable accumulative errors in the vision-inertial SLAM system. To eliminate the accumulated error, we introduce the re-localization module. After the keyframe is traced successfully, it can be judged whether the SLAM system has been here before by loop closure detection. We utilize DBoW2, a state-of-the-art bag-of-word place recognition approach, for loop closure detection. When a loop is detected, the re-localization module can effectively align the current sliding window, thus eliminating the accumulated error. For a detailed description of re-location, readers may refer to [18].

[·]xyz

rC (Z ̃ F1 Cj

[ũ F1 Ci , ṽ F1 Ci

F1

,X) = π

] [ũ

i th

j th

[θ ̃ Bj Bi , β̃ Bj Bi , γ̃<sup>B</sup><sup>j</sup> Bi ] T

ṽ F1 Cj

−1

]) + P<sup>C</sup> B ) + P<sup>B</sup><sup>i</sup>

<sup>W</sup> − P<sup>B</sup><sup>j</sup>

<sup>W</sup> ) − P<sup>C</sup> B )

i th

B<sup>i</sup> B<sup>j</sup>

]) − R<sup>B</sup> C (R<sup>W</sup> Bj (R<sup>B</sup><sup>i</sup> <sup>W</sup>(R<sup>C</sup> Bπ −1 ([ ũ F1 Cj

th π

F1

−1 ([ ũ F1 Ci

F1 Cj , ṽ F1 Cj ]

j

ṽ F1 Ci

**Figure 6.** Marginalization strategy of SD-VIS.

#### **6. Experiment**

We evaluate the accuracy, robustness, and real-time performance of SD-VIS on the EuRoC dataset [32]. The SD-VIS method is compared with the state-of-the-art vision SLAM methods, such as VINS-mono [18] and VINS-Fusion [20]. In Section 6.1, the accuracy and robustness of SD-VIS are evaluated, and the experimental results show that the accuracy and robustness of the proposed method reach the same level as the state-of-the-art method. Section 6.2 evaluates real-time performance. The experimental results show that the proposed method achieves a good balance between accuracy and real-time performance. Section 6.3 evaluates the loop closure detection capability and verifies the overall feasibility of the SLAM system.

#### *6.1. Accuracy and Robustness Evaluate*

In the experiments on the EuRoC dataset, we adopt the open source tool EVO [33] to evaluate the performance of SD-VIS. By comparing the estimated value with the actual value, we calculate the absolute pose error (APE) as an index of the evaluation algorithm [34]. Table 1 shows the root mean square error (RMSE) of the translation on the EuRoC dataset. For fairness, the following algorithms do not use the loop closure detection module.


**Table 1.** The root mean square error (RMSE) results of the translation (m).

As can be seen from Table 1, in terms of accuracy, SD-VIS and VINS-Mono and VINS-Fusion are at the same level. The accuracy of SD-VIS is slightly lower than that of VINS-Mono when moving at low and medium speed in the environment with abundant feature points (such as MH\_01\_easy and MH\_03\_medium). This is due to the susceptibility to various illumination changes when tracking non-keyframes using the direct method. We have observed that some datasets exhibit strong exposure changes between images and, therefore, the tracking effect of the direct method is reduced in these cases. In addition, in order to further improve the real-time performance of the algorithm, we only extract new feature points in keyframes, which results in that the number of feature points in non-keyframes will be less than keyframes, which will also bring some negative effects. Although the accuracy performance is not significantly better than the traditional method, it has achieved the same level of accuracy as VINS-Mono while greatly improving real-time performance. Therefore, our algorithm is very suitable for small-sized unmanned platform with limited computing resources. The accuracy of SD-VIS is higher than that of VINS-Mono and VINS-Fusion when moving fast in the low-texture environment (such as V2\_03\_difficult). This is due to the excellent performance of the direct method in the low-texture environment. In addition, the keyframe selection strategy will tend to generate more keyframes during fast motion, which will also improve the accuracy performance of the algorithm. Figure 7 shows more intuitively the trajectory heat map estimated by SD-VIS, VINS-Mono, and VINS-Fusion in MH\_01\_easy. Figure 8 shows the change of translation absolute pose error with time in MH\_01\_easy, MH\_04\_medium, and V2\_03\_difficult. Through Figures 7 and 8, we came to the conclusion that the accuracy and robustness of our algorithm have reached the level of the state-of-the-art algorithm. Especially in the initialization procedure and low-texture environment, our algorithm performs better.

**Figure 7.** The trajectory heat map estimated by SD-VIS, VINS-Mono, and VINS-Fusion in MH\_01\_easy.

**Figure 8.** The change of translation absolute pose error with time in MH\_01\_easy, MH\_04\_medium and V2\_03\_difficult. Blue, green, and red represent SD-VIS, VINS-Mono, and VINS-Fusion respectively

#### *6.2. Real-Time Performance Evaluate*

In this section, we evaluate the real-time performance of SD-VIS. We compared the average time required to track an image (Table 2).


**Table 2.** Average time (ms) spent tracking an image.

As can be seen from Table 2, the image tracking of ORB-SLAM2 [11] uses the feature-based method to extract and match the ORB features of each frame, which takes a long time. However, VINS-Mono uses the optical flow method to track FAST features, which saves the calculation of feature descriptors, so the time consumption is less than ORB-SLAM2. VINS-Fusion is a stereo visual-inertial fusion SLAM algorithm, and image tracking also takes a long time. In SD-VIS, non-keyframes are used for fast-tracking and localization by direct method, and keyframes are tracked by feature-based method and used for back-end optimization and loop closure detection. This algorithm saves a lot of time and minimizes the average time of SD-VIS tracking images. Due to the keyframe selection strategy, in some low-speed motion scenes (such as MH\_01\_easy and V1\_01\_easy), the number of keyframes will be less, and the time to track a frame of the image will be reduced. In some fast-moving scenes (such as V1\_02\_medium and V2\_03\_difficult), as the number of keyframes increases, the time required to track an image will be increased.

In summary, the reason why we can obtain good real-time performance is due to the use of KLT sparse optical flow algorithm when tracking keyframes, which eliminates the calculation of descriptors and feature matching. In addition, for non-keyframes, only the direct method is used to track existing feature points, and new feature points are not extracted. Due to the close distance between two adjacent non-keyframes, the direct method of image alignment and feature matching can quickly converge.

Compared with the feature-based method, we use the direct method to track non-keyframes and accelerate the algorithm without reducing the accuracy and robustness. As shown in Figure 9, in MH\_02\_easy, 26% of the frames are determined to be keyframes, while 74% of the frames are determined to be non-keyframes. The time consumption of tracking keyframes is 65%, while that of non-keyframes are only 35%. Combined with Section 6.1, we can conclude that compared with the state-of-the-art SLAM systems, we can achieve a better balance between quickness and exactness.

**Figure 9.** The left picture (**a**) shows a comparison of the number of keyframes and non-keyframes. The right picture (**b**) shows the comparison of time consumption between tracking keyframes and non-keyframes

#### *6.3. Loop Closure Detection Evaluate*

Finally, in order to verify the integrity and feasibility of the proposed algorithm, we evaluate the loop closure detection capability of SD-VIS. As can be seen from Figures 10 and 11, the accuracy of SD-VIS with loop detection is improved obviously. Compared with the direct method, SD-VIS exhibits the function of loop closure detection and solves the problem of drift in long-term operation.

**Figure 10.** The trajectory heat map estimated by SD-VIS and SD-VIS-LOOP in V1\_01\_easy.

**Figure 11.** The change of translation absolute pose error with time in V1\_01\_easy. Blue and green represent SD-VIS and SD-VIS-LOOP respectively.

#### **7. Conclusions**

We present SD-VIS, a novel fast and accurate semi-direct visual-inertial SLAM framework, which combines the exactness of feature-based method and quickness of direct method. Compared with the state-of-the-art feature-based method, we use the direct method to track non-keyframes and accelerate the algorithm without reducing the accuracy and robustness. Compared with the direct method, SD-VIS exhibits the function of loop closure detection and solves the problem of drift in long-term operation. We get a better balance between accuracy and speed, so the algorithm is more suitable for the platform with limited computing resources. In the future, we will extend the algorithm to support more types of multi-sensor fusion to increase its robustness in complex environments.

–

–

–

–

–

–

— —

**Author Contributions:** Conceptualization, Q.L. and H.W.; methodology, Q.L.; software, Q.L. and H.W.; validation, Q.L., Z.W. and H.W.; formal analysis, Q.L.; investigation, Q.L.; resources, Q.L. and H.W.; data curation, Q.L.; writing—original draft preparation, Q.L.; writing—review and editing, Q.L. and Z.W.; visualization, Q.L.; supervision, Z.W.; funding acquisition, Z.W. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the National defense Innovation Fund.

**Acknowledgments:** We express our high respect and gratitude to the editors and reviewers, and their help and suggestions are very helpful to our manuscript.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

*Article*

## **Camera Calibration with Weighted Direct Linear Transformation and Anisotropic Uncertainties of Image Control Points**

## **Francesco Barone 1,\* , Marco Marrazzo 2,\* and Claudio J. Oton <sup>1</sup>**


Received: 12 November 2019; Accepted: 30 December 2019; Published: 20 February 2020 -

**Abstract:** Camera calibration is a crucial step for computer vision in many applications. For example, adequate calibration is required in infrared thermography inside gas turbines for blade temperature measurements, for associating each pixel with the corresponding point on the blade 3D model. The blade has to be used as the calibration frame, but it is always only partially visible, and thus, there are few control points. We propose and test a method that exploits the anisotropic uncertainty of the control points and improves the calibration in conditions where the number of control points is limited. Assuming a bivariate Gaussian 2D distribution of the position error of each control point, we set uncertainty areas of control points' position, which are ellipses (with specific axis lengths and rotations) within which the control points are supposed to be. We use these ellipses to set a weight matrix to be used in a weighted Direct Linear Transformation (wDLT). We present the mathematical formalism for this modified calibration algorithm, and we apply it to calibrate a camera from a picture of a well known object in different situations, comparing its performance to the standard DLT method, showing that the wDLT algorithm provides a more robust and precise solution. We finally discuss the quantitative improvements of the algorithm by varying the modules of random deviations in control points' positions and with partial occlusion of the object.

**Keywords:** camera calibration; DLT; PnP; weighted DLT; uncertainty; covariance; robustness

#### **1. Introduction**

Many computer vision applications, such as robotics, photogrammetry, or augmented reality, require camera calibration algorithms. The calibration is the estimation of the parameters of the camera model from given photos and videos, acquired with the camera. Camera parameters are both extrinsic and intrinsic: extrinsic parameters are pose dependent, while the intrinsic ones are related to the intrinsic properties of the camera itself.

A camera model is a mathematical description of the projection of a 3D point in the real world on the 2D image plane. The pinhole camera model assumes no distortion and a small aperture size; thus, the projection is linear, and it is fully described by a 3 × 4 projection matrix (or camera matrix). Control points are usually used for the parameters' estimation. They are points whose coordinates are known both in the 3D real world and in the 2D image plane. How the control points are chosen influences the estimation of parameters: a poor choice of control points requires a robust estimation algorithm. The most common procedures exploit a photo of an object whose geometry and pose in space are known. A chessboard pattern is usually used because corners in the chessboard pattern are very easy to identify and its geometry is simple.

When such a pattern is not available, auto-calibration algorithms could be used: they require multiple images to extract information to cope with the lack of a known object in the scene. In this case, features from multiple frames of a moving target (or moving camera) have to be extracted, then the correspondences between features of the different images have to be found [1,2]. Several video frames could be used for accurate auto-calibration procedures, especially offline when there are no constraints on execution times [3]. Plane surfaces are also commonly found in a scene, and some works use them to calibrate cameras [4,5].

Some calibration procedures aim for a one-off estimation of the intrinsic parameters of the camera, which will not vary in time. Once the intrinsic parameters are known, an online algorithm has to estimate only the camera pose. Given a set of *n* control points and given the intrinsic parameters, the estimation of the camera pose is called the Perspective-*n*-Points problem (P*n*P). Three is the smallest number of correspondences that yields a finite number of solutions [6,7]. P*n*P with three correspondences points is called P3P.

There are many different solutions to P*n*P [8,9], and some of them are very efficient with an *O*(*n*) complexity [10,11]. Other algorithms also provide some internal parameters, such as the focal length [12,13]. The sensitivity to the control points is critical. In particular, the RANSAC algorithm [14] became very popular because of its robustness to outliers. A recent algorithm developed by Ferraz (EPP*n*P [15]) reaches even better performance and higher robustness. He also integrated the uncertainties of features point [16], given the internal parameters.

If intrinsic parameters are unknown, a complete calibration is required. Many analytical methods were developed in the past especially by photogrammetrists [17]. The most known is the Direct Linear Transformation (DLT [18]) algorithm, which can compute the projection matrix from at least six control points. If there are more than six points, DLT minimizes the least squared error (LSQ) of reprojected points. Then, from the projection matrix, it is possible to extract the parameters of the pinhole camera model [2]. Because of LSQ, DLT is not robust to outliers [19]. To reduce gross errors, Molnár [20] used the Huber estimator, based on a weight function that limits the accounting of errors for outliers. Uncertainties were also integrated in the EPP*n*P ([15]) algorithm to solve the P*n*P problem, but it assumed that the internal parameters were known [16].

Current research for camera calibration faces many different problems, such as: automatic calibration methods [21], multiple cameras' calibration [22], and distortion parameters' estimation. For non-linear cameras' calibration, Bouguet developed a general method, which included radial and tangential distortion [23], while Mei extended the algorithm also to omnidirectional cameras by means of multiple views of a planar pattern [24]. Further advances for omnidirectional camera calibration were done by Scaramuzza, who developed and implemented a fully automatic and iterative procedure for non-linear camera calibration [25,26].

Compared to DLT, non-linear camera models have the disadvantage of needing multiple images to estimate the distortion parameters; indeed, trying to estimate many parameters from only few points of a single image leads to an ill conditioned problem. A large set of control points is available especially when calibration patterns are used and corners are easy to detect. In particular applications, the number of control points is required to be low. For example, even if calibration patterns for an infrared camera exist [27], they cannot be used for an infrared camera inside a gas turbine.

Temperature mapping of gas turbine rotor blades is extremely important both for condition health monitoring and blade design optimization. Infrared thermography allows contactless measurements of the rotor blade surface temperature in gas turbines [28], but the mapping between the 2D image and the 3D blade requires adequate camera calibration. Thus, the blade itself has to be used as the calibration frame: indeed, its geometry is well known, and usually, an accurate 3D model is available. Nevertheless, the blade is always partially in the field of view.

Moreover, camera lenses are subject to high temperature variation and the intrinsic parameters, such as the focal length changes with temperature; thus, offline calibration of intrinsic parameters cannot be done.

Manual calibration is required because infrared images of blades usually have low quality and low resolution; thus, images have a limited number of features (such as corners and edges), and the number of control point is very limited.

To solve this problem, in this paper, we propose a method for camera calibration suitable for a low number (less than 10) of control points with anisotropic uncertainty. Assuming a different bivariate Gaussian distribution of the position error of each control point, its anisotropic uncertainty is determined by the uncertainty ellipse, within which the control point is estimated to be. The proposed method, based on direct linear transformation, consists of minimizing the weighted reprojection error of control points. The weight matrix is chosen according to the axis lengths and axis rotation of the uncertainty ellipses. Because of that, the presented weighted variation of the DLT (wDLT) includes the uncertainty information, improving the average accuracy, especially with few control points.

The wDLT algorithm we propose requires a photo of a well known 3D object (such as the rotor blade in a gas turbine).

Uncertainty ellipses could also be used to constrain a point along the direction of an edge, or a segment: in this case, the uncertainty is high along the edge, while it is low in the direction perpendicular to it. Exploiting the uncertainty becomes necessary when there are only a few control points, and some of them are almost exact, while the other ones are approximated.

In the following sections, we show the DLT algorithm for camera calibration and how to set the weight matrix for wDLT according to the uncertainty areas of control points. Then, we compare the performance of three algorithms (DLT, Bouguet's method [23], and wDLT) using the reprojection error on 21 real images with only seven control points each. We evaluate the robustness of the algorithms, introducing fictitious errors to control points. Infrared images of rotor blades inside gas turbine always have a partial view of the blade itself, because of small spaces and the design constraint of the camera system; thus, we also simulate different scenarios where the object is partially occluded, making it difficult to find the control points manually, leading to an accuracy error. We also evaluate the sensitivity of uncertainty values. Results show that wDLT reaches higher average accuracy when the control points are perturbed with fictitious errors, and it allows camera calibration in some evaluated scenarios where the standard DLT and Bouguet's method fail.

#### **2. Materials and Methods**

In this section, we first describe the weighted direct linear transformation algorithm and how we propose to choose the weight matrix; then, we describe the tests we performed on real images. The implementations of the DLT and wDLT algorithms were realized with MATLAB R2018b [29]; Bouguet's method was performed by using the functions and scripts of the Camera Calibration toolbox for MATLAB [23].

#### *2.1. The Weighted Direct Linear Transformation Algorithm*

Considering the pinhole approximation, the camera model maps the 3D world points to 2D image points through a linear projection operation, which is expressed by the following relation:

$$
\lambda \mathbf{\hat{x}}\_{\text{i}} = \mathbf{P} \mathbf{X}\_{\text{i}} \tag{1}
$$

where **X***<sup>i</sup>* = [*X<sup>i</sup>* ,*Y<sup>i</sup>* , *Z<sup>i</sup>* , 1] *T* is the *i* th world point in homogeneous coordinates, **ˆx***<sup>i</sup>* = [*x<sup>i</sup>* , *yi* , *zi* ] *T* is a 3D representation of the ith projected point, **P** ∈ ℜ3×<sup>4</sup> is the projection matrix, and *λ* ∈ ℜ is a free parameter. The parameter *λ* is usually set equal to the reciprocal of *z<sup>i</sup>* , such as **˜x***<sup>i</sup>* = *λ***ˆx***<sup>i</sup>* = [*u<sup>i</sup> v<sup>i</sup>* 1] *T*

becomes the ith image point in the homogeneous coordinate. Applying the DLT algorithm (see Appendix A), for each correspondence between **X***<sup>i</sup>* and **˜x***<sup>i</sup>* , two equations could be written:

$$
\begin{bmatrix}
\mathbf{0}^T & -\mathbf{X}\_i^T & v\_i \mathbf{X}\_i^T
\end{bmatrix}
\begin{bmatrix}
\mathbf{p}\_1^T \\
\mathbf{p}\_2^T \\
\mathbf{p}\_3^T
\end{bmatrix} = \mathbf{M}\_i \mathbf{p} = \begin{bmatrix} 0 \\ 0 \end{bmatrix} \tag{2}
$$

where **p**1, **p**2, and **p**<sup>3</sup> are the rows of the projection matrix **P**, which have to be determined. Writing the equations for Ncorrespondences, the following equation system is obtained:

$$\mathbf{M}\mathbf{p} = \begin{bmatrix} \mathbf{M}\_1 \\ \vdots \\ \mathbf{M}\_N \end{bmatrix} \begin{bmatrix} \mathbf{p}\_1^T \\ \mathbf{p}\_2^T \\ \mathbf{p}\_3^T \end{bmatrix} = \begin{bmatrix} \mathbf{0} \\ \vdots \\ \mathbf{0} \end{bmatrix} \tag{3}$$

Considering that **p** is a 12 element array and each correspondence adds two equations, at least six equations are needed. Thus, for *N* ≥ 6, an LSQ solution of the homogeneous linear system **Mp** = **0** is the eigenvector of **M***T***M** corresponding to the minimum eigenvalue.

#### 2.1.1. Weights of the Equations

The proposed approach is based on modeling the error on the image plane as a bivariate Gaussian distribution, such as Ferraz et al. [16] did for the PnP problem. The covariance matrix **C***<sup>i</sup>* of the error distribution could be written as follows:

$$\mathbf{C}\_{i} = \begin{bmatrix} \sigma\_{\mathbf{x}\_{i}}^{2} & 0\\ 0 & \sigma\_{\mathbf{y}\_{i}}^{2} \end{bmatrix} \mathbf{R}(\theta\_{i}) \tag{4}$$

where **R**(*θi*) ∈ ℜ2×<sup>2</sup> is the rotation matrix that rotates an angle *θ<sup>i</sup>* and *σx<sup>i</sup>* and *σy<sup>i</sup>* are the standard deviations along the principal axes.

The weight matrix is defined by the following equation:

$$\mathbf{W}\_{i} = \begin{bmatrix} 1/\sigma\_{\mathbf{x}\_{i}} & 0\\ 0 & 1/\sigma\_{\mathbf{y}\_{i}} \end{bmatrix} \mathbf{R}(\theta\_{i}) \tag{5}$$

Each equation of (3) is left multiplied by a weight matrix **W***<sup>i</sup>* that depends on the covariance matrix of the expected error distribution for the *i* th point. This yields the following equation:

$$\mathbf{\hat{M}p} = \mathbf{W}\mathbf{M}p = \begin{bmatrix} \mathbf{W}\_1\mathbf{M}\_1 \\ \vdots \\ \mathbf{W}\_N\mathbf{M}\_N \end{bmatrix} \begin{bmatrix} \mathbf{p}\_1^T \\ \mathbf{p}\_2^T \\ \mathbf{p}\_3^T \end{bmatrix} = \begin{bmatrix} \mathbf{0} \\ \vdots \\ \mathbf{0} \end{bmatrix} \tag{6}$$

where **W** is a diagonal block matrix whose blocks on the diagonal are **W***<sup>i</sup>* .

The LSQ solution of the homogeneous linear system in Equation (6) is the eigenvector of **Mˆ** *<sup>T</sup>***Mˆ** corresponding to the minimum eigenvalue. The error to minimize is weighted by the matrix **W**, so that for the ith point, the reprojection error along the axis *x<sup>i</sup>* (or *y<sup>i</sup>* ), which is tilted by an angle *θ<sup>i</sup>* with respect to the horizontal (or vertical) image axis, is accounted for less if the *σx<sup>i</sup>* (or is *σy<sup>i</sup>* ) is high and vice versa.

#### 2.1.2. Covariance Matrix Estimation

The error is considered to be a bivariate Gaussian distribution, whose average is the ith control point and covariance matrix is **C***<sup>i</sup>* . Here, the estimation of **C***<sup>i</sup>* is done by the manual assessment of the confidence region, the region within which each point is almost surely. For a bivariate distribution, this region is an ellipse. This ellipse is fully determined by the length *lx<sup>i</sup>* and *ly<sup>i</sup>* of the two orthogonal principal axis and the direction *θ<sup>i</sup>* of *x<sup>i</sup>* axes. The standard deviation of the bivariate Gaussian distribution is set to a third of the length of the axis: *σx<sup>i</sup>* = *lx<sup>i</sup>* /3 and *σy<sup>i</sup>* = *ly<sup>i</sup>* /3. Thus, the covariance matrix is computed using Equation (4). The rotation matrix **R**(*θi*) aligns the image axis to the axis of the confidence ellipse of the ith point. For our purposes, it is not the absolute value of *σx<sup>i</sup>* and *σy<sup>i</sup>* that is relevant, but their ratio.

#### *2.2. Tests on Real Images*

The following tests aimed to compare the wDLT performance with the standard DLT and Bouguet's method (BOU), highlighting the advantages of using uncertainty information when the number of control points is low. The approach would reach similar results if the uncertainties of each point were equal (i.e., if the covariance matrix **C***<sup>i</sup>* was the same scalar matrix for each i) or if the number of points was equal to six, that is the least needed for both algorithms.

For the test, we used 21 images<sup>1</sup> of a metal parallelepiped (see Figure 1a) whose size was 1 × 3 × 5 cm. The sizes of the blocks were accurate up to ±0.01 mm, but the edges and the vertices were blunt. The block stood on a table, lying on one of the two 3 × 5 cm faces.

(**a**) Test images

**Figure 1.** The 21 images<sup>1</sup> of the metal block used for testing the DLT, Bouguet's method (BOU), and weighted DLR (wDLT) algorithms. The control points (black dots) are manually selected on the images. They correspond to the seven visible vertices of the parallelepiped. The reprojection error, using DLT to calibrate the camera, is 2.5 ± 1.3 pixels.

The images were recorded varying the point of view with respect to the object. Each image had 3000 × 3000 pixels, and the same seven vertices of the block were visible, while the eighth was always hidden. Vertices were labeled with capital letters, as shown in Figure 1b.

The geometry of the object in the scene was known; thus, a 3D model of the block was generated. The seven visible vertices of the block were chosen as control points, because they were easy to recognize: from the 3D model of the object, the positions of the vertices were exactly known, while on the images, they were set manually. The DLT was applied: the reprojection error was 2.5 ± 1.3 pixels (average ± STD), and the maximum error was 6.9 pixels for a single point. The average and the standard deviation of the reprojection error were first computed among points of the same image, and the results were averaged across different images. The average reprojection error for BOU was 2.5 ± 1 pixels without distortion and 2.2 ± 1.2 with second order radial distortion approximation. Adding more distortion parameters led to an ill conditioned estimation problem and numerical errors. The error of a few pixels on a high resolution image could be considered negligible, and the additional improvement obtained by including radial distortion was very slight. The two tests we performed are described below.

<sup>1</sup> Photo Credits: *Copyright 2020 Baker Hughes Company. All rights reserved.*

#### 2.2.1. Robustness to Random Error

In order to evaluate robustness to random errors, a first test was performed: the calibration was performed with DLT, BOU, and wDLT using perturbed control points. For each image *nE*, control points of the image were randomly chosen, and an error, whose module was *E*, was added to these points in a random direction. Because of the randomness, the test was performed 10 times for each image, and *n<sup>E</sup>* = {1, 2, 3} and *E* = {10, 20, 30, 40}. For each point, the standard deviation was *σx<sup>k</sup>* = *σy<sup>k</sup>* = 1, except for perturbed points, where *σx<sup>p</sup>* = *σy<sup>p</sup>* = {5, 8, 12}. In this test, the axes were equal, their rotation had no influence, and the confidence ellipses were circles. An example with *n<sup>E</sup>* = 1 and *σx<sup>p</sup>* = *σy<sup>p</sup>* = 8 is shown in Figure 2: the real position of the vertices (unperturbed control points) is in black circles, and the perturbed ones are in blue. For wDLT, the confidence regions are shown as well. The green crosses are the reprojected points for DLT (Figure 2a), BOU (Figure 2b), and wDLT (Figure 2c). The reprojected points of the BOU algorithm, if no distortion was estimated, were very close to the DLT ones.

**Figure 2.** Example: Here, the calibration is performed on the seventh image<sup>1</sup> with an error E on a control point of 10 and 40 pixels. The black circles are the unperturbed position of the control point, while in blue, we show the perturbed ones. In Figure 2c, the confidence region is shown as well: the radius of the circles is 24 pixels, corresponding to *σ*=8. Here, wDLT can recover an error of 40 pixels on a single perturbed point, while DLT and BOU cannot.

For each image, the average, the standard deviation, and the maximum of the reprojection error were computed. The reprojection error was the Euclidean distance on the image plane between the seven visible vertices and the control points without error.

#### 2.2.2. Occluded Vertices Scenario

The objective of the second test was to simulate a manual choice of the control points when reference points of the objects in the scene (such as corners) are occluded in the image. We performed the calibration without using some of the vertices' positions in the images. Instead of using control

<sup>1</sup> Photo Credits: *Copyright 2020 Baker Hughes Company. All rights reserved.*

points corresponding to the occluded vertices, the median points of the block edges were used in addition. The world points corresponding to the median were exactly known, while the corresponding points on the image were not. In order to have comparable results, each additional control point on the images was chosen systematically, using the following equation:

$$
\begin{bmatrix} u\_{XY} \\ v\_{XY} \end{bmatrix} = \begin{bmatrix} u\_X \\ v\_X \end{bmatrix} + m \begin{bmatrix} u\_Y - u\_X \\ v\_Y - v\_X \end{bmatrix} \tag{7}
$$

where [*uXY*, *vXY*] are the coordinates of the point along the edge *XY*, [*uX*, *vX*] and [*uY*, *vY*] are the coordinates of two vertices X and Y of the edge, and *m* ∈ [0, 1] is an adimensional parameter. The coordinates of the control point [*uXY*, *vXY*] varied with *m* between the vertices along the edge *XY*.

Two scenarios were simulated:


For each scenario and for each image, three sets of control points with *m* = {0.45, 0.50, 0.55} were computed. Figure 3 shows, for both scenarios, the control points on the vertices in black, the occluded vertices in red, and in blue, the additional control points along the edge for *m* = {0.45, 0.50, 0.55}. The reprojection error was the Euclidean distance between the real vertices and the reprojected vertices (i.e., without considering the error on the additional control points).

**Figure 3.** Scenarios with two and three occluded vertices: the black dots are the visible vertices, the red dots the occluded vertices (not used in calibration), and blue dots the additional points along the edges. The figures show the additional control points chosen along the edges with *m* = {0.45, 0.50, 0.55}.

wDLT was tested with different values of uncertainty. For control points that were on vertices, the axis lengths of the confidence ellipses were the same (*lx<sup>i</sup>* = *ly<sup>i</sup>* = 1). Instead, for control points that were not on vertices, but on edges, the axes of the confidence ellipses were set to be equal to one in the direction perpendicular to the edge (*ly<sup>i</sup>* = 1) and equal to 3, 15, and 10<sup>6</sup> in the direction parallel to

the edge (*lx<sup>i</sup>* = {3, 15, 106}). We chose these values in order to evaluate the sensitivity of wDLT with different ellipse eccentricities to the estimation of the uncertainty *σ*.

Figure 4 shows for the second scenario and for different ratios *lx<sup>i</sup>* /*ly<sup>i</sup>* the control points, the occluded vertices, and the confidence region of the additional control points. As an example, the reprojected control points and reprojected vertices are also shown. For the sake of clear representation, all the confidence ellipses had sizes 50 times bigger than the real ones.

**Figure 4.** Scenario 2 with three occluded points and m = 0.50: black circles are the visible vertices, the red ones the occluded vertices (not used in calibration), and the blue ellipses the confidence ellipses of the additional control points on the edges. We represent each ellipse 50 times bigger in order to make them clearly visible. The green pluses and the green crosses are respectively the reprojected control points and the reprojected occluded vertices. While DLT and BOU reach good alignment of the control points (black and blue circles), the reprojection error on the occluded vertices is high. Instead, wDLT reaches a better calibration because the error on the occluded vertices' reprojection is low.

#### **3. Results**

#### *3.1. Robustness to Error*

The first test aimed to evaluate the robustness of the three algorithms (DLT, BOU, and wDLT); thus, the reprojection error was calculated with control points perturbed with random errors and with different values of uncertainty for wDLT. We varied both the module of the error E and the number of points *n<sup>E</sup>* affected by the error E. BOU calibration was performed without distortion estimation, because not all the random configurations of control points led to a solution; this fact was due to the estimation problem, which became ill conditioned when data did not contain enough information.

The reprojection error for random perturbations of control points (see Section 2.2.1) is summarized in the charts in Figure 5: vertical bars show the mean absolute error (MAE); the standard deviation (STD) is indicated by the black vertical lines, which show the ±1*σ* range; and red squares are the maximum absolute errors (MaxErr). The whole distributions of the error are shown in Appendix B.

**Figure 5.** Reprojection error of the calibration with DLT, BOU, and wDLT with *n<sup>E</sup>* perturbed control points and error E; bars represent the mean absolute error (MAE), the black lines the average STD of the errors, and the red squares the average of the maximum error.

As expected, the DLT and BOU performances decreases with high error E and number of errors *n<sup>E</sup>* (see Figure 5a). While the DLT and BOU algorithms only relied on the position of the control points, wDLT had the uncertainty of the points' position. Because higher uncertainty was set for perturbed points, wDLT assigned less weight to the information about their position, while it focused more on the position of the others. In this way, the average error for each E and *n<sup>E</sup>* was lower for wDLT. Moreover, with wDLT, the MAE did not vary significantly when the values of uncertainties *σx<sup>P</sup>* and *σy<sup>P</sup>* of perturbed control points changed.

With only one perturbed point of seven control points (*n<sup>E</sup>* = 1), the error was almost fully recovered, and the performance was similar to the DLT with no perturbed points. Figure 2c shows an example where with wDLT, the reprojected point was almost unaffected by the position of the control point and was near the real vertex. For more than one of seven perturbed points, it was shown that the error was no longer recovered. The reason for that was the fact that the algorithms needed at least six control points, and with *n<sup>E</sup>* = 1 and high uncertainty, wDLT would consider only the remaining six low uncertainty points. On the contrary, with *n<sup>E</sup>* ≥ 1, less than six points were reliable; thus, wDLT needed the information of the perturbed points. However, wDLT would assign more weight to points with low uncertainty, and the average error would decrease as well.

It is worth noting that when the perturbation was small (E = 10), the maximum error did not improve with the exploitation of the uncertainty by wDLT. The reason was that the value of E was comparable with the uncertainty of the unperturbed points; thus, the best performance was achieved by setting in wDLT an equal uncertainty for all the points, which was equivalent to using DLT or BOU.

Summarizing: If the estimated uncertainty *σ* was too low for a subset of control points, the wDLT would take high error points more into account, and the MAE would increase because of the wrong information in input; if the uncertainty was too high, the algorithm would not consider important information from those points, and the MAE would increase as well. Therefore, in order to have better MAE than DLT and BOU, it was enough to set the uncertainty such that if a point was more uncertain than another one, the first had to have higher uncertainty than the second one, regardless of the exact ratio between the two uncertainties.

#### *3.2. Occluded Vertices Scenario*

The results of the second test are shown in the charts in Figure 6. As for the first test, the BOU estimation of distortion parameters was not feasible; thus, they were not estimated. In both scenarios (with two or three occluded points), the algorithms showed the same behaviors, but the errors with three occluded points were always higher than the ones with only two occluded points, as expected.

**Figure 6.** Reprojection error of the calibration with DLT, BOU, and wDLT with different settings of the uncertainty *σ* for points along the edges. The bars represent the average error, the black lines the average STD of the errors, and the red squares the average of the maximum error.

The results with *m* = 0.45 were better than the others because they corresponded to the points that were nearer to the real ones. Indeed, points with *m* = 0.45 were chosen along the edge BD and nearer to D than to B, along the edge DE and nearer to E than to D, and along EF and nearer to E than to F; due to the perspective of the image, these points were nearer to the median of the edges, which was the point that was considered in the 3D model (see Section 2.2.2). For the same reason, the worst results were obtained with *m* = 0.55. Figure 4 shows the reprojected vertices for the seventh image: both DLT and BOU minimized the error on control points (black and blue circles), but because blue circles had position errors, the alignment was wrong, which was clear by looking at the reprojection of occluded vertices. For wDLT, the confidence ellipses allowed the reprojected points to slide more along their major axes; thus, the results were better because the high variance along the edges produced a better estimation of the positions of the occluded vertices.

With a very high uncertainty set along the edge (*σ* = 10<sup>6</sup> ), wDLT reached a similar MAE without depending on the position of the point along the edge (i.e., the values of m, according to our parametrization of its position). The reason was that with a very high *σ* along the direction of the edge, the resulting system of Equation (6) constrained the median points of the edge to be along the lines of each edge, but made them free to move along that. Thus, wDLT did not take into account if the positions of those points slid along the edges.

#### **4. Discussion**

The proposed method was a weighted DLT based algorithm, suitable when the uncertainty areas of the control points on the image were known or could be estimated. It did not require the knowledge of the exact values of uncertainty, because a gross estimation still improved the mean absolute error. To the best of our knowledge, no one has yet presented a calibration algorithm that exploits the

anisotropic uncertainty of the control points by setting a weight matrix according to the lengths and rotation of the axis of uncertainty ellipses.

wDLT was also useful when there was no calibration pattern in the scene and the choice of the control points was manual. In this kind of situation, there were only a few control points: some of them corresponded to corners, and their uncertainty was low; others could be on edges, and their uncertainty was low only in the direction perpendicular to the edge; other points could be assigned in correspondence to a wide area, and their uncertainty was higher. If there were only a few corners, it was necessary to exploit as much information as possible about the points.

For infrared thermography in gas turbines, the image of the rotor blade was only partial and had low resolution. Many defects and artifacts were present on the image, and manual calibration was necessary. The lack of corners, reference points (the shape of the blade was smooth and curvy), and clear edges made the choice of control points very hard. Allowing the use of uncertainty ellipses (instead of single points), this step became easier and calibration more accurate.

Some works presented different calibration algorithms that could cope with outliers, but they required many points to be extracted. Tens of points are usually given from automatic feature extraction algorithms, and usually, many outliers are present. The presented results showed the advantages of using uncertainty information in the calibration problem. In the presented cases, the obtained error was lower compared to the DLT algorithm and Bouguet's method.

The cost of this improvement was the knowledge of the uncertainty. In the first test, we supposed knowing the points with higher error. Huge improvement was reached only with *n<sup>E</sup>* = 1, but almost the same result could be reached applying the DLT to the six points without errors (which we supposed were known). For *n<sup>E</sup>* ≥ 1, there were only minor improvements.

The second test showed that the wDLT algorithm was useful when there were partially occluded objects with few visible corners. Introducing anisotropy in the uncertainty area of the edge points strongly improved the estimation of the positions of the hidden vertices.

The main advantage of this algorithm was that it offered the possibility of including anisotropic uncertainty information in the calibration, improving the solution. Even though this method did not provide a precise way to estimate the absolute uncertainty of the individual points, when these values were manually set, a strong improvement was observed as soon as the relative differences between uncertainties were assigned.

Infrared camera calibration for gas turbine thermography by using single control points and standard DLT led to gross and unacceptable errors in camera parameters, whereas by setting uncertainty areas, instead of control points, the camera could be properly calibrated by using weighted DLT.

#### **5. Conclusions**

We presented a method to set the weight matrix in a weighted Direct Linear Transformation (wDLT) algorithm, in order to take into account the anisotropic uncertainties of the positions of individual control points. Instead of control points, we proposed to use uncertainty ellipses with different axis lengths and angles. This feature became important when few control points were available. To demonstrate the effectiveness of the algorithms, we performed two tests: the first, to evaluate the robustness to random error, and the second, to simulate a scenario with few control points. We showed that wDLT with the uncertainty performed better (lower MAE) in both tests. In particular, in the second test, the calibration failed both with DLT and Bouguet's method when several vertices were hidden, while wDLT still provided a successful calibration in these cases.

A suitable application of wDLT is in infrared thermography in gas turbines, where only a few control points can be chosen accurately, and then, manual selection of them is required. Accuracy in infrared camera calibration allows associating the correct temperature with the 3D model of the object in the field of view.

**Author Contributions:** Conceptualization, methodology, software, validation, formal analysis, investigation, and draft preparation, F.B.; manuscript review and editing, C.J.O. and M.M.; supervision, C.J.O. and M.M. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** This work was done in collaboration with Baker Hughes, through the Scuola Sant'Anna—Baker Hughes Joint Lab "Advanced Sensors for Turbomachinery".

**Conflicts of Interest:** The authors F.B. and C.J.O. declare no conflict of interest. The author M.M. declares that he works for Baker Hughes.

#### **Appendix A. Direct Linear Transformation**

The Direct Linear Transformation (DLT) algorithm aims to solve the problem of determining the pinhole camera parameters from at least six correspondences between 2D image points and 3D world points. A camera model maps each point of the 3D world to a point of the 2D image through a projection operation. The pinhole camera model makes the assumption that the aperture size of the camera is small, so that it can be considered as a point. Thus, the ray of light has to pass across a single point, the camera center, and there are no lenses, no distortion, and an infinite depth of field.

The pinhole camera model is fully represented by a linear projection, which is expressed by the following equation:

$$
\lambda \mathbf{\hat{x}}\_{\text{i}} = \mathbf{P} \mathbf{X}\_{\text{i}} \tag{A1}
$$

where **X***<sup>i</sup>* = [*X<sup>i</sup>* ,*Y<sup>i</sup>* , *Z<sup>i</sup>* , 1] *T* is the *i* th world point in homogeneous coordinates, **ˆx***<sup>i</sup>* = [*x<sup>i</sup>* , *yi* , *zi* ] *T* is a 3D representation of the *i* th projected point, **P** = [*pij*] ∈ ℜ3×<sup>4</sup> is the projection matrix, or camera matrix, and *λ* ∈ ℜ is a free parameter. The parameter *λ* is usually set equal to the reciprocal of *z*, such as:

$$\mathbf{\tilde{x}}\_{l} = \lambda \mathbf{\hat{x}}\_{i} = \frac{1}{z} \mathbf{\hat{x}}\_{l} = \begin{bmatrix} \frac{x\_{i}}{z\_{i}} & \frac{y\_{i}}{z\_{i}} & 1 \end{bmatrix}^{T} = \begin{bmatrix} u\_{i} & v\_{i} & 1 \end{bmatrix}^{T} = \begin{bmatrix} \mathbf{x}\_{i}^{T} & 1 \end{bmatrix}^{T} \tag{A2}$$

where **˜x***<sup>i</sup>* is the *i* th projected point **x***<sup>i</sup>* in the image plane in homogeneous coordinates.

Combining Equations (A1) and (A2), the coordinates of the image point **x***<sup>i</sup>* are expressed by the following relationship:

$$\mathbf{x}\_{i} = \begin{bmatrix} u\_{i} \\ v\_{i} \end{bmatrix} = \begin{bmatrix} \mathbf{x}\_{i}/z\_{i} \\ y\_{i}/z\_{i} \end{bmatrix} = \begin{bmatrix} \mathbf{p}\_{1}\mathbf{X}\_{i}/\mathbf{p}\_{3}\mathbf{X}\_{i} \\ \mathbf{p}\_{2}\mathbf{X}\_{i}/\mathbf{p}\_{3}\mathbf{X}\_{i} \end{bmatrix} \tag{A3}$$

where **p**1, **p**2, and **p**<sup>3</sup> are the rows of the projection matrix **P**. This relationship is not linear in the unknown parameters **p**1, **p**<sup>2</sup> and **p**3.

Through DLT, the nonlinear problem becomes linear in the unknown parameters, rearranging Equation (A3) as follows:

$$
\begin{bmatrix} \mathbf{u} \mathbf{p}\_3 \mathbf{X}\_i \\\\ \mathbf{v} \mathbf{p}\_3 \mathbf{X}\_i \end{bmatrix} = \begin{bmatrix} \mathbf{p}\_1 \mathbf{X}\_i \\\\ \mathbf{p}\_2 \mathbf{X}\_i \end{bmatrix} \\
\Rightarrow \begin{bmatrix} -\mathbf{X}\_i^T \mathbf{p}\_1^T + \mathbf{0}^T \mathbf{p}\_2^T + u\_i \mathbf{X}\_i^T \mathbf{p}\_3^T \\\ \mathbf{0}^T \mathbf{p}\_1^T - \mathbf{X}\_i^T \mathbf{p}\_2^T + v\_i \mathbf{X}\_i^T \mathbf{p}\_3^T \end{bmatrix} \\
= \begin{bmatrix} -\mathbf{X}\_i^T & \mathbf{0}^T & u\_i \mathbf{X}\_i^T \\\ \mathbf{0}^T & -\mathbf{X}\_i^T & v\_i \mathbf{X}\_i^T \end{bmatrix} \begin{bmatrix} \mathbf{p}\_1^T \\\\ \mathbf{p}\_2^T \\\\ \mathbf{p}\_3^T \end{bmatrix} = \begin{bmatrix} 0 \\\\ 0 \end{bmatrix} \tag{A4}
$$

These homogeneous equations can be solved linearly. Combining the equations for each point *i*, the homogeneous linear systems shown in Equation (3) is obtained.

#### **Appendix B. Error Distribution for Random Perturbation of Control Points**

In this section, the error distributions for the test described in Section 2.2.1 are shown. Figure A1a,b show the error distribution using the DLT algorithm and Bouguet's method, while in Figure A1c–e, we show the distributions for the wDLT algorithm with different *σ* settings for the perturbed points. Each bar chart represents the error distribution for different *n<sup>E</sup>* and E.

**Figure A1.** Distribution of the reprojection error of the calibration with DLT, BOU, and wDLT with *n<sup>E</sup>* perturbed control points and error E.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

*Article*

## **A Self-Assembly Portable Mobile Mapping System for Archeological Reconstruction Based on VSLAM-Photogrammetric Algorithm**

## **Pedro Ortiz-Coder \*,**† **and Alonso Sánchez-Ríos \*,**†

Department of Graphic Expression, University Centre of Mérida, University of Extremadura, 06800 Mérida, Spain

**\*** Correspondence: peorco04@alumnos.unex.es (P.O.-C.); schezrio@unex.es (A.S.-R.)

† These authors contributed equally to this work.

Received: 19 July 2019; Accepted: 9 September 2019; Published: 12 September 2019

**Abstract:** Three Dimensional (3D) models are widely used in clinical applications, geosciences, cultural heritage preservation, and engineering; this, together with new emerging needs such as building information modeling (BIM) develop new data capture techniques and devices with a low cost and reduced learning curve that allow for non-specialized users to employ it. This paper presents a simple, self-assembly device for 3D point clouds data capture with an estimated base price under €2500; furthermore, a workflow for the calculations is described that includes a Visual SLAM-photogrammetric threaded algorithm that has been implemented in C++. Another purpose of this work is to validate the proposed system in BIM working environments. To achieve it, in outdoor tests, several 3D point clouds were obtained and the coordinates of 40 points were obtained by means of this device, with data capture distances ranging between 5 to 20 m. Subsequently, those were compared to the coordinates of the same targets measured by a total station. The Euclidean average distance errors and root mean square errors (RMSEs) ranging between 12–46 mm and 8–33 mm respectively, depending on the data capture distance (5–20 m). Furthermore, the proposed system was compared with a commonly used photogrammetric methodology based on Agisoft Metashape software. The results obtained demonstrate that the proposed system satisfies (in each case) the tolerances of 'level 1' (51 mm) and 'level 2' (13 mm) for point cloud acquisition in urban design and historic documentation, according to the BIM Guide for 3D Imaging (U.S. General Services).

**Keywords:** self-assembly device; 3D point clouds; accuracy analysis; VSLAM-photogrammetric algorithm; portable mobile mapping system; low-cost device; BIM

#### **1. Introduction**

The tridimensional modeling of an object starts with its original design or with the process of acquiring the data necessary for its geometric reconstruction. In both cases, the result is a 3D virtual model that can be visualized and analyzed interactively on a computer [1,2]. In many cases, the process continues with the materialization of the model in the form of a prototype, which serves as a sample of what will be the final product, allowing us to check if its design is correct, thus changing the traditional manufacturing or construction industry [3–5].

The applications of 3D models (virtual or prototype) are numerous and widely used; they are usually used in the scope of clinical applications [6,7], geosciences [8–13], cultural heritage preservation [14,15] and engineering [16].

In this context, to address this wide variety of application areas, both data capture techniques and devices, as well as the specific software for data processing and management tend to be simplified. This is done in order to be accessible to the greatest number of users, even with limited knowledge in 3D measurement technologies.

In this sense, the classical methods of photogrammetry are combined with new techniques and procedures which are usually adopted for other areas [17], such as visual odometry (VO), the simultaneous localization and mapping (SLAM) and the visual slam (VSLAM) techniques. These are normally used to solve localization and mapping problems in the areas of robotics and autonomous systems [18–21], but also the combination of photogrammetry techniques with methodologies based on instruments like terrestrial or aerial laser scanners have obtained successful results [22,23].

These combined methods provide support and analytical robustness for the development of low/middle-cost capture systems, usually based on tablets or mobile devices that incorporate inertial sensors, absolute positioning and low cost cameras which can achieve medium 3D positional accuracy scanning, in compliance with technical requirements of a wide range of applications at a low cost and reduced learning curve [24–28]. As a result, handheld mobile mapping systems have appeared in recent years, using different technologies to perform 3D reconstructions that use fully automated processes [27,28]. Among which we can find systems based exclusively on images, requiring a fully automated process, taking into account the usual technical constraints in photogrammetry, and the free user movements in data capture [17]. In this field, different lines of research have been developed, depending on whether the final result is obtained in real time [29,30] or not. In the first case, the reduction in the time needed for data processing is the most important factor in the approach to research objectives (even at the expense of a metric accuracy reduction); in the second, however, metric accuracy is the most important factor, although the temporal cost is higher [17,31,32].

There are many commercial mobile mapping systems for urban, architectural or archaeological applications with high accuracy results [33]. Those systems are based on the integration on different sensors such as (Inertial Measurement Unit) IMU, line scanners [28], cameras, Global Navigation Satellite System (GNSS) [34], odometers and other sensors. The price and complexity of those systems are normally high [35].

The classical applications require a known level of data accuracy and quality, however, the emerging needs of Industry 4.0, building information modeling (BIM) or digital transformation, next to the appearance of new devices and information processing techniques pose new challenges and research opportunities in this field. Each capture method has its advantages and drawbacks, offering a particular level of quality in its results; in this sense, numerous investigations have linked these parameters, allowing people to choose the most cost-effective approach [35]. This can be achieved by way of evaluating the use of the laser scanner and the vision-based reconstruction among different solutions for progress monitoring inspection in construction and concluding (among other characteristics) that both of them are appropriate for spatial data capture. This could [36] include among 3D sensing technologies, photo/video-grammetry, laser scanning and the range of images that make a detailed assessment of the content (low, medium or high) into BIM working environments. It may also include [37] comparing photo/video-grammetry capture techniques with laser scanning, considering aspects such as accuracy, quality, time efficiency and cost needed for collecting data on site. The combination of data capture methods has also been traditionally analyzed; thus, [38] presently, there is a combined laser scanning/photogrammetry approach to optimize data collection, cutting around 75% of the time required to scan the construction site.

A common aspect, taken into account in most of the research, is the point cloud accuracy evaluation, that has been addressed in three different ways [39]: (a) By defining levels of quality parameters defined in national standards and guidelines that come from countries like the United States, Canada, the United Kingdom and Scandinavian countries that lead BIM implementation in the world [40], such as the U.S. General Services Administration (GSA) BIM Guide for 3D Imaging [41] that sets the quality requirements of point clouds in terms of their level of accuracy (LOA) and level of detail (LOD); (b) by evaluating quality parameters of a point cloud, in terms of accuracy and

completeness [37] or (c) following three quality criteria: Reference system accuracy, positional accuracy and completeness [42].

Furthermore, in the specific environment of 3D indoor models, [43] we propose a method that provides suitable criteria for the quantitative evaluation of geometric quality in terms of completeness, correctness, and accuracy by defining parameters to optimize a scanning plan in order to minimize data collection time while ensuring that the desired level of quality is satisfied, in some cases, with the implementation of an analytical sensor model that uses a "divide and conquer" strategy based on segmentation of the scene [44], or one that captures the relationships between parameters like data collection parameters and data quality metrics [45]. In other cases, the influence of scan geometry is considered in order to optimize measurement setups [46], or are compared to different known methods for obtaining accurate 3D modeling applications, like in the work of [47], in the context of cultural heritage documentation.

This paper extends on past surveys of classical photogrammetry solutions, adopting an extended solution approach for outdoor environments based on the use of a simple and hand-held self-assembly device for data capture, based on images, that consist on two cameras: One, which data will be used to calculate in real time, the path followed by the device using a VSALM algorithm, while with other one; a high-resolution video recorded and used to achieve the scene reconstruction using photogrammetric techniques. Finally, after following simple data collection and fully automated processing, a 3D point cloud with associated color is obtained.

To determine the effectiveness of the proposed system, we evaluate it in one study site performed outdoors in the facades of the Roman Aqueduct of Miracles, in terms of the requirements laid down in the GSA BIM Guide for 3D Imaging. In this experiment, we obtain 3D point clouds from different data capture conditions, that vary according to the distance from the device and the monument; the measurements acquired by a total station serve to compare the coordinates of fixed points in both systems, and therefore, determining the LOA of each point cloud. The results obtained, with root mean square errors (RMSEs) between eight and 33 mm, stress the feasibility of the proposed system for urban design and historic documentation projects, in the context of allowable dimensional deviations in BIM and CAD deliverables.

This paper is divided into four sections. Following the Introduction, the portable mobile mapping system is described, including the proposed algorithm schema for the computations; therefore, a case study in which the system is applied is described in Section 2. The results are presented in Section 3 and finally, the conclusions are presented in Section 4.

#### **2. Materials and Methods**

This study was conducted with a simple and self-assembly prototype specifically built for data capture (Figure 1), that consists of two cameras from the Imaging Source Europe GmbH company (Bremen, Germany): Camera A (model DFK 42AUC03) and camera B (model DFK 33UX264) were fixed to a platform with the condition of its optical axes being parallel; each camera incorporated a lens; for camera A, the model was TIS-TBL 2.1 C, from the Imaging Source Europe GmbH company and for camera B, the model was the Fujinon HF6XA–5M, from FUJIFILM Corporation (Tokyo, Japan). The technical characteristics of cameras and lenses appear in Tables 1 and 2, respectively. Both cameras were connected to a laptop (with an Intel core i7 7700 HQ CPU processor and RAM of 16 Gb, running under Windows 10 Home), via USB 2.0 (camera A) and 3.0 (camera B). This beta version of the prototype had an estimated base price under €2500.

(**a**)

**Figure 1.** (**a**) 3D printing process of the prototype case; (**b**) cameras A and B with their placement inside the case; and (**c**) the final portable mobile mapping system prototype.

**Table 1.** Main technical characteristics of cameras used in the prototype (from the Imaging Source Europe GmbH company).



**Table 2.** Main technical data of lenses used in the prototype (from the Imaging Source Europe GmbH company and FUJIFILM Corporation).

The calibration process of the cameras was carried out with a checkerboard target (60 cm × 60 cm) using a complete single camera calibration method [48], that provided the main internal calibration parameters: The focal length, radial and tangential distortions, optical center coordinates and camera axe skews. In addition, to know the parameters that related to the position of one camera compared to the other, we designed the following, practical test: To use as ground control points we placed 15 targets on two perpendicular walls and measured the coordinates of each target with a TOPCON Robotic total station, with an accuracy of 1" measuring angles (ISO 17123-3:2001) and 1.5 mm + 2 ppm measuring distances (ISO 17123-4:2001). After running the observations with the prototype, we used a seven-parameter transformation, using the 15 targets, to determine the relative position of one camera in relation to the other [17].

Camera A and camera B had different configuration parameters which defined image properties such as brightness, gain or exposure between others. In order to automate the capture procedure, automatic parameters options had been chosen. In such a way, the data collection was automatic and the user didn't need to follow specials rules since the system accepted convergent or divergent turns of the camera, stops or changes in speed. The algorithm processed all this data properly using the proposed methodology.

During the capture (Figure 2), the user needed to see the VSLAM tracking in the screen of the computer in real time. In this way, the user was sure he didn´t make a fast movement or if an item appeared that interrupted camera visualization and, therefore, the tracking could not continue. In this case, the user must return again to a known place and continue the tracking from this point.

**Figure 2.** In-field data capture. The white arrow indicates the user direction movement, parallel to the monument façade, followed during this test.

#### *Workflow of the Proposed Algorithm for the Computation*

The application of the VSLAM technique on a low weight device, normally with limited calculation capabilities, needed the implementation of a low computational cost VSLAM algorithm to achieve effective results. The technical literature provided a framework that consisted of the following basic modules: The initialization module; to define a global coordinate system, and the tracking and mapping modules; to continuously estimate camera poses. In addition, two additional modules were used for a more reliable and accurate result: The re-localization module, that has to be used when, due to a fast device motion or some disruptions in data capture, the camera pose must be computed again and the global map optimization, which is performed to estimate and remove accumulative errors in the map, produced during camera movements.

The characteristics of the VSLAM-photogrammetric algorithm, including identified strong and weak points, depend on the methodology used for each module which sets its advantages and limitations. In our case, we proposed the following sequential workflow (Figure 3) divided into four threaded processes, which have been implemented in C++.

**Figure 3.** General workflow of the algorithm implemented in C++ for the computations.

Basically, the four processes consisted in the following: (I) A VSLAM algorithm to estimate both motion and structure, that is applied in frames obtained from camera A, (II) an image selection and filtering process of frames obtained with camera B, (III) the application of an image segmentation algorithm and finally, (IV) a classical photogrammetric process applied to obtain the 3D point cloud. Each process is explained in more detail below.

The first process (I) started with the simultaneous acquisition of videos with cameras A and B, with speeds of 25 FPS and 4 FPS, respectively. With the frames from camera A, used as an ORB descriptor [49] for object recognition, detection and matching was used. This descriptor built on the FAST key-point detector and the BRIEF descriptor, with good performance and low cost, and therefore, was appropriate for our case. An ORB-SLAM algorithm was then applied to estimate camera positioning and trajectory calculation [50]; this was an accurate monocular SLAM system that worked in real time and can be applied in indoor/outdoor scenarios, and has modules to loop closure detection, re-localization (to recover from situations where the system becomes lost) and to a totally automatic initialization, taking into account the calibration parameters of the camera. From these remarks, our process was carried out in three steps as follows [50]. The first step was the tracking, which calculated the positioning of the camera for each frame and selected keyframes and decided which frames were added to the list; the second one was local mapping, which performed keyframes optimization, incorporating those that were being taken and removing the redundant keyframes. With these data, through a local bundle adjustment, in addition to increasing the quality of the final map, it was possible to reduce the computational complexity of the processes that were just running, and equally for the subsequent steps. The third one was loop closing, which looked for redundant areas where the camera had already passed before, which could be found in each new keyframe; the transformation of similarity on the accumulated drift in the loop was calculated, the two ends of the loop were aligned [50], the duplicate points were merged and the trajectory was recalculated and optimized to achieve overall consistency. The result of this process was a text file with UNIX time parameters and camera poses of the selected keyframes.

The above information together with the frames recorded by camera B, was used to start the second process (II), in which a selection and filtering of the images obtained by camera B was carried out, which consisted in the direct deletion of images whose baseline was very small, and therefore, which made it difficult to compute an optimum relative orientation [17,51,52]. The filtering process was performed in three consecutive steps: The first, a filtering based on keyframes coincidence, which consisted of incorporating a β number of frames (in our case β = 2) from camera B between each two consecutive frames from camera A and, at the same time, the remaining frames were removed. To run this filter, it was necessary that the cameras were synchronized by UNIX time. The second process, applied the so-called AntiStop filter, which removed those frames obtained in the event that the camera had been in a static position, or with a very small movement, recorded images of the same zone, which we described as redundant and which should, therefore, be eliminated. To determine the redundant frames, it was assumed that cameras A and B were synchronized and that we knew the coordinates of the projection center of each frame, computed in (I). We continued with the calculation of the distances between the projection centers of every two consecutive keyframes *i* and *j* (*Dij*) as well as the mean value of all the distances between consecutive frames (*Dm*) and the definition of the minimum distance (*Dmin*) from which the device was either stopped or was in motion, by the expression:

$$D\_{\min} = D\_m \ast p\_\prime$$

where *p* is a parameter that depends on the data capture conditions (in our case, after performing several tests, we defined a value of *p* = 0.7). Finally, the keyframes took by camera B in which the distance between the projection centers of each two consecutive keyframes was less than the minimum distance (*Dij* < *D*min) were removed.

The third, called the divergent self-rotation filter, was able to remove those keyframes captured by camera B when they met two conditions: The rotation angles of the camera ω*<sup>i</sup>* (*X* axis) and κ*<sup>i</sup>* (*Z* axis) (Figure 4) increase or decrease their value permanently during data capture for of at least three consecutive frames at a value of ±9 ◦ (in our case), and besides, their projection centers were very close to each other; for the calculation of the same procedure is the same one used as the one used for the AntiStop filter, but with a different value of *p* (in our case, we considered a value of *p* = 0.9).

**Figure 4.** Divergent self-rotation in the (**a**) *X*-axis; and (**b**) *Z*-axis.

The next process (III) was segmentation, which aimed to obtain more significant and easy to analyze images in the subsequent photogrammetric process. It started searching for the homologous points belonging to the keyframes resulting from the filtering process carried out in (II) [53–55], which was performed between an image, the earlier one and the later one. The resulting images were stored in a set, called a "segment". The result of this process generated one or more independent segments among themselves, which had a number of homologous points and an appropriate distribution to be properly oriented (in our case, 200 points and 10% of these points were in each quadrant of the image; in addition, if the segment did not have at least three images, it was discarded and its images were removed).

The last process (IV) was called the photogrammetric process, which was structured in three steps: The first was to compute a relative image orientation [53] setting the first image as the origin of the relative reference system and used the homologous points of each segment and algorithms leading to direct solutions [17,51,53]; then, a bundle adjustment) was used on the oriented images to avoid divergences [56], obtaining the coordinates of the camera poses and computed tie points. The second step consisted of an adjustment of the camera poses in each segment to adapt them to the overall trajectory, computed in (I). This procedure was performed using minimum square techniques [57] in each segment, and a three-dimensional transformation [10] to correct the positions of camera B with respect to camera A.

In the third step, the scene was reconstructed using MICMAC software [54], in order to obtain dense cloud points with color. MICMAC is a free open-source photogrammetry software developed by the French National Mapping Agency (IGN) and the National School of Geographic Sciences (ENSG) [58]. This software generates a depth map from the main image and a series of secondary images to obtain parallax values. The calculation was carried out having taken into account that the scene could be described by a single function *Z* = *f*(*X*; *Y*) (with *X*; *Y*; *Z* using Euclidean coordinates) with several parameters of MICMAC to calculate the density correlation and obtain the cloud of dense points with color [54,55,59] which was the final result of the process.

#### **3. Accuracy Assessment and Results**

This work determined the accuracy of a set of point clouds obtained with the prototype in order to validate the device for BIM work environments. Additionally, the results were compared with a usual photogrammetric procedure, using a reflex camera and photogrammetric software (Agisoft Metashape [60]), in order to compare the advantages and disadvantages of the proposed prototype in respect to this known methodology. For this, an experimental test was carried out in the Roman aqueduct of "The Miracles" in the city of Mérida (Spain). This monument, built in the first-century A.C, has a total dimension of 12 km in length between underground and aerial sections with arches. The test was carried out on an archery stretch which was 23 m high and 60 m wide, performing a set of three data capture scenarios at different observation distances (5, 12, and 20 m) from the prototype to the base of the monument (Figure 5).

**Figure 5.** (**a**) Scheme with the data capture trajectories and (**b**) the areas covered by a frame, for 5, 12 and 20 m of distance prototype-monument. The figure that appears in (**b**), is a 3D model (mesh) generated by the software Meshlab [61] from the 20m points cloud made only for visualization purposes.

In this test, the data collection was carried out in such a way that the movement of the user followed a perpendicular direction to the camera optical axe (Figure 2), avoiding divergent turns since this kind of movement was not necessary in this case. In this way, this prevented the algorithm from using the divergent self-rotation filter in an unnecessary situation.

′ In order to evaluate the metric quality of the measures obtained with the prototype and the Agisoft Metashape photogrammetric procedure, a control network was performed to be used in the dimensional control study, following the procedures carried out by [62] and [63]. The network was used as reference points and consisted on a set of targets and natural targets whose three-dimensional coordinates in a local coordinate system were obtained by a second measuring instrument (more precise than the device we want to evaluate). In this case, a total station Pentax V-227N (Pentax Ricoh Imaging Company, Ltd, Tokyo, Japan) was used, with an accuracy of 7′ for angular measurements (ISO 17123-3:2001) and 3 mm ± 2 ppm for distance measurements (ISO 17123-3:2001) with which a total of 40 uniformly distributed targets have been measured (Figure 6).

Then, the method proposed by [62] was used, in which the accuracy of the 3D point cloud was quantified according to the Euclidean average distance error (δ*avg*) as:

$$\delta\_{\text{avg}} = \frac{1}{n} \sum\_{i=1}^{n} |\mathbf{R}a\_i - T - b\_i| \tag{1}$$

=1

where *a<sup>i</sup>* is the *i*th checkpoint measured by the prototype, *b<sup>i</sup>* is the corresponding reference point acquired by the total station, *R* and *T* are the rotation and translation parameters for 3D Helmert transformation. And the quality of the 3D point cloud was also evaluated by the root mean square error (RMSE) as:

> *RMSE* = vt 1 *n* X*n i*=1 *a t i* − *b<sup>i</sup>* 2 (2) = ඩ 1 ሺ <sup>௧</sup> − ሻ ଶ ୧ୀଵ

where *a t i* indicates the *a<sup>i</sup>* point after the 3D conformal transformation to bring the model coordinates in the same system of the reference points. <sup>௧</sup>

**Figure 6.** (**a**,**c**) Reference points spread on the two fronts of the monument. (**b**) Target model used in the test and (**d**) total station Pentax V-227N used to measure the network coordinates.

As mentioned previously in Section 1 of this paper, the point cloud accuracy evaluation can be done according to different criteria. In our case, we have used the GSA BIM Guide for 3D Imaging criteria, that defines four levels of detail (LOD) with dimensions of the smallest recognizable feature ranging between 13 mm × 13 mm to 152 mm × 152 mm; and also defines the level of accuracy (LOA) associated to each LOD, ranging between three and 51 mm of tolerance, considering it as the allowable

dimensional deviation in the deliverable from truth (that has been obtained by some more precise other means). In the case of a point cloud, the guide specifies that the distance between two points from the model must be compared to the true distance between the same two points, and be less than or equal to the specified tolerance; the guide also defines the area of interest as a hierarchical system of scale in which each scan is registered, depending on the LOD. In Table 3, we summarize the data quality parameters defined by the GSA for registering point clouds.

**Table 3.** Data quality parameters defined by U.S. General Services Administration (GSA) for registering point clouds. (Unit: Millimeters).


In order to complete the study, other photogrammetric system was analyzed under conditions similar to the prototype (Figure 7). The camera used was a Canon EOS 1300D and the lens was an EFS 18–55 mm, but we only used the focal length of 18mm for this experiment. Multiple images were taken in this experiment for each distance (35 images for 5 m, 41 images for 12 m and 43 images for 20 m) and the camera was configured with a resolution of 2592 pixels × 1728 pixels with the aim of comparing the results in an equitable way with the proposed approach which have a similar image resolution. The reflex camera´s parameters (shutter, diaphragm, ISO, etc.) were chosen in automatic mode during the test to match the conditions to the prototype test. The pictures were taken standing on the same trajectories previously followed by the prototype, at the same distances from the aqueduct: 5, 12 and 20 m. These circumstances increase the time consumed in the field during the data capture, as can be seen in the Table 4, because the user must focus each image and ensure that the picture has been taken with enough overlap and quality. On the other hand, the prototype cameras also have a configuration with automatic parameters which allowed the user, along with the methodology used, to make a continuous capture, without stopping to take the images. The images of the Canon camera were processed using the software Agisoft Metashape 1.5.4 [60] which is commercialized by the company Agisoft LLC, sited in St. Petersburg, Russia (Figure 8).

Point cloud density for each system was measured. Two points clouds, one for each system, were processed using the same 10 images of the aqueduct at a distance of 5 m. The density [37] of the point cloud was 328 points/dm<sup>2</sup> for the proposed prototype system and 332 points/dm<sup>2</sup> for the Agisoft Metashape photogrammetric software.

**Table 4.** Comparison between the proposed approach and the camera with Agisoft Metashape software in regards to the time spent in the field for data capture and processing time using the same laptop (Intel core i7 7700 HQ CPU processor, 16Gb RAM, Operative System Windows 10 Home). Distance values are measured from the camera to the monument.


With the prototype and VSLAM-Photogrammetric algorithm we have computed the average error and the RMSE in each direction (*x*, *y*, and *z*) of each data capture distance, that are listed in Table 5, with overall accuracies of 12, 26 and 46 mm for 5, 12 and 20 m respectively and the RMSEs on each

axis ranging between 5 to 8 mm (5 m), 10 to 21 mm (12 m) and 30 to 38 mm (20 m) (Figure 7) which satisfied the error tolerance of 'level 1' (51 mm) for data capture distances from 12–20 m and 'level 2' (13 mm) for data capture distances about 5 m.

**Table 5.** This table compares the accuracy assessment results with the root mean square errors (RMSEs) and average errors for data capture distances from 5 to 20 m from the camera to the monument, between the prototype and VSLAM-photogrammetric algorithm and the Canon camera with Agisoft Metashape software. The RMSE error values have been computed in the three vector components: *X*, *Y* and *Z*.


**Figure 7.** Graphic on the evolution of the average errors and RMSEs for the distances of 5, 12 and 20 m from the camera to the monument. The results are shown for both systems: Prototype and VSLAM-photogrammetric algorithm and Canon camera with Agisoft Metashape software.

**Figure 8.** Comparison between points clouds resulting from both systems (with a data capture distance of 12 m): (**a**) Prototype and VSLAM-photogrammetric algorithm; and (**b**) Canon camera EOS 1300D with Agisoft Metashape software.

The point clouds obtained at the different distances of observation shown in Figure 9. Small holes or missing parts can be seen in those points clouds. This occurs due to the camera's trajectory, since it needs to focus directly on all the desired areas and capture a minimum number of images to perform optimal triangulation. No filter has been applied in the results shown in Figure 9.

(**a**) (**b**)

**Figure 9.** Points clouds resulted at the distances established in the experimental test: (**a**) 5 m; (**b**) 12 m; and (**c**) 20 m. The images show the central part of the color points clouds that resulted from the test. The points clouds have not been filtered or edited.

(**c**)

#### **4. Conclusions**

The major innovations of this study are as follows: First, the proposed approach for the 3D data capture and the implementation of the VSLAM-photogrammetric algorithm has been materialized in a functional and low-cost prototype, which has been checked in an experimental test, the results of which have been presented in the context of the BIM work environment.

Second, the results obtained in the experimental test comply with the precision requirements of the GSA BIM Guide for 3D Imaging for point cloud capture work with a resolution (minimum artifact size) of 152 mm × 152 mm, for observation distances of approximately 20 m. For distances between 5 and 12 m, we saw that better accuracies and resolution results were achieved.

Third, the possibility of using the instrument at different distances facilitates the data capture in shaded areas or areas with difficult access. This, together with the fact that the device has been designed for outdoor data collection, makes it suitable for urban design and historic documentation, which are usually carried out in outdoor environments, registering information for plans, sections, elevations and details and 3D point cloud in PLY format (positioning: *x*, *y*, *z* and color: R, G, B), following the GSA PBS(Public Building Service) CAD standards (2012) and the GSA BIM Guide for 3D Imaging Standards.

In order to increase the knowledge of the proposed approach, it has been compared with a well-known photogrammetric methodology consisting of a Reflex Canon 1300D camera and the software Agisoft Metashape. The results of the comparison test have provided interesting conclusions:


A new handheld mobile mapping system based on images have been presented in this paper. This proposed methodology does not adversely affect the known photogrammetric process (accuracy, processing time, point cloud density) but it proposes a new, easier and faster way to capture the data in the field, based on continuous data capture and fully automatic processing, without human intervention in any phase.

**Author Contributions:** Conceptualization, P.O.-C.; data curation, P.O.-C. and A.S.-R.; formal analysis, A.S.-R.; investigation, P.O.-C.; methodology, P.O.-C.; resources, P.O.-C.; software, P.O.-C.; supervision, A.S.-R.; validation, A.S.-R.; writing—original draft, P.O.-C.; writing—review, A.S.-R.

**Funding:** This research received no external funding.

**Acknowledgments:** We are grateful to the "Consorcio de la Ciudad Monumental de Mérida" for allowing the work in this monument.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Sensors* Editorial Office E-mail: sensors@mdpi.com www.mdpi.com/journal/sensors

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34 Fax: +41 61 302 89 18

www.mdpi.com ISBN 978-3-0365-1583-0