1. Introduction
The Portuguese Exclusive Economic Zone (EEZ) incorporates three large areas associated with the continent (327,667 km
), Azores (953,633 km
), and Madeira (446,108 km
) [
1,
2]. As a coastal state, Portugal may exercise sovereign rights over managing natural resources and other economic activities (e.g., tidal power) and is also responsible for conserving living resources and fighting against pollution [
3].
To assess compliance with the existing regulations and laws, many technologies can be used to perform surveillance, e.g., Synthetic Aperture Radar (SAR) [
4], Satellite Automatic Identification System (S-AIS) [
5], or Vessel Monitoring System (VMS) [
6]. Using SAR images, we can detect non-AIS-transmitting vessels [
7]. Nevertheless, the covered area depends on the satellite location, and the detection accuracy depends on the used algorithm [
8]. Exploring and using complementary technologies are essential for constant surveillance in the largest area possible.
Currently, Fast Patrol Boats (FPBs) are essential to act on the field and are still extensively used for surveillance in Portuguese territorial waters. Extending the FPB surveillance capability using fixed-wing Unmanned Aerial Vehicles (UAVs) is essential. Still, the small ship dimensions add an extra challenge, mainly in the take-off and landing operations [
9,
10,
11,
12,
13]. Small-sized fixed-wing UAVs (typically with a payload below 5 kg) can usually be launched by hand [
10,
14], bringing most of the operational risk to the landing stage. Landing safety is essential to culminate UAV-based missions successfully. There are three major causes of UAV accidents [
15]: human, material, and environment (e.g., weather or illumination changes). The vast majority of UAV registered accidents are mainly due to human factors [
16,
17,
18], and investing in the automation of the most-complex maneuvers is essential.
When operating UAVs at sea, we must adapt the system to the existing environment. It is essential to guarantee a waterproof platform, and the communication range and reliability must also be high [
19,
20]. The UAV system’s basic architecture is composed of a Ground Control Station (GCS), a communications data link, and a vehicle platform [
21] (
Figure 1).
The UAV landing retention system also plays an essential part in the landing operation complexity, and it must be perfectly adapted to the vehicle’s size and weight. A net-based retention system (
Figure 2) presents a good compromise between complexity and effectiveness [
10,
22], being able to be rapidly mounted on any ship without depending on structural changes. If we do not want to capture the UAV directly on the ship, we can also use an external retention system, for instance based on quadcopters [
23] using a net [
24] or a line [
25].
We were trying to land a UAV on a moving platform (ship) in an outdoor environment where we are subject to different meteorological conditions, e.g., illumination, wind, and balance. The UAV is considered a cooperative rigid object with a simple autopilot that maintains a constant trajectory to the landing area. UAVs are also evolving with the development of new structural designs [
26], new materials [
27], new payloads [
28,
29], and optimized radar cross-sections [
30]. The ship is also considered cooperative and performs maneuvering to adjust the relative speed of the UAV and, consequently, the relative wind. Depending on the ship’s superstructures, we can have wind vortices that can make the UAV adopt an erroneous trajectory.
An automated landing control system can be based on the onboard sensors [
21], but we usually do not have enough processing power onboard a small UAV. As an option, we can also rely on the Global Positioning System (GPS), but this system can be affected by jamming or spoofing [
31,
32,
33]. This article used a ground-based monocular Red, Green, and Blue (RGB) vision system with the camera on the ship deck [
13,
34]. Using a ground-based system, we can access more processing power and implement algorithms to perform the autonomous landing of a UAV with a simple autopilot. Since we used a ground-based Graphics Processing Unit (GPU) system [
11,
35], which only depend on the GPU processing capabilities, we can easily upgrade it and have access to power without any restrictions. On the other hand, if we use UAV onboard-based algorithms, we need high processing capabilities, which are not easily available in small-sized UAVs. To successfully perform an autonomous landing maneuver, knowing the UAV’s position and its orientation is crucial to predict and control its trajectory. Pose estimation and tracking are widely studied (and still) open problems in Computer Vision (CV), with new methods constantly emerging [
36,
37].
When we know the object we want to detect and have its 3D model or Computer-Aided Design (CAD) model, we can use it to retrieve knowledge from the captured image. We can use Sequential Monte Carlo (SMC) methods or Particle Filter (PF) [
38] to establish the 2D/3D correspondence and try to estimate the UAV pose and perform tracking over time [
39]. The PFs represents the pose distribution by a set of weighted hypotheses (particles) that explicitly test the object’s projection on the image with a given pose [
40]. Despite it being desirable to have a large number of particles to have a fast convergence, particle evaluation is usually very computationally demanding [
35], and we need to have a good compromise between speed and accuracy. Some examples of the obtained pose tracking results when using our method can be seen in
Figure 3, where the obtained pose estimation is represented in red.
The adopted tracking architecture is divided into: (i) Pose Boosting; (ii) Tracking; (iii) Pose optimization [
41,
42]. In the Pose Boosting stage, we used Machine Learning (ML) algorithms to detect the UAVs on the captured frame and used a pre-trained database to generate representative pose samples [
12,
43]. In the Tracking stage, we used a 3D-model-based tracking approach using the UAV CAD model based on a PF approach [
42]. In the Pose optimization stage, since we used a sub-optimal similarity metric to approximate the likelihood function iteratively, we included optimization steps to improve the estimate [
41].
This article presents the following main developments (innovations): (i) the analysis, test, and comparison on the same dataset of two different PF implementations using pose optimization: (a) an Unscented Bingham Filter (UBiF) [
41,
42,
44] and (b) an Unscented Bingham–Gauss Filter (UBiGaF) [
42,
45]; (ii) the implementation of a new tree-based similarity metric approach to be able to obtain a faster and more accurate weight estimation; (iii) better analysis and evaluation of how the optimization steps can decrease the filter convergence time; (iv) a validation and comparison between methods using a realistic synthetic dataset. In this article, we did not perform the comparison between UBiF and UBiGaF and more traditional methods such as the Unscented Kalman Filter (UKF) for the orientation estimation since, in [
41,
42], we already made this comparison and showed that these implementations outperform the UKF. As far as we know, there are no publicly available datasets or other ground-based model-based UAV tracking approaches, making the comparison with other state-of-the-art methods impossible.
The main developments regarding our previous work are: (i) in [
42], a tracking architecture using a UBiGaF without pose optimization implementation, which is explored in this article, was proposed; (ii) in [
41], we applied optimization steps to the UBiF that is implemented in this article also to the UBiGaF; (iii) in [
43], we proposed a distance transform similarity metric that is modified to a tree-based approach to be able to decrease the processing time without a loss in accuracy.
This article is organized as follows.
Section 2 presents the related work concerning pose estimation, pose tracking, and UAV tracking.
Section 3 details the overall implemented system, detailing the pose boosting, tracking, and optimization stages. In
Section 4, we present the experimental results. Finally, in
Section 5, we present the conclusions and explore additional ideas for future research work.
3. Overall System Description
An autonomous landing system’s final objective is to use the sensor information to control the trajectory of the UAV. In this article, the main focus was on the use of the camera information (sensor) to perform pose tracking (measurement) over time (
Figure 4).
We used a ground-based monocular RGB camera system to sense the real world (capture image frames), using a frame rate of 30 Frames Per Second (FPS) and an image size of
pixels (width × height) [
11,
35]. Then, we combined that information with algorithms that use the UAV CAD model as a priori information. In this specific application, the camera should be located near the landing area to be able to estimate the UAV’s relative pose to the camera (translation and orientation) and perform tracking (
Figure 5).
Additionally, we only need to ensure communication with the UAV to send the needed trajectories to a simple onboard autopilot and to ensure also that the net-based retention system is installed and that it has the needed characteristics to ensure no damage to the UAV or ship (
Figure 6). In a small-sized FPB, the expected landing area (net-based retention system available landing area) is about
m, which is about 2.5-times larger than the used UAV model’s wingspan (
Table 1).
We used a small-sized fixed-wing UAV (
Figure 7), with the characteristics described in
Table 1. Due to its size and weight, the UAV’s take-off can be easily performed by hand, and all the focus will be on the landing maneuver, as described before. It is essential to develop autonomous methods that are more reliable than using a human-in-the-loop, being able to decrease the probability of an accident.
During the development of the adopted approach, we considered that (i) the UAV is a rigid body; (ii) the UAV’s mass and mass distribution remain constant during operation; (iii) the UAV’s reference frame origin is located in its Center Of Gravity (COG). The UAV’s state is represented according to the camera reference frame according to [
41]:
where
is the linear position,
is the linear velocity,
is the angular orientation (quaternion representation), and
is the angular velocity according to the camera reference frame represented in
Figure 8.
Pose Boosting (
Section 3.1)—In this stage, the UAV is detected in the captured frame using deep learning, and filter initialization is performed using a pre-trained database. In each iteration, information from the current frame (using the pre-trained database) using the most-current observation to add particle diversity is also used;
Tracking (
Section 3.2)—In this stage, we used a 3D-model-based approach based on a PF to perform tracking. By applying temporal filtering, we improved the accuracy by being able to minimize the estimation error;
Pose Optimization (
Section 3.3)—Since we used a similarity metric to obtain the particle weights, we added a refinement step that uses the current frame (current time instant) and the current estimate to search for a better solution.
3.1. Pose Boosting
This stage was initially inspired by the Boosted Particle Filter (BPF) [
88], which extends the Mixture Particle Filter (MPF) application [
89] by incorporating Adaptive Boosting (AdaBoost) [
90]. We adopted the approach described in [
41,
42,
43,
87], characterized by the following two stages:
Detection—In this stage, we detect the UAV in the captured frame using the You Only Look Once (YOLO) [
91] object detector. Target detection is critical since we were operating in an outdoor environment and we can be in the presence of other objects that can affect the system’s reliability;
Hypotheses generation—From the ROIs obtained in the detection stage, we cannot infer the UAV’s orientation, but only its 3D position. To deal with this, we obtained the UAV’s Oriented Bounding Box (OBB) and performed the comparison with a pre-trained database of synthetically generated poses to obtain the pose estimates (
Figure 10).
3.2. Tracking
We adopted a PF approach based on the Unscented Particle Filter (UPF) [
92], whose standard scheme is divided into three stages: (i) initialization, (ii) importance sampling, and (iii) importance weighting and resampling. The initialization was performed in the pose boosting stage described in
Section 3.1. The main difference, when compared with the standard UKF is in the importance sampling stage, where we used a UKF for the translational motion and a UBiF/UBiGaF for the rotational motion to be able to incorporate the current observation and generate a better proposal distribution. The adopted tracking stage is divided into the following two stages [
41,
42]:
Proposal (
Section 3.2.1)—In this stage, we generate the proposal distribution, which should be as close as possible to the true posterior distribution;
Approximate weighting and resampling (
Section 3.2.2)—In this stage, we approximate the particle weights (approximate weighting) by using a Distance-Transform-based similarity metric. After evaluating the particles, we apply a resampling scheme to replicate the high-weight ones and eliminate the low-weight ones.
3.2.1. Proposal
Motion filtering has the purpose of using measures affected by uncertainty over time and being able to generate results closer to reality. We adopted the motion model (translational and rotational model) described in [
12,
41,
42], where a UKF is applied to the translational filtering and a UBiF/UBiGaF to the rotational filtering. The Bingham (Bi) distribution is an antipodally symmetric distribution [
93] used by the UBiF to better quantify the existing uncertainty without correlation between angular velocity and attitude. On the other hand, the UBiGaF uses a Bingham–Gauss (BiGa) [
45] distribution, which takes into account this uncertainty in the filtering structure. The UBiF update step is simple to implement since the product between two Bi distributions is closed after renormalization. Still, the same does not happen in the BiGa distribution, which needs to have an update step based on the Unscented Transform (UT) [
42,
45,
94]. We already explored both implementations in [
12,
41,
42], it being possible to state that we have a clear performance improvement when compared with traditional methods, e.g., the UKF.
3.2.2. Approximate Weighting and Resampling
Usually, the object characteristics and the search space complexity are the primary sources of complexity in tracking. We used a small-size fixed-wing UAV (
Figure 11), which is very difficult to detect at long distances since it is designed for military surveillance operations. On the other hand, the vast search space is outdoors, making the problem very complex, since we wanted to start detecting the UAV at least 80 m from the landing area.
The UAV model symmetries are another problem since we used a 3D-model-based approach, where we need clearly to discriminate in the captured frame the correct UAV pose. In
Figure 12, we see that symmetric poses have almost the same appearance, affecting the obtained results since we used a pixel-based similarity metric. The
angle represents the rotation around the camera
x-axis; the
angle represents the rotation around the camera
y-axis; the
angle represents the rotation around the camera
z-axis, as described in
Figure 8. In this article, we implemented the Distance Transform (DT) similarity metric described in [
43,
95]. This similarity metric computes the distance to the closest edge between the DT [
96] of the captured frame and the edge map [
97] of the pose hypothesis to compare. The DT similarity metric is obtained according to:
where
is a fine-tuning parameter and
s is given by
where
k is the number of edge pixels of the hypothesis to compare,
B is the total number of image pixels,
h is the pose hypothesis image,
f is the captured frame,
is the DT of the captured frame, and
is the edge map of the pose hypothesis image to compare.
Some of the most time-demanding operations in 3D-model-based pose estimation are particle rendering (pose hypothesis rendering) and evaluation [
35]. Considering that, at each instant (system iteration), we evaluated hundreds of particles (or even thousands), we realized that we needed to optimize these operations. A complex problem to be solved is normally divided into smaller ones, and the same approach was adopted here regarding the UAV CAD model. We divided the UAV CAD model into the following smaller parts (
Figure 13): (i) wing, (ii) elevator, and (iii) body. Since the adopted similarity metric can be computed independently for each UAV part, we can adopt a particle evaluation strategy where each part is evaluated independently. If needed, discard the evaluation sooner, or combine the obtained weights.
To optimize the particle evaluation, a tree-based similarity metric approach is used (
Figure 14), where each part is evaluated independently and sequentially (wing, elevator, and then, fuselage) compared with a predefined threshold
(which can be different for each part). If the obtained weight
S is smaller than a predefined threshold, the analysis avoids rendering the complete UAV CAD model. This allows us to use the available processing power better, only performing a complete model rendering and analysis for the promising particles. The adopted tree-based similarity metric for each particle is given by
where
are fine-tuning parameters,
is the DT similarity metric for the wing part,
is the DT similarity metric for the elevator part, and
is the DT similarity metric for the fuselage part. By adjusting the threshold levels
and the fine-tuning parameters, we can guarantee that the sum of the evaluations is equivalent to the evaluation of the whole model alone. This method can speed up the particle evaluation without a loss in accuracy.
As described in [
13], the resampling strategy that obtained better results in the problem at hand was the resampling reallocation, which was the used resampling scheme in the experimental results presented in
Section 4.
3.3. Pose Optimization
In this stage, we performed a local search (refinement steps) in the particle neighborhood to optimize the used similarity metric using the current frame information. The adopted similarity metric is multimodal with more than one peak and cannot be optimized using gradient-based methods [
41,
43]. We adopted the Particle Filter Optimization (PFO) approach described in [
43], which is based on the PF theory, but uses the same input (image frame) in each iteration. Using a zero velocity model, the PFO method is similar to a PF applied repetitively in the same frame. By adjusting the added noise to decrease over time, we can obtain better estimates and reduce the needed convergence time.