1. Background and Motivations
Unmanned aerial vehicles (UAVs) are a well-established tool for surveyors and engineers, as they can provide cost-effective, high resolution, aerial imagery for mapping and 3D reconstruction of natural and artificial structures [
1]. One important limit of the current UAV technology is the dependency on the Global Navigation Satellite Systems (GNSS) coverage as positioning information is generally required for guidance. Good reception of the GNSS signals is also required for cm-level mapping without the need to survey several ground control points [
2,
3]. The dependency on the GNSS reception severely limits the applicability of UAVs each time the sky is in large part occluded: the geometry of the GNSS constellation degrades and the position fix becomes irregular, inaccurate (due to multipath effect) and unreliable. This is very common in mountainous environment, e.g., see
Figure 1, or near large natural or artificial structures, such as bridges, dams, in urban canyons, etc. In such conditions, autonomous flight is risky and most commercial platforms prevent the take-off.
Many researchers are currently focusing on developing navigation solutions for GNSS denied environments. Most of these are vision-based or involve some sort of multi-sensor fusion (visual-inertial navigation being the most popular [
4,
5,
6]). These methods rely on exteroceptive sensors and thus are inherently dependent on the environment (e.g., enough texture has to be present in the scene) and it is difficult to certify or guarantee their performances. This is the reason why many excellent research prototypes exist whereas no commercial product is currently shipped with full fledged visual-inertial navigation.
If the environment can be freely structured, e.g., by placing landmarks, beacons, or other sensors, positioning information may be obtained similar to the one provided by GNSS. Once such landmarks have been identified, the navigation problem can be solved, e.g., as presented in works by the authors of [
7,
8]. Another solution is given by ultrawideband (UWB) positioning [
9,
10]. This technology offers a low-cost replacement to GNSS in indoor environments that can reach submeter accuracy in well designed scenarios. An example of the application of UWB ranging for cooperative navigation of multiple UAVs can be found in the work by the authors of [
11]. In the field of visible light positioning systems, multiple light-emitting diodes (LEDs) beacons, placed at known position, are imaged with a conventional camera [
12,
13]. Each beacon is identified thanks to an ID that is modulated over the beacon light intensity: the encoded signal can be recovered from a single frame exploiting the rolling shutter property of certain imaging sensors [
14]. However, for this to work, the light source must span over several pixels on the imaging sensor, restricting the operation to close range, compared to the camera resolution. The opposite idea consists in placing multiple small, point-wise active or passive targets on the moving platform and to track them from multiple static cameras placed at known positions. The targets form high contrast image features, possibly thanks to artificial IR or UV illumination and specific target surfaces, that are easily detected, e.g., by means of template matching algorithms [
15]. This approach is well understood and widespread in 3D Motion Capture Systems (MOCAPs) [
16] and many commercial implementations exist. The very high precision achieved by such systems, up to 1:15000 with respect to the volume diagonal [
17], comes at the price of using several cameras in converging viewpoints and a very accurate intrinsic and extrinsic calibration. These systems require a heavily structured environment, and it is not straightforward to set those up outdoors, e.g., because the natural light dominates the artificial one.
In this work we focus on an alternative solution based on cooperative navigation to replace GNSS positioning without the need of structuring the environment. This is suited for outdoor operations where the GNSS position fix is unreliable because of occlusions or multipath effects. We consider two UAVs (see
Figure 2): the first one, D1, flies outside of the GNSS denied area, e.g., higher or far away from any tall structure occluding the line of sight to the GNSS satellites. The second drone, D2, flies in an area where the GNSS signals are not available, but in line-of-sight with respect to D1. This is common and realistic in many outdoor scenarios, for example when flying in the close proximity to tall objects occluding a large part of the sky. D2 navigates using the position solution estimated by D1 tracking a known pattern of high-power LEDs placed on D2 fuselage: the absolute position of D2 is obtained composing the absolute position and orientation of D1, determined by its internal INS/GNSS navigation filter and the relative D1-to-D2 position, computed via LED tracking. A feasibility study and the implications of such scheme on the accuracy of the reconstructed 3D models from D2 imagery were discussed in the work by the authors of [
18], whereas the first prototype was presented in the work by the authors of [
19]. A similar concept has been explored in [
20], where multiple “father” UAVs help to localize a “son” UAV, fusing opportunistic GNSS observations with collaborative measurements. With respect to this work, we present and evaluate a real-world implementation of a visual ranging algorithm that allows to reduce the requirements to only one “father” UAV.
As in many other collaborative navigation systems [
21,
22,
23], the key element is how the robots gain knowledge of their relative position in real-time. Many methods have been explored, from laser ranging to UWB, whereas the most widely employed one consists in optical targets. Planar coded targets [
24,
25,
26] generally consist in a high contrast geometric feature, such as a square or a circle, in which a code is embedded to exclude false matches and to distinguish between multiple targets present in the scene. The a-priori notion of the physical dimensions of the target, along with the intrinsic camera calibration, allows to determine the relative pose of the target with respect to the camera using a single image. One well known method is to locate known points on the target and then solve the Perspective-n-Points (PnP) problem [
27] using the established set of 2D to 3D correspondences. Planar targets are widely employed in computer vision and robotics, as 3D landmarks, for camera calibration, and in several other applications in which the environment can be structured to ease tasks such as, for instance, navigation and object manipulation. One drawback is that typically many pixels are needed to recover the target code, limiting to close-to-medium distance applications. Second, the conventional planar targets are typically printed on rigid surfaces which are not applicable to flying drones because of dimensions and aerodynamic limitations.
3. Image Formation Model
In this section we derive an image formation model for a moving point-wise light source, considering a nonideal lens system. We develop the model in the one-dimensional case for simplicity, being the 2D case an intuitive extension.
At any given moment, a point-wise light source projects to a location
x (e.g., in pixels) on the imaging sensor. Suppose now that its apparent motion is uniform, with velocity
v, and centered at the origin. If the lens system is ideal, the light energy at any continuous location
x of the imaging sensor during the exposure time
is given by
where
is the rectangular function,
is the exposure time, and
c is a proportionality constant. Note that
. The measured intensity of each pixel is typically proportional to the integral of the light energy over the pixel surface. In case of a nonideal lens system, the actual energy distribution is obtained convolving Equation (
1) with the Point Spread Function (PSF) of the lens system. Here we assume that the PSF can be modeled with a Gaussian, i.e.,
[
29]:
where we have used the fact that
. hen the light source apparent velocity with respect to the camera is small, or when the exposure time is very short.
Equation (
2) can be approximated with
In fact, it can be shown that
and
when
, i.e., when the motion or the exposure time are small. In the limit case no approximation is introduced in Equation (
3) in this case. When
, this does not hold. In order to quantify the impact of such approximation, we first compare
and
in two cases: in
Figure 4a, the effects of motion blur are very well captured by Equation (
3). In the second case, see
Figure 4b, we consider a faster motion and a sharper lens system. Here, the approximation becomes more coarse. A numerical study of the difference between
and
as a function of
and
is shown in
Figure 4c: it is possible to see that with a reasonably sharp lens system (
px), the approximation introduced in Equation (
3) are below 10% for motions of less than 5 pixels during an exposure time.
4. Photometric PnP
In the following we employ the image formation model presented in the previous section to extend the classical iterative algorithms for the solution of the Perspective-
n-Points problem [
27]: if an image formation model is available for the image features associated to each known object point, the image position of such features can be determined jointly with respect to the object pose minimizing the photometric error with respect to the model. This is useful to better localize object points in the image when these consist of a limited number of pixels, and may be blurred by motion, or by a nonideal or imperfectly focused lens system.
We first review the conventional PnP algorithm. First, a set of
n known 3D points in object space are matched with their corresponding projection in the image. In classical iterative algorithms, such as in the work by the authors of [
30], or later extensions, the reprojection error is defined as the difference between the positions of the points measured in the image and the predicted ones based on the current estimate of the object pose. Such error is minimized, in least-squares sense, yielding the final object pose. More precisely, the predicted image coordinates of the
i-th point,
are given by the pinhole camera model:
where
are the object coordinates of the
i-th point;
is the pose of the object reference frame,
L, with respect to the camera,
C; and
K is the
intrinsic camera calibration matrix. As usual,
is obtained from the third component of Equation (
4) and eliminated. Lens distortion can be corrected with the well known Brown’s model [
31], which is omitted for brevity. Given the corresponding image observations for each point
i,
, and their a priori uncertainty,
, the object pose can be determined solving
The nonlinear optimization problem in Equation (
5) is typically solved by means of the Levenberg–Marquardt (LM) algorithm. An initial guess for the camera pose can be obtained by means of the direct linear transform [
32,
33]. Particular care needs to be taken in handling
during the optimization, as the rotation component,
, belongs to
, the special orthonormal group, or the group of rotations in three dimensions, a non-Euclidean space.
is typically over-parameterized (e.g., 4D in case of unit quaternions or 9D in case of rotation matrices) and the constraints existing within the parameterization must be preserved during optimization. This problem is however well understood and manifold-aware variations of LM have been proposed, e.g., see the work by the authors of [
34].
The input image coordinates, , are measured separately for each point i in a prior image processing step, for example by feature or template matching, corner detection, etc. The problem in this approach is that the accuracy of such image measurements is hardly below one pixel, especially in presence of motion blur and/or when the image features related to the object points consist in just a few pixels (e.g., because the distance is large compared to the camera resolution). In such cases it is also difficult to establish a consistent a priori .
In this work we propose to determine the image coordinates of the object points jointly with the pose of
L. Instead of minimizing the reprojection error, as defined in Equation (
5), we minimize the photometric error with respect to an image formation model. In our case, each known point on the target is signalized with a small yet powerful light source. This allows us to use the image formation model derived in
Section 3. We consider a small square patch of
pixels centered at an (integer) initial guess of the
i-th light source position,
. The intensity of the pixel
, with
is given by
where
are the unknown image coordinates of the
i-th point-wise light source. Equation (
6) is the straightforward extension of Equation (
3) to the two-dimensional case:
a is an unknown proportionality constant,
b is the background intensity, and
is a positive definite matrix encoding the blur kernel. We assume that the target is small in the image space, the background can be assumed as uniform, and that all the light sources have similar intensity. This means that
a and
b are unknown but common to each object point. The same holds for
, provided that no fast rotation occurs around the target center viewing ray.
The target pose is determined along with the image coordinates of the points and the unknown parameters of the image formation model as
where
and
are as defined in Equations (
4) and (
6) and
is the pixel intensity as measured in the image. The geometric error component is equivalent to Equation (
5): it is defined in terms of the unknown image positions of the points (and not with respect to image measurements) and constraints them to be consistent with the three-dimensional structure of the light source array. Indeed,
does no longer correspond to the a priori uncertainty of the image measurements: it acts as a weight between the geometric and the photometric error components and accounts for uncertainties in the 3D positions of the object points and for the imaging system not perfectly satisfying the pinhole camera model.
The details of the solution of the optimization problem in Equation (
7) are analogous to the ones of conventional PnP. However, further parameters have to be initialized: initial guesses for
a and
b can be obtained averaging the central and the corner pixels of each image patch, whereas
can be set to be the identity matrix. In continuous operations, the values obtained for the last processed frames are used.
In
Figure 5 we depict the results based on two real images, with the accent on the shape of the determined blur kernel; a comparison of the predicted versus actual image intensities are shown in
Figure 6.
5. Target Detection
As an additional input for the solution of the Photometric PnP problem, we need to first locate the correct image features in the image, and next to establish the association of each image feature to the corresponding object point. In this work we solve this nontrivial segmentation task relying on specific aspects of the application at hand: the object points are signalized with point-wise light sources and are imaged from medium to large distance with a monochrome camera. Moreover, the tilt between the target and the camera is small, as both UAVs are hovering or gently moving in formation flight. Our solution first generates a set of promising image locations, then reduces this set based on the predicted appearance of a point-wise light source in the image, and finally establishes the association between image features and object points relying on the a priori knowledge of the geometry of the target. The details are given in the following.
Since the point-wise light sources appear as bright spots in the image, provided that a proper exposure time is set (see
Section 6), the first step is to threshold the image and to cluster connected bright pixels together. However, further objects in the scene may be bright in the image, such as highly reflective or white surfaces. To eliminate most of these candidates, we rely on the photometric model developed in
Section 4: From the latest estimates of the photometric model parameters (or from default assignments, for the first frame) we can generate the expected appearance of a light source, in terms of a
patch, see
Figure 6, middle row. Indeed, the parameters
a and
encode the shape and the size that a bright spot should have to correspond to one of the target signalizing lights. All clusters found are ranked according to the similarity with respect to the generated patch (in terms of squared intensity difference) and only the
K most similar ones are kept. Here,
K is a parameter of the algorithm that is chosen to be 5–10 times the number of object points
N.
In the last step, we introduce the a priori knowledge on the shape of the target to exclude the remaining outliers and to establish the association between each image feature and the corresponding object point. For each couple of candidate image features, and , we pretend that these correspond to object points 1 and 2. Based on this assumption, we compute the 2D transformation (translation, rotation and scale) that maps the image coordinates of and to the object coordinates of points 1 and 2. Such transformation is unique and can be computed in closed form. Then, for each remaining image feature, , we apply T to obtain the expected object coordinates of , . If the association , is correct, we expect to find an object point approximately at coordinates , which we check iterating trough all the remaining object points. If we find an element for which the discrepancy is below a given threshold, we consider this a match. The couple that scores the highest number of matches, plus the matches themselves, give us the searched association between image features and object points. Note that we have assumed that a translation, rotation and scale is sufficient to approximately map image to object coordinates. In general, an homography would be needed instead, if the object points are coplanar. However, in the application at hand, the relative tilt between the two UAVs, and thus between the camera and the target, is small, making the perspective effect negligible.
The last step of the image segmentation algorithm performs an exhaustive search trough all the possible image feature to object point associations. Indeed, there is no apparent difference (e.g., color or shape) between the light sources signalizing object points and we can not rely on other distinctive image features to identify the location of the target with high confidence, as instead happens with most of the planar targets, e.g., a black and white surface with an encoded identifier, as in the work by the authors of [
24]. Moreover, one ore more of the features corresponding to object points may fail to rank among the best
K ones according to the similarity to the reference image patch. The presented algorithm can tolerate up to
missing points, even tough the more points are missing the higher the chance of a wrong detection is. The complexity of the algorithm is
, where
K is the number of image features kept after the photometric test and
N is the number of object points. This moderately high complexity is bearable by a small embedded computer in real-time, provided that
K is kept relatively small. This is desirable, as a weaker photometric filter, and thus more candidate bright spots, increase the possibility of a wrong match: for instance, a set of bright stones on the ground could lay in a configuration comparable to the one of the object points on the target. This problem can be mitigated by increasing the number of object points. However, their density must be kept below
points/pixel
in the whole range of operating distances, otherwise multiple ones may lie in the same image patch considered for the photometric PnP adjustment, violating model assumptions.
6. Automated Exposure Adjustment
In Equations (
3) and (
6) we introduced an image formation model for point-wise light sources able to account for motion blur and Gaussian lens point spread function. However, in practice,
: saturated pixels are outliers with respect to that model, and can not be employed in the least-square adjustment. This means that the exposure time,
, has to be adapted to maintain a sufficient apparent brightness of the light sources while limiting pixel saturation. In the following, we discuss how an optimal exposure time can be computed in real time based on the current estimate of the parameters of the image formation model.
It is well known that the light intensity per unit of surface decreases with the square of the distance from the source. Additionally, the light sources are generally directional, so that the emitted power decreases drastically as the angle of looking increases. The total light energy
reaching the sensor during the exposure time
can be modeled with
where
is the distance from the target and
is a function modeling the non-uniform light emission, which is thus dependant on the relative camera pose with respect to the target.
is an unknown proportionality coefficient depending on the light sources intensity. After the light sources array has been successfully detected and measured in an image, an estimate for
can be obtained from the final
a,
b, and
, as in Equation (
7):
where
is as in Equation (
6) and
is a coefficient depending on the camera gain and quantum efficiency. Imposing
and
, (arbitrarily), we obtain
We continuously determine
using multiple frames with an exponential moving average filter so that local unmodeled effects are accounted for. Once
has been determined with sufficient accuracy, Equation (
8) can be solved in
, using the last known parameters to determine the exposure time needed to achieve the desired
, which is chosen such that the number of saturated pixels is minimal, based on an average
. Indeed, for the same
, the peak intensity is a function of
, as in case of motion the same light is spread over multiple pixels. While typically the relative camera pose changes slowly, the blur kernel
does not, as it is highly dependant on the angular velocity of camera and on vibrations.
Clearly, for a given (which is related to the light sources intensity), there exist a distance for which the computed exposure time becomes excessive. The background features (e.g., the terrain) reflect the sunlight and the more is increased, the more the highly reflective points in the background will look like the target light sources in the image. This can be solved, up to a certain density of candidates, by means of a clever image segmentation algorithm which takes into account the geometry of the array to exclude possible outliers. However, whatever algorithm will break above a certain distance.
7. Experimental Evaluation
We mark each tip of the arms of a hexacopter, plus other two points on the fuselage, with high-power LEDs, as in
Figure 3. The maximum distance between two LEDs is 65 cm. We place the copter on the ground and we fly a second one, equipped with a nadir camera, above the first. The flight pattern has a “butterfly” shape (⋈), it is centered at the target position and enlarges with the elevation (which tops at 100 m). The camera has 5 MP resolution, a pixel size of
μm/px, the focal length of the lenses is 8 mm, so that 1 px
cm from 100 m distance. The flight was performed under sunny conditions.
We first evaluate the impact of the exposure adjustment algorithm:
needs to be controlled in real-time as a function of the current estimates of
,
,
a, and
b to achieve an optimal LED intensity in the image. In
Figure 7, we show that as soon as the exposure adjustment algorithm is engaged,
, as defined in Equation (
8), is constant regardless of the camera pose: the farther we move from the light sources, the higher the exposure time has to be set. In the considered range of distances, the background is still substantially less bright than the light sources, which ensures that the threshold and clustering algorithm works properly.
Next, we discuss the precision of the determined image coordinates. In Photometric PnP,
are explicit unknowns in the least-squares adjustment, so we can evaluate their a posteriori uncertainty:
where
is the residuals vector,
R is the problem redundancy, i.e., the number of observations minus the degrees of freedom,
is the a priori measurement uncertainty, and
J is the Jacobian of the residuals with respect to the unknowns.
includes the relative weight between the photometric and the geometric constraints (see again Equation (
7)), and it is defined as
, with
, determined empirically. This means that a residual of 30 units in pixel intensity has the same weight of 1 pixel in the geometric constraint (
). As a comparison, we run a classical iterative PnP algorithm (we used the well known implementation available in OpenCV) employing the centroid of the bright clusters as image measurements. In PnP, only
is estimated, so we determine the a posteriori uncertainty of image measurements by means of covariance propagation. Note that for PnP we do not need to specify an a priori uncertainty for the image observations (which would be arbitrary), as
and
does not depend on
. The results shows that Photometric PnP allows to determine the location of the light sources with better precision compared with classical methods,
px, as it is shown in
Figure 8.
More accurate image coordinates translate in better determination of the relative target position with respect to the camera,
. We show this comparing the position results with respect to a classical PnP algorithm. The reference for this comparison is obtained as follows. We exploit the fact that the target is static and we orient all the images with the well know bundle adjustment software Pix4D Mapper, with the scale being fixed by camera position priors from a GPS receiver and by ground control points. Such adjustment gives the reference camera poses and the target position with respect to a global frame
W,
and
, out of which the reference
is computed. As hundreds of images are adjusted together,
is more precise than
, which instead is determined from a single frame only. To eliminate one possible source of bias in the results, we use the same calibration in all the experiments. The result of the comparison are reported in
Table 1. It is possible to see that the results are practically unbiased, except for the
Z component, and that the photometric PnP algorithm outperforms classical PnP in all the statistics. Notably, we reduce the standard deviation of the error by a factor of two in the
Z component, which is the most sensible to the accuracy of image measurements. An equivalent comparison for the orientation estimates,
, is not reported here as all planar targets suffer of pose ambiguity, meaning that in certain circumstances two orientations of the target would produce the same image projection of the known points [
35], which complicates the comparison. Nevertheless, both algorithms achieve a standard deviation better than
for roll and pitch and
for yaw.