LandNet: Combine CNN and Transformer to Learn Absolute Camera Pose for the Fixed-Wing Aircraft Approach and Landing

Shen, Siyuan; Yu, Guanfeng; Zhang, Lei; Yan, Youyu; Zhai, Zhengjun

doi:10.3390/rs17040653

Open AccessArticle

LandNet: Combine CNN and Transformer to Learn Absolute Camera Pose for the Fixed-Wing Aircraft Approach and Landing

by

Siyuan Shen

¹

,

Guanfeng Yu

^2,3

,

Lei Zhang

²,

Youyu Yan

⁴

and

Zhengjun Zhai

^1,*

¹

School of Computer Science, Northwestern Polytechnical University, Xi’an 710072, China

²

AVIC Xi’an Aeronautics Computing Technique Research Institute, Xi’an 710068, China

³

School of Software, Northwestern Polytechnical University, Xi’an 710072, China

⁴

School of Computer Science and Engineering, Xi’an University of Technology, Xi’an 710048, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(4), 653; https://doi.org/10.3390/rs17040653

Submission received: 2 December 2024 / Revised: 27 January 2025 / Accepted: 12 February 2025 / Published: 14 February 2025

Download

Browse Figures

Versions Notes

Abstract

:

Camera localization approaches often degrade in challenging environments characterized by illumination variations and significant viewpoint changes, presenting critical limitations for fixed-wing aircraft landing applications. To address these challenges, we propose LandNet—a novel absolute camera pose estimation network specifically designed for airborne scenarios. Our framework processes images from forward-looking aircraft cameras to directly predict 6-DoF camera poses, subsequently enabling aircraft pose determination through rigid transformation. As a first step, we design two encoders from Transformer and CNNs to capture complementary spatial–temporal features. Furthermore, a novel Feature Interactive Block (FIB) is employed to fully utilize spatial clues from the CNN encoder and temporal clues from the Transformer encoder. We also introduce a novel Attentional Convtrans Fusion Block (ACFB) to fuse the feature maps from encoder and transformer encoder, which can enhance the image representations to promote the accuracy of the camera pose. Finally, two Multi-Layer Perceptron (MLP) heads are applied to estimate 6-DOF of camera position and orientation, respectively. Thus the estimated position and orientation of our LandNet can be further used to acquire the pose and orientation of the aircraft through the rigid connection between the airborne camera and the aircraft. The experimental results from simulation and real flight data demonstrate the effectiveness of our proposed method.

Keywords:

absolute camera regression; transformer; fixed-wing aircraft landing

1. Introduction

Fixed-wing aircraft landing is one of the most important and dangerous procedures during the whole flight. Therefore, it is necessary to estimate the 6-DoF (Degrees of Freedom) pose. The Global Navigation Satellite System (GNSS) and Inertial Navigation System (INS) are two types of classical measuring instruments applied to estimate the 6D pose of the aircraft.

In recent years, vision navigation research and applications have seen remarkable success. A review of the literature reveals two primary approaches based on the type of visual information used: motion estimation in unknown environments and motion estimation using known structured ground features. Motion estimation in unknown environments typically involves estimating the system’s pose by integrating the motion between consecutive images with inertial data. For example, Reference [1] extracted surrounding features on the current frame and matched them with features in the pre-built visual map. Additionally, they developed an invariant Extended Kalman Filter (EKF) for map-based visual–inertial navigation. Eduardo Gallo [2] presented a long-distance Global Navigation Satellite System (GNSS)-denied visual–inertial navigation system, which applied Visual Odometry (VO) and inertial to achieve accurate position estimation. Lee et al. [3] identified features or points of interest in the surroundings and incorporated the unknown time delays into the Visual–Inertial Odometry (VIO) system using an Extended Kalman Filter (EKF). The research in [4] applied SR-UKF [5] to fuse Direct Sparse Odometry (DSO) [6] and the inertial measurement unit to estimate the motion information of the aircraft in the landing phase. The works presented in [7,8,9] detected features in the surrounding environment and tracked them using the Lucas–Kanade method, generating keyframe-based estimates of pose changes. Their back-end systems then fused visual and inertial measurements through either filtering-based or optimization-based Visual–Inertial Odometry (VIO). Unlike the feature-based approaches mentioned above, the authors of [10,11,12] minimized a pose graph that incorporated surrounding photometric reprojection errors, IMU pre-integral errors, and relative pose errors to estimate motion, without the need for feature detection. However, these visual–inertial Simultaneous Localization and Mapping (VI-SLAM) or VIO methods operate within a local frame, making accumulation errors, or drift, inevitable in motion estimation. As a result, the navigation accuracy does not meet the requirements for the approach and landing of fixed-wing aircraft.

Motion estimation based on known structured features leverages observed characteristics with known geometric or geographic information, resulting in higher precision and drift-free pose estimation. For example, the authors of [13] presented a neural mapping method for Unmanned Aerial Vehicle (UAV) landing. They proposed a method to estimate the UAV pose based on the runway’s line features, using a flexible coarse-to-fine runway-line-detection method. Then, the UAV’s pose was acquired through the neural radiance field (NeRF). The articles [14,15] employed manually designed, structured landmarks for the correction of VIO. By leveraging the global geographic information of these landmarks, it enhanced the accuracy of navigation systems. Zhaoyang wang [16] adopted Fast R-CNN [17] to detect the runway region in the images and then calculated the position and attitude by using the detected results. Guanfeng Yu [18,19] designed a visual observation function based on runway corners with prior geographic information for visual–inertial combined navigation. The work in [20] applied optical flow and the horizon line to estimate the attitude of the fixed wing in the landing phase. Grof et al. [21] introduced an estimation method for runway-relative positioning with IMU camera data fusion. Shang et al. [22] determined the relative position/attitude parameters with the object image conjugate relationship of the runway sideline and fused the inertial information to improve the measurements.

Camera localization refers to the process of estimating the absolute camera pose (position and orientation) from an image or a sequence of images. Current image-based localization methods can be broadly categorized into three main approaches: structure-based localization, retrieval-based methods, and deep regression-based techniques:

Structure-Based Localization Methods: These methods leverage the 3D structure of the scene and camera geometry [23]. They establish correspondences between pixels in the image and 3D scene points by matching descriptors extracted from the test image with descriptors associated with the 3D points [24,25]. The camera pose is then calculated using algorithms such as Random Sample Consensus (RANSAC) and Perspective-n-Points (PnP) [26]. However, these methods struggle with thermal images due to their low resolution and uneven noise, which significantly affects performance.
Image Retrieval-Based Methods: These methods estimate the camera pose by matching a query image with geo-tagged images from a pre-built database. While effective in some cases, these approaches have high storage requirements and can produce inaccurate results due to the high similarity of thermal images, which makes it challenging to distinguish between different scenes.
Deep Regression-Based Methods: These methods utilize Convolutional Neural Networks (CNNs) to learn a mapping between input images and their corresponding poses. Early work such as PoseNet [27] regressed camera poses from a single image by adding a multi-layer perceptron (MLP) head to a GoogLeNet backbone. Subsequent methods, like PoseGAN [28], improved on this by incorporating geometric structures to enhance pose estimation. MS-Trans [29] introduced transformers with a complete encoder–decoder structure to regress camera poses across multiple scenes. TransBoNet [29] employed a transformer bottleneck to estimate the absolute camera pose. MapNet [30] incorporated relative geometric constraints to improve pose estimation, though this differs from our approach in several key aspects. ALNet [31] utilized a local discrepancy perception module and an adaptive channel attention module to refine pose estimation. Yoli Shavit [32] employed a transformer encoder for camera pose regression, while MambaLoc [32] introduced a selective state-space model for visual localization.

Deep regression-based methods have also been applied to localization for flying robots. For instance, RCPNet [33] learned the relative pose for UAVs, while Baldini [34] integrated visual and inertial data to estimate UAV poses. UAVPNet [35] adopted CNNs for object detection and pose regression in UAVs.

In this paper, we propose a novel end-to-end Absolute Camera Pose Regression network named LandNet for predicting the 6-DOF camera pose in the fixed-wing aircraft landing procedure. Furthermore, the predicted camera pose can be used to compute the aircraft’s pose through a rigid connection between the airborne camera and the aircraft. The main contributions of our method are summarized as follows:

We propose LandNet, a novel hybrid architecture combining Vision Transformer (ViT) and Convolutional Neural Networks (CNN) for 6-DoF camera relocalization using single images. LandNet is designed to efficiently generalize to large-scale environments, providing robustness for fixed-wing aircraft landing applications.
We propose a Feature Interaction Block that fully leverages both spatial and temporal information to enhance image representations for absolute pose regression.
An Attentional ConvTrans Fusion Block is designed to effectively integrate multi-scale and multi-level information to improve camera prediction.
We evaluate the proposed LandNet using real flight data. The experimental results show that the method performs effectively in fixed-wing aircraft landing scenarios, especially in terms of the accuracy of the predicted orientation.

The remaining part of this paper is organized as follows: In Section 2, the preliminary knowledge of the coordinates’ definitions will be discussed. The methodology will be presented in Section 3. Section 4 will describe the experimental details, including flight data acquisition, training, and test details. Section 5 concludes the whole paper.

2. Preliminary Knowledge

2.1. Attitude Transformation

2.1.1. From Quaternion to Euler Angle

The norm of the Hamilton quaternion is defined as follows:

q = w + x \overset{⇀}{i} + y \overset{⇀}{i} + z \overset{⇀}{k}

(1)

| | q | | = \sqrt{w^{2} + x^{2} + y^{2} + z^{2}} = 1

(2)

The Euler angle can be computed from the quaternion, which is expressed as follows:

[\begin{matrix} θ \\ φ \\ ξ \end{matrix}] = [\begin{matrix} arctan \frac{2 (w x + y z)}{1 - 2 (x^{2} + y^{2})} \\ arcsin (2 (w y - z x)) \\ arctan \frac{2 (w z + x y)}{1 - 2 (y^{2} + z^{2})} \end{matrix}]

(3)

where

ξ

,

φ

, and

θ

represent roll angle, pitch angle, and yaw angle, respectively.

2.1.2. From Euler Angle to Quaternion

Given a pitch angle

φ

, roll angle

ξ

, and yaw angle

θ

, the quarternion is computed as follows:

[\begin{matrix} w \\ x \\ y \\ z \end{matrix}] = [\begin{matrix} cos (\frac{φ}{2}) cos (\frac{ξ}{2}) cos (\frac{θ}{2}) + sin (\frac{φ}{2}) sin (\frac{ξ}{2}) sin (\frac{θ}{2}) \\ sin (\frac{φ}{2}) cos (\frac{ξ}{2}) cos (\frac{θ}{2}) - cos (\frac{φ}{2}) sin (\frac{ξ}{2}) sin (\frac{θ}{2}) \\ cos (\frac{φ}{2}) sin (\frac{ξ}{2}) cos (\frac{θ}{2}) + sin (\frac{φ}{2}) cos (\frac{ξ}{2}) sin (\frac{θ}{2}) \\ cos (\frac{φ}{2}) cos (\frac{ξ}{2}) sin (\frac{θ}{2}) - sin (\frac{φ}{2}) sin (\frac{ξ}{2}) cos (\frac{θ}{2}) \end{matrix}]

(4)

2.2. Coordinates Definitions

As shown in Figure 1, the world coordinate, the World Geodetic System 1984 (WGS84) frame

{G}

, the earth-centered, earth-fixed frame (ECEF)

{E}

, the navigation frame (East–North–Up)

{N}

, the body frame (Right–Forward–Down)

{B}

, and the camera frame

{C}

were applied in our methodology and experiments. Furthermore, the world coordinates in this study will be presented. The transformation matrix between different frames will be presented in the following.

2.2.1. WGS84 to ECEF

Given point A in WGS84 frame, which is expressed by

{[\begin{matrix} lat & lon & alt \end{matrix}]}^{T}

, one can obtain the position in the ECEF frame using the following equations:

\begin{matrix} \{\begin{matrix} X_{e c e f} = (N + alt) \cos (lat) \cos (lon) \\ Y_{e c e f} = (N + alt) \cos (lat) \sin (lon) \\ Z_{e c e f} = (N {(1 - f)}^{2} + a l t) sin (lat)) \end{matrix} \end{matrix}

(5)

N = \frac{a}{\sqrt{1 - f (2 - f) * {sin}^{2} (lat)}}

(6)

where lat, lon, and alt represent the latitude, longitude, and altitude of point, A respectively, a is the equatorial radius, and f is the flatness of the earth.

2.2.2. ECEF to ENU

As shown in Figure 2, the transformation from an ECEF frame to ENU frame is done with respect to the reference point. The transformation from the reference point

(X_{r}, Y_{r}, Z_{r}, λ, φ)

to estimated position

(X_{m}, Y_{m}, Z_{m})

in the ENU coordinate frame is calculated as follows:

[\begin{matrix} X_{E} \\ Y_{N} \\ Z_{U} \end{matrix}] = [\begin{matrix} - sin λ & cos λ & 0 \\ - sin φ cos λ & - sin φ sin λ & cos φ \\ cos φ cos λ & cos φ sin λ & sin φ \end{matrix}] [\begin{matrix} X_{m} - X_{r} \\ Y_{m} - Y_{r} \\ Z_{m} - Z_{r} \end{matrix}]

(7)

where

λ

is the longitude of the reference point;

φ

is the latitude of the reference point.

2.2.3. ENU to Body

As shown in Figure 1, the navigation frame and the body frame are defined using the East–North–Up (ENU) and Forward–Right–Up (FRU) coordinates; to make our discussion simple, we omitted the translation between two frames. The rotation transformation matrix

C_{n}^{b}

between the ENU frame and body frame is expressed as follows:

C_{n}^{b} = [\begin{matrix} cos ξ & 0 & - sin ξ \\ 0 & 1 & 0 \\ sin ξ & 0 & cos ξ \end{matrix}] [\begin{matrix} 1 & 0 & 0 \\ 0 & cos φ & sin φ \\ 0 & - sin φ & cos φ \end{matrix}] [\begin{matrix} cos θ & - sin θ & 0 \\ sin θ & cos θ & 0 \\ 0 & 0 & 1 \end{matrix}]

(8)

where

ξ

,

φ

, and

φ

represent roll angle, pitch angle, and yaw angle, respectively. This transformation is shown in Figure 3.

2.2.4. Body to Camera

As shown in Figure 1, the rotation matrix

C_{b}^{c}

from the body frame to the camera frame is calculated as follows:

C_{b}^{c} = [\begin{matrix} 0 & 1 & 0 \\ 0 & 0 & 1 \\ 1 & 0 & 0 \end{matrix}]

(9)

2.2.5. World Coordinate

We establish a world coordinates system with runway point A as the origin, which is illustrated in Figure 1. Furthermore, the camera position in the ECEF frame is then transformed into a world coordinate using Equation (7).

2.3. Procedure for the Fixed-Wing Aircraft Landing

As shown in Figure 4, the landing procedure of a fixed-wing aircraft typically comprises two distinct phases: an instrument flight phase and a visual flight phase. The instrument landing phase extends from the initial approach to the Decision Altitude (DA), while the visual phase commences at DA and continues through touchdown on the runway. During the visual phase, pilots assume manual control of the aircraft for landing operations. Consequently, the provision of precise navigation information is crucial to ensure safe landing operations. The proposed pose estimation method is specifically designed to enhance navigation accuracy, particularly during the critical phase when the aircraft descends below 515 feet, where precise spatial awareness is essential for the execution of a safe landing.

3. Methodology

In this section, we present a comprehensive description of our proposed LandNet framework. The section begins with an overview of the network’s overall architecture, followed by a detailed explanations of its key components: the Initial Convolutional Blocks, CNN Encoder, Transformer Encoder, Feature Interaction Block, and Attentional ConvTrans Fusion Block. Finally, we elaborate on the design principles and implementation details of the loss function.

3.1. Network Architecture

Visual descriptors, such as local features and global representations, have long been extensively studied as counterparts [36]. Convolutional Neural Networks (CNNs) hierarchically acquire local features via consecutive convolutional operations; thus, the local cues are retained in feature maps. Transformer [37] is capable of aggregating global representations through cascaded self-attention mechanisms.

In this paper, we combine the CNNs and Transformers to take advantage of the local and global features. As illustrated in Figure 5, the proposed LandNet is composed of an Initial Convolution Blocks (ICB), two parallel branches, a Feature Interactive Block to bridge the two branches, an Attentional ConvTrans Fusion Block to aggregate two branches, and MLP layers for the regression of the pose and orientation.

3.2. Initial Convolutional Blocks

The initial convolution block comprises a 7× convolution with stride 2 followed by a 3 × 3 convolution, which is used to generate a feature map

F_{initial} \in R^{64 \times \frac{H}{4} \times \frac{W}{4}}

, where 64 is the channel. Then, the

F_{initial}

features are fed into two separate branches for further processing.

3.3. CNN Encoder

To obtain contextual features and certain spatial information using Convolutional Neural Networks, the CNN Encoder applies a feature pyramid architecture, where the resolution of the feature map decreases and the channel number increases with an increase in depth. As shown in Figure 6, the CNN encoder uses two types of residual structures according to whether downsampling is required.

The CNN encoder is split into four stages. Each stage is composed of multiple convolution blocks and each convolution block contains n bottlenecks [38]. Each bottleneck contains a 1 × 1 down projection convolution, a 3 × 3 spatial convolution, a 1 × 1 up-projection convolution, and a residual connection between the input and output of the bottleneck.

Let

C_{1} ()

and

C_{2} ()

represent the residual structures in Figure 6a and Figure 6b, respectively:

\begin{matrix} C_{1} (x) = σ (BN (W_{2} (σ (B N (W_{1} x)))) + x) \\ C_{2} (x) = σ (BN (W_{2} (σ (B N (W_{1} x)))) + W_{s} x) \end{matrix}

(10)

where

W_{1}

and

W_{2}

represent the weight of the two 3 × 3 convolution operations in

C_{1}

and

C_{2}

.

W_{s}

represents the weight of the 1 × 1 convolution operation.

B N

represents the batch normalization operation.

σ

is the Rectified Linear Unit activation function (ReLU). In our implementation, we sequentially stack two residual structures to construct the CNN encoder.

The remaining stages are composed of several convolutional blocks, and each block consists of several regular consecutive 1 × 1 convolutions and 3 × 3 convolutions. The detailed information inside the CNN encoder is shown in Table 1.

3.4. Transformer Encoder

Our proposed LandNet employs the Transformer encoder to capture global context and long-range dependencies through the Multi-Head Self-Attention Mechanism (MHSA). This capability is particularly advantageous when handling significant viewpoint changes during aircraft landings. The Transformer aligns features from various image regions by assigning attention to the relevant areas, regardless of their spatial position, thereby enhancing pose accuracy.

The Transformer branch is designed following the Vision Transformer (ViT) architecture [39]. It comprises N repeated Transformer blocks, each containing two core components: a multi-head self-attention (MHSA) module and a multi-layer perceptron (MLP) block. The MLP block consists of a fully connected (fc) up-projection (fc) layer followed by a fc down-projection layer. To ensure stable training and improved performance, Layer Normalization (LayerNorm) is applied before each layer, and residual connections are incorporated in both the self-attention layer and the MLP block.

For tokenization, the feature maps produced by the initial convolutional module are compressed into non-overlapping 14 × 14 patch embeddings through a linear projection layer. This projection is implemented as a 4 × 4 convolutional layer with a stride of 4, effectively transforming the input feature maps into a sequence of patch tokens.

The scaled dot-product attention mechanism maps a query matrix

Q

, key matrix

K

, and value matrix

V

to an attention matrix. A scaling factor is used to prevent very small gradients.

The scaled dot-product attention can be expressed as follows:

A t t e n t i o n (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{C}}) V

(11)

The multi-head attention joint obtains information from diverse subspaces, providing the concatenation of different heads to improve the modelling capability:

\begin{matrix} MultiHead (Q, K, V) = C o n c a t (h e a d_{1}, h e a d_{2}, \dots, h e a d_{h}) W^{O} \end{matrix}

(12)

where

h e a d_{i} = Attention ({QW}_{i}^{Q}, {KW}_{i}^{K}, {VW}_{i}^{V})

is the projected matrix of the output.

W_{i}^{Q}

,

W_{i}^{K}

, and

W_{i}^{V}

are the corresponding parameter matrices, respectively.

The Transformer branch is built by stacking multiple Transformer encoders. Given the embedded feature

z_{0}

, the Transformer encoder structure with L layers can be described as follows:

\begin{matrix} {z^{'}}_{l} = MHSA (LN (z_{l - 1})) + z_{l - 1}, l = 1, 2, \dots, L \\ z_{l} = MLP (LN (z_{l})) + {z^{'}}_{l}, l = 1, 2, \dots, L \end{matrix}

(13)

where LN() is the layer normalization. By constructing a Transformer encoder with L identical layers, the encoder output,

z_{l} \in R^{N \times C}

has the same size as the input

z_{0} \in R^{N \times C}

. The Transformer encoder is illustrated in Figure 7.

3.5. Feature Interactive Block

The feature maps

F^{C \times H \times W}

(

C, H, W

represent channel, height, and width, respectively) were obtained from the CNN encoder and patch embeddings

P^{K \times E}

were obtained from the Transformer branch, where K and E represent the number of image patches and the dimensions of embedding. Then, a Feature Interactive Block (FIB) was designed to eliminate the misalignment between the two feature maps.

When the feature map

F^{C \times H \times W}

is fed to the Transformer branch, our FIB module first adopts a downsampling operation to finish the spatial dimension alignment. After that, feature concatenation is performed to fuse the local features and global features.

On the other hand, when the

P^{K \times E}

from the Transformer is fed into the CNN branch, our FIB performs an upsampling operation to the patch embeddings to align them with the spatial size of the feature map.

Furthermore, a 1 × 1 regular convolution is applied to adjust the channel dimension. After the channel dimension and spatial dimension are aligned, an element-wise addition is employed to complete the feature enhancement for the CNN encoder. An illustration of the FIB is shown in Figure 8.

3.6. Attentional ConvTrans Fusion Block

Element-wise addition is the most simple way to fuse the output of the Transformer branch to the output of the CNN branch Transformer encoder. However, this fusion method omits the relationship inside the feature maps, which leads to a bad fusion effect. To solve this problem, motivated by Peng [40], an Attentional Convtrans Fusion Block (ACFB) was developed to fuse the feature maps from the two branches to improve the feature quality. The structure of the proposed ACFB block is shown in Figure 9.

The procedure of ACFB block is expressed as follows:

F_{o u t} = C o n c a t (F_{C N N} \otimes f_{s}, F_{T r a n s f o r m e r} \otimes f_{c})

(14)

where

F_{C N N}

and

F_{T r a n s f o r m e r}

are the CNN feature map and Transformer feature map, respectively,

f_{s}

is the spatial attention module, and

f_{c}

is the channel attention module. ⊗ refers to matrix multiplication.

Let

f_{c o n v}

represent the convolution operation.

f_{m e a n}

and

f_{m a x}

are the mean operation and maximum operation, respectively. The spatial attention map in our ACFB module is computed as follows:

S p a t i a l = σ (f_{c o n v} C o n c a t (f_{m e a n}, f_{max}))

(15)

Following SENet [41], the channel attention module is used to refine the Transformer feature map.

3.7. Loss Function Design

Our camera localization method is optimized to minimize the deviation between the ground truth

(x_{t r u t h}, q_{t r u t h})

and the predicted position and pose

({\hat{x}}_{p r e d i c t e d}, {\hat{q}}_{p r e d i c t e d})

. Euclidean distance is used to evaluate the position loss

L_{x}

and the attitude loss

L_{q}

, which are calculated as follows:

L_{x} = | | x_{t r u t h} - {\hat{x}}_{p r e d i c t e d} | |_{2}

(16)

where

q

is normalized to a unit vector to ensure that it can be transformed into a valid rotation matrix.

x_{t r u t h}

and

{\hat{x}}_{p r e d i c t e d}

are both transformed into the ENU frame. Considering the scale difference between the two types of losses, two learned parameters

s_{x}

and

s_{q}

are used to balance the two losses. The loss function then is defined as follows:

L = L_{x} exp (- s_{x}) + s_{x} + L_{q} exp (- s_{q}) + s_{q}

(17)

where

L

represents the overall loss.

4. Experiments

4.1. Datasets

4.1.1. Simulation Data

The landing procedure of a fixed-wing UAV equipped with a forward-looking camera was simulated using an Unreal Engine. The coordinate system definition during the simulation was consistent with that in Section 2.2. The Unreal Engine simulation of the aircraft landing scene is shown in Figure 10. The simulated images and corresponding camera poses were used for training and validation of the algorithm.

4.1.2. Flight Data Acquisition

As shown in Figure 11, our flight platform consists of an FLIR camera, an Inertial Measurement Unit (IMU), a recorder for flight-related data, and a Differential Global Positioning System (DGPS). The IMU and recorder were installed in the cabin, which is considered the aircraft’s center of gravity. Prior to flight, the rotation matrix and translation vector between the FLIR camera and IMU were calibrated. The DGPS provides high-accuracy position and orientation data, which served as the ground truth for our method. The calibration data between the camera and IMU are presented in Table 2. The data-collection flight test was conducted with an infrared camera onboard at Pucheng Airport, Shaanxi Province, China, and the weather conditions were foggy. A total of three flight sessions were carried out, collecting images below a flight altitude of 515 feet, with 1404 images taken per session. The distribution of images according to flight altitude is as follows: 27% of the images were taken at altitudes between 515 and 400 feet, 20% at altitudes between 400 and 300 feet, 20% at altitudes between 300 and 200 feet, and 33% at altitudes below 200 feet. Images captured by the airborne FLIR camera are shown in Figure 12.

4.2. Data Preparation

It is necessary to transform the aircraft’s attitude information into the camera’s attitude. The camera can be obtained from the aircraft’s pose using Equations (5)–(9) through a rigid connection. After that, we split all the images and corresponding pose data into a training set and a test set.

Data Processing

We begin by processing the position data within a world coordinate system, applying normalization to scale the data to a standardized range.

In addition, the images captured by the airborne forward-looking camera undergo a series of transformations. First, rescaling is performed to ensure uniformity in the size of all training and test images, which is essential to ensure they are compatible with the CNN inputs. To further enhance performance in large-view scenarios for the LandNet model, data augmentation techniques such as random rotation and random cropping are applied.

Moreover, during the training phase, adjustments to the brightness, contrast, and saturation of input images are made. These modifications increase the variability of the dataset, enabling LandNet to train on images with diverse visual conditions, thereby improving its robustness.

4.3. Training Details

All the training and test experiments are performed on Pytorch ver. 2.6 [42] platform on a single NVIDIA RTX3090 GPU card with Windows 11, where we set the batch size to 16, the learning rate to

1 \times 10^{- 3}

, the weight decay to

5 \times 10^{- 4}

, and the epoch to 300.

4.4. Testing Metric Definitions

The accuracy of position and orientation estimations were used to evaluate the effectiveness of the proposed method in our experiments. To be specific, we compared the specific aircraft position and orientation errors in the navigation frame (East–North–Up) with the differential DGPS values.

4.5. Experimental Results

4.5.1. Simulation Results

The simulated images and corresponding camera poses were used for training and validation of the algorithm. PoseNet [27], IRPNet [43], Atloc [44], HyperPoseNet [45], and TransPose [46] were used to compare the accuracy of position and orientation errors with our LandNet below a flight altitude of 500 feet using the Root Mean Square Error (RMSE). The experimental results are reported in Table 3.

Δ X

,

Δ Y

, and

Δ Z

represent the differences between the predicted values and the ground truth.

Δ ξ

(deg),

Δ φ

, and

Δ θ

refer to the differences in the Euler angles between the estimated values and the ground truth. Table 3 shows that our proposed LandNet achieved the smallest errors in the X, Y, and Z positions (7.69 m, 7.36 m, and 0.96 m, respectively). This indicates that LandNet performs better in accurately predicting the 3D coordinates compared to the other methods listed.

In terms of orientation, LandNet has a relatively low angular error (

0 . 218^{\circ}

), which suggests that it offers better rotation accuracy, particularly when compared to models like PoseNet (

0 . 328^{\circ}

) and TransPose (

0 . 295^{\circ}

).

Additionally, our proposed LandNet offers a well-balanced performance across all metrics (position and orientation). While models like HyperPoseNet and TransPose achieve a slightly better performance in specific aspects (e.g., PoseNet in orientation), LandNet consistently obtained fewer positioning and orientation errors, which may suggest it has a more stable or reliable overall performance.

4.5.2. Results on Real-Flight Data

We conducted an experimental comparison of our approach with contemporary localization schemes, which are categorized into two classes: (i) CNN-based Camera Pose Regression methods, including PoseNet [27], IRPNet [43], Atloc [44], and HyperPoseNet [45]. (ii) Transformer-based Camera Pose Regression methods, including SitPose [47], Transposenet [46], Transecondernet [48], and our LandNet. The experimental results are presented in Table 4.

Δ X

,

Δ y

, and

Δ Z

represent the position estimation errors in the world frame along the respective axes. Similarly,

Δ ξ

,

Δ φ

, and

Δ θ

denote the orientation estimation errors of the camera’s roll angle, pitch angle, and yaw angle, respectively. These orientation errors are derived from the quaternion output of the proposed LandNet model.

As shown in Table 4, our proposed LandNet employs a hybrid architecture that combines CNNs and Transformers, effectively leveraging the strengths of both models. This design enables LandNet to achieve s competitive performance across varying flight altitudes. For instance, LandNet demonstrates exceptional accuracy in the Z-direction, achieving an error of just 0.2 m within the 515-400 ft flight height range-significantly outperforming other methods in the same conditions.

To be specific, PoseNet has a

Δ Z

of 0.9 m, IRPNet shows a 1.9 m error of

Δ Z

, and Atloc has a

Δ Z

of 1.1 m. The relatively low

Δ Z

of 0.2 m from LandNet demonstrates its superior navigation capabilities for fixed-wing landings.

While LandNet’s performance in the X and Y directions (errors of 6.2 m and 3.5 m, respectively) is not as outstanding as that in the Z-direction, it still performs well compared to other models in the 515–400 ft range. For example, PoseNet shows errors of 12.5 and 5.7, and IRPNet has errors of 12.3 and 10.5 in the X and Y directions. LandNet’s errors are much lower, especially in the X direction. TransPose has error of 11.3 m. TransencoderNet shows errors of 8.8 m. Also, LandNet consistently achieves the smallest

Δ X

,

Δ Y

, and

Δ Z

errors across all height ranges.

LandNet also performs well in minimizing Euler angle errors. In the 515–400 ft range, the Euler angle errors are

0 . 02 0^{\circ}

,

0 . 01 0^{\circ}

, and

0 . 03 0^{\circ}

, showing that it outperforms many other methods in terms of orientation accuracy. At lower heights (below 200 ft), LandNet achieves Euler angle errors of

0 . 01 6^{\circ}

,

0 . 00 2^{\circ}

, and

0 . 00 1^{\circ}

, which is significantly better than the other methods.

Compared to other CNN-based and Transformer-based methods like PoseNet, IRPNet, and TransPose. LandNet provides lower

Δ X

,

Δ Y

,

Δ Z

, and Euler angle errors. Even transformer-based methods, like SitPose and TransencoderNet, which perform well in specific ranges, cannot match LandNet’s overall accuracy.

In particular, SitPose exhibited significant errors in the real-flight data. This may be due to its nature as a relative camera pose regression method, where it learns the motion between adjacent image frames. This approach can lead to substantial errors, especially in landing scenarios.

In conclusion, LandNet is best-suited for the landing phase of the fixed-wing aircraft, which demands high precision in vertical measurements.

We also provide a visual comparison with PoseNet, HyperposeNet, TransPoseNet, and our proposed LandNet. The visual comparison results are shown in Figure 13 and Figure 14.

From Figure 13, it can be observed that the proposed LandNet achieved the most optimal position estimation throughout the entire aircraft landing process compared to PoseNet, HyperPoseNet, TransPoseNet, and differential GPS (ground truth). To further evaluate LandNet’ performance, we divided the flight phases into four stages—515–400 ft, 400–300 ft, 300–200 ft, and below 200 ft—as illustrated in Figure 14.

As shown in Figure 14, during the 515–400 ft altitude phase, TransPoseNet exhibited significant estimation errors. This may stem from its exclusive reliance on the Transformer architecture, which lacks the ability to capture local spatial features. In the 400–300 ft phase, both PoseNet and HyperPoseNet demonstrate a degraded performance. This is likely due to the aircraft’s sharp turn to align with the runway during this stage, which induces substantial viewpoint changes that challenge the robustness of CNN-based methods in handling large perspective variations.

In the 300–200 ft phase, all baseline methods show noticeable performance deterioration, whereas the proposed LandNet makes relatively accurate position estimations. This highlights LandNet’s enhanced robustness due to its hybrid design that synergistically combines the strengths of CNNs (local feature extraction) and Transformers (global context modeling). These results validate the effectiveness of LandNet in addressing complex dynamic scenarios during critical landing phases.

4.6. Ablation Study

To assess the effectiveness of the proposed module in LandNet, we conducted a series of ablation studies. We compared the performance of each module against the ground truth in terms of RMSE.

4.6.1. Ablation on Parameter Inside CNN and the Transformer

First, we investigated the parameters within both the CNN and Transformer branches. For the CNN branch, we adjusted the number of channels in the first stage. For the Transformer branch, we explored different numbers of embedding dimensions and attention heads. The experimental setup and corresponding results are presented in Table 5.

We can observe that the accuracy improves as we increase the parameters in either the CNN or Transformer branch. Increasing the parameters in the CNN branch yields greater improvements, while also increasing the computational overhead costs.

4.6.2. Abaltion Study on FIB and ACFB Module

As shown in Table 6, we calculated the accuracy of orientation and position using RMSE for flight altitudes below 515 feet. Our results indicate that the proposed ACFB module positively impacts both orientation and position accuracy, yielding an improvement of approximately 1 meter in position.

4.6.3. Sensitivity Study for $s_{x}$ and $s_{x}$

The hyperparameters

s_{x}

and

s_{q}

are learned parameters that control the balance between the two losses. In the camera pose regression task, quaternions are commonly used to represent orientation due to their continuous differentiability. Since we expect quaternion values to be much smaller in magnitude, their noise should be significantly lower than the position noise. Therefore,

s_{q}

should be much smaller than

s_{x}

, meaning that the orientation loss should be weighted much higher than the position loss.

To investigate how performance varies with changes in the hyperparameters

s_{x}

and

s_{q}

, we conducted a sensitivity study. Specifically, we selected values for

s_{x}

and

s_{q}

from predefined ranges, trained the model with these hyperparameters on the training set, and reported the pose prediction error on the test set. The experimental settings and results are summarized in Table 7. Additionally, we compared the accuracy of position and orientation errors below a flight altitude of 515 feet using the Root Mean Square Error (RMSE).

As shown in Table 7, the three groups of

s_{x}

and

s_{q}

exhibited a similar performance in terms of orientation accuracy. However, higher values of

s_{q}

can negatively impact position accuracy. Therefore, we set

s_{x}

= 3 and

s_{q}

= 0 as the initial values for our LandNet.

5. Conclusions

In this paper, we presented a novel absolute camera pose estimation network called LandNet for fixed-wing aircraft landings. Our proposed LandNet utilizes images from the forward-looking camera to estimate the 6-DOF camera pose.

We first introduced the basic concepts related to the landing procedures of fixed-wing aircraft, including coordinate definitions and the overall landing process. Following this, we presented the architecture of our hybrid network, which combines CNN and Transformer models. We then discussed the Feature Interactive Method, which enhances the mutual feature representation capabilities of both networks. An Attentional ConvTrans block was introduced to fuse the feature maps from the CNN and Transformer. Finally, two MLP pose regressors were employed to estimate the camera’s position and orientation. As a result, the fixed-wing aircraft’s attitude and position can be derived through rigid connection using pre-calibrated parameters. Experimental results using simulation data and real flight data demonstrated the effectiveness of the proposed method. However, the estimated position error in the eastward and northward directions remains significant. To address this issue, we plan to implement the following measures in the future:

Enhanced Data Augmentation: To handle image feature changes across altitude ranges, introduce more diverse augmentation techniques such as varying lighting conditions, simulating different textures, or introducing perspective shifts. This would help the model better adapt to various conditions during training.

Improved Data Collection and Preprocessing: To address data quality issues, improve the data collection process to ensure that the dataset represents a wide range of altitudes and conditions. Additionally, perform data cleaning to remove noise and ensure that all data are consistent and accurate.

Model Architecture Refinement: Deeper layers, the Adaptive Attention Mechanism, and fine-tuning operations should be considered in the future to improve the accuracy of LandNet.

Author Contributions

Conceptualization, S.S. and Z.Z.; methodology, S.S. and G.Y.; software, S.S. and Y.Y.; validation, Z.Z. and L.Z.; writing—original draft preparation, S.S., G.Y. and L.Z.; writing—review and editing, S.S. and Y.Y.; supervision, Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets presented in this article are not readily available because commercial restrictions.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

6DoF	Six-Degrees of Freedom
GNSS	Global Navigation Satellite System
INS	Inertial Navigation System
UAV	Unmanned Aerial Vehicle
DSO	Direct Sparse Odometry
VO	Visual Odometry
ECEF	Earth-Centered, Earth-Fixed frame
DA	Decision Altitude
CNNs	Convolutional Neural Networks

References

Zhang, Z.; Song, Y.; Huang, S.; Xiong, R.; Wang, Y. Toward consistent and efficient map-based visual-inertial localization: Theory framework and filter design. IEEE Trans. Robot. 2023, 39, 2892–2911. [Google Scholar] [CrossRef]
Gallo, E.; Barrientos, A. Long-Distance GNSS-Denied Visual Inertial Navigation for Autonomous Fixed-Wing Unmanned Air Vehicles: SO (3) Manifold Filter Based on Virtual Vision Sensor. Aerospace 2023, 10, 708. [Google Scholar] [CrossRef]
Lee, K.; Johnson, E.N. Latency compensated visual-inertial odometry for agile autonomous flight. Sensors 2020, 20, 2209. [Google Scholar] [CrossRef] [PubMed]
Song, X.; Zhang, L.; Zhai, Z.; Yu, G. A Multimode Visual-Inertial Navigation Method for Fixed-wing Aircraft Approach and Landing in GPS-denied and Low Visibility Environments. In Proceedings of the 2019 IEEE/AIAA 38th Digital Avionics Systems Conference (DASC), San Diego, CA, USA, 8–12 September 2019; pp. 1–10. [Google Scholar]
Van Der Merwe, R.; Wan, E.A. The square-root unscented Kalman filter for state and parameter-estimation. In Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), Salt Lake City, UT, USA, 7–11 May 2001; Volume 6, pp. 3461–3464. [Google Scholar]
Engel, J.; Koltun, V.; Cremers, D. Direct sparse odometry. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 611–625. [Google Scholar] [CrossRef] [PubMed]
Ellingson, G.; Brink, K.; McLain, T. Relative navigation of fixed-wing aircraft in GPS-denied environments. NAVIGATION J. Inst. Navig. 2020, 67, 255–273. [Google Scholar] [CrossRef]
Seiskari, O.; Rantalankila, P.; Kannala, J.; Ylilammi, J.; Rahtu, E.; Solin, A. HybVIO: Pushing the limits of real-time visual-inertial odometry. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 701–710. [Google Scholar]
Wang, Z.; Pang, B.; Song, Y.; Yuan, X.; Xu, Q.; Li, Y. Robust visual-inertial odometry based on a kalman filter and factor graph. IEEE Trans. Intell. Transp. Syst. 2023, 24, 7048–7060. [Google Scholar] [CrossRef]
Guan, W.; Chen, P.; Xie, Y.; Lu, P. Pl-evio: Robust monocular event-based visual inertial odometry with point and line features. IEEE Trans. Autom. Sci. Eng. 2023, 21, 6277–6293. [Google Scholar] [CrossRef]
Cao, S.; Lu, X.; Gvins, S.S. Tightly coupled GNSS-visual-inertial fusion for smooth and consistent state estimation. IEEE Trans. Robot. 2022, 38, 2004–2021. [Google Scholar] [CrossRef]
Leutenegger, S. Okvis2: Realtime scalable visual-inertial slam with loop closure. arXiv 2022, arXiv:2202.09199. [Google Scholar]
Liu, X.; Li, C.; Xu, X.; Yang, N.; Qin, B. Implicit Neural Mapping for a Data Closed-Loop Unmanned Aerial Vehicle Pose-Estimation Algorithm in a Vision-Only Landing System. Drones 2023, 7, 529. [Google Scholar] [CrossRef]
Hou, B.; Ding, X.; Bu, Y.; Liu, C.; Shou, Y.; Xu, B. Visual Inertial Navigation Optimization Method Based on Landmark Recognition. In Proceedings of the International Conference on Cognitive Computation and Systems, Beijing, China, 17–18 December 2022; pp. 212–223. [Google Scholar]
Huang, L.; Song, J.; Zhang, C. Observability analysis and filter design for a vision inertial absolute navigation system for UAV using landmarks. Optik 2017, 149, 455–468. [Google Scholar] [CrossRef]
Wang, Z.; Zhao, D.; Cao, Y. Visual navigation algorithm for night landing of fixed-wing unmanned aerial vehicle. Aerospace 2022, 9, 615. [Google Scholar] [CrossRef]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Yu, G.; Zhang, L.; Zou, C.; Liu, Y.; Cheng, Y. A Robust and Real-time Visual-Inertial Pose Estimation for Fixed-wing Aircraft Landing. In Proceedings of the 32nd Congress of the International Council of the Aeronautical Sciences, Shanghai, China, 6–10 September 2021; pp. 4398–4409. [Google Scholar]
Yu, G.; Zhang, L.; Shen, S.; Zhai, Z. Real-time vision-inertial landing navigation for fixed-wing aircraft with CFC-CKF. Complex Intell. Syst. 2024, 10, 8079–8093. [Google Scholar] [CrossRef]
Dusha, D.; Mejias, L.; Walker, R. Fixed-wing attitude estimation using temporal tracking of the horizon and optical flow. J. Field Robot. 2011, 28, 355–372. [Google Scholar] [CrossRef]
Grof, T.; Bauer, P.; Hiba, A.; Gati, A.; Zarándy, Á.; Vanek, B. Runway relative positioning of aircraft with IMU-camera data fusion. IFAC-PapersOnLine 2019, 52, 376–381. [Google Scholar] [CrossRef]
Shang, K.; Li, X.; Liu, C.; Ming, L.; Hu, G. An Integrated Navigation Method for UAV Autonomous Landing Based on Inertial and Vision Sensors. In Proceedings of the CAAI International Conference on Artificial Intelligence, Beijing, China, 27–28 August 2022; pp. 182–193. [Google Scholar]
Macario Barros, A.; Michel, M.; Moline, Y.; Corre, G.; Carrel, F. A comprehensive survey of visual slam algorithms. Robotics 2022, 11, 24. [Google Scholar] [CrossRef]
Li, Q.; Cao, R.; Zhu, J.; Hou, X.; Liu, J.; Jia, S.; Li, Q.; Qiu, G. Improving synthetic 3D model-aided indoor image localization via domain adaptation. ISPRS J. Photogramm. Remote Sens. 2022, 183, 66–78. [Google Scholar] [CrossRef]
Sattler, T.; Zhou, Q.; Pollefeys, M.; Leal-Taixe, L. Understanding the limitations of cnn-based absolute camera pose regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3302–3312. [Google Scholar]
Zhuang, S.; Zhao, Z.; Cao, L.; Wang, D.; Fu, C.; Du, K. A robust and fast method to the perspective-n-point problem for camera pose estimation. IEEE Sens. J. 2023, 23, 11892–11906. [Google Scholar] [CrossRef]
Kendall, A.; Grimes, M.; Cipolla, R. Posenet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2938–2946. [Google Scholar]
Liu, K.; Li, Q.; Qiu, G. PoseGAN: A pose-to-image translation framework for camera localization. ISPRS J. Photogramm. Remote Sens. 2020, 166, 308–315. [Google Scholar] [CrossRef]
Shavit, Y.; Ferens, R.; Keller, Y. Coarse-to-fine multi-scene pose regression with transformers. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 14222–14233. [Google Scholar] [CrossRef]
Brahmbhatt, S.; Gu, J.; Kim, K.; Hays, J.; Kautz, J. Geometry-aware learning of maps for camera localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2616–2625. [Google Scholar]
Gao, H.; Dai, K.; Wang, K.; Li, R.; Zhao, L.; Wu, M. ALNet: An adaptive channel attention network with local discrepancy perception for accurate indoor visual localization. Expert Syst. Appl. 2024, 250, 123792. [Google Scholar] [CrossRef]
Wang, J.; Zhou, K.; Markham, A.; Trigoni, N. MambaLoc: Efficient Camera Localisation via State Space Model. arXiv 2024, arXiv:2408.09680. [Google Scholar]
Yang, C.; Liu, Y.; Zell, A. RCPNet: Deep-learning based relative camera pose estimation for UAVs. In Proceedings of the 2020 International Conference on Unmanned Aircraft Systems (ICUAS), Athens, Greece, 1–4 September 2020; pp. 1085–1092. [Google Scholar]
Baldini, F.; Anandkumar, A.; Murray, R.M. Learning pose estimation for UAV autonomous navigation and landing using visual-inertial sensor data. In Proceedings of the 2020 American Control Conference (ACC), Denver, CO, USA, 1–3 July 2020; pp. 2961–2966. [Google Scholar]
Shan, P.; Yang, R.; Xiao, H.; Zhang, L.; Liu, Y.; Fu, Q.; Zhao, Y. UAVPNet: A balanced and enhanced UAV object detection and pose recognition network. Measurement 2023, 222, 113654. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural. Inf. Process. Syst. 2017, 5998–6008. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Geng, P.; Lu, J.; Zhang, Y.; Ma, S.; Tang, Z.; Liu, J. TC-Fuse: A Transformers Fusing CNNs Network for Medical Image Segmentation. CMES-Comput. Model. Eng. Sci. 2023, 137, 2001–2023. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural. Inf. Process. Syst. 2019, 32, 721. [Google Scholar]
Shavit, Y.; Ferens, R. Do we really need scene-specific pose encoders? In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 3186–3192. [Google Scholar]
Wang, B.; Chen, C.; Lu, C.X.; Zhao, P.; Trigoni, N.; Markham, A. Atloc: Attention guided camera localization. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 10393–10401. [Google Scholar]
Ferens, R.; Keller, Y. Hyperpose: Camera pose localization using attention hypernetworks. arXiv 2023, arXiv:2303.02610. [Google Scholar]
Shavit, Y.; Ferens, R.; Keller, Y. Paying attention to activation maps in camera pose regression. arXiv 2021, arXiv:2103.11477. [Google Scholar]
Leng, K.; Yang, C.; Sui, W.; Liu, J.; Li, Z. Sitpose: A Siamese Convolutional Transformer for Relative Camera Pose Estimation. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; pp. 1871–1876. [Google Scholar]
Shavit, Y.; Ferens, R.; Keller, Y. Learning single and multi-scene camera pose regression with transformer encoders. Comput. Vis. Image Underst. 2024, 243, 103982. [Google Scholar] [CrossRef]

Figure 1. Coordinates’ definitions in the fixed-wing aircraft landing. A, B, C, and D are the runway vertices.

Figure 2. Illustration of the ECEF and ENU coordinates.

Figure 3. Transform matrix between navigation frame and body frame.

Figure 4. Illustration of aircraft landing procedures. A, B, and C points are represented as 1000 feet, 200 feet, and 100 feet respectively of altitue.

Figure 5. Overall architecture of proposed camera localization network.

Figure 6. Two types of residual structures. (a): Residual structure without downsamping (b): Residual structure with downsamping.

Figure 7. Illustration of the Transformer encoder.

Figure 8. Illustration of the proposed FIB.

Figure 9. Structure of the proposed ACFB.

Figure 10. The simulation landing scene of the UAV.

Figure 11. Data Acquisition Platform.

Figure 12. Images captured by FLIR camera.

Figure 13. Landing trajectories.

Figure 14. Trajectory comparisons at various flight altitudes.

Table 1. Detailed information for the CNN encoder.

Stage	Kernel Size × Channels	Repeated Times
$s_{1}$	$[\begin{matrix} 1 \times 1 \times 64 \\ 3 \times 3 \times 64 \\ 1 \times 1 \times 256 \end{matrix}]$	1
	$[\begin{matrix} 1 \times 1 \times 64 \\ 3 \times 3 \times 64 \\ 1 \times 1 \times 256 \end{matrix}]$	3
	$[\begin{matrix} 1 \times 1 \times 64 \\ 3 \times 3 \times 64 \\ 1 \times 1 \times 256 \end{matrix}]$	3
$s_{2}$	$[\begin{matrix} 1 \times 1 \times 128 \\ 3 \times 3 \times 128 \\ 1 \times 1 \times 512 \end{matrix}]$	4
$s_{2}$	$[\begin{matrix} 1 \times 1 \times 128 \\ 3 \times 3 \times 128 \\ 1 \times 1 \times 512 \end{matrix}]$	4
$s_{3}$	$[\begin{matrix} 1 \times 1 \times 256 \\ 3 \times 3 \times 256 \\ 1 \times 1 \times 1024 \end{matrix}]$	3
$s_{3}$	$[\begin{matrix} 1 \times 1 \times 256 \\ 3 \times 3 \times 256 \\ 1 \times 1 \times 1024 \end{matrix}]$	3
$s_{4}$	$[\begin{matrix} 1 \times 1 \times 256 \\ 3 \times 3 \times 256 \\ 1 \times 1 \times 1024 \end{matrix}]$	1
$s_{4}$	$[\begin{matrix} 1 \times 1 \times 256 \\ 3 \times 3 \times 256 \\ 1 \times 1 \times 1024 \end{matrix}]$	1

Table 2. The calibrated parameters of the IMU and FLIR cameras.

Parameters Type	Specific Parameters	Values
Camera intrinsic parameters	Pixel size	$0.025 μ$ m
	Resolution	640 × 512
	Focal length	$\{\begin{matrix} f_{x} = 1010.7 pixel \\ f_{y} = 1009.5 pixel \end{matrix}$
	Radial distortion	$\{\begin{matrix} k_{1} = - 0.3408 \\ k_{2} = 0.1238 \end{matrix}$
IMU installation on the aircraft	Position	$[\begin{matrix} 0.0704 & - 0.4742 & - 7.2863 \end{matrix}]$ m
Camera installation	Attitude	$[\begin{matrix} - 1 . 101882 & - 5 . 366247 & - 0 . 070693 \end{matrix}] deg$
Camera installation	Position	$[\begin{matrix} - 0.007 & 0.190 & - 12.229 \end{matrix}]$ m

Table 3. The accuracy of the simulated datasets.

Method	$Δ X$ (m)	$Δ Y$ (m)	$Δ Z$ (m)	$Δ ξ$ (deg)	$Δ φ$ (deg)	$Δ θ$ (deg)
PoseNet [27]	13.9	9.2	1.3	0.235	0.381	0.328
IRPNet [43]	11.1	10.2	1.5	0.286	0.394	0.271
Atlocc [44]	10.8	8.5	1.4	0.301	0.372	0.249
HyperPoseNet [45]	8.9	9.1	1.19	0.201	0.453	0.232
TransPose [46]	8.4	8.7	1.26	0.257	0.371	0.295
LandNet	7.6	7.3	0.9	0.110	0.339	0.218

Table 4. The comparison results on flight data.

Flight Height (ft)	Category	Methods	$Δ X$ (m)	$Δ Y$ (m)	$Δ Z$ (m)	$Δ ξ$ (deg)	$Δ φ$ (deg)	$Δ θ$ (deg)
515–400	CNN-based	PoseNet [27]	12.5	5.7	0.9	0.070	0.080	0.020
		IRPNet [43]	12.3	10.5	1.9	0.018	0.009	0.038
		Atloc [44]	9.5	7.9	1.1	0.010	0.033	0.025
		HyperPoseNet [45]	9.2	5.8	1.8	0.020	0.010	0.030
	Transformer-based	SitPose [47]	12.8	35.7	3.34	0.058	0.060	0.203
		TransPose [46]	9.4	8.7	1.6	0.053	0.048	0.051
		TransencoderNet [48]	8.3	9.3	1.1	0.045	0.034	0.029
		LandNet	6.2	3.5	0.2	0.020	0.010	0.030
400–300	CNN-based	PoseNet [27]	6.5	3.9	1.1	0.002	0.020	0.050
		IRPNet [43]	13.6	6.4	0.9	0.021	0.004	0.029
		Atloc [44]	7.5	5.2	2.6	0.004	0.006	0.008
		HyperPoseNet [45]	11.1	5.9	1.2	0.006	0.003	0.008
	Transformer-based	SitPose [47]	10.0	7.5	0.4	0.009	0.033	0.104
		TransPose [46]	9.3	5.5	1.2	0.036	0.048	0.029
		TransencoderNet [48]	8.8	6.6	0.9	0.032	0.031	0.032
		LandNet	7.0	3.7	0.7	0.003	0.002	0.007
300–200	CNN-based	PoseNet [27]	9.2	4.4	1.9	0.003	0.020	0.003
		IRPNet [43]	22.2	10.6	0.88	0.016	0.001	0.014
		Atloc [44]	11.8	7.2	2.5	0.004	0.001	0.006
		HyperPoseNet [45]	7.3	3.8	1.7	0.006	0.004	0.040
	Transformer-based	SitPose [47]	13.5	8.5	1.1	0.025	0.016	0.073
		TransPose [46]	9.5	5.3	1.3	0.021	0.030	0.015
		TransencoderNet [48]	8.3	5.6	0.9	0.022	0.027	0.019
		LandNet	7.5	2.8	0.9	0.003	0.001	0.006
Below 200	CNN-based	PoseNet [27]	14.5	6.0	1.4	0.015	0.017	0.030
		IRPNet [43]	9.2	5.8	1.8	0.021	0.007	0.033
		Atloc [44]	5.1	4.9	1.8	0.005	0.001	0.009
		HyperPoseNet [45]	15.0	3.9	2.2	0.017	0.002	0.030
	Transformer-based	SitPose [47]	12.5	11.4	1.2	0.015	0.003	0.015
		TransPose [46]	9.3	6.0	1.1	0.021	0.033	0.018
		TransencoderNet [48]	6.3	7.3	0.8	0.020	0.019	0.021
		LandNet	4.9	6.5	0.9	0.016	0.002	0.001

Table 5. Ablation setting and parameter results.

Transformer Branch	CNN Branch	$Δ X$ (m)	$Δ Y$ (m)	$Δ Z$ (m)	$Δ ξ$ (deg)	$Δ φ$ (deg)	$Δ θ$ (deg)
Embedding Dimension	Heads	Channel
384	6	64	12.34	9.12	1.91	0.035	0.025	0.063
		128	11.08	8.52	1.58	0.032	0.016	0.040
		512	10.98	7.92	1.12	0.026	0.013	0.032
576	9	256	10.65	8.31	1.56	0.023	0.011	0.036
768	12	256	9.34	6.07	0.93	0.020	0.005	0.031

Table 6. Ablation studies results on FIB and ACFB.

Module		$Δ X$ (m)	$Δ Y$ (m)	$Δ Z$ (m)	$Δ ξ$ (deg)	$Δ φ$ (deg)	$Δ θ$ (deg)
FIB	ACFB
✓	✓	9.34	6.07	0.93	0.020	0.005	0.031
✓		10.45	6.78	1.22	0.026	0.006	0.030
	✓	9.56	7.12	1.32	0.030	0.011	0.038

Table 7. Experimental results for the

s_{x}

and

s_{q}

.

Table 7. Experimental results for the

s_{x}

and

s_{q}

.

$s_{x}$	$s_{q}$	$Δ X$ (m)	$Δ Y$ (m)	$Δ Z$ (m)	$Δ ξ$ (deg)	$Δ φ$ (deg)	$Δ θ$ (deg)
3	0	9.34	6.07	0.93	0.020	0.005	0.031
0	0	14.93	7.74	1.29	0.024	0.006	0.033
0	2	12.20	7.62	0.91	0.022	0.006	0.03

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shen, S.; Yu, G.; Zhang, L.; Yan, Y.; Zhai, Z. LandNet: Combine CNN and Transformer to Learn Absolute Camera Pose for the Fixed-Wing Aircraft Approach and Landing. Remote Sens. 2025, 17, 653. https://doi.org/10.3390/rs17040653

AMA Style

Shen S, Yu G, Zhang L, Yan Y, Zhai Z. LandNet: Combine CNN and Transformer to Learn Absolute Camera Pose for the Fixed-Wing Aircraft Approach and Landing. Remote Sensing. 2025; 17(4):653. https://doi.org/10.3390/rs17040653

Chicago/Turabian Style

Shen, Siyuan, Guanfeng Yu, Lei Zhang, Youyu Yan, and Zhengjun Zhai. 2025. "LandNet: Combine CNN and Transformer to Learn Absolute Camera Pose for the Fixed-Wing Aircraft Approach and Landing" Remote Sensing 17, no. 4: 653. https://doi.org/10.3390/rs17040653

APA Style

Shen, S., Yu, G., Zhang, L., Yan, Y., & Zhai, Z. (2025). LandNet: Combine CNN and Transformer to Learn Absolute Camera Pose for the Fixed-Wing Aircraft Approach and Landing. Remote Sensing, 17(4), 653. https://doi.org/10.3390/rs17040653

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LandNet: Combine CNN and Transformer to Learn Absolute Camera Pose for the Fixed-Wing Aircraft Approach and Landing

Abstract

1. Introduction

2. Preliminary Knowledge

2.1. Attitude Transformation

2.1.1. From Quaternion to Euler Angle

2.1.2. From Euler Angle to Quaternion

2.2. Coordinates Definitions

2.2.1. WGS84 to ECEF

2.2.2. ECEF to ENU

2.2.3. ENU to Body

2.2.4. Body to Camera

2.2.5. World Coordinate

2.3. Procedure for the Fixed-Wing Aircraft Landing

3. Methodology

3.1. Network Architecture

3.2. Initial Convolutional Blocks

3.3. CNN Encoder

3.4. Transformer Encoder

3.5. Feature Interactive Block

3.6. Attentional ConvTrans Fusion Block

3.7. Loss Function Design

4. Experiments

4.1. Datasets

4.1.1. Simulation Data

4.1.2. Flight Data Acquisition

4.2. Data Preparation

Data Processing

4.3. Training Details

4.4. Testing Metric Definitions

4.5. Experimental Results

4.5.1. Simulation Results

4.5.2. Results on Real-Flight Data

4.6. Ablation Study

4.6.1. Ablation on Parameter Inside CNN and the Transformer

4.6.2. Abaltion Study on FIB and ACFB Module

4.6.3. Sensitivity Study for s x and s x

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.6.3. Sensitivity Study for $s_{x}$ and $s_{x}$