Pose Tracking and Object Reconstruction Based on Occlusion Relationships in Complex Environments

Zhao, Xi; Zhang, Yuekun; Zhou, Yaqing

doi:10.3390/app14209355

Open AccessArticle

Pose Tracking and Object Reconstruction Based on Occlusion Relationships in Complex Environments

by

Xi Zhao

^*

,

Yuekun Zhang

and

Yaqing Zhou

School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an 710049, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(20), 9355; https://doi.org/10.3390/app14209355

Submission received: 27 July 2024 / Revised: 2 October 2024 / Accepted: 10 October 2024 / Published: 14 October 2024

(This article belongs to the Special Issue Technical Advances in 3D Reconstruction)

Download

Browse Figures

Versions Notes

Abstract

:

For the reconstruction of objects during hand–object interactions, accurate pose estimation is indispensable. By improving the precision of pose estimation, the accuracy of the 3D reconstruction results can be enhanced. Recently, pose tracking techniques are no longer limited to individual objects, leading to advancements in the reconstruction of objects interacting with other objects. However, most methods struggle to handle incomplete target information in complex scenes and mutual interference between objects in the environment, leading to a decrease in pose estimation accuracy. We proposed an improved algorithm building upon the existing BundleSDF framework, which enables more robust and accurate tracking by considering the occlusion relationships between objects. First of all, for detecting changes in occlusion relationships, we segment the target and compute dual-layer masks. Secondly, rough pose estimation is performed through feature matching, and a keyframe pool is introduced for pose optimization, which is maintained based on occlusion relationships. Lastly, the estimated results of historical frames are used to train an object neural field to assist in the subsequent pose-tracking process. Experimental verification shows that on the HO-3D dataset, our method can significantly improve the accuracy and robustness of object tracking in frequent interactions, providing new ideas for object pose-tracking tasks in complex scenes.

Keywords:

pose tracking; 6 degrees of freedom pose estimation; occlusion relationship

1. Introduction

Object pose estimation is crucial for accurate 3D reconstruction, aiming to infer the position and orientation of objects in the camera coordinate system from given RGB or depth images. More accurate pose tracking and estimation results can significantly enhance the quality of 3D reconstruction. Meanwhile, with the rapid development of robotics [1,2,3], autonomous driving [4,5], and augmented reality [6,7] in recent years, object pose detection and tracking technologies have gained significant attention.

The core task of object pose estimation is to compute the transformation between the camera coordinate system and the object coordinate system. Research on single-frame object pose estimation can be divided into two categories: instance-level object pose estimation and category-level object pose estimation. Instance-level pose estimation aims to estimate the six degrees of freedom (DoF) pose of the target object, consisting of two pose components: three degrees of freedom spatial translation (3-DoF translation) and three degrees of freedom object rotation (3-DoF rotation). Category-level pose estimation builds upon instance-level methods by overcoming the constraints of computer-aided design (CAD) models, generalizing to a class of objects with similar geometric structures. Therefore, it also requires additional calculation of three degrees of freedom scale information (3-DoF Scale). Pose tracking is an extension of object pose estimation in the temporal dimension, generally referring to the continuous updating of the position and pose information of the target object in a sequence of consecutive images or video frames. This process involves extracting geometric shapes and texture features of objects, perceiving object positions and movements, and emphasizing the continuity of estimation results. Although current methods have made good progress in addressing issues such as object motion and drift caused byong-term tracking, these methods are mostly limited to relatively simple scenarios. In complex scenarios, challenges remain for pose tracking, such as difficulties in accurately segmenting targets and occlusions between targets and the environment or other objects.

Object pose tracking can be categorized into instance-level and category-level approaches. Instance-level tracking relies on specific 3D models, such as CAD models [8], to track known objects with high precision. In contrast, category-level tracking does not depend on CAD models and aims to generalize across objects within the same category. Because the estimated parameters are unified as 6-DoF poses (the target scale is determined by the first frame), these approaches are discussed together. Unlike single-frame object pose estimation, object pose tracking does not independently estimate the pose of the target in each frame of the image. Instead, it is considered a motion process, where the pose of the new frame is adjusted based on the estimated pose of the previous frame. At the same time, new information is introduced into the system, and historical estimation results are optimized through consistency constraints. Wang et al. [9] proposed a method that takes single-frame RGB-D images as input and estimates the pose of objects by tracking the cumulative changes in relative pose over time. Specifically, the method first employs an anchor-based key-point generation scheme to adaptively generate key points from previous frames and the current frame. Then, using two sets of ordered key points and previously computed instance poses, it derives the current estimated pose of the object, including six degrees of freedom of pose information. Deng et al. [10] utilize Rao–Blackwellized particle filters to sample object poses and estimate the discrete distribution of rotations for each particle using a precomputed codebook. This method can effectively track object poses and is more robust to motion blur and occlusion. However, its efficiency significantly decreases when objects are heavily occluded, or measurement results deviate significantly from synthetically generated training data. EPro-PnP [11] achieves end-to-end training of camera poses by transforming the perspective-n-point (PnP) problem into a probability density prediction task. By learning the 2D–3D correlation based on ground truth poses, this method can stably and flexibly train pose estimation networks, surpassing the performance of traditional methods. Additionally, the concept of EPro-PnP can be applied to other geometric optimization problems and has certain theoretical generalizations. The latest research, BundleSDF [12], links the pose graph optimization process with the geometric representation of objects using a key-frame pool, making it highly adaptable to significant motion, partial occlusion, and textureless surfaces. Furthermore, with online learning of the neural object field, BundleSDF enables pose tracking for arbitrary objects.

Although current pose tracking achieves considerable accuracy in scenarios with no occlusion or occasional small occlusions, it still faces significant challenges in situations where the target interacts frequently with the environment (such as hands, robotic arms, etc.) or in complex scenes with long-term random occlusions. Existing pose tracking solutions [13,14,15,16], especially those that rely on modeling [17], face severe challenges in such scenarios. These challenges mainly manifest in the following aspects:

Difficulties in Segmentation. For tracking unknown objects, accurate segmentation of the target is crucial. The segmentation difficulty increases significantly when occluding objects are moving relative to the target object.
Boundary Ambiguity. When occluding objects appear in front of the target, even if the segmentation module can accurately separate them in a timely manner, the mask recorded at the boundary between the two objects will only represent the boundary of the occluding object. If not properly handled, this situation may introduce ambiguity.
Feature Matching Problems. Traditional tracking methods typically rely on matching relationships between key points in adjacent frames to estimate relative displacement. However, this approach heavily relies on the texture features of the target object’s surface. When an occlusion occurs in the scene, some selected feature points from the previous frame may be lost due to occlusion.

Our method is an improved version based on BundleSDF [12], and the core focus of this method lies in deep learning-based object pose tracking, aimed at enhancing the performance of object pose estimation in complex environments. We found that frequent occlusions can severely affect pose tracking. The baseline method uses segmentation masks to separate the target object from the environment to eliminate environmental interference, and in subsequent geometric shape analysis, the segmentation mask is assumed to represent the geometric boundary of the object in the current view. However, the occurrence of foreground occlusion may mislead this process, leading to tracking failure. Therefore, addressing the challenge of effectively tracking targets in complex environments and dealing with frequent occlusions, our research introduces the concept of dual-layer masks and designs a target object pose-tracking algorithm based on the occlusion relationship between the foreground and the target object. Firstly, the target video segmentation network is applied to segment the target object area in the current frame, then, different from the baseline method, the depth relationship is analyzed to calculate the dual-layer mask of the current frame, effectively detecting foreground occluding objects while segmenting the target object. Secondly, a feature-matching network is used to roughly estimate the pose and perform pose graph optimization with historical frames. During this process, like BundleSDF, a neural object field for the object is trained using multi-view images, recording the geometric appearance of the target to assist the subsequent optimization process. Finally, the tracking algorithm is optimized based on the occlusion relationship, using a novel adaptive key-frame maintenance strategy to detect changes in the strength of occlusion relationships as a basis for retaining effective frames in the system. The proposed method enables the model to accurately estimate the target pose in frequent occlusion scenarios. Comparative studies are conducted by changing the baseline key frame maintenance strategy, demonstrating that the proposed method significantly improves the accuracy and robustness tracking in scenarios with frequent occlusions.

Our main contributions can be summarized as follows:

We propose the concept of dual-layer masks and utilize depth relationships to compute the dual-layer mask of the current frame, which enables effective detection of foreground occluding objects while segmenting the target object.
We propose a novel key-frame selection strategy. This strategy detects changes in occlusion relationships and uses them as the basis for retaining effective frames in the system. Additionally, since this new strategy is highly efficient, our method can still achieve real-time performance.
Besides conducting experiments on the hand–object interaction dataset HO-3D [18], we also created some indoor interaction data for tracking experiments. The results of experiments on both the HO-3D dataset and our custom dataset demonstrate that our key-frame selection strategy significantly improves the robustness and accuracy of object tracking, leading to improved reconstruction results.

2. Methods

The overall process is illustrated in Figure 1. Given an input RGB-D stream

{F_{t}, t = 1 \dots N}

, along with a segmentation mask for the object of interest provided only in the first frame, our method tracks the 6-DoF pose of the object across subsequent frames and reconstructs a textured 3D model, even under conditions of severe occlusion. Our approach employs a neural object field, like BundleSDF [12], to represent the reconstructed object and, importantly, does not require category-level priors, such as instance-level CAD models of the object.

2.1. Neural Object Field

To extend the method to track arbitrary objects, a novel neural object field (NOF) [12] is introduced to enhance the optimization process. Unlike prior instance-level or category-level tracking methods, this approach eliminates the need for specific CAD models or geometric priors associated with object categories. The neural object field represents a neural signed distance field (SDF) centered on the object, which learns the object’s multi-view consistent 3D shape and appearance while optimizing the poses of memory frames. Since the model is trained independently for each video, it is capable of learning effectively for each new object without the need for pre-training.

Similar to BundleSDF [12], we apply multi-resolution hash encoding [20] to the input x before passing it to the network. The surface normal at a point in the object field can be derived by taking the first-order derivative of the signed distance field, expressed as

n (x) = \frac{\partial Ω (x)}{\partial x}

. This is implemented using automatic differentiation in PyTorch.

Given the object pose

ξ

of a memory frame, an image is rendered by emitting rays through each pixel. The 3D points along each ray are sampled at different positions, defined as

x_{i} (r) = o (r) + t_{i} d (r)

, where

o (r)

is the ray origin (i.e., the camera focal point) and

d (r)

is the ray direction, both of which depend on the object pose

ξ

. The parameter

t_{i} > 0

controls the position along the ray. The color c of a ray r is computed by integrating over the near-surface regions.

\begin{matrix} c (r) = \int_{z (r) - λ}^{z (r) + 0.5 λ} ω (x_{i}) f (x_{i}, n (x_{i}), d (x_{i})) d x \end{matrix}

(1)

\begin{matrix} ω (x_{i}) = \frac{1}{(1 + e^{- α Ω (x_{i})}) (1 + e^{α Ω (x_{i})})} \end{matrix}

(2)

where

ω (x_{i})

is the bell-shaped probability density function [21] that depends on the distance from the point to the implicit object surface.

To achieve efficient rendering, we adopt the ray sampling strategy proposed in [12]. Initially, we uniformly sample N points within the occupancy voxel bounds and

z (r) + 0.5 λ

. To further improve the quality of reconstruction, additional samples are distributed around the surface. Rather than employing importance sampling based on SDF predictions, which would require multiple forward passes through the network [22,23], we sample

N^{'}

points from a normal distribution centered on the depth reading

N (z (r), λ^{2})

. This approach results in a total of

N + N^{'}

samples, avoiding the computational overhead of querying multi-resolution hash encodings or the network itself.

2.2. Key-Frame Pool Strategy Based on Dual-Layer Mask

In object pose tracking, most existing methods address environmental interference by applying masks to exclude all image content outside the target object, treating the mask’s boundary as the object’s edge. However, these methods often overlook foreground occlusions on the target object, potentially leading to biased pose estimations. To overcome this limitation, we propose a dual-layer masking approach. The first layer is generated by a segmentation network to isolate the background, while the second layer is computed by using depth information to mark foreground occlusion. This dual-layer mask effectively accounts for occlusions and provides a more accurate representation of the object. Once generated, the dual-layer mask is used to update the key-frame pool, further improving the accuracy of pose estimation.

2.2.1. Dual-Layer Mask Generation

To generate the first layer of the background mask, we utilize a video segmentation network to segment the object region in the current frame. For online segmentation of the target mask, we employ XMem [24], as it does not require prior knowledge about the target object or interactive objects (such as hands) and is adaptable to various scenes and objects. XMem effectively segments the object without the need for explicit annotations or scene-specific tuning.

The second year, the foreground occlusion mask, is computed based on depth information. In the current camera view, the occluded area must satisfy the condition that the depth value of the occluding surface is smaller than that of the target object’s surface. To achieve this, we first filter the depth values of the target object using a threshold set at three times the standard deviation to eliminate outliers caused by noise from depth sensors. We then compute the mean depth value of the target object’s surface and use this as a threshold to separate pixels with smaller depth values, distinguishing them from the background. Next, we apply an erosion operation to the intermediate results to disconnect weakly connected components in the environment, retaining only the most probable foreground occlusion mask. Finally, starting from the edge of the target object mask, we search for the connected component adjacent to the target object mask, which represents the potential occlusion area. The results of this process are illustrated in Figure 2.

2.2.2. Key-Frame Pool

In our approach, we use a key-frame pool to store historical information, as opposed to relying on a merged global model, which helps to reduce drift in long-term tasks [12]. The key idea behind the creation of the key-frame pool is to preserve observations of the target object from different viewpoints as comprehensively as possible. This allows for a more sparse allocation of memory frames in space, while still maintaining a sufficient level of multi-view consistency. However, under this strategy, the system tends to prioritize earlier observations and discard later ones from the same viewpoint, regardless of frame quality. This poses a risk for tracking in complex scenes. To address this issue, we introduce a dynamic memory pool, which retains a subset of historical frames for subsequent optimization and error correction. Additionally, we propose a more universally applicable maintenance strategy to ensure robustness across different scenarios.

The principle guiding the update of the key-frame pool is to capture as much object appearance information as possible while maintaining frame storage sparsity. The pose of the current frame, denoted as

ξ_{t}

, is updated accordingly. If the updated frame can maintain consistency with other historical frames in the pool and provides information not currently present, it is added to the key-frame pool. This newly added frame will then be involved in subsequent optimization processes and the learning of neural object fields.

In scenarios where long-term frequent occlusion occurs, maintaining a compact and accurate key-frame pool is essential for ensuring robustness against interference. To achieve this, we propose a key-frame pool maintenance strategy based on occlusion relationships, as outlined in Algorithm 1. The process starts with adding the first frame

F_{0}

by default, which serves to establish a reference coordinate system for the new object. The next 10 frames are also added directly to the key-frame pool, allowing for a rapid understanding of the target object’s appearance. After this initialization phase, each new incoming frame undergoes a coarse pose estimation and is subject to graph optimization with historically stored frames that share co-visibility relationships. This step updates the initial pose of the new frame. More specifically, the pose

ξ_{t}

of the current frame is compared to the poses of existing frames in the key-frame pool. First, the optimization error function ensures that adding the new frame will not disrupt the consistency of the pool. Then, if the current frame provides observations from a new viewpoint or reveals previously occluded regions, it is added to the key-frame pool. For planar objects, rotations around the camera’s optical axis do not yield new information and are thus ignored. This strategy helps distribute key frames more sparsely in space while preserving the essential multi-view consistency information.

Algorithm 1	Self-Adaption Key-Frame Pool Maintenance Strategy
Input:	Current frame $F_{t}$ , Frame pool $P$ .
Output:	Updated frame pool $P$ .
1.	Initialize frame pool $P$ with first frame $F_{0}$ and set its canonical coordinate system.
2.	Compare $F_{t}$ with frames in $P$ to update its initial pose $λ_{t}$ .
3.	if observation of $F_{t}$ comes from a new viewpoint or sees previously occluded areas then
4.	Add $F_{t}$ to $P$ .
5.	end if
6.	if occlusion relationship of $F_{t}$ changes then
7.	if foreground occlusion area of $F_{t}$ isess than theast frame or the nearest key-frame then
8.	Add $F_{t}$ to $P$ .
9.	Replace frame in P that has the same angle as $F_{t}$ but aarger foreground occlusion area with $F_{t}$ .
10.	end if
11.	end if

When the occlusion relationships of the current frame

F_{t}

change, the system employs an adaptive strategy upgrade. If the foreground occlusion area in

F_{t}

is smaller than in the previous frame or the nearest key-frame, it suggests that the occluding object may have moved, resulting in reduced occlusion of the target object. In other words, compared to

F_{t - 1}

,

F_{t}

may capture more useful information, prompting the system to add this frame to the key-frame pool. Additionally, if the viewing angle

ξ_{i}^{- 1}

of the current frame

F_{t}

matches the viewing angle

ξ_{i}^{- 1}

of a historical frame

F_{i}

in the pool, but the foreground occlusion area in

F_{t}

is smaller, the system will replace

F_{i}

with

F_{t}

. Beyond the weakening of occlusion, movement of the occluding object is also considered. This study calculates the difference between the foreground masks of consecutive frames. The difference mask, which is used to calculate occlusion changes, is illustrated in Figure 3. Figure 3a displays the original RGB image. To identify changes, the foreground masks of two frames are superimposed (Figure 3b), and the overlapping areas are removed. The remaining regions (Figure 3c) indicate the relative change in foreground occlusion between the two frames. Next, the non-zero elements and their ratio are computed and compared against a threshold to determine whether there has been a significant movement or change in the foreground occluding object. If a significant change is detected, the new frame is added to the key-frame pool.

By utilizing the sensitivity to occlusion changes provided by the dual-layer masks, the system quickly supplements valid information in the scene and replaces low-quality key frames affected by occlusion.

2.3. Optimization

2.3.1. Two-Stage Pose Optimization

Pose estimation is approached in two stages, similar to the method used in BundleSDF [12]. The first stage involves estimating a coarse pose using random sample consensus (RANSAC), while the second stage refines this pose through pose graph optimization.

In the first stage, feature matching is performed between two frames to obtain an initial pose estimate. Specifically, for video segmentation results, a feature matching network based on transformers [25] is employed. This network establishes feature correspondences between the current frame

F_{t}

and the previous frame

F_{t - 1}

using RGB images. The relative change in object pose between

F_{t}

and

F_{t - 1}

is then approximately calculated. The RGB feature correspondences are filtered through RANSAC [26] and least squares methods [27]. The coarse pose estimate

ξ_{t}

for the current frame is determined by selecting the result that maximizes the number of inliers.

In the second stage, after obtaining frame

F_{t}

and its coarse pose estimate

ξ_{t}

, K frames are selected from a memory pool for pose graph optimization. The optimized pose is output as the final pose estimation result. The selection of these K frames is based on choosing the subset

P_{p g} \in P

of K key-frames that have the maximum observation overlap with

F_{t}

. In the pose graph

G = (V, ε)

, the nodes consist of

F_{t}

and the selected subset of key frames, such that

V = F_{t} \cup P_{p g}

and

| V | = K + 1

. The objective is to determine the best pose that minimizes the total loss of the pose graph, given by:

L_{pg} = ω_{s} L_{s} (t) + \sum_{i \in V, j \in V, i \neq j} [ω_{f} L_{f} (i, j) + ω_{p} L_{p} (i, j)]

(3)

where

L_{f}

and

L_{p}

denote pairwise edge losses [17], and

L_{s}

represents an additional unary loss, which will be detailed in subsequent sections.

The loss functions

L_{f}

and

L_{p}

used in the pose graph optimization are defined as follows.

L_{f}

measures the Euclidean distance between RGB-D feature correspondences

p_{m}

and

p_{n}

in

R^{3}

, where

ξ

denotes the object pose in frame

F^{(i)}

, and

ρ

represents the robust Huber loss [28]. Specifically:

\begin{matrix} L_{f} (i, j) = \sum_{(p_{m}, p_{n}) \in C_{i, j}} ρ ({∥ξ_{i}^{- 1} p_{m} - ξ_{j}^{- 1} p_{n}∥}_{2}) \end{matrix}

(4)

Here,

T_{i j}

denotes the transformation from frame

F^{(i)}

to frame

F^{(j)}

, given by

ξ_{i} ξ_{j}^{- 1}

;

π_{j}

represents the perspective projection onto image

I_{j}

associated with

F^{(j)}

,

π_{D}^{- 1}

is the inverse projection mapping used to obtain pixel positions in the depth image

D_{j}

, and

n_{i} (p)

is the normal vector of point p in the point cloud corresponding to image

I_{i}

. The deformed reprojection loss

L_{p}

is given by:

\begin{matrix} L_{p} (i, j) = \sum_{p \in I_{i}} ρ (|n_{i} (p) \cdot (T_{i j}^{- 1} π_{D_{j}}^{- 1} (π_{j} (T_{i j} p)) - p)|) \end{matrix}

(5)

In addition, we integrate object information from the neural object field (NOF) to jointly optimize the pose for improved accuracy. The unary loss

L_{s}

associated with the NOF is incorporated into the pose graph optimization process:

L_{s} (t) = \sum_{p \in I_{t}} ρ (|Ω (ξ_{t}^{- 1} (π_{D}^{- 1} (p)))|)

(6)

Here,

Ω (\cdot)

denotes the signed distance function of the neural object field. This loss quantifies the alignment between the neural implicit shape and the observed frames. Note that

L_{s}

is considered only after the initial training of the neural object field has converged, and the NOF parameters are fixed during this optimization step.

2.3.2. Neural Object Field Optimization

At the beginning of each training cycle, the neural object field retrieves all memory frames from the memory pool and begins learning. When training converges, optimized poses are updated to the memory pool to assist subsequent online pose graph optimization. To optimize the neural object field, we adopt the loss function from BundleSDF [12]. The trainable parameters include the multi-resolution hash encoder

Ω, Φ

and object pose updates within the tangent space parameterized in the Lie algebra

δ \bar{ξ} \in R^{(| P | - 1) \times 6}

, where the pose of the first memory frame is fixed as an anchor.

The uncertain free space loss

L_{u}

is designed to ignore rays passing through segmentation masks or rays with uncertain depth. By over-predicting a small positive value, the model can quickly converge when reliable observations are later available. The empty space loss

L_{e}

trains the model to predict the truncated distance for points far from the surface, ensuring that distant points are accurately handled. The near-surface space loss

L_{s u r f}

is used to ensure that points near the surface predict the signed distance function (SDF) correctly.

\begin{matrix} L_{u} = \frac{1}{| X |} \sum_{x \in X} {∥ Ω (x) - ϵ ∥}_{2} \end{matrix}

(7)

\begin{matrix} L_{e} = \frac{1}{| X |} \sum_{x \in X} | Ω (x) - λ | \end{matrix}

(8)

\begin{matrix} L_{s u r f} = \frac{1}{| X_{s u r f} |} \sum_{x \in X_{s u r f}} {(Ω + d_{x} - d_{D})}^{2} \end{matrix}

(9)

The color loss function

L_{c}

is used to learn appearance and the Eikonal regularization [29] is

L_{e i k}

over the SDF in near-surface space.

\begin{matrix} L_{c} = \frac{1}{| R |} \sum_{r \in R} {∥ c (r) - \hat{c} (r) ∥}_{2} \end{matrix}

(10)

\begin{matrix} L_{e i k} = \frac{1}{| x_{s u r f} |} \sum_{x \in x_{s u r f}} {({∥ δ Ω (x) ∥}_{2} - 1)}^{2} \end{matrix}

(11)

The total loss is

L = w_{u} L_{u} + w_{e} L_{e} + w_{s u r f} L_{s u r f} + w_{c} L_{c} + w_{e i k} L_{e i k}

(12)

and we set all w to 1 empirically.

3. Experiments

3.1. Datasets

To train and comprehensively evaluate the proposed method, we not only utilize the HO-3D dataset but also data specifically recorded using RealSense.

HO-3D [18] is a dataset containing RGB-D videos of interactions between human hands and YCB objects [30] captured at close range using RealSense cameras. The ground truth is generated through multi-view registration. We used the latest version, HO-3D_v3, and conducted testing on the official evaluation set. It includes four different objects, 13 video sequences, and a total of 20,428 frames.

While the HO-3D dataset is the most widely used dataset for object pose tracking-related research, it contains relatively few instances where the foreground object is blocked by moving occlusion. Therefore, we captured RGB-D video data using RealSense cameras, which include 6 different objects, 12 video sequences, and a total of 10,265 frames. The videos such as video_switch, video_firm_cushion, and video_box involve a large amount of interaction with objects, and the surface textures of the objects are weak. To test the robustness of the method when occlusions change frequently and when the object’s own texture information is unreliable, we also recorded long-duration RGB-D video data: video_cup. In this video, the target object is only fully exposed in the frame for 20% of the time, with less than 50% of its surface visible for the rest of the time. The data aim to assess whether the method can rapidly and accurately perceive changes in occlusion relationships, maximize the use of limited information, and make reliable pose predictions. Finally, to highlight the method’s ability to withstand moving occlusions, we also recorded a video, video_boxes, where the camera and the target object remain relatively stationary, but there are randomly moving occluding objects between them. While ground truth for pose estimation is not available for these datasets, the target objects can be fully reconstructed in advance via scanning. Therefore, to provide a comprehensive and objective analysis of the results, this section will focus on evaluating the effectiveness of the model reconstruction.

3.2. Experimental Setup

This section mainly introduces the key parameter settings used in the experiments. In the pose graph optimization part, the overall

L o s s

is composed of three terms,

L_{f}

,

L_{f}

, and

L_{s}

, with equal weights set to 1 in the experiments. Additionally, many decisions in the algorithm involve hyperparameters, among which the values of three coefficients related to key-frame addition will be discussed in the ablation experiments. The values shown in Table 1 represent the best parameter combination in the experimental environment. In the subsequent comparisons with the experimental results of BundleSDF, we used the public implementation of BundleSDF. For our method, we applied the improvements over the baseline, as described in Section 2, and used a new set of hyperparameters to achieve the best performance. The impact of these hyperparameters is discussed in Section 3.5.3.

3.3. Evaluation Indices

We mainly evaluate the results from two aspects: pose estimation and shape reconstruction. For the evaluation of the nine degrees-of-freedom object pose, we calculate the percentage of the area under the curve for the ADD and ADD-S metrics:

ADD = \frac{1}{| M |} \sum_{x \in M} {∥(R x + t) - (\tilde{R} x + \tilde{t})∥}^{2}

(13)

ADD-S = \frac{1}{| M |} \sum_{x_{1} \in M} min_{x_{2} \in M} {∥(R_{x_{1}} + t) - ({\tilde{R}}_{x_{2}} + \tilde{t})∥}^{2}

(14)

where M is the object model. We use the ground truth pose of the first frame to define the canonical coordinate system for each video for pose evaluation.

For the evaluation of 3D shape reconstruction, we measure the similarity between the generated mesh and the ground truth model using the Chamfer distance. The Chamfer distance between two point clouds

x_{1}

and

x_{2}

is calculated as:

d_{C D} = \frac{1}{| M_{1} |} \sum_{x_{1} \in M_{1}} min_{x_{2} \in M_{2}} ∥ x_{1} - x_{2} ∥_{2} + \frac{1}{| M_{2} |} \sum_{x_{2} \in M_{2}} min_{x_{1} \in M_{1}} {∥ x_{1} - x_{2} ∥}_{2} .

(15)

Specifically, we first uniformly sample the model to obtain a 3D point cloud and then calculate the Chamfer distance between the two point clouds. For all methods, we use the same resolution (5 mm) to sample points for evaluation. For our self-collected dataset, due to the lack of ground truth poses, we focus on using the reconstructed results for evaluation. In the framework used in this paper, the accuracy of reconstruction and tracking are interdependent and complementary. Correct understanding of geometric shapes often stems from accurate pose estimation, and accurate modeling simplifies the tracking task.

3.4. Results and Analysis

3.4.1. HO-3D Dataset

This section continues to visualize the pose estimation results by converting them into object bounding boxes in the images. As shown in Figure 4, the red bounding boxes represent the predicted results of our method, while the green bounding boxes represent the annotated ground truth poses. By comparing the similarity between the two, the effectiveness of the pose estimation can be directly observed. In terms of qualitative analysis, the projected bounding boxes of our method’s estimation results closely match the ground truth bounding boxes, indicating a high degree of overlap. This suggests that the predicted results of our method accurately reflect the true pose of the target object in the current frame, providing effective pose information.

The results of estimating each object in the dataset are presented in this section, with efforts made to select results with different angles and significant motion between frames in chronological order. However, due to space constraints, the section divides the results into time segments, extracting results from different time intervals rather than focusing on a specific time period to ensure a comprehensive display of the entire tracking process. From the results, it is evident that our method accurately predicts the current pose of the target object regardless of its orientation. Furthermore, there is no apparent decrease in tracking accuracy over time, indicating that our method overcomes the drift problem commonly encountered in long-term tracking. Particularly noteworthy are frames from video SM1 (Figure 4a) and video MPM14 (Figure 4b), where a large portion of the object’s surface is occluded by interacting objects (human hands), yet our method still makes correct predictions. This is because, at this point, the algorithm has detected the presence of foreground occlusion, and the key-frame addition strategy becomes more proactive in evaluating each frame. Additionally, it continuously monitors changes in foreground occlusion, responding to sudden decreases in the visible portion of the object’s surface. When such situations occur, they leverage information from key frames in similar viewpoints to assist pose optimization, as there is usually more usable texture in similar viewpoints. Moreover, thanks to the modeling of the target’s geometric shape during tracking, even when tracking objects with weak texture or smooth surfaces, such as AP12 (Figure 4c) and SB11 (Figure 4d), which have uniform colors and smooth surfaces, making textureless areas extensive and unfavorable for key-point matching, our method still maintains stable tracking under these circumstances.

In terms of quantitative evaluation, this study employs ADD, ADD-S [12], and Chamfer distance (CD) as evaluation indices, tabulating the results of target pose estimation for all frames in the HO-3D dataset, as shown in Table 2. Our method achieves 92.56% and 96.51% on the ADD and ADD-S metrics, respectively, with a CD of only 0.50. Performance across all three metrics surpasses BundleSDF and significantly exceeds NICE-SLAM and SDF-2-SDF. Specifically, on the HO-3D dataset, NICE-SLAM achieves an ADD-S of only 25.68, while SDF-2-SDF performs slightly better, reaching 34.88, but both fall short of our method’s performance by half.

3.4.2. Self-Made Indoor Interaction Dataset

To further validate the generalization of our method to various objects and its ability to track object poses during interaction actions in indoor scenes from both first-person and third-person perspectives, experiments on object tracking were conducted on a self-made indoor interaction dataset. Partial results are shown in Figure 5. Overall, the sizes of the bounding boxes match the true scales of the target objects, and the bounding boxes fit closely with the target objects, accurately reflecting their orientations. When horizontally observing the tracking results of each object over time, the motion of each object includes but is not limited to translation, rotation, occlusion by hands or other objects, and so on. The bounding boxes predicted by our method accurately reflect the poses of the target objects regardless of how they move or rotate. Vertically, various objects are distinguished by their sizes, textures, surface materials, etc., and our method successfully adapts to different objects, demonstrating its excellent generalization performance once again.

We employ the open-source software Visual SfM 0.5.26 for pre-reconstruction of the tracked objects from multiple viewpoints. The correctly reconstructed results served as the ground truth for this experiment, used for calculating the Chamfer distance. A comparison between the reconstruction results of our method and BundleSDF is shown in Table 3, following the same order as Figure 5. From the data in the table, it can be observed that for the reconstruction results of all objects, our method outperforms the comparative method BundleSDF, with an average reconstruction lossess than half of BundleSDF. Particularly, regarding the reconstruction loss for the switch, our method achieved a loss of only 0.21, significantly better than BundleSDF’s 1.58. However, it is worth mentioning that both methods did not achieve satisfactory reconstructions for airpods. Objectively, this is due to the small size of the airpods, further reducing the visible area when handheld, and their smooth surface lacking any usable texture. Nevertheless, even in such circumstances, our method maintained tracking throughout the continuous 475-frame video data.

3.5. Comparison and Analysis

For a detailed discussion on the impact of occlusion on object pose tracking, this section sets up two sets of comparative experiments for analysis and validation.

3.5.1. Comparative Experiments on Moving Occluders

To verify the effectiveness of the occlusion detection algorithm proposed in our research, this section conducts a comparative experiment. The experiment contrasts our method with the state-of-the-art method in pose tracking algorithms—BundleSDF. The experimental setup involves a stationary tracking target (red box) relative to the camera, while a white box is consistently present in the scene, positioned between the camera and the target object, serving as an occluder. Additionally, this occluder intermittently moves, altering the occlusion relationship between the two objects. It is important to note that throughout the entire recording process, the target object (red box) remains occluded, with no frame showing the front of the target object fully exposed in the scene.

The experimental results are obtained under the optimal parameter settings and represent the final output of the optimization process, as shown in Figure 6. The left side visualizes the tracking results (yellow boxes represent the target object boundary boxes predicted by the method), while the right side displays rendered views of the reconstructed models from two angles. BundleSDF and our method’s results are arranged in corresponding pairs. As the tracking progresses, the occluder (white box) transitions from a vertical to a horizontal orientation and undergoes random lateral movements. Firstly, in terms of discerning the motion of the target object, BundleSDF is notably influenced by the foreground object (white box) in the third image, resulting in a significant deviation of the boundary box as the foreground object moves to the left of the frame. In contrast, our method maintains overall stability. Additionally, throughout the tracking process, there are significant differences in the positions of the boundary boxes drawn by the two methods. Specifically, the boundary boxes drawn by BundleSDF are not parallel to the plane where the target object resides in the scene; instead, they exhibit a certain degree of tilt and fail to accurately reflect the pose of the target object. In contrast, our method consistently maintains boundary boxes tightly around the target object throughout the tracking process, unaffected by the movement of the occluder.

From the reconstruction results, since the target object (red box) only appears frontally in the video data, ideally, the reconstruction result should be a rectangular plane. When viewed from the front, BundleSDF’s reconstruction result outlines a rectangular plane overall, but there is a hole in the middle of the rectangle, occupying approximately 20% of the entire front face, and the shape of the hole is rectangular. Due to the occlusion by the white box, the area where the hole is located is not captured by the camera for a considerable period. Additionally, due to the factor mentioned earlier (segmentation leading to unclear geometric boundaries), the method mistakenly considers the boundary of the occluder as the geometric boundary of the target object, resulting in the formation of the hole. When viewed from the side, BundleSDF’s reconstruction result is not a flat plane but rather, from the waist of the plane, there is an additional plane connected to the main body, forming a Y shape. Combined with the tracking process, the reason for this situation is that during the movement of the occluder, due to the inaccurate segmentation result, the key points used to calculate the relative pose change are sampled onto the moving occluder, causing the model to think that the target object is in motion. This incorrect calculation of the object’s movement (which is actually stationary but misinterpreted as moving) leads to a bifurcation in the model.

The method described in our research does not suffer from the aforementioned issues. From the tracking results, regardless of how the foreground occluder moves, there is no interference, and the pose of the target object is accurately predicted throughout the process. This is also confirmed by the reconstruction results, which show a complete rectangle without any holes on the front side and a smooth surface, consistent with the original shape of the target object. When viewed from the side, there are no layers or additional structures, completely matching the ideal reconstruction results. This indicates that the occlusion-aware pose-tracking method described in this paper has the ability to overcome interference from moving occluders and can continuously and stably track the pose of the target object even with limited segmentation accuracy.

3.5.2. Frequent Interaction Comparative Experiment

Different from the previous experiment where the structure was fixed and the motion was uniform, in practical applications such as AR and robotics, interactions with complex structures like human limbs and robotic arms from a first-person perspective are more common. Therefore, in this section, we mainly discuss the effectiveness of our method in such complex scenarios. The data used in this section include video_switch and video_cup, both of which contain challenging situations representative of pose estimation tasks.

The characteristics of video_switch are as follows:

The object has a smooth black screen on the front, which lacks texture information and also exhibits specular reflection, causing significant interference with feature extraction and matching.
Over 90% of the video time involves interactions with human hands, resulting in prolonged and random occlusions.
Rapid rotational motion: In the video, the object undergoes full-angle rotations to showcase its entire appearance, including the front and back. Each rotation action is completed within one to two seconds, spanning no more than 50 frames (at a frame rate of 30).

The tracking results are shown in Figure 7. Although there is no ground truth bounding box to directly calculate the 3D IoU for evaluating pose prediction results, we can still assess the correctness of the prediction results based on the closeness and alignment between the bounding boxes and the target objects. In terms of closeness, our method achieves seamless alignment between the bounding boxes and the surface of the target objects. However, BundleSDF is affected by frequent interactions and fails to accurately lock onto the target object, resulting in erroneous spatial predictions and causing the bounding boxes to completely deviate from the target object. Regarding alignment, as seen in Figure 7b, the orientation of the bounding boxes predicted by our method consistently matches the current orientation of the objects, indicating that our method can accurately estimate the rotation of the target objects. In contrast, for the comparative method (Figure 7a), while the orientations of the bounding boxes in the first two rows generally represent the orientations of the target objects, the method loses track of the target when the object flips to the other side. The bounding boxes do not adjust their orientation with the rotation of the target object, leading to errors.

Figure 8 presents observations of the reconstruction results for the video_switch dataset, comparing the two methods from three different angles. From a frontal view, the mesh reconstructed by our method essentially reproduces the shape of the target object (the switch console) and preserves some details, including the joysticks and the boundary between the controllers and the screen. In contrast, the comparative method exhibits some extraneous structures, and there is a hole near the left joystick, barely discernible as the outline of the object, resulting in poor performance. From side and top views, the deficiencies of BundleSDF are even more pronounced. The reconstructed mesh contains numerous structures that do not belong to the target object. This may be attributed to the reliance of reconstruction on accurate pose estimation. Repeatedly projecting the depth image onto incorrect poses can incorrectly assemble the object surface (for example, the plane extending from the bottom of the model in view 3 is formed by incorrectly stitching the screen of the target). While our method’s reconstruction also includes a small amount of extraneous structure, it overall preserves the appearance of the original object, further confirming the accuracy of our method in object tracking.

The characteristics of the video_cup dataset are as follows:

The target object is a red cup with a smooth outer surface and no specific texture. The cup is symmetrical.
Throughout the entire process, the target object is never fully exposed in the frame (even when taken out from behind the occluding object, there are also fingers blocking).
The video includes rapid movements of the target object being quickly taken out from behind the occluding object (the switch gaming machine) and promptly placed back.

Additionally, there is camera movement present. While camera movement also pertains to the relative pose changes discussed in this paper, camera movement more often leads to image blur, as depicted in Figure 9.

The tracking and reconstruction results are shown in Figure 10. The evaluation of the pose tracking result is based on the proximity and alignment of the bounding boxes. Firstly, for the tracking results obtained with BundleSDF, there is a noticeable error in estimating the object size, as the bounding box only encompasses the upper half of the cup. Combined with the reconstruction results, it can be inferred that this error occurred when the cup was taken out from behind the occluder, and the method failed to capture the lower half of the cup in time, resulting in an incorrect judgment of the overall shape of the target object. Our method overcomes this difficulty well, even though the lower half of the cup only briefly appears. The change in occlusion allows the method to perceive the existence of more parts of the target object in a timely manner, indicating the need to add key frames to improve the completeness of the reconstruction. This conclusion is corroborated by the reconstruction results: the overall contour of the cup reconstructed by our method conforms to the tall and slender cup shape, and the notch appearing in the middle is due to occlusion by the fingers during the handheld process.

When the camera is in motion, such as due to high speed or shaking, causing image blurring, it is an occasional event. However, it is often in such situations that object tracking methods are most needed to provide stable and effective pose estimation results. In Figure 9, three frames with significant motion are selected to compare the robustness of the two methods under these conditions. The first column on the left shows the tracking results of BundleSDF, and the second column shows the local magnification of each frame. From the images, it can be observed that all three frames have varying degrees of blurring. The combination of intense motion and blurry images directly leads to incorrect pose estimation by BundleSDF, and even complete deviation, resulting in tracking failure. Under these conditions, our method also faces significant challenges. In the first and third frames, the predicted pose deviates to the left, and in the second frame, there is an incorrect judgment of the target’s orientation. Overall, although there is some fluctuation, the bounding boxes still surround the target object and form an enclosure around it, indicating that the method still provides relatively reasonable prediction results under these conditions.

3.5.3. Hyperparameters Effects

To validate the effectiveness of the occlusion-based key-frame detection algorithm proposed in our method and ensure that the method achieves optimal performance under the current parameter settings, the following experiment is designed. This experiment mainly adjusts three coefficients: the sensitivity coefficient used to determine whether the occluded area of the current frame has significantly decreased, the percentage threshold used to determine whether there is a significant change in the occluded area of the current frame, and the absolute threshold.

Sensitivity Coefficient: This coefficient is used to determine whether the occluded area of the current frame has significantly decreased. It specifies the threshold reduction percentage between the occluded area of the current frame and the occluded area of the previous key frame, below which the current frame will be considered as a new key frame.

Percentage Threshold: This threshold is used to determine whether the difference between the occluded area of the current frame and that of the previous key frame is significant. If the difference exceeds a certain percentage, then the current frame is considered as a new key frame.

Absolute Threshold: Similar to the percentage threshold, the absolute threshold is used to determine whether the difference between the occluded area of the current frame and that of the previous key frame is significant. However, in this case, it is determined by the total sum of non-zero pixels in the difference image. If this sum exceeds a certain absolute threshold, then the current frame is considered as a new key frame.

In the experimental setup, the three factors are condensed into two. The interaction between the percentage threshold and the absolute threshold is minimal. The design of two thresholds serves as a double insurance in the system, so the specific values for both thresholds can be considered as a single problem. Therefore, they are presented in the form of factor combinations. This design allows for the simultaneous consideration of multiple factors and levels, even within a relatively small experimental scale. With this design, each factor and its levels are adequately considered, enabling an effective exploration of the interactions between parameters. The system systematically tests the effects of different parameter combinations to find the optimal one. The experiment adjusts the parameters’ sizes with a fixed step length, synchronously adjusting the absolute threshold and the percentage threshold. The results are shown in Table 4, where the horizontal direction represents different values of the sensitivity coefficient, and the vertical direction represents combinations of the percentage threshold and the absolute threshold.

The experiment comprises a total of nine groups, evaluating the accuracy of the method under each parameter setting to ultimately determine the optimal parameter combination as (0.8, 0.05, 500), where the sensitivity coefficient is set to 0.8, the percentage threshold is 0.05, and the absolute threshold is 500. Since the experimental results are based on limited data, with image formats all being 480 × 640 and most scenes being indoor environments, and with the target object occupying at least 20% of the frame, adjustments to the parameter settings should be made accordingly if these conditions change.

4. Conclusions and Discussion

With the decreasing cost of 3D data acquisition devices, tasks relying on depth data for object pose estimation and tracking have gained broad prospects for development and industrial applications. However, they also face challenges such as complex and diverse environments, occlusions caused by object clustering, and segmentation difficulties. Our research focuses on the problem of tracking occluded objects and proposes a method based on occlusion relationship analysis, which significantly reduces the negative impact of foreground occlusion on object pose estimation. We propose a pose-tracking framework based on occlusion relationships for object pose-tracking tasks in complex environments. By analyzing the pain points of existing research, which struggle with pose estimation in complex environments due to occlusions and inaccurate segmentation, the concept of dual-layer masks is introduced. Depth information is used to detect changes in occlusion relationships, which are then used to maintain a pool of key frames. The input RGB-D images undergo a series of steps, including segmentation masking, rough pose estimation, and pose graph optimization. Additionally, a neural object field is trained during this process, which assists in the pose graph optimization after convergence. In the experimental section, the proposed method demonstrates accuracy comparable to the BundleSDF method on the HO-3D dataset. Furthermore, through targeted self-made datasets, it is shown that even in scenarios with moving occlusions, frequent interactions, and significant motion blur, the proposed method maintains algorithm robustness and pose estimation accuracy.

It is important to acknowledge that there are still some limitations and areas for improvement in the current work. Our method focuses on the pose tracking of individual objects. However, when multiple objects are present in an image, the current approach segments individual objects using masks and independently calculates the pose of each object without fully utilizing the correlation information between objects. In real world scenarios, adjacent objects often have certain correlations; they may be located on the same plane or follow a certain logic, such as different objects in the same space–time not overlapping with each other. These correlations can not only help constrain and optimize the pose of each object but also elevate the isolated prediction of individual objects to an understanding of the entire 3D scene.

Author Contributions

Conceptualization, X.Z. and Y.Z. (Yuekun Zhang); methodology, Y.Z. (Yuekun Zhang); software, Y.Z. (Yuekun Zhang); validation, Y.Z. (Yuekun Zhang); formal analysis, Y.Z. (Yuekun Zhang); investigation, Y.Z. (Yuekun Zhang); resources, X.Z.; data curation, Y.Z. (Yuekun Zhang); writing—original draft preparation, Y.Z. (Yaqing Zhou); writing—review and editing, X.Z. and Y.Z. (Yuekun Zhang); visualization, Y.Z. (Yuekun Zhang); supervision, X.Z.; project administration, X.Z.; funding acquisition, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (U23A20312, 62072366).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data is not publicly due to privacy.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bousmalis, K.; Irpan, A.; Wohlhart, P.; Bai, Y.; Kelcey, M.; Kalakrishnan, M.; Downs, L.; Ibarz, J.; Pastor, P.; Konolige, K.; et al. Using simulation and domain adaptation to improve efficiency of deep robotic grasping. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QSD, Australia, 21–25 May 2018; pp. 4243–4250. [Google Scholar]
James, S.; Wohlhart, P.; Kalakrishnan, M.; Kalashnikov, D.; Irpan, A.; Ibarz, J.; Levine, S.; Hadsell, R.; Bousmalis, K. Sim-to-real via sim-to-sim: Data-efficient robotic grasping via randomized-to-canonical adaptation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12627–12637. [Google Scholar]
Morrison, D.; Corke, P.; Leitner, J. Closing the Loop for Robotic Grasping: A Real-time, Generative Grasp Synthesis Approach. arXiv 2018, arXiv:abs/1804.05172. [Google Scholar]
Grigorescu, S.; Trasnea, B.; Cocias, T.; Macesanu, G. A survey of deepearning techniques for autonomous driving. J. Field Robot. 2019, 37, 362–386. [Google Scholar] [CrossRef]
Levinson, J.; Askeland, J.; Becker, J.; Dolson, J.; Held, D.; Kammel, S.; Kolter, J.Z.; Langer, D.; Pink, O.; Pratt, V.; et al. Towards fully autonomous driving: Systems and algorithms. In Proceedings of the 2011 IEEE Intelligent Vehicles Symposium (IV), Baden-Baden, Germany, 5–9 June 2011; pp. 163–168. [Google Scholar] [CrossRef]
Cipresso, P.; Giglioli, I.A.C.; Raya, M.A.; Riva, G. The Past, Present, and Future of Virtual and Augmented Reality Research: A Network and Cluster Analysis of the Literature. Front. Psychol. 2018, 9, 2086. [Google Scholar] [CrossRef] [PubMed]
Ibáñez, M.B.; Delgado-Kloos, C. Augmented reality for STEMearning: A systematic review. Comput. Educ. 2018, 123, 109–123. [Google Scholar] [CrossRef]
Fan, Z.; Zhu, Y.; He, Y.; Sun, Q.; Liu, H.; He, J. Deepearning on monocular object pose detection and tracking: A comprehensive overview. ACM Comput. Surv. 2022, 55, 1–40. [Google Scholar] [CrossRef]
Wang, C.; Martín-Martín, R.; Xu, D.; Lv, J.; Lu, C.; Fei-Fei, L.; Savarese, S.; Zhu, Y. 6-pack: Category-level 6d pose tracker with anchor-based keypoints. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 10059–10066. [Google Scholar]
Deng, X.; Mousavian, A.; Xiang, Y.; Xia, F.; Bretl, T.; Fox, D. PoseRBPF: A Rao–Blackwellized particle filter for 6-D object pose tracking. IEEE Trans. Robot. 2021, 37, 1328–1342. [Google Scholar] [CrossRef]
Chen, H.; Wang, P.; Wang, F.; Tian, W.; Xiong, L.; Li, H. Epro-pnp: Generalized end-to-end probabilistic perspective-n-points for monocular object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2781–2790. [Google Scholar]
Wen, B.; Tremblay, J.; Blukis, V.; Tyree, S.; Müller, T.; Evans, A.; Fox, D.; Kautz, J.; Birchfield, S. Bundlesdf: Neural 6-dof tracking and 3d reconstruction of unknown objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 606–617. [Google Scholar]
Merrill, N.; Guo, Y.; Zuo, X.; Huang, X.; Leutenegger, S.; Peng, X.; Ren, L.; Huang, G. Symmetry and Uncertainty-Aware Object SLAM for 6DoF Object Pose Estimation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 14881–14890. [Google Scholar] [CrossRef]
Salas-Moreno, R.F.; Newcombe, R.A.; Strasdat, H.; Kelly, P.H.; Davison, A.J. SLAM++: Simultaneous Localisation and Mapping at the Level of Objects. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 1352–1359. [Google Scholar] [CrossRef]
McCormac, J.; Clark, R.; Bloesch, M.; Davison, A.J.; Leutenegger, S. Fusion++: Volumetric Object-Level SLAM. In Proceedings of the 2018 International Conference on 3D Vision (3DV), Verona, Italy, 5–8 September 2018; pp. 32–41. [Google Scholar]
Wada, K.; Sucar, E.; James, S.; Lenton, D.; Davison, A.J. MoreFusion: Multi-object Reasoning for 6D Pose Estimation from Volumetric Fusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Wen, B.; Bekris, K. BundleTrack: 6D Pose Tracking for Novel Objects without Instance or Category-Level 3D Models. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 8067–8074. [Google Scholar] [CrossRef]
Hampali, S.; Rad, M.; Oberweger, M.; Lepetit, V. Honnotate: A method for 3d annotation of hand and object poses. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3196–3206. [Google Scholar]
Umeyama, S. Least-squares estimation of transformation parameters between two point patterns. IEEE Trans. Pattern Anal. Mach. Intell. 1991, 13, 376–380. [Google Scholar] [CrossRef]
Müller, T.; Evans, A.; Schied, C.; Keller, A. Instant Neural Graphics Primitives with a Multiresolution Hash Encoding. ACM Trans. Graph. 2022, 41, 1–15. [Google Scholar] [CrossRef]
Azinovic, D.; Martin-Brualla, R.; Goldman, D.B.; Niebner, M.; Thies, J. Neural RGB-D Surface Reconstruction. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 6280–6291. [Google Scholar] [CrossRef]
Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 2021, 65, 99–106. [Google Scholar] [CrossRef]
Wang, P.; Liu, L.; Liu, Y.; Theobalt, C.; Komura, T.; Wang, W. NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction. In Proceedings of the Advances in Neural Information Processing Systems; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P.S., Vaughan, J.W., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 27171–27183. [Google Scholar]
Cheng, H.K.; Schwing, A.G. Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: New York, NY, USA, 2022; pp. 640–658. [Google Scholar]
Sun, J.; Shen, Z.; Wang, Y.; Bao, H.; Zhou, X. LoFTR: Detector-freeocal feature matching with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8922–8931. [Google Scholar]
Fischler, M.A.; Bolles, R.C. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
Arun, K.S.; Huang, T.S.; Blostein, S.D. Least-Squares Fitting of Two 3-D Point Sets. IEEE Trans. Pattern Anal. Mach. Intell. 1987, PAMI-9, 698–700. [Google Scholar] [CrossRef] [PubMed]
Huber, P.J. Robust estimation of aocation parameter. In Breakthroughs in Statistics: Methodology and Distribution; Springer: New York, NY, USA, 1992; pp. 492–518. [Google Scholar]
Gropp, A.; Yariv, L.; Haim, N.; Atzmon, M.; Lipman, Y. Implicit geometric regularization forearning shapes. In Proceedings of the 37th International Conference on Machine Learning. JMLR.org, ICML’20, Virtual, 13–18 July 2020. [Google Scholar]
Calli, B.; Walsman, A.; Singh, A.; Srinivasa, S.; Abbeel, P.; Dollar, A.M. Benchmarking in manipulation research: Using the Yale-CMU-Berkeley object and model set. IEEE Robot. Autom. Mag. 2015, 22, 36–52. [Google Scholar] [CrossRef]
Zhu, Z.; Peng, S.; Larsson, V.; Xu, W.; Bao, H.; Cui, Z.; Oswald, M.R.; Pollefeys, M. Nice-slam: Neural implicit scalable encoding for slam. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12786–12796. [Google Scholar]
Slavcheva, M.; Kehl, W.; Navab, N.; Ilic, S. Sdf-2-sdf: Highly accurate 3d object reconstruction. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: New York, NY, USA, 2016; pp. 680–696. [Google Scholar]

Figure 1. Overview of our system. First, we compute the image mask of the target object, then feed the mask segmentation result into the feature-matching network to obtain a coarse pose estimation using the Umeyama [19] algorithm with the feature-matching result of the previous frame. Second, we use a dual-layer mask-based strategy to select frames from the key-frame pool that have a strong co-visibility relationship with the current frame and perform joint optimization with the current frame to obtain the final pose estimation result.

Figure 2. Visualization of the dual-layer mask results (yellow: target object mask; pink: detected foreground occlusions).

Figure 3. The images of mask difference between adjacent frames and the original RGB image are shown above. (a) The original RGB image of this scene. (b) The result of superimposing foreground occlusion masks of adjacent frames is shown above. White indicates overlapping areas, while gray indicates non-overlapping areas. (c) The portion showing the change in foreground occlusion masks between adjacent frames (white areas).

Figure 4. The partial results of our method on the HO-3D dataset are visualized above. Each row represents the results of a video in the dataset, where the green bounding box indicates the ground truth pose, and the red bounding box represents the predicted results of our method.

Figure 5. Visualizations of partial results of our method on the self-made indoor interaction dataset are presented. Each row represents the results of one video in the dataset.

Figure 6. Experiment with moving occluders: qualitative comparison with BundleSDF on the custom dataset. On the left are the visualized results of the estimated poses of the target (yellow box) by the two methods, while on the right, the final reconstructed meshes of the two methods are displayed.

Figure 7. The partial pose tracking results for the video_switch dataset are shown below: yellow wireframes represent the pose estimation results of the method. (a) The results of the comparative method, BundleSDF; (b) the results of our method.

Figure 8. The comparison of reconstruction results for the video_switch dataset. The first row depicts the reconstruction results by BundleSDF, while the second row shows the reconstruction results by our method. From left to right, the observations are from the frontal, side, and top views of the object, respectively.

Figure 9. The partial results of the tracking for video_cup are shown above. The left two columns display the original images and local enlargements of the pose estimation results by BundleSDF, while the right two columns show the original images and local enlargements of the pose estimation results by our method.

Figure 10. Comparison of tracking and reconstruction results for the video_cup dataset. The first row shows the results obtained with BundleSDF, while the second row presents the output from our method.

Table 1. Key parameter setups.

Parameter Category	Parameter Name	Parameter Size
Loss Term	$ω_{f}$	1.0
	$ω_{p}$	1.0
	$ω_{s}$	1.0
Hyperparametes	Point Cloud Visibility Threshold	0.1
	Sensitivity Coefficient	0.8
	Ratio Threshold	0.05
	Absolute Threshold	500

Table 2. The comparative results of different methods on the HO-3D dataset.

	ADD (%) ↑	ADD-S (%) ↑	CD (cm) ↓
BundleSDF [12]	92.27	96.43	0.51
NICE-SLAM [31]	10.42	25.68	54.91
SDF-2-SDF [32]	15.19	34.88	9.65
Ours	92.56	96.51	0.50

The symbol ↑ indicates that higher values represent reconstruction results that are closer to the ground truth, while the symbol ↓ indicates that lower values represent reconstruction results that are closer to the ground truth. Bold values represent the best performance.

Table 3. Comparison of reconstruction accuracy on the self-made dataset.

	Reconstruction CD (cm) ↓
	BundleSDF	Ours
firm_cushion	2.13	0.75
toy	1.67	0.31
box	1.75	0.52
airpods	2.60	2.56
switch	1.58	0.21
cup	1.66	0.47
mean	1.90	0.80

The symbol ↓ indicates that lower values represent reconstruction results that are closer to the ground truth. Bold values represent the best performance.

Table 4. Hyperparameter effects.

Percentage Threshold	Absolute Threshold	Sensitivity Coefficient
		0.7		0.8		0.9
		ADD-S ↑	CD ↓	ADD-S ↑	CD ↓	ADD-S ↑	CD ↓
0.025	400	97.022%	0.470	97.114%	0.469	96.355%	0.498
0.05	500	97.144%	0.471	97.214%	0.449	97.100%	0.469
0.1	600	96.040%	0.459	96.952%	0.472	96.840%	0.466

The symbol ↑ indicates that higher values represent reconstruction results that are closer to the ground truth, while the symbol ↓ indicates that lower values represent reconstruction results that are closer to the ground truth. Bold values represent the best performance.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, X.; Zhang, Y.; Zhou, Y. Pose Tracking and Object Reconstruction Based on Occlusion Relationships in Complex Environments. Appl. Sci. 2024, 14, 9355. https://doi.org/10.3390/app14209355

AMA Style

Zhao X, Zhang Y, Zhou Y. Pose Tracking and Object Reconstruction Based on Occlusion Relationships in Complex Environments. Applied Sciences. 2024; 14(20):9355. https://doi.org/10.3390/app14209355

Chicago/Turabian Style

Zhao, Xi, Yuekun Zhang, and Yaqing Zhou. 2024. "Pose Tracking and Object Reconstruction Based on Occlusion Relationships in Complex Environments" Applied Sciences 14, no. 20: 9355. https://doi.org/10.3390/app14209355

APA Style

Zhao, X., Zhang, Y., & Zhou, Y. (2024). Pose Tracking and Object Reconstruction Based on Occlusion Relationships in Complex Environments. Applied Sciences, 14(20), 9355. https://doi.org/10.3390/app14209355

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Pose Tracking and Object Reconstruction Based on Occlusion Relationships in Complex Environments

Abstract

1. Introduction

2. Methods

2.1. Neural Object Field

2.2. Key-Frame Pool Strategy Based on Dual-Layer Mask

2.2.1. Dual-Layer Mask Generation

2.2.2. Key-Frame Pool

2.3. Optimization

2.3.1. Two-Stage Pose Optimization

2.3.2. Neural Object Field Optimization

3. Experiments

3.1. Datasets

3.2. Experimental Setup

3.3. Evaluation Indices

3.4. Results and Analysis

3.4.1. HO-3D Dataset

3.4.2. Self-Made Indoor Interaction Dataset

3.5. Comparison and Analysis

3.5.1. Comparative Experiments on Moving Occluders

3.5.2. Frequent Interaction Comparative Experiment

3.5.3. Hyperparameters Effects

4. Conclusions and Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI