Depth Estimation of a Deformable Object via a Monocular Camera

Jiang, Guolai; Jin, Shaokun; Ou, Yongsheng; Zhou, Shoujun

doi:10.3390/app9071366

Open AccessArticle

Depth Estimation of a Deformable Object via a Monocular Camera

by

Guolai Jiang

^1,2,

Shaokun Jin

^1,2,*,

Yongsheng Ou

^1,3,4,*

and

Shoujun Zhou

¹

Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China

²

Shenzhen College of Advanced Technology, University of Chinese Academy of Sciences, Shenzhen 518055, China

³

Guangdong Provincial Key Laboratory of Robotics and Intelligent System, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China

⁴

Key Laboratory of Human-Machine Intelligence-Synergy Systems, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2019, 9(7), 1366; https://doi.org/10.3390/app9071366

Submission received: 28 January 2019 / Revised: 21 March 2019 / Accepted: 26 March 2019 / Published: 1 April 2019

Download

Browse Figures

Versions Notes

Abstract

:

The depth estimation of the 3D deformable object has become increasingly crucial to various intelligent applications. In this paper, we propose a feature-based approach for accurate depth estimation of a deformable 3D object with a single camera, which reduces the problem of depth estimation to a pose estimation problem. The proposed method needs to reconstruct the target object at the very beginning. With the 3D reconstruction as an a priori model, only one monocular image is required afterwards to estimate the target object’s depth accurately, regardless of pose changes or deformability of the object. Experiments are taken on an NAO robot and a human to evaluate the depth estimation accuracy by the proposed method.

Keywords:

depth estimation; deformable; 3D reconstruction; monocular; pose estimation

1. Introduction

In the case of human–robot cooperation, deep reinforcement learning (DRL) is used to train the robot to undertake the task. For an instance of the bolt screwing task, the human partner’s arms might be obstacles for the robot during the working process (Figure 1). In order to train the capability of avoiding obstacles for the robot through DRL method, one has to prepare a huge number of samples, which are usually hard to generate. In this case, it can be done by reconstructing the 3D imaging [1,2,3] of a human worker executing the task. The reconstruction sequence of the human’s arms can be used as moving obstacles to train the obstacle avoidance capability of the robot in a virtual environment (Figure 2).

In the scenarios above, a common prerequisite is the accurate pose information of a human or a robot. However, when an object is projected onto the camera plane, its depth information along the optical axis is lost, which possibly makes two actually far separated objects look close to each other [4]. This results in incorrect estimation of the pose without correct depth information. Although lasers are used to provide depth information, they are usually expensive. Low-cost equipment such as a Kinect are also prevailing alternatives, but their accuracy is not high enough (Kinect has an error of at least 4 mm and has a dead zone within 0.5 m. Additionally, the farther the distance is, the more inaccurate the measuring result is).

Many previous studies propose different approaches to produce depth estimation from images. Among them, a few researchers used optimization methods to handle the problem. Ranftl et al. propose a novel motion segmentation method to produce dense depth map from two consecutive frames through a single monocular camera. They segment the optical flow field into a set of motion models, with which the scene is reconstructed by minimizing a convex program [5]. Smith et al. present a method which estimates depth from a single polarisation image by solving a large, sparse system of linear equations [6]. Karsch proposes a technique that automatically generates plausible depth maps from videos using non-parametric depth sampling [7]. Optimization methods need the manually defined constraints to guarantee the accuracy of the resulted depth estimation. Thus, these methods require human experiences to give effective constraints.

Additionally, approaches using learning-based methods to model manually extracted features are other promising alternatives as well. Saxena et al. propose a Markov Random Field (MRF) learning algorithm to handle monocular cues like texture gradients and variations, defocus, etc. They incorporate these cues into a stereo system to obtain depth estimation results [8]. Ma et al. improve the ResNet-50 network by transfer learning to tackle the depth estimation problem through a single image [9]. Haim et al. propose a phase-coded aperture camera for depth estimation. They equip the camera with an optical phase mask to produce unambiguous depth-related color characteristics for the captured image [10]. Gan et al. present a convolutional neural network architecture that pays more attention to the relationships of different image locations and incorporates the absolute and relative features [11]. These methods depend on the extracted features. In some scenarios, these features might not incorporate sufficient cues of the depth information.

Recently, approaches using deep learning to generate depth maps from images become prevailing. Fu et al. propose a space-increasing discretization (SID) strategy to discretize depth and recast depth network learning as an ordinal regression problem [12]. Jiao et al. propose an approach to handle the depth estimation and semantic labeling task simultaneously. They present a concept called the attention-driven loss for the network supervision, and a synergy network to learn the relevance between the two tasks [13]. Godard et al. trained an unsupervised deep neural network with binocular stereo data to address the ground truth depth data deficient problem in traditional depth estimation methods. A novel training loss was proposed for the deep convolutional neural network to perform single image depth estimation with a high quality [14]. Inspired from the concept of autoencoders, Garg et al. trained the first convolutional neural network end-to-end from scratch for single view depth estimation in an unsupervised manner [15]. Deep learning methods are powerful tools for regressions. However, they usually need expensive platforms to run the algorithm and they usually need many training samples related to the target object for accurate depth estimation.

Different from the aforementioned previous approaches, the proposed method in this paper estimates depth information by means of 3D reconstruction. The first thing we need to do is to circle a camera around the target object and get its reconstruction at the very beginning. Afterwards, only a single monocular image is required to accurately estimate the depth information of the object no matter how it moves or becomes deformable in front of the camera. The proposed method is therefore more applicable in the scenarios where the target object is not rigid and where accurate depth information is necessary. With the point cloud of a target object (in a certain static pose) reconstructed beforehand, the proposed method can estimate the pose and reconstruct the point cloud of a target object (in its other poses) by a single input RGB image. This work can be used on not only humans and humanoid objects, but on other deformable objects as well.

The remainder of this paper is structured as follows. Section 2, Section 3 and Section 4 individually introduce the three modules of the proposed approach specifically. Section 5 introduces the whole flow chart of the proposed method. Experimental evaluations are provided on a NAO robot and a human in Section 6. Section 7 concludes the contributions of this paper.

2. 3D Labeled Reconstruction

This section introduces the priori model for the target object. The priori model is the 3D reconstruction (stored in the form of the point cloud [16,17,18]) of the target object in its stationary status, with a SIFT feature vector attached to each cloud point. Therefore, the priori model is built through two steps. First, we use a traditional 3D reconstruction approach (for static object) to reconstruct the target object by multiple images. Second, we use the SIFT algorithm [19] to extract feature vectors from the collected images, attaching them to the corresponding 3D points on the reconstructed point cloud.

2.1. 3D Reconstruction with Multiple Images

Given a static object, we can use a single camera (whose intrinsic parameter is f) to circle around the object, reconstructing a point cloud. In order to reconstruct the object, the total number of the images captured by the camera is N. Additionally, the total number of points on the object surface is M. The orientation and position of the camera with regard to the world frame at the ith instant can be respectively represented by a matrix

R_{i}

and a vector

t_{i}

. Denote

P_{j} = {(\begin{matrix} X_{j} & Y_{j} & Z_{j} \end{matrix})}^{T}

as the jth point on the object surface with regard to the world frame,

P_{j}^{i} = {(\begin{matrix} X_{j}^{i} & Y_{j}^{i} & Z_{j}^{i} \end{matrix})}^{T}

as the same jth point with regard to the camera frame at the ith instant, and

m_{j}^{i} = {(\begin{matrix} x_{j}^{i} & y_{j}^{i} \end{matrix})}^{T}

as the image coordinate of jth point at the ith instant (for denotation simplification, we define

m_{j}^{i} \in \emptyset

if

P_{j}

is occluded under the observation by the camera at ith instant). The following can be given [20]:

[\begin{matrix} P_{j}^{i} \\ 1 \end{matrix}] = [\begin{matrix} R_{i} & t_{i} \\ 0 & 1 \end{matrix}] [\begin{matrix} P_{j} \\ 1 \end{matrix}],

(1)

Z_{j}^{i} [\begin{matrix} m_{j}^{i} \\ 1 \end{matrix}] = [\begin{matrix} f^{i} I_{2 \times 2} & O_{2 \times 1} \\ O_{1 \times 2} & 1 \end{matrix}] P_{j}^{i} .

(2)

Let

v_{j}^{i} = Z_{j}^{i} m_{j}^{i} - [\begin{matrix} f^{i} I_{2 \times 2} & O_{2 \times 1} \end{matrix}] [\begin{matrix} R_{i} & t_{i} \end{matrix}] {[\begin{matrix} P_{j}^{T} & 1 \end{matrix}]}^{T} .

(3)

Subsequently, the desired result in this step is the 3D reconstruction

{\{P_{j}^{*}\}}_{j = 1 \dots M}

in the following form:

{\{P_{j}^{*}\}}_{j = 1 \dots M} = arg min_{{\{P_{j}\}}_{j = 1 \dots M}} \sum_{i}^{N} \sum_{j}^{M} ∥v_{j}^{i}∥ .

(4)

2.2. SIFT Features to Label the 3D Reconstruction

Given a two-dimensional image

I (x, y)

, the SIFT algorithm [19] is able to extract effective key points through the LoG operator. By computing the gradients in the neighborhood of each key point, a corresponded descriptor vector can be obtained to distinguish the key point. Then, we can use the SIFT algorithm to find a set of feature points (denoted as

{\{m_{s}^{i}\}}_{s = 1 \dots S}

) and their corresponding descriptor vectors (denoted as

{\{l_{s}^{i}\}}_{s = 1 \dots S}

) for the image captured at the

i^{t h}

instant, which jointly yield a two-tuple set denoted as

{\{(m_{s}, l_{s})\}}_{s = 1 \dots S}

. Executing the same operation to all the images, we can finally get

{\{(m_{s}^{i}, l_{s}^{i})\}}_{\binom{i = 1 \dots N}{s = 1 \dots S^{i}}}

(where

S^{i}

indicates the total number of feature points derived from the SIFT algorithm for the image captured at the ith instant).

Subsequently, we need to attach the descriptor vectors to the corresponding 3D points on the surface of the reconstructed point cloud. It can be deduced from Equations (1) and (2) that

P_{j} = R_{i}^{T} [\begin{matrix} \frac{d_{j}^{i}}{f^{i}} I_{2 \times 2} & O_{2 \times 1} \\ O_{1 \times 2} & d_{j}^{i} \end{matrix}] [\begin{matrix} m_{j}^{i} \\ 1 \end{matrix}] - R_{i}^{T} t_{i} .

(5)

By Equation (5), we can determine the 3D point

P_{s}^{i}

on the reconstructed point cloud corresponding to the feature point

m_{s}^{i}

. Thus, we can acquire a two-tuple set

{\{(P_{s}^{i}, l_{s}^{i})\}}_{\binom{i = 1 \dots N}{s = 1 \dots S^{i}}}

. For denotation simplification, we define that any

P_{j}

that is occluded from the camera view at the ith instant, or whose corresponding

m_{j}^{i}

is not a key point, still has a descriptor vector

l_{j}^{i} = 0

. Therefore, we can get

{\{(P_{j}^{i}, l_{j}^{i})\}}_{\binom{i = 1 \dots N}{j = 1 \dots M}}

after the Nth instant. The required 3D labeled reconstruction is the following:

{\{(P_{j}^{*}, {l_{j}}^{*})\}}_{j = 1 \dots M} = {\{(P_{j}^{*}, {\bar{l}}_{j})\}}_{j = 1 \dots M},

(6)

where

{\bar{l}}_{j}

represents the average vector over all the non-zero descriptor vectors (i.e.,

\{l_{j}^{i} |i \leq N, l_{j}^{i} \neq 0\}

) related to the 3D point

P_{j}

within the Nth sampling instant.

3. Skeleton-Based Topological Segmentation

This section introduces how to provide the reconstructed point cloud with a robust topological segmentation, so as to deal with the case where the target object is not rigid. In this way, each sub-point-cloud derived from the topological segmentation is expected to be rigid. Topological segmentation based on the surface information of the target object is inclined to getting influenced by the surface noise, causing weak robustness. Therefore, the proposed topological segmentation is executed through two steps. First, we extract the skeleton of the reconstructed point cloud [21] and segment the skeleton based on its curvature. Second, we dilate the sub-skeletons [22] to yield the sub-point-clouds, which are the results of the topological segmentation.

Skeleton extraction and segmentation. Given an object denoted as

Ω = {\{P_{j}^{*}\}}_{j = 1 \dots M}

, we denote

C O R E (Ω)

as the set of all the maximally inscribed spheres in

Ω

, none of which has common points of tangency with noisy surface. Then, the skeleton of

Ω

is denoted as

S (Ω)

.

After extracting the skeleton of the reconstructed point cloud, we segment the skeleton according to its curvature. Supposing

C (Ω) \subset S (Ω)

, an equivalence relation

\sim_{C}

implicated by

C (Ω)

is defined as

\forall p_{1}, p_{2} \in S (Ω)

,

p_{1} \sim_{C_{1}} p_{2}

if and only if

p_{1}

,

p_{2}

are on a curve segment whose ends are two points in

C (Ω)

and no other points in

C (Ω)

are on the same curve segment.

Thus, the curve segments determined by

C (Ω)

are the equivalence classes [23]

S_{C} (Ω)

defined as

S_{C} (Ω) = S (Ω) / \sim_{C} = \{\{p \in S (Ω) : p \sim_{C} q\} : q \in S (Ω)\} .

(7)

It is clear that the curve segments in

S_{C} (Ω)

are separated from each other. In this paper, we propose two categories of points, respectively, denoted as

C_{1} (Ω)

and

C_{2} (Ω)

in order that

C (Ω) = C_{1} (Ω) \cup C_{2} (Ω)

determines

\sim_{C}

in Equation (7).

3.1. The First Category $C_{1} (Ω)$

Supposing all the points on the skeleton

S (Ω)

constitute a set

{\{p_{k}\}}_{k = 1 \dots K}

(K is the total number of elements in

S (Ω)

). We use a set

{\{e_{m n}\}}_{\binom{0 < m \leq K}{0 < n \leq K}}

to represent the connectivity of each two points from

S (Ω)

. Specifically,

e_{m n} = 1

if

p_{m}

and

p_{n}

are adjacent to each other. Otherwise,

e_{m n} = 0

. Subsequently, the first category of points

C_{1} (Ω)

is defined as

C_{1} (Ω) = \{p_{m} \in {\{p_{k}\}}_{k = 1 \dots K} |\sum_{n}^{K} e_{m n} > 2 o r \sum_{n}^{K} e_{m n} = 1\} .

(8)

3.2. The Second Category $C_{2} (Ω)$

Similar to

S_{C} (Ω)

(Equation (7)),

S (Ω)

segmented by

C_{1} (Ω)

is

S_{C_{1}} (Ω) = S (Ω) / \sim_{C_{1}} .

(9)

We define the function of the cth curve segment (suppose C curve segments totally) in

S_{C_{1}} (Ω)

as

r_{c} (u_{c})

, where

u_{c}

is the arc length parameter of the function

r_{c}

and

u_{c} \in (0, L_{c})

(

L_{c}

is the length of the cth curve segment). Then, the Frenet formulas [24] of the cth curve is

\frac{d}{d u_{c}} [\begin{matrix} r_{c} \\ α_{c} \\ β_{c} \\ γ_{c} \end{matrix}] = [\begin{matrix} 1 & 0 & 0 \\ 0 & κ_{c} & 0 \\ - κ_{c} & 0 & τ_{c} \\ 0 & - τ_{c} & 0 \end{matrix}] [\begin{matrix} α_{c} \\ β_{c} \\ γ_{c} \end{matrix}],

(10)

where

α_{c}

,

β_{c}

and

γ_{c}

are, respectively, the unit vector tangent, normal unit vector and binormal unit vector of

r_{c}

;

κ_{c}

and

τ_{c}

are the curvature and torsion.

We construct a quantity

δ_{c}

satisfying that

δ_{c} \propto κ_{c}

and

δ_{c} \propto τ_{c}

as well as a threshold

υ

. Subsequently, the second category of points

C_{2} (Ω)

is defined as

C_{2} (Ω) = {\{p = r_{c} (u_{c}) |δ_{c} \geq υ\}}_{c = 1 \dots C} .

(11)

3.3. Skeleton Dilation with Constraints

Supposing B is a structuring element [20] in the form of a subset in

R^{3}

, the dilation of

S (Ω)

by B after T iterations is defined as

B_{T} (S (Ω)) = \underset{t = 1 \dots T}{\oplus} B \oplus S (Ω) .

(12)

The dilation operation is executed to segment the reconstructed point cloud by the sub-skeletons. Therefore, sub-point-clouds should be separated from each other. Additionally, the dilation should be stopped when reaching the surface of the reconstructed point cloud. Thus, in each iteration

t \in \{1 \dots T\}

, we remove the points that violate the following two constraints:

B_{t} (S_{c_{1}} (Ω)) \cap B_{t} (S_{c_{2}} (Ω)) = \emptyset,

(13)

B_{t} (S_{c_{1}} (Ω)) \cap \bar{Ω} = \emptyset,

(14)

where

S_{c_{1}} (Ω)

and

S_{c_{2}} (Ω)

represent two distinct sub-skeletons in

S_{C} (Ω)

,

\bar{Ω}

is the external space of

Ω

.

Moreover, for the dilation of each sub-skeleton

S_{c_{1}} (Ω)

under constraints Equations (13) and (14), the corresponding total iteration times T satisfy that

B_{T + 1} (Ω) = B_{T} (Ω), B_{T} (Ω) - B_{T - 1} (Ω) \neq \emptyset .

(15)

Then, the dilation result of sub-skeletons in

S_{C} (Ω)

under constraints in the form of Equations (13)–(15) forms a equivalence relation for topological segmentations

(Ω, T)

.

4. 3D Reconstruction at $i$ th Time

This section introduces how to quickly reconstruct the dynamic object with a single RGB camera. For each frame by the camera, we first extract all the feature points. Through matching these feature points to those on the reconstructed point cloud, we are able to know to which sub-point-clouds these feature points individually correspond. Finally, the poses of the sub-point-clouds can be estimated by the correspondences between the feature points on the point cloud and the image feature points. This pose estimation problem can be easily handled by solving a nonlinear optimization. The reconstruction is therefore reduced to the reorganization of the sub-point-clouds with updated poses.

Denote the image captured at ith instant (

i > N

) as

{\{m_{i}^{j}\}}_{j = 1 \dots M}

. Through the SIFT algorithm, we can extract the feature points and their corresponding descriptor vectors. Through Section 2.2, a labeled 3D point cloud is reconstructed, to which the descriptor vectors are attached. Thus, we can use the descriptors from the captured image and from the cloud to find their correspondence, which can be denoted as a two-tuple set

{\{(m_{i}^{f_{i}}, P_{f_{i}}^{*})\}}_{f_{i} = j_{1} \dots j_{F}}

. In this set,

m_{i}^{f_{i}}

and

P_{f_{i}}^{*}

respectively represent the feature point from the image and the cloud point, whose descriptor vectors are similar. In addition,

{\{(m_{i}^{f_{i}}, P_{f_{i}}^{*})\}}_{f_{i} = j_{1} \dots j_{F}}

yields a bijective map

φ_{i} : {\{m_{i}^{f_{i}}\}}_{f_{i} = j_{1} \dots j_{F}} \to {\{P_{f_{i}}^{*}\}}_{f_{i} = j_{1} \dots j_{F}} .

Suppose

T = \{[p] : p \in Ω\} = \{\{q \in Ω : q \sim p\} : p \in Ω\}

as the basis for the topology of space

Ω = {\{P_{j}^{*}\}}_{j = 1 \dots M}

. We can get the basis for topology (denoted as

T_{i}

) of the subset

{\{P_{f_{i}}^{*}\}}_{f_{i} = j_{1} \dots j_{F}} \subset {\{P_{f_{i}}^{*}\}}_{j = 1 \dots M}

by

T_{i} = \{T_{c} \cap {\{P_{f_{i}}^{*}\}}_{f_{i} = j_{1} \dots j_{F}} |T_{c} \in T\} .

(16)

We can further acquire the basis for the topology (denoted as

T_{m_{i}}

) of the set

{\{m_{i}^{f_{i}}\}}_{f_{i} = j_{1} \dots j_{F}}

by

T_{m_{i}} = \{φ_{i}^{- 1} (T_{m_{i}}) |T_{m_{i}} \in T_{i}\} .

(17)

Based on our design, each element in

T

is a rigid component of

Ω

. Thus, when the object represented by

Ω

moves stochastically, the elements in

T_{c} \in T

have the same rigid transformation, i.e., there exists a single pair

(R_{i}^{c}, t_{i}^{c})

for

T_{c}^{i}

such that

T_{c}^{i} = R_{i}^{c} T_{c} + t_{c}^{i},

(18)

where

R_{i}^{c}

is a three-dimensional rotation matrix,

t_{i}^{c}

is a three-dimensional translation vector and

R_{i}^{c} T_{c}^{i} + t_{c}^{i}

is defined as

R_{i}^{c} T_{c}^{i} + t_{c}^{i} = \{R_{i}^{c} p + t_{c}^{i} |p \in T_{c}^{i}, T_{c}^{i} \in T\},

(19)

where p is denoted in the form of a column vector.

Defining

T_{c_{i}} = T_{c} \cap {\{P_{f_{i}}^{*}\}}_{f_{i} = j_{1} \dots j_{F}}

, since

T_{c_{i}} \subset T_{c}

, based on Equation (18), we can get the actual coordinates of

T_{c_{i}}^{i}

at

i^{t h}

instant as

T_{c_{i}}^{i} = R_{i}^{c} T_{c_{i}} + t_{c}^{i} .

(20)

Then, we can get another expression of

φ^{- 1} (T_{c_{i}})

as

φ^{- 1} (T_{c_{i}}) = \{m_{i} (p) |p \in T_{c_{i}}^{i}\},

(21)

where

m_{i} (\cdot)

(based on Equation (2)) is an operator to transform a three-dimensional point to a two-dimensional coordinate related to the camera at ith instant, which is defined as

\forall p \in R^{3}, m_{i} (p) = \frac{f^{i}}{[\begin{matrix} O_{1 \times 2} & 1 \end{matrix}] p} [\begin{matrix} I_{2 \times 2} & O_{2 \times 1} \end{matrix}] p .

(22)

Thus, we can compute a

(R_{i}^{c}, t_{i}^{c})

for

T_{c}^{i}

by solving the following optimization problem:

(R_{i}^{c}, t_{i}^{c}) = arg min_{(R_{i}^{c}, t_{i}^{c})} \sum_{p \in T_{c_{i}}} {[φ^{- 1} (p) - m_{i} (R_{i}^{c} p + t_{i}^{c})]}^{2} .

(23)

Finally, the raw result of the 3d reconstruction

{\{{\tilde{P}}_{i}^{j}\}}_{\begin{array}{l} i > N \\ j = 1 \dots M \end{array}}

based on

{\{m_{i}^{f_{i}}\}}_{f_{i} = j_{1} \dots j_{F}}

at ith time is

{\{{\tilde{P}}_{i}^{j}\}}_{\begin{array}{l} i > N \\ j = 1 \dots M \end{array}} = ⋃_{T_{c} \in T} (R_{i}^{c} T_{c} + t_{i}^{c}) .

(24)

5. Approach Overview

Specifically, we utilize a single camera to resolve the dynamic 3D object reconstruction. The problem can be formulated as: given the images

{\{m_{i}^{j}\}}_{\binom{i = 1 \dots N}{j = 1 \dots M}}

captured within time (the pose and position alters at each time

i \leq N

in order that static 3D reconstruction of the object can be satisfactorily achieved) and images captured at time

i > N

, the expected result is the dense reconstruction of the object

{\{P_{i}^{j}\}}_{\begin{array}{l} i > N \\ j = 1 \dots M \end{array}}

at the same time

i > N

.

Accordingly, we consider to fully utilize the 3D information acquired from the foregoing frames and then reduce the dynamic object reconstruction problem to a re-organizing problem. Thus, the proposed approach mainly incorporates three steps.

In the first step, we obtain the static 3D reconstruction of the target object in its stationary status. Specifically, the point cloud of the target object is acquired in the first phase through existing static 3D reconstruction methods. Meanwhile, we obtain the SIFT features on each image (among images utilized for 3D reconstruction) and attach the feature descriptors to the corresponding points on the reconstructed point cloud.

In the second step, we find an appropriate topological segmentation for the reconstructed point cloud such that each topological part moves rigidly during the object motion. Actually, the point cloud topological segmentation is also an open problem due to the fact that it is difficult to determine the standard for a satisfactory segmentation. In this paper, we transfer the point cloud segmentation problem into the skeleton segmentation problem, i.e., the segmentation of the point cloud results from the segmentation of its skeleton. This is based on the thought that the object skeleton is much more stable toward perturbation than the object surface. Thus, we can simply segment the skeleton into several sub-skeletons based on its curvature and torsion property before dilating each sub-skeleton to determine a corresponding topological part of the point cloud.

In the third step, when capturing a new image, we extract the SIFT features and match these features to those attached to the point cloud in the first phase. These matched features implicate the correspondence between the image and the point cloud. Based on the topological segmentation executed on the point cloud, the correspondence between the image and each topological part can also be computed. Then, the pose and position of each topological part can be deduced, and the reconstruction result can thus be obtained through a simple re-organization of these topological parts. The whole flow chart of the proposed approach is illustrated in Figure 3.

Construction of rotation matrix. Since the analytic expression of rotation matrix

R_{i}^{c}

in Equations (18), (20), (23) and (24) is required for optimization program (following Equation (23)), we utilize the quaternion to construct the rotation matrix. Specifically, a quaternion is in the form of

q = w + x \cdot \vec{i} + y \cdot \vec{j} + z \cdot \vec{k} .

(25)

Then, the corresponding rotation

R_{i}^{c}

in the optimization problem of Equation (23) is

R_{i}^{c} = [\begin{matrix} 1 - 2 (y^{2} + z^{2}) & 2 (x y - z w) & 2 (x z + y w) \\ 2 (x y + z w) & 1 - 2 (x^{2} + z^{2}) & 2 (y z - x w) \\ 2 (x z - y w) & 2 (y z + x w) & 1 - 2 (x^{2} + y^{2}) \end{matrix}],

(26)

with a constraint as

w^{2} + x^{2} + y^{2} + z^{2} = 1,

(27)

which constitutes a simple constrained nonlinear optimization problem.

6. Experiments and Discussion

This section provides the experiment results from the proposed approach. We evaluate the proposed approach by a set of experiments on a NAO robot Figure 4 and a human being.

6.1. Experiments on a NAO Robot

NAO robot is an autonomous, programmable humanoid robot developed by Aldebaran Robotics (France), with a height of 58 centimetres and 25 degrees of freedom. We use a NAO robot as the target object for 3D reconstruction. During the experiment, the NAO robot continuously changes its poses such that it is deformable. We reconstruct the NAO robot to test the depth accuracy by the proposed algorithm.

This experiment is undertaken on a single monocular camera to reconstruct the NAO robot in its dynamic status. In order to guarantee the effectiveness of the proposed method, several specific processing procedures are listed in Appendix A, Appendix B and Appendix C.

Other procedures of the proposed approach in experiments follow the descriptions and formulas in Section 2, Section 3 and Section 4.

We compare the proposed method with approaches in [14,25]. The depth estimation result by the proposed method is shown in Figure 5. Note that the proposed method only estimates the depth information of the target object while the previous approaches estimate the depth of the whole image. Therefore, we only compare the depth estimation accuracies corresponding to the image regions where the target object appears. Four accuracy metrics are used [26] as shown in Table 1.

The depth accuracies of the proposed method and other three approaches are shown in Table 2.

6.2. Experiments on a Human

Experiments are also undertaken on a human to verify the proposed approach. We produce the 3D reconstruction of the human in his still pose at the very beginning. Subsequently, the human changes his pose and the camera captures the corresponding monocular image for each pose. We use these images to test the proposed approach and the previous algorithms [14,25]. The program to implement this experiment is the same as that for the NAO robot experiment.

The depth estimation results by the proposed method and other approaches are shown in Figure 6. Additionally, the depth accuracies of the proposed method and other three approaches are shown in Table 3.

Similar to the experiments taken on NAO robot, the experiments results can validate the effectiveness of the proposed method. The proposed method uses the 3D point cloud reconstructed ahead of time as the priori model. Subsequently, it depends on the image features to estimate the pose changes of the priori model, therefore being capable of estimating the depth information accurately.

7. Conclusions

In this paper, we propose a feature-based approach to estimate the depth of a deformable object accurately via a monocular camera. The proposed approach needs to reconstruct the target object in its initial pose as a priori model. Afterwards, only one monocular image is required to to accurately estimate the depth of the target object no matter how it changes its pose. Experiments are undertaken on a NAO robot and a human to evaluate the accuracy of the proposed approach. In future work, we aim to accurately estimate depth of the same kind of deformable object, by only reconstructing a single instance of that kind as the priori model.

Author Contributions

The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript. conceptualization, methodology and writing—original draft preparation: G.J.; software, validation and writing—review and editing: S.J.; investigation, resources, software and supervision: Y.O.; the investigation, resources and validation: S.Z.

Funding

This work was supported by the National Natural Science Foundation of China Grant No. U1613210, U1813208, the Shenzhen Fundamental Research Programs (JCYJ2016428154842603, JCYJ20170413165528221), and the Shenzhen Engineering Laboratory for Integration of Interventional Diagnosis and Treatment.

Conflicts of Interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled.

Appendix A. SIFT Matching Accuracy

The SIFT algorithm, in spite of its effectiveness validated in [19], cannot actually match features from two images enough accurately, especially when a corresponding object in two images has deformation. Thus, in order to ameliorate the result of feature matching (crucial as a prerequisite of our approach), we propose the following engineering improvements:

For a certain feature point, we collect the 10 other feature points closest to it on the image. The 11 feature points forms a group. In this way, each feature point on the image can constitute a group, with itself as the center of the group. It is easy to consider that, if two feature points from two images match each other, their corresponding groups match each other as well.

When judging whether two feature points from different images are matched, we compare their groups. Specifically, we first look for two points respectively from two groups, which matched each other best (the matching degree is decided by the SIFT descriptor vectors of the feature points.). Recording the matching degree, we delete the two points from their groups. We continue looking for the next two points that are matched best in the same way. This is iteratively done until there are no more points in both groups. We accumulate the matching degrees of the feature point pairs as the similarity of the original two feature points.

Appendix B. Discrete Computation of Curvature and Torsion

In Equation (11), since

δ_{c}

should satisfy that

δ_{c} \propto κ_{c}

,

δ_{c} \propto τ_{c}

, we compute

δ_{c}

in discrete situations when programming as: for any voxel (namely a 3D point restored in computers), we choose the 5th point individually from it along both sides. Construct two vectors respectively pointing from the voxel to the two chosen voxels. Then, the cosine value of the two vectors is selected as

δ_{c}

for simplicity.

Appendix C. Result Optimization for Smoothness

The result from Equation (23) can basically represent the pose of the target object, but the joint of two connected topological parts may be too coarse to be the final result. Thus, we propose a simple algorithm to smooth the joint of two connected topological parts. The procedures can be described as:

When the local point cloud is segmented into two topological parts, we know the correspondence between all the points on both borders. After solving the optimization problem as Equation (23), the relative pose of the two topological parts can also be acquired. Then, the new correspondence between the points on both borders can be deduced. We interpolate points equably along the line segment with them as ends. The interpolated points as well as the pairs of points constitutes several included angles by connecting neighbor points. Regard these included angles as variables, by solving an optimization problem to make the variance of these variables smallest, a smoothed reconstruction with smoothly connected topological parts can be obtained.

References

Xu, G.; Chen, J.Y.; Li, X.T. 3-D Reconstruction of Binocular Vision Using Distance Objective Generated From Two Pairs of Skew Projection Lines. IEEE Access 2017, 5, 27272–27280. [Google Scholar] [CrossRef]
Chu, P.M.; Cho, S.; Fong, S.; Park, Y.W.; Cho, K. 3D Reconstruction Framework for Multiple Remote Robots on Cloud System. Symmetry 2017, 9, 55. [Google Scholar] [CrossRef]
Xu, G.; Yuan, J.; Li, X.T.; Su, J. 3D reconstruction of laser projective point with projection invariant generated from five points on 2D target. Sci. Rep. 2017, 7, 7049. [Google Scholar] [CrossRef] [PubMed]
Xu, G.; Zhang, X.Y.; Li, X.T.; Su, J.; Hao, Z.B. Global Calibration Method of a Camera Using the Constraint of Line Features and 3D World Points. Meas. Sci. Rev. 2016, 16, 190–196. [Google Scholar] [CrossRef] [Green Version]
Ranftl, R.; Vineet, V.; Chen, Q.; Koltun, V. Dense Monocular Depth Estimation in Complex Dynamic Scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Smith, W.A.P.; Ramamoorthi, R.; Tozza, S. Linear Depth Estimation from an Uncalibrated, Monocular Polarisation Image. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016. [Google Scholar]
Karsch, K.; Liu, C.; Kang, S.B. Depth Transfer: Depth Extraction from Video Using Non-Parametric Sampling. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 2144–2158. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Saxena, A. Depth estimation using monocular and stereo cues. In Proceedings of the International Joint Conference on Artifical Intelligence, Hyderabad, India, 6–12 January 2007. [Google Scholar]
Depth Estimation from Single Image Using CNN-Residual Network. Available online: http://cs231n.stanford.edu/reports/2017/pdfs/203.pdf (accessed on 30 August 2017).
Haim, H.; Elmalem, S.; Giryes, R.; Bronstein, A.M.; Marom, E. Depth Estimation From a Single Image Using Deep Learned Phase Coded Mask. IEEE Trans. Comput. Imaging 2018, 4, 298–310. [Google Scholar] [CrossRef]
Gan, Y.; Xu, X.; Sun, W.; Lin, L. Monocular Depth Estimation with Affinity, Vertical Pooling, and Label Enhancement. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
Fu, H.; Gong, M.; Wang, C.; Batmanghelich, K.; Tao, D. Deep Ordinal Regression Network for Monocular Depth Estimation. arXiv, 2018; arXiv:1806.02446. [Google Scholar]
Jiao, J.; Cao, Y.; Song, Y.; Lau, R.W.H. Look Deeper into Depth: Monocular Depth Estimation with Semantic Booster and Attention-Driven Loss. In Proceedings of the 15th European Conference, Munich, Germany, 8–14 September 2018. [Google Scholar]
Godard, C.; Mac Aodha, O.; Brostow, G.J. Unsupervised Monocular Depth Estimation with Left-Right Consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Garg, R.; Bg, V.K.; Carneiro, G.; Reid, L. Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue. arXiv, 2016; arXiv:1603.04992. [Google Scholar]
Wang, G.H.; Chu, Y.B. A New Oren-Nayar Shape-from-Shading Approach for 3D Reconstruction Using High-Order Godunov-Based Scheme. Algorithms 2018, 11, 75. [Google Scholar] [CrossRef]
Zhu, W.; Chang, X.; Wang, Y.B.; Zhai, H.Y.; Yao, Z.X. Reconstruction of Hydraulic Fractures Using Passive Ultrasonic Travel-Time Tomography. Energies 2018, 11, 1321. [Google Scholar] [CrossRef]
Xu, G.; Yuan, J.; Li, X.T.; Su, J. Optimization reconstruction method of object profile using flexible laser plane and bi-planar references. Sci. Rep. 2018, 8, 1526. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef] [Green Version]
Stockman George, C. Computer Vision; Prentice Hall: Upper Saddle River, NJ, USA, 2001. [Google Scholar]
Jalba, A.; Sobiecki, A.; Telea, A. An Unified Multiscale Framework for Planar, Surface, and Curve Skeletonization. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 30–45. [Google Scholar] [CrossRef] [PubMed]
Rodriguez, J.; Ayala, D. Erosion and Dilation on 2D and 3D Digital Images: A new size-independent approach. In Proceedings of the Vision Modeling & Visualization Conference, Stuttgart, Germany, 21–23 November 2001. [Google Scholar]
Munkres, J. Introduction to Topology; Saunders College Pub: Philadelphia, PA, USA, 1983. [Google Scholar]
Kreyszig, E. Differential Geometry; University of Toronto Press: Toronto, ON, Canada, 1959. [Google Scholar]
Laina, I.; Rupprecht, C.; Belagiannis, V.; Tombari, F.; Navab, N. Deeper Depth Prediction with Fully Convolutional Residual Networks. arXiv, 2016; arXiv:1606.00373. [Google Scholar]
Depth Map Prediction from a Single Image using a Multi-Scale Deep Network. Available online: https://papers.nips.cc/paper/5539-depth-map-prediction-from-a-single-image-using-a-multi-scale-deep-network.pdf (accessed on 10 September 2014).

Figure 1. The tool used in a human–robot cooperation scenario for screwing a bolt. Since the human and the robot together screw the cross tool, the human arm might be obstacles to the robot during the bolt screwing process.

Figure 2. 3D reconstruction for training a cooperative robot, which provides training samples and allows the robot to learn to avoid obstacles in a virtual environment.

Figure 3. The framework of the proposed method.

Figure 4. Experiments on a NAO robot. (a) shows the robot as the target object to evaluate the algorithms; (b) shows the priori model reconstructed at the beginning.

Figure 5. The depth estimation results for the NAO robot by the approaches in [14,25] and the proposed method. We use the ground truth to build a mask so as to only display the depth information corresponding to the image region of the NAO robot.

Figure 6. The depth estimation results for the human volunteer by the approaches in [14,25] and the proposed method.

Table 1. Quantitative comparison on the aspects of accuracy and efficiency.

Metrics Names	Equations
Abs Relative Difference	$\frac{1}{T} \sum_{y \in T} \|y - y^{}\| / y^{}$
Squared Relative difference	$\frac{1}{T} \sum_{y \in T} {\|y - y^{}\|}^{2} / y^{}$
RMSE (linear)	$\sqrt{\frac{1}{T} \sum_{y \in T} {\|y - y^{*}\|}^{2}}$
RMSE (log)	$\sqrt{\frac{1}{T} \sum_{y \in T} {\|log y - log y^{*}\|}^{2}}$

Table 2. Quantitative comparison on the aspects of accuracy and efficiency.

Approaches	Abs Relative Difference	Squared Relative Difference	RMSE (linear)	RMSE (log)
The approach in [25]	4.8693	510.8358	452.1486	3.3712
The approach in [14]	1.1972	53.7397	182.3361	1.2484
The proposed approach	0.9987	31.7892	152.0740	1.0743

Table 3. Quatitative comparison on the aspects of accuracy and efficiency.

Approaches	Abs Relative Difference	Squared Relative Difference	RMSE (linear)	RMSE (log)
The approach in [25]	5.9473	638.9376	367.0673	3.7256
The approach in [14]	2.0792	197.2335	216,7147	2.1638
The proposed approach	1.8274	83.5462	187.4539	2.0132

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jiang, G.; Jin, S.; Ou, Y.; Zhou, S. Depth Estimation of a Deformable Object via a Monocular Camera. Appl. Sci. 2019, 9, 1366. https://doi.org/10.3390/app9071366

AMA Style

Jiang G, Jin S, Ou Y, Zhou S. Depth Estimation of a Deformable Object via a Monocular Camera. Applied Sciences. 2019; 9(7):1366. https://doi.org/10.3390/app9071366

Chicago/Turabian Style

Jiang, Guolai, Shaokun Jin, Yongsheng Ou, and Shoujun Zhou. 2019. "Depth Estimation of a Deformable Object via a Monocular Camera" Applied Sciences 9, no. 7: 1366. https://doi.org/10.3390/app9071366

APA Style

Jiang, G., Jin, S., Ou, Y., & Zhou, S. (2019). Depth Estimation of a Deformable Object via a Monocular Camera. Applied Sciences, 9(7), 1366. https://doi.org/10.3390/app9071366

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Depth Estimation of a Deformable Object via a Monocular Camera

Abstract

1. Introduction

2. 3D Labeled Reconstruction

2.1. 3D Reconstruction with Multiple Images

2.2. SIFT Features to Label the 3D Reconstruction

3. Skeleton-Based Topological Segmentation

3.1. The First Category $C_{1} (Ω)$

3.2. The Second Category $C_{2} (Ω)$

3.3. Skeleton Dilation with Constraints

4. 3D Reconstruction at $i$ th Time

5. Approach Overview

6. Experiments and Discussion

6.1. Experiments on a NAO Robot

6.2. Experiments on a Human

7. Conclusions

Author Contributions

Funding

Conflicts of Interest

Appendix A. SIFT Matching Accuracy

Appendix B. Discrete Computation of Curvature and Torsion

Appendix C. Result Optimization for Smoothness

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Depth Estimation of a Deformable Object via a Monocular Camera

Abstract

1. Introduction

2. 3D Labeled Reconstruction

2.1. 3D Reconstruction with Multiple Images

2.2. SIFT Features to Label the 3D Reconstruction

3. Skeleton-Based Topological Segmentation

3.1. The First Category C 1 Ω

3.2. The Second Category C 2 Ω

3.3. Skeleton Dilation with Constraints

4. 3D Reconstruction at i th Time

5. Approach Overview

6. Experiments and Discussion

6.1. Experiments on a NAO Robot

6.2. Experiments on a Human

7. Conclusions

Author Contributions

Funding

Conflicts of Interest

Appendix A. SIFT Matching Accuracy

Appendix B. Discrete Computation of Curvature and Torsion

Appendix C. Result Optimization for Smoothness

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.1. The First Category $C_{1} (Ω)$

3.2. The Second Category $C_{2} (Ω)$

4. 3D Reconstruction at $i$ th Time