1. Introduction
Humans are able to predict the three-dimensional geometry of an object from a single 2D image based on the prior knowledge of target objects and related environments. However, it is quite a challenging task for machines without such prior knowledge because of the drastic variation of object shape, texture, lighting conditions, environments and occlusions. In traditional 3D reconstruction approaches, such as Structure from Motion (SfM) [
1] and Simultaneous Localization and Mapping (SLAM), visual appearance consistency across multiple views are utilized to infer lost three-dimensional information, finding multi-view corresponding point pairs. Extracting dense corresponding point pairs is not a trivial task due to texture-less regions, large differences in viewpoints, and self-occlusion [
2]. A complete 3D shape can be observed as long as the multiple images cover the target object from entire angles of view. A 3D reconstruction from 2D images is a challenging task that has been studied for a long time. Traditional approaches, such as structure from motion (SfM) and simultaneous localization and mapping (SLAM), require multiple RGB images of the same target scene [
3,
4]. To recover the 3D structure, dense features are extracted and matched to perform the minimization of re-projection errors [
5]. However, detecting corresponding pairs is a difficult task when the distances between multiple viewpoints are large. Furthermore, scanning the entire aspects of an object with 2D color images is computationally expensive especially with occluded or concave surface regions [
3]. Taking the advantage of the availability of large-scale synthetic data, deep learning-based networks have been introduced to reconstruct 3D shapes from single or multiple view RGB images. LSM [
6] and 3D-R2N2 [
7] propose RNN based networks to predict voxel representation of 3D shape from single or multiple view images where input views are processed sequentially using a shared encoder. Extracted features from views are refined incrementally when more input views are available. In their methods, the information from earlier input views is hardly passed to the final refined features for 3D reconstruction. DeepMVS [
8] and RayNet [
9] apply max and average polling, respectively, over features of entire viewpoints to accumulate outstanding features from unordered multiple views. While these methods alleviate the limitation of RNN, they lose useful view-specific information. Recently, for more effective accumulation of entire viewpoint features, AttSets [
10] uses an attentional aggregation module that predicts weight matrix for the features of attention. However, feature aggregation from multiple views is a difficult task in latent space. To overcome the difficulties of feature aggregation, Pix2Vox++ [
11] proposes a multi-scale context-aware fusion module that applies context information across multiple view images. In the method, they apply the context-aware fusion module after coarse volume prediction and finally refine the 3D reconstruction using a 3D auto-encoder. However, all these methods learn shape priors implicitly and are sensitive to noisy input or heavy occlusion frequently existing in real-world images. Recently deep learning methods are able to recover 3D shapes in the form of meshes and point clouds. Reference [
12] predicts dense point cloud using 2D convolutional operation from multiple viewpoint images and performs geometric reasoning with 2D projection optimization. Pixel2Mesh++ [
13] uses a graph convolutional network to recover 3D mesh of objects from cross-view information.
Different from multi-view 3D reconstruction, single view reconstruction has to predict a complete 3D shape based on single view observation and corresponding prior knowledge. Therefore, complete 3D shape reconstruction from a single 2D image is a challenging task. To resolve the problem, several methods have attempted to reconstruct a 3D shape from a 2D view, where the 2D view is either silhouettes [
14], shading [
15,
16], or texture [
17]. In real-world scenarios, these methods are not practical for shape reconstruction because of presumption step on camera calibration, smooth surface, single light source, and so forth [
18]. To address the issues, most of the state-of-the-art methods map input images to latent representation and obtain the corresponding 3D shape by decoding the latent representation. A joint 2D–3D embedding network proposed in [
19], named TL-Network, extracts good latent feature representation using a 3D auto-encoder. Latent features are inferred from 2D input images. But the method reconstructs very low resolution of the 3D shape. With great achievements of generative adversarial networks (GANs) [
20] and variational auto-encoders (VAEs) [
21], 3DVAE-GAN [
22] uses GAN and VAE to reconstruct a 3D shape from a single 2D image. MarrNet [
23] and ShapeHD [
24] use an intermediate step of estimating silhouettes, depth, and normal of 2D images and reconstructs 3D shape from the depth and normal. To reconstruct a high-resolution 3D shape, OGN [
25] proposes octree based learning representation. Matryoshka networks [
26] use nested shape layers by recursively decomposing a 3D shape. Currently, several methods are trying to reconstruct a 3D shape with different representations such as point cloud [
27], mesh [
28] and signed distance field [
29]. PSG [
27] proposes a method to retrieve point cloud using a simple deep neural network from a single 2D image. Pixel2Mesh [
28] reconstructs triangular meshes of 3D shape. Taking the advantage of fine-grained 3D part annotations [
30], several approaches [
30,
31,
32] reconstruct 3D shapes by combining 3D parts in a hierarchical scheme.
Reflection symmetry is one of the useful characteristics in both real-world and man-made objects for better description, understanding, and sometimes also being visually attractive. Reference [
33] shows the importance of symmetry and virtual views on 3D object recognition. Reflection symmetry has been employed in diverse tasks in computer vision such as novel view synthesis [
34], texture inpainting [
35] and shape recovering [
36]. Symmetry correspondences are utilized for 3D reconstruction with different representation such as curves [
37], points [
38] and deep implicit fields [
39]. To find the symmetry correspondence, these methods need either camera pose or symmetry plane as given inputs. References [
40,
41] proposes symmetry detection methods for 3D reconstruction. On the other hand, reference [
39] propose an implicit function for 3D reconstruction and find the local details by projecting 3D points onto 2D image and applying symmetry fusion. Symmetry prior in considerable number of object types is one of the powerful clues to infer a complete 3D shape from single-view images [
42]. Symmetry properties have provided additional knowledge about 3D structures and the entire geometry of objects in 3D reconstruction tasks [
43,
44,
45,
46].
In this work, we propose Sym3DNet for single-view 3D reconstruction. Sym3DNet recovers occluded and unseen symmetry shapes by sharing the symmetry feature inside the networks performing 3D reconstruction in a complex environment. Additionally, we incorporate 3D shape perceptual loss that improves the reconstruction of a realistic 3D structure in both global and local 3D shapes. Extensive quantitative and qualitative evaluations on ShapeNet and Pix3D datasets show an outstanding performance of the proposed 3D reconstruction from single image.
2. Proposed Method
To apply symmetry prior to 3D reconstruction, a view-orientation of 3D output with respect to input view is critical. The orientation of reconstructed 3D output in view-centric coordinate is aligned to the 3D camera space of input image as shown in
Figure 1a. On the other hand, in object-centric coordinates, the orientation of the 3D output is aligned to the canonical view as shown in
Figure 1b. In the canonical view, orientations of all 3D shapes are aligned in 3D space as shown in
Figure 1c. In order to include symmetric prior inside a network, we have to know the symmetry correspondence of the 3D output. To find the symmetry correspondences in view-centric coordinates, symmetry plane detection is essential [
41]. In a view-centric coordinate, the global reflection symmetry plane of an object comes to vary along with the change of input view. NeRD [
40] has proposed a learning-based reflection symmetry plane detection from a single-view shape image. This method can be extended to pose estimation, depth estimation and 3D reconstruction task. It performs well with synthetic clean object images with uniform backgrounds. However, symmetry plane detection with real-world images with arbitrary backgrounds becomes a more challenging task. Therefore, incorporating an additional step of symmetry plane detection makes the performance of 3D reconstruction dependent on the performance of symmetry plane detection. Global reflection symmetry plane of 3D shapes in object-centric coordinates can be defined in the canonical view. Ladybird [
39] has proposed a deep implicit field-based 3D reconstruction where local image features of the corresponding symmetry points are fused to solve the problem of self-occlusion. To extract the local feature from an input image, camera extrinsic parameters are either given or predicted from the input view, which restricts the scope of single-view 3D reconstruction application. The prediction of camera parameters is often very difficult for real-world images and reconstruction performance is affected by the prediction performance of camera parameters.
To exploit the advantage of symmetry in view-centric coordinates, most of the methods have to solve the symmetry detection problem which itself is challenging with real-world images due to self-occlusion or occlusions by other objects and complicated background. On the other hand, other methods using 3D projection onto 2D to collect symmetry correspondence have to know the camera pose which is another challenging task with occlusions and complicated background. In this work, we utilize the advantage of the canonical view and propose voxel-wise symmetry fusion inside our network to share the symmetry prior. Symmetry fusion helps the network to recover the missing shapes in 3D reconstruction.
Instead of using camera information or a separate symmetry detection step [
39,
40], we focus on the canonical view 3D reconstruction network in which the global symmetry plane of 3D shapes are learned implicitly and aligned, exploiting the benefits of symmetry properties of 3D objects in 3D reconstruction. We propose a Symmetric 3D Prior Network (Sym3DNet) for single-view 3D reconstruction. Sym3DNet is composed of two jointly trained encoder branches: a 3D shape encoder and a 2D image encoder. A 3D decoder is connected to the encoders so that the input 3D volume and 2D image are represented in the same latent space and reconstruct 3D volume from the latent representation via 3D decoder. To incorporate the symmetric prior, the symmetry fusion module is attached to the 3D decoder and perceptual loss is calculated. A pair of symmetry voxels share their receptive features in the symmetry fusion module. To penalize the non-symmetric shape in our network, we introduce a perceptual loss that has two terms: global and local. We pass the reconstructed 3D shape and corresponding ground truth to the pre-trained 3D shape encoder and calculate global perceptual loss based on the latent representation of predicted and original 3D shape. In this way, global perceptual loss tries to penalize not only reconstructed shapes deviating from original shapes but also the prediction of perceptually unpleasant (i.e., visual discomfort) shapes. Many encoder–decoder networks [
6,
7,
10,
11] suffer from limited performance in the reconstruction of local details. To recover local details, the local perceptual loss is calculated from the randomly chosen partial 3D data of reconstructed 3D object and their ground truths. For the local perceptual loss, we use a 3D shape encoder pre-trained with randomly cropped small 3D shapes. In our experimental evaluation, the proposed Sym3DNet achieves an outstanding performance of single-view 3D reconstruction in terms of both efficiency and accuracy on ShapeNet [
47], a synthetic dataset and Pix3D [
48], a real-world dataset. The proposed method is also generalized to unseen shape reconstruction.
The goal of our method is to predict a voxelized 3D shape from a single 2D image. The output of the 3D shape prediction is represented by a 3D voxel grid where 1 indicates the surface of the shape and 0 indicates another empty space. Our method maps the input RGB image to a latent representation of a 3D shape and decodes the representation obtaining the corresponding volumetric shape. The proposed method optimizes the latent space representation of a 3D shape so that the original 3D shape can be reconstructed in the 3D decoder. At the same time, the latent space representation of the 3D shape has to be obtained from the corresponding 2D color image in the 2D image encoder.
First, 3D shape encoder and decoder shown in
Figure 2 are trained as a 3D auto-encoder. The 3D shape encoder takes the 3D voxel representation
as input and transforms 3D shape into latent feature vector
. The 3D decoder takes
as input and decodes the latent feature vector to a voxel representation of a 3D shape. Then, to ensure the transformation of the same latent feature
from a respective 2D image, our network contains a 2D image encoder shown in
Figure 2. The 2D image encoder of our network takes a 2D RGB image
I as input and transforms the 2D image into latent features
and tries to map with corresponding
in the second stage of training. Finally, we fine tune the 2D image encoder and the 3D decoder with perceptual and reconstruction losses.
The proposed Sym3DNet consists of two jointly trained encoders: a 2D image encoder and a 3D shape encoder. In particular, two encoders map a 3D shape and corresponding 2D image into the same point of latent space. The 3D output is generated by the 3D decoder from the latent space. As our output is in canonical view, we apply a symmetry fusion module in the 3D decoder to provide the symmetry feature fusion to our model as shown in
Figure 2. The symmetry feature fusion module shares the structural shape information in their corresponding region of reflection symmetry and improves the prediction of the object surface. We use perceptual loss that considers the 3D shape encoder as a high-level 3D shape features extractor to differentiate between the original shape and the predicted shape. We employ a local perceptual loss which tries to differentiate the local shape and a global perceptual loss, which keeps the naturalness of the global 3D shape in the reconstruction (
Figure 2).
2.1. Symmetry Prior Fusion
The reflection symmetry plane is the x-y plane of object-centric coordinates where 3D objects are aligned with the canonical view. With symmetrical 3D objects, the reconstructed outputs of the 3D decoder are aligned with the canonical view as shown in
Figure 1. The features of the 3D shape in the receptive field of the 3D decoder are aligned with the reconstructed output. So the local receptive field of the decoder is the symmetry in x-y plane as well. If
B is a voxel located at the other side of the symmetry structure of voxel
A, then we share the feature of
B inside the receptive field of 3D decoder with
A by feature concatenation as shown in
Figure 2. Concatenated features recover local shapes from symmetry prior considering the neighborhood patterns of symmetry fusion as shown in
Figure 3. In the left of
Figure 3, a symmetry object has a missing voxel
k that has symmetry counterpart
h reconstructed correctly. In this case, voxel
h has strong features to be predicted as an object surface (
Figure 3a) and the neighbor features of voxel
k also support voxel
h to be predicted as the object’s surface. On the other hand, voxel
k has weak features and is going to be predicted as an empty surface. However, strong neighbor features from
h influence
k to be predicted as the object surface after symmetry fusion is applied as shown in
Figure 3b.
As the symmetry features are explicitly added to the network, the symmetry fusion module does not affect the reconstruction of 3D shapes which are partially non-symmetry. In
Figure 3c,d, we show that neighborhood patterns of symmetry fusion are able to find the non-symmetry shapes of an object. In
Figure 3c, voxel
q lies on non-symmetry structure of an object and weak neighbor features of
s are not able to influence
q to be predicted as empty space because neighbor features of
q are strong. In
Figure 3d, voxel
s lies on the other side of the non-symmetry structure of an object and weak neighbor features of
s influence
s to be predicted as empty space even though neighbor features of
q is strong. To include the neighborhood patterns, we add a symmetry fusion module prior to the prediction layer of the 3D decoder at a resolution of
.
2.2. Reconstruction Loss
In the first step of training, the 3D shape encoder and 3D decoder are trained to get the latent representation of 3D shapes. Both networks are optimized with reconstruction loss
, which is defined as binary cross-entropy (BCE) between the reconstructed object
p and ground truth 3D object
. The BCE loss calculates the mean of binary cross-entropy of each voxel between
p and
. More formally we define
as follows:
where
N represents the total number of voxels.
and
p denote the ground truth voxel and the corresponding predicted occupancy respectively.
In the second step of training, only the 2D image encoder is trained with
loss between the latent features
from the 2D image encoder and the latent features
from the 3D shape encoder as shown in Equation (
2). Here, latent features
are considered as a ground truth of the corresponding 3D shape.
where
I represents input images and
i represents the index of latent features.
n is the number of latent features. Both the 2D image encoder and the 3D shape encoder give an output of the same sized latent features.
In the final step of the training, we initialize the 2D image encoder from the second step and the 3D decoder from the first step. Finally, we train the 2D image encoder and the 3D decoder with both loss and along with perceptual loss .
2.3. Perceptual Loss
It is a difficult task to transfer local details observed from a 2D encoder to a 3D decoder in a latent space. To provide local guidance, we propose local perceptual loss
between partial region
X of
and corresponding region
of
p. We pass both
X and
to a pretrained 3D shape encoder and get the respective latent representation of the partial regions
and
.
We apply a 3D voxel mask M to both and p to find and , where ⊙ represents element-wise multiplication. The size of the 3D bounding box M is randomly picked with a condition that within the bounding box of should contain the object surface.
In the case of 3D reconstruction from single view, the goal is to recover the 3D shape to visually match the perfect undistorted 3D shape. The single 2D view might have multiple counterpart shapes which are perceptually acceptable. Using only reconstruction loss, the output can have a mean shape problem for a counterpart. To solve the problem, we apply a global perceptual loss
which is defined as follows:
where
are latent features of the 3D ground truth shape
and
i is the index of latent features and
p represents the predicted shape. Usually perceptual loss is calculated from the network that is trained for object classification. In our method, we use the same 3D shape encoder for perceptual loss calculation. Our perceptual loss includes both
and
. We define the perceptual loss
as follows:
2.4. Network Architecture
The 3D shape encoder takes 32 × 32 × 32 3D voxels as an input. The global latent features are directly learned from the input voxel through the 3D shape encoder. The 3D shape encoder consists of four 3D convolution blocks. Each 3D convolution block contains 3D convolution, batch normalization, and ReLu activation. Four 3D convolution layers are configured with kernel sizes of (4,4,4,4), strides of (2,2,2,1) and channels of 128,256,512,128 respectively. Our 3D shape encoder produces 128 latent features.
After getting the global latent feature vector from the 3D shape encoder, we put
to the 3D decoder to reconstruct the respective 3D voxel from the latent space as shown in
Figure 2. The 3D decoder is designed with four deconvolution blocks including transpose 3D convolution, batch normalization, and ReLu activation. Four transpose 3D convolutions have kernel sizes of (4,4,4,4), strides of (1,2,2,2) and channels of (512,256,256,1), respectively. We apply an element-wise logistic sigmoid function in the output layer for each voxel. Prior to the prediction layer, we add our symmetry fusion module as shown in
Figure 2.
The 2D image encoder takes an RGB image
I as input. Latent feature vector
is extracted by the 2D image encoder from a single view RGB image to learn the respective
. In our implementation, ResNet [
49] is used for the 2D image encoder. The 2D image encoder gets a 224 × 224 image as input and extracts 128 latent features. In testing, we remove the 3D shape encoder and keep only the 2D image encoder with a 3D decoder.