3D Shape Generation via Variational Autoencoder with Signed Distance Function Relativistic Average Generative Adversarial Network

Ajayi, Ebenezer Akinyemi; Lim, Kian Ming; Chong, Siew-Chin; Lee, Chin Poo

doi:10.3390/app13105925

Open AccessArticle

3D Shape Generation via Variational Autoencoder with Signed Distance Function Relativistic Average Generative Adversarial Network

Faculty of Information Science and Technology, Multimedia University, Malacca 75450, Malaysia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(10), 5925; https://doi.org/10.3390/app13105925

Submission received: 13 March 2023 / Revised: 9 April 2023 / Accepted: 17 April 2023 / Published: 11 May 2023

Download

Browse Figures

Versions Notes

Abstract

:

3D shape generation is widely applied in various industries to create, visualize, and analyse complex data, designs, and simulations. Typically, 3D shape generation uses a large dataset of 3D shapes as the input. This paper proposes a variational autoencoder with a signed distance function relativistic average generative adversarial network, referred to as 3D-VAE-SDFRaGAN, for 3D shape generation from 2D input images. Both the generative adversarial network (GAN) and variational autoencoder (VAE) algorithms are typical algorithms used to generate realistic 3D shapes. However, it is very challenging to train a stable 3D shape generation model using VAE-GAN. This paper proposes an efficient approach to stabilize the training process of VAE-GAN to generate high-quality 3D shapes. A 3D mesh-based shape is first generated using a 3D signed distance function representation by feeding a single 2D image into a 3D-VAE-SDFRaGAN network. The signed distance function is used to maintain inside–outside information in the implicit surface representation. In addition, a relativistic average discriminator loss function is employed as the training loss function. The polygon mesh surfaces are then produced via the marching cubes algorithm. The proposed 3D-VAE-SDFRaGAN is evaluated with the ShapeNet dataset. The experimental results indicate a notable enhancement in the qualitative performance, as evidenced by the visual comparison of the generated samples, as well as the quantitative performance evaluation using the chamfer distance metric. The proposed approach achieves an average chamfer distance score of 0.578, demonstrating superior performance compared to existing state-of-the-art models.

Keywords:

3D shape generation; variational autoencoder; generative adversarial network; signed distance function; relativistic average

1. Introduction

The advancements in Artificial Intelligence (AI) technology have enabled the use of 3D shapes in the computer vision (CV) and computer graphics (CG) domains. The demand for fine details and accurate 3D shapes in robotics, 3D games, medical imaging, and virtual and augmented reality applications has necessitated the development of robust deep learning algorithms such as generative adversarial networks (GAN), variational autoencoder (VAE) and so on, to generate 3D mesh-based shapes.

Recent advancements in unsupervised learning with deep neural networks, trained on robust 3D datasets [1,2,3,4], have led to more detailed representations of shape space in deep learning models. These advancements enable realistic 3D shape generation for applications such as 3D scene understanding, data augmentation, visualization, simulation, analysis, and interpretation in the field of AI technology [5,6,7]. However, convolutional neural network (CNN)-based generative models that use voxels [6,7,8] and point-cloud [9,10,11] data representations as inputs often yield unsatisfactory results due to low resolution and high memory usage with voxels and the unstructured nature of point-clouds. On the other hand, using mesh data representation as input for 3D-GANs and 3D-VAEs leads to the generation of high-resolution, compact, and computationally efficient 3D shapes. However, the irregular and complex organization of mesh data representation [12,13,14] makes it challenging for it to be used as input for learning in a CNN.

Recently, several variations of the 3D-VAE-GAN [6,8,15] have been proposed to address the problem of 3D shape generation. These models have the ability to produce accurate 3D shape samples, classify 3D shapes, and reconstruct 3D shapes from 2D images. They are based on the well-known original GAN architecture [16] and have a tendency towards unstable training. While these models excel in creating 3D shapes from single classes, their outputs tend to be coarse, and training on complex data distributions with multiple object classes in various poses can be challenging. This paper aims to improve upon existing VAE-GAN models by generating 3D mesh-based shapes using a 3D-CNN framework and jointly training over multiple classes to achieve fully data-driven modeling.

To generate high-quality 3D mesh-based shapes from 2D images and capture a more complex distribution, this paper presents an innovative deep learning approach by integrating a relativistic average loss function with a signed distance function-based VAE-GAN framework, referred to as 3D-VAE-SDFRaGAN.

The 3D-VAE-SDFRaGAN method is a novel approach for generating 3D shapes that combines the strengths of VAEs and GANs. By using a signed distance function (SDF), this method can overcome the limitations of traditional methods for generating 3D shapes and creating more realistic shapes with smoother surfaces. The relativistic average GAN (RaGAN) is also employed to enhance the stability and quality of the generated shapes.

This approach has broad potential applications in several fields, including computer vision, computer graphics, gaming, robotics, medical imaging, simulation, and virtual reality. The generated 3D shapes offer improved communication, better design, and an enhanced understanding of complex structures and concepts.

The proposed 3D-VAE-SDFRaGAN aims to produce a signed distance function field on a gridded domain by utilizing a deconvolution layer. In doing so, it helps to produce a polygon mesh surface reconstruction with a higher quality 3D shape generation and stabilizes the GAN training process through the proposed relativistic average GAN loss function. The main contributions of this work are as follows:

VAE and GAN are integrated to enable the simultaneous learning of encoding, generating, and comparing data. The proposed method explicitly learns the latent spaces of 2D images, which are then used to generate corresponding signed distance functions of objects and reconstruct them into high-quality 3D mesh-based shapes.
A relativistic average GAN loss function is proposed to enhance the stability of the GAN training process, leading to a better convergence with the VAE-GAN framework.
The proposed 3D-VAE-SDFRaGAN model not only generates high-quality 3D shapes from their corresponding 2D images but also achieves superior results compared to other state-of-the-art 3D shape generative methods.

This paper is structured as follows: related works are presented in Section 2. The concept of the SDF, data pre-processing, and model architecture for the proposed method are discussed in Section 3. Section 4 describes the training procedure and the results of the experiment, while Section 5 discusses the limitations of the proposed model and future directions. Finally, the conclusion is presented in Section 6.

2. Related Work

2.1. Modeling and Generation of 3D Shapes

Early studies on 3D geometric shape models with recognizable 2D characteristics can be traced back to the pioneering works of Roberts [17] and Marr [18]. Despite the advent of more advanced methods, traditional approaches that use symmetric parts [19], skeletons [20], and CAD wireframes [21] continue to be popular. For instance, Miao et al. [22] proposed SymmSketch using a progressive method that involved self-symmetric shape components, mutual-symmetric shape components, and computational theory for recovering 3D construction curves to produce symmetric 3D free-form shapes. Huang et al. [23] explored the generation of 3D shapes using pre-trained templates that generate object structure and surface geometry. R. Kelly et al. [24] proposed ShapeAssembly that combined the benefits of neural and procedural shape modeling and generated complete objects by synthesizing and assembling multiple subparts. Another example was the Shape Part Slot Machine developed by Kai et al. [25], which retrieved and combined pre-existing high-quality part meshes to create new shapes. Unlike these models, the proposed 3D-VAE-SDFRaGAN generates 3D mesh-based shapes in an end-to-end unsupervised manner without the need to collect and assemble shapes from a database.

2.2. 3D Shape Generation via Deep Neural Networks

Recently, researchers have made significant advancements in creating distributions that can depict shapes without relying on prior knowledge. This has been achieved by learning complex functions from data and parameterizing them to generate distributions [7,26]. A number of algorithms have been proposed to achieve this, including deep convolutional autoencoders [27], autodecoders [28], variational autoencoders [9], convolutional neural networks [29], recurrent neural networks [30], deep belief networks [3], and generative adversarial networks [6,31]. These algorithms use various data representations, such as voxels, point clouds, and meshes.

Wu, Z. et al. [3] used volumetric data representation as an input to perform 3D shape completion and recognition tasks. Wu, J. et al. [6] proposed the use of a generative adversarial network (GAN) to generate 3D shapes from a probabilistic latent space. Gadelha et al. [32] introduced PrGANs, a 3D generative model that was trained using the GANs framework and was able to match the input distributions of 2D views. Zhu et al. [8] proposed an architecture that used a GANs framework and incorporated a 2D image enhancer network to efficiently train the 3D model generator network. The output of the architecture was a voxel-based 3D shape, which is computationally expensive and requires a large memory footprint.

Ruihui et al. [33] introduced the SP-GAN, a novel generative model that represented 3D shapes as point clouds. The input latent vector was disentangled into a global prior (sphere points) and a local prior (random latent vector), and the generator network used style embedding and adaptive instance normalization to develop the 3D shape. Jun et al. [34] proposed GET3D, which leveraged the success of differentiable surface modeling, differentiable rendering, and training GAN models from 2D images. GET3D produced high-quality and textured 3D meshes with complex topologies, rich geometric features, and high textures. The model was trained using adversarial losses and used a rasterization-based differentiable renderer to obtain 2D RGB and silhouettes. However, GET3D only relies on generated data for evaluation purposes and needs to be evaluated on real-world data for a more accurate evaluation. Shen et al. [35] proposed GINA-3D, which uses implicit neural representations to generate 3D models. It uses generative adversarial networks (GANs) to learn from a dataset of 3D scans of real-world objects and generates 3D models that are both realistic and diverse. GINA-3D introduces a novel neural network architecture that can efficiently represent 3D geometry in an implicit form and uses a combination of adversarial and perceptual losses to improve the quality of the generated 3D models. However, the 3D-VAE-SDFRaGAN can generate 3D shapes with fine details from 2D images.

In contrast to the existing works, our proposed model seeks to use a similar framework but with a signed distance function as a 3D data representation and a RaGAN loss function to train the model to produce mesh-based 3D shapes. The generated 3D shape is of better quality and is less computationally expensive with a smaller memory footprint. To the best of our knowledge, no work has directly mapped 2D images using the VAE-GAN framework and relativistic average GAN loss function to 3D mesh-based shapes.

2.3. 3D Shape Generation with Signed Distance Functions

The signed distance function (SDF) is a data representation that has recently gained popularity in the computer vision and graphics domain due to its ability to overcome certain limitations of voxels, point clouds, and meshes when used as input data for CNNs. SDF represents the structural relationship of distances on 3D surfaces and has been used to train various deep learning algorithms, such as GANs, VAEs, and AD algorithms, to generate detailed 3D mesh-based shapes. In the field of 3D mesh-based shape generation, Jiang and Marcus [36] proposed a 3D-GAN-based hierarchical detail enhancing mesh shape generation using SDF data representation. They generated SDF fields on a gridded domain to create higher-quality polygon mesh surfaces, but their network was limited by being driven by uninformative random vectors and trained only on 3D data. Kingkan et al. [37] proposed a 3D-VAE-GAN-based framework that mapped point clouds directly to other 3D shape representations. Zheng et al. [38] proposed SDF-StyleGAN, which extends the popular generative model for 2D images, StyleGAN, to the 3D domain using an implicit SDF-based approach. The network generates new shapes by sampling from the learned SDF representation and then extracting the surface mesh. SDF-StyleGAN is trained using a combination of adversarial and reconstruction losses to produce high-quality 3D shapes with controllable attributes. However, 3D-VAE-SDFRaGAN has an advantage in generating a greater variety of shapes by interpolating between different SDF representations. In contrast, our proposed 3D-VAE-SDFRaGAN framework is trained on both 2D images and 3D models concurrently to enhance the quality of 3D mesh-based shape generation and stabilize the training process. To date, none of the current research studies use a 3D-VAE-GAN framework with a relativistic average GAN loss function to map 2D directly to 3D mesh-based shapes. Here we offer a comparative analysis between the proposed model and related works in 3D shape generation with signed distance functions.

2.4. Improving GAN Training for Quality 3D Model Generation

Several recent studies have demonstrated the effectiveness of generative adversarial networks (GANs) in various applications [6,8,39]. However, GANs can be challenging to train due to their non-convex game with high-dimensional continuous parameters. The conventional approach to training GANs involves finding low values of cost functions using gradient descent techniques, rather than finding the Nash equilibrium of the game. This can result in non-convergence or instability during the training process. To address these issues, various techniques have been proposed to stabilize GAN training [40]. Smith and Meger [7] proposed a 3D-VAE-IWGAN algorithm that generates realistic 3D models by leveraging the Wasserstein distance normalized with a gradient penalty as a loss function. IWGAN stabilized GAN training and encouraged model convergence. Furthermore, Chen et al. [41] proposed an infoGAN by adding variational mutual information maximization to GAN’s loss function to train 3D face and chair images. Although these methods showed faster convergence and generated higher-quality results compared to the standard GAN loss function, they have not been applied to generate 3D mesh-based shapes. In this study, the relativistic average GAN loss function is proposed to stabilize the training process and generate high-quality 3D mesh-based shapes from 2D images.

In a nutshell, many of the existing works used data representations like voxels, point clouds, and meshes, while others have used the signed distance function (SDF) representation for creating detailed 3D mesh-based shapes. The proposed 3D-VAE-SDFRaGAN uses a similar framework as previous works but with a signed distance function as a 3D data representation and a RaGAN loss function to stabilize the training. The generated 3D shapes are of better quality and computationally less expensive than previous works. Table 1 presents a comparative analysis between the proposed model and related works. Table 2 provides a comparative analysis of the proposed model with related works on improving GAN algorithms.

3. Proposed 3D-VAE-SDFRaGAN

The steps involved in the proposed 3D-VAE-SDFRaGAN for 3D shape generation are outlined in this section. A 3D signed distance function field and a 2D image are the inputs to the system. A geometry analysis pipeline is presented to convert the mesh-based shape data into a 3D signed distance function (SDF) field as a format that can be used for learning. Next, the process of creating the 2D image dataset is described. The variational autoencoder (VAE) [42] and the generative adversarial network (GAN) [16] are utilized and integrated to generate 3D shapes in this work. A brief background on these techniques is provided to give a better understanding of their implementation in our system. Additionally, a relativistic average GAN loss function is proposed to establish a mapping between the 2D images and the signed distance functions. As a result, realistic 3D shapes that correspond to the input 2D images can be generated by the network.

3.1. 2D Image and 3D SDF Dataset Generation

ShapeNet [1] was used as the dataset in the experiments. The dataset comprises 55 common objects, consisting of 51,300 3D models. Furthermore, the 2D dataset utilized images from Choy et al. [30], which contain rendered images of ShapeNet 3D models from 23 different views. Information about 2D images used in our experiment is described in Table 3.

To preprocess the 3D signed distance function (SDF) field, SDFs were first derived from the 3D meshes [43]. Motivated by [36,37], the network output took the form of SDF. An SDF is a subset of implicit functions that assigns a 3D point to a real value, expressing the structural relationship and distance to the 3D surface, rather than a likelihood. The SDF data representation, which is a 3D representation that uses signed values to represent a mesh object’s inside–outside characteristics, has become popular for mesh generation with deep convolutional neural networks, as its presentation is not limited by fixed topology, unlike meshes and point clouds. Additionally, the SDF representation has a higher resolution compared to the voxel resolution.

Given a spatial point

p \in R^{3}

, the sign distance function

k (p) \in R

encodes the point’s distance to its closest surface point, where p lies inside (−) or (+) of the object. In mathematical terms, given a set of points in a 3D Euclidean space,

Ω

, where

Ω

is a non-zero volume open set with a smooth enclosed boundary,

δ Ω

, the signed distance function, k, is defined as

k (x) = \{\begin{matrix} dis (x, δ Ω), & if x \in Ω \\ 0, & if x \in δ Ω \\ - dis (x, δ Ω), & if x \in Ω^{c} \end{matrix}

(1)

where

δ Ω

denotes the boundary of

Ω

. The distance from a point x, where x belongs to 3D Euclidean space to the boundary

δ Ω

is defined as the infimum of the distances from the point x to all points on the boundary:

d i s (x, δ Ω) = inf_{y \in δ Ω} d i s (x, y)

(2)

The sign of a point x with respect to the boundary,

δ Ω

, is defined as follows:

s i g n (x, δ Ω) = \{\begin{matrix} 1, & if x \in Ω \\ 0, & if x \in δ Ω \\ - 1, & if x \in Ω^{c} \end{matrix}

(3)

The signed distance functions in (1), which is the product of (2) and (3) can also be written as

k (x) = d i s (x, δ Ω) \cdot s i g n (x, δ Ω)

(4)

To convert a triangular mesh into an SDF, the triangular mesh is first centered and normalized. Next, a 3D unit grid of resolution

64^{3}

is established around the geometry. The calculation of point-to-mesh distance is performed using an axis-aligned bounding box (AABB) tree, and the winding number of each point is computed to determine the sign of each point in the grid. Figure 1 illustrates examples of the transformation of 3D meshes into SDFs and 2D images. In the first row, 2D images of the car, table, sofa, and cabinet are shown. The chair and table meshes with their SDF views from different angles are shown in the second and third rows. The distance between each point and the surface is depicted in an SDF colormap. The colors in this colormap range from dark blue (inside) to yellow (outside).

3.2. VAE, GAN, and VAE-GAN

An extended version of an autoencoder network that imposes additional restrictions on latent variables is known as a variational autoencoder (VAE). The network is transformed into an algorithm that learns its input information from a latent variable model due to the restriction. The parameters that model the data from the probability distribution are learned by the VAE. The VAE network comprises encoder and decoder networks, with the encoder network compressing the input data over the latent distribution

p (z)

regulated by the prior into a latent representation, while the decoder network generates a new instance of input data from the latent representation. The VAE’s weights are trained simultaneously by improving the reconstruction loss and Kullback–Leibler divergence between the learned latent distribution and the prior.

On the other hand, a generative modeling algorithm called the generative adversarial network (GAN) utilizes a generator network and a discriminator network to generate new instances of data. The generator network transforms a random vector drawn from a Gaussian distribution into a data sample that is similar to the training dataset, while the discriminator network evaluates if the generated data sample is similar to the training dataset. The generator and discriminator networks train each other by competing against each other until the generator produces high-quality data.

The integration of VAE and GAN to improve the quality of data generation is becoming increasingly popular. The combination is computationally more efficient and produces a better quality of data generation. For instance, Larsen et al. [44] proposed a combination of VAE and GAN for 2D image generation, which resulted in improved feature representation and similarity measures compared to using either VAE or GAN alone. Wu, J. et al. [6] combined 3D-VAE-GAN to generate a voxel-based 3D model from 2D images, while Kingkan and Hashimoto [37] combined 3D-VAE-GAN to generate SDF from 3D point clouds. Another work by Smith and Merger [7] combined 3D-VAE-IWGAN to perform voxel-based 3D model generation, 3D model reconstruction, and 3D shape completion from a 2D image. In this work, the authors propose a similar framework to learn the latent spaces of 2D images and map them to their respective signed distance functions.

3.3. 3D-VAE-SDFRaGAN Framework

A novel approach called 3D-VAE-SDFRaGAN is proposed to learn the latent representation of a 2D image and generate 3D mesh-based shapes that are similar to the 2D image. The use of the learned latent space of a 2D image from the 2D-VAE network as input to the SDF-generator network is the key innovation of this approach. A relativistic average GAN loss function is used to train the SDF-generator network, allowing it to generate high-quality 3D shapes from both the 2D image and SDF. The architecture of the 3D-VAE-SDFRaGAN consists of three main components: a 2D-image encoder network

(E)

, an SDF-generator network

(G_{s d f})

, and an SDF-discriminator network

(D_{s d f})

. The 2D-image encoder network compresses the 2D image into a latent representation that can be used as input to the SDF-generator network. 3D mesh-based shapes are generated by the SDF-generator network using this latent representation. The generated shapes are evaluated by the SDF-discriminator network, providing feedback to the SDF-generator network to continually improve its ability to generate high-quality 3D shapes. Figure 2 depicts the overall architecture of the 3D-VAE-SDFRaGAN, which demonstrates the three key components and their interaction. The overall architecture of the proposed 3D-VAE-SDFRaGAN is comprised of three components: (a) the encoder network, which converts a 2D image to latent spaces; (b) the generator network, which produces an SDF from a 2D image latent vector; and (c) the relativistic discriminator network, which evaluates the likelihood of an actual SDF being more authentic than a randomly selected generated SDF from the generator. The 2D image is encoded through a 2D-image encoder network in Figure 2a to obtain its latent vector

(z_{i m g})

. The latent vector

(z_{i m g})

is concatenated with random noise

(z)

before sending to

(G_{s d f})

network as shown in Figure 2b to generate a fake SDFs

(s d f_{g e n}, s d f_{z})

. Next, the

(s d f_{g e n}, s d f_{z})

is concatenated with

(s d f_{r e a l})

as input for

(D_{s d f})

, as shown in Figure 2c, to return the prediction score.

3.3.1. 2D-Image Encoder Network $(E)$

The latent representation of a 2D image is learned by the encoder network in the proposed 3D-VAE-SDFRaGAN framework. This network is composed of five convolutional layers with channels of {

64, 128, 256, 512, 400

}, kernel sizes of {

11, 5, 5, 5, 8

}, and strides of {

4, 2, 2, 2, 1

}. To ensure the output of the convolutional layers is free of negative values and has consistent mean and variance, ReLU activation functions and batch normalization layers are included between the convolutional layers. A 400-dimensional latent vector is outputted by the last layer of the encoder network, which is split into a 200-dimensional mean latent vector and a 200-dimensional variance latent vector. Furthermore, a 200-dimensional latent vector is sampled from the Gaussian distribution by the encoder network’s sampling layer. The loss function of the encoder network (

L_{E}

) is composed of two parts: the KL divergence loss (

L_{K L}

) and the reconstruction loss (

L_{R}

) defined as

L_{K L} = D_{K L} (q (z_{i m g} | i m g) ∥ p (z))

(5)

L_{R} = ∥ G (E (i m g)) - s d f_{r e a l} ∥_{2}

(6)

where

L_{K L}

is divergence loss between the learned latent distribution (

z_{i m g}

) and the prior distribution

p (z)

from a uniform distribution over

[- 1, 1]

,

s d f_{r e a l}

is an SDF of a 3D shape from the training set,

i m g

is the corresponding 2D image,

q (z_{i m g} | i m g)

denotes the variational distribution of the latent representation.

The KL divergence loss measures the difference between the learned latent distribution (

q (z_{i m g} | i m g)

) and the prior distribution

p (z)

. The reconstruction loss measures the difference between the output of the SDF-generator network when fed with the mean latent vector (

G (E (i m g))

) and the ground truth SDF of the 3D shape from the training set (

s d f_{r e a l}

). The overall loss function encourages the encoder network to generate a latent representation that both follows the prior distribution and accurately reconstructs the SDF of the 3D shape. To allow the SDF-generator network to sample

z_{i m g}

from the same distribution as

p (z)

, KL divergence is used to limit

q (z_{i m g} | i m g)

to be as similar to

p (z)

as possible.

3.3.2. SDF-Generator and SDF-Discriminator Networks

The SDF-generator network

(G_{s d f})

is composed of five transpose convolution layers with channel numbers

{512, 256, 128, 64, 1}

and kernel sizes of

{4, 4, 4, 4, 4}

, with strides of

{1, 2, 2, 2, 2}

. Between the transpose convolution layers, ReLU and batch normalization layers are utilized, except for the last layer, which applies a Tanh function to map the output into

[- 1, 1]

. The SDF of a

64^{3}

matrix is the output of the generator network

(S D F_{g e n})

, with values in the range of

[- 1, 1]

. The triangular mesh surfaces are generated from the SDF matrix using the MCA [45].

The SDF-discriminator network

(D_{s d f})

is similar to the SDF-generator network and uses leaky ReLU as an activation function. The sigmoid function is applied at the last layer to compress the output to

[0, 1]

. It consists of five 3D-convolution layers with channel numbers

{64, 128, 256, 512, 1}

and kernel sizes of

{4, 4, 4, 4, 4}

, with strides of

{2, 2, 2, 2, 1}

. The loss function for SDFRaGAN is

\begin{matrix} L_{S D F R a G A N} = & - E_{s d f_{r e a l} ∽_{P}} [log (\bar{D} (s d f_{r e a l}))] - E_{z} ∽_{Q} [log (1 - \bar{D} (z))] \\ - E_{z_{i m g}} ∽_{Q} [log (1 - \bar{D} (z_{i m g}))] \end{matrix}

(7)

where

\begin{matrix} \tilde{D} (s d f_{r e a l}) = sigmoid (C (s d f_{r e a l}) - E_{z \sim Q} C (z, z_{i m g})) \\ \tilde{D} (z) = sigmoid (C (z) - E_{s d f_{r e a l} \sim P} C (s d f_{r e a l})) \\ \tilde{D} (z_{i m g}) = sigmoid (C (z_{i m g}) - E_{s d f_{r e a l} \sim P} C (s d f_{r e a l})) \end{matrix}

The total loss

L_{t o t a l}

function used in the proposed 3D-VAE-SDFRaGAN framework consists of the sum of three components: reconstruction loss

(L_{R})

, a cross-entropy

(L_{S D F R a G A N})

, and KL divergence

(L_{K L})

to impose a limit on the distribution of the output of the 2D encoder network:

L_{t o t a l} = L_{S D F R a G A N} + γ_{1} L_{K L} + γ_{2} L_{R}

(8)

where

γ_{1}

and

γ_{2}

are weights of

L_{K L}

and

L_{R}

, respectively.

4. Experiments

In this section, the training procedure is first discussed. Then, a comparison is made between our proposed 3D-VAE-SDFRaGAN and several state-of-the-art generative models. The qualitative and quantitative results are also presented. Qualitative evaluation involves the visual inspection of the generated 3D shapes, while quantitative evaluation involves the utilization of the chamfer distance (CD) metric. The performance of the proposed model is determined by comparing the results with those of the state-of-the-art models.

4.1. Training Procedure

The pair of

\{i m g_{i}, s d f_{real_{i}}\}

that is drawn from the training dataset is used to train the proposed 3D-VAE-SDFRaGAN framework, where

i m g_{i}

is the 2D image, and

{s d f}_{r e a l_{i}}

is the corresponding signed distance function of the 3D object. During training, the latent representation

(z_{i m g})

, which represents the image feature, is generated by the image encoder after encoding a 2D image

i m g_{i}

. The SDF-generator network, which receives

z_{i m g}

with a 200-dimension vector as input, generates

s d f_{g e n}

as output. A random vector z is sampled from a uniform distribution

p (z)

and fed to the SDF-generator network to produce

s d f_{z}

. Both generated SDFs

(s d f_{g e n}, s d f_{z})

, along with

s d f_{r e a l}

, serve as the input to the SDF-discriminator network for classification purposes. The SDF-discriminator network distinguishes SDFs and determines whether the SDF is generated from the SDF-generator network or if it is the real SDF from the dataset. The learning rates with the values of

10^{- 5}

,

10^{- 3}

, and

10^{- 3}

are used during training for the discriminator network, generator network, and encoder network, respectively. For optimization purposes, we use the Adam optimizer with

β

= 0.5. The proposed network is trained separately on each class of objects, with a batch size of 64. During the experiment, the values of 1 and 100 are set to

γ_{1}

and

γ_{2}

, respectively. We used the ShapeNet dataset to train and test the proposed 3D-VAE-SDFRaGAN, with a dataset split of 80% and 20% for the training and test sets, respectively [30,46,47]. The model was trained for 1500 epochs on the table, lamp, sofa, and cabinet categories, while the car and chair categories were trained for 1700 epochs. The experiment was conducted using a single Nvidia GeForce GTX 1080 GPU, with a total training time of 192 h. The complex architecture of the proposed model, which combines a variational autoencoder (VAE) and a generative adversarial network (GAN), imposes a relatively high computational burden. During the training process, multiple iterations were carried out to cover the car, chair, cabinet, lamp, sofa, and table categories of the ShapeNet dataset. Model performance optimization was achieved by training some categories for 1500 epochs while others were trained for 1700 epochs. The computational burden was increased using a single Nvidia GeForce GTX 1080 GPU for training, resulting in longer training times due to limited computational resources. The following loss functions are used in the encoder, generator, and discriminator networks during training:

L_{E} = γ_{1} L_{K L} + γ_{2} L_{R}

(9)

\begin{matrix} L_{G} = & - E_{z} ∽_{Q} [log (1 - \bar{D} (z))] - E_{z_{i m g}} ∽_{Q} [log (1 - \bar{D} (z_{i m g}))] - γ_{2} L_{R} \end{matrix}

(10)

\begin{matrix} L_{D} = L_{(S D F R a G A N)} \end{matrix}

(11)

The parameters of the discriminator network are updated when the accuracy is less than 0.8 in each batch. Figure 3 illustrates the training process flow of our proposed framework. During training, we use a random noise sampled from a uniform distribution,

p (z)

, denoted as z. We also use the signed distance function,

s d f_{r e a l}

, from the dataset. Additionally, we generate two new signed distance functions:

s d f_{g e n}

from random noise, z, and

s d f_{z}

from an encoded image,

z_{i m g}

.

4.2. Performance Evaluation

A framework named 3D-VAE-SDFRaGAN, which can infer a 3D shape from its associated 2D image, is proposed in this paper. The results of the proposed model, which learns to build SDFs of chair, table, car, and sofa from their associated 2D images, are shown in Figure 4. The marching cubes algorithm is employed to extract triangular surfaces from the SDFs, which are then smoothed using Laplacian smoothing to generate the final 3D shapes, as presented in Figure 4c.

Several state-of-the-art models are compared against the proposed 3D-VAE-SDFRaGAN, following the evaluation process outlined in [46]. The volumetric outputs of the 3D-R2N2 model are converted into a mesh representation using the marching cubes algorithm, and the quality of the generated model is compared with the proposed model. The chamfer distance (CD) [9,46] is used for evaluating the accuracy of the 3D-VAE-SDFRaGAN model, which measures the difference between the generated 3D mesh and the ground truth mesh for the six categories presented in Table 4. Also, the proposed model is compared with state-of-the-art models in Table 5.

Table 4 also presents the results of this evaluation. The performance of the proposed 3D-VAE-SDFRaGAN was evaluated by computing the CD scores for the six categories. The chamfer distance (CD) is a standard evaluation metric in 3D deep learning, which is used to assess the similarity between two meshes by calculating the distance between the points of each mesh and the nearest surface point of the other mesh. The computation is performed by determining the closest point in the opposite mesh for each point in each mesh and then summing up the square of the distances between the two sets of points. The accuracy of the object is better measured when the CD score is smaller. The equation of the CD is defined as

d_{C D} (S_{1}, S_{2}) = \sum_{x \in S_{1}} min_{y \in S_{2}} {∥ x - y ∥}_{2}^{2} + \sum_{y \in S_{2}} min_{x \in S_{1}} {∥ x - y ∥}_{2}^{2}

(12)

where

S_{1}

and

S_{2}

are the original mesh and generated mesh, respectively, and x and y are vertices of the original mesh and generated mesh, respectively.

The results of the experiment demonstrated that other state-of-the-art methods were surpassed in all categories by the proposed model. This was attributed to the significant improvement in the stability of the 3D-VAE-SDFRaGAN network, which was achieved by incorporating a relativistic discriminator and a non-standard GAN loss function, without any additional computational cost. Superior quality in generating 3D shapes was exhibited by the results of the proposed method, and a comparative analysis with other models was performed.

Objects with finer details that cannot be achieved with 3D-R2N2 using different model views [30] were generated by our proposed model, which demonstrated its efficacy. Additionally, the 3D-R2N2 algorithm was based on a 3D-autoencoder that inherited the limitations of autoencoder algorithms and was primarily used for reconstruction purposes, unlike our proposed method, which was designed for the generation task. The proposed model had the capability to generate objects with diverse topologies, unlike the N3MR approach, which lacked this feature [48]. The proposed model also addressed a key issue with the SIF method by utilizing SDF data representation to accurately capture the intricate details of an object’s structure [47]. Furthermore, a well-detailed 3D shape is produced by the proposed model, which is sacrificed by the MeshSDF framework [51]. Finally, the proposed 3D-VAE-SDFRaGAN not only effectively scaled, but also produced highly detailed shape representations that were previously unattainable with the DISN model. This was because the shape representation in DISN’s model was limited to a fixed-length feature vector, which only captured the global shape and was unable to represent fine details [52]. Figure 5, Figure 6 and Figure 7 present a graph to show the performance of CD metrics on a chair, cabinet categories, and average chamfer distance of the proposed model with other state-of-the-art models.

In Figure 8, the qualitative comparison of the generated samples and the learned representations of the proposed model with other 3D GAN-based models for visual evaluation is shown. (a) Sample of car, chair, and table generated by J. Wu et al. [6]. (b) Sample of car, chair, and table generated by J. Zhu et al. [8]. (c) Sample of chair and table generated by Smith et al. [7]. (d) Sample of car, chair, and table generated by Jiang et al. [36]. (e) Sample of car, chair, and table generated by Kingkan et al. [37]. (f) Sample of car, chair, and table generated by our model. It can be observed that noticeably better and higher qualitative performance is obtained by the smooth meshed surfaces recovered from signed distance function fields powered by the proposed 3D-VAE-SDFRaGAN framework and 2D image features in our work. The results of Kingkan et al. [37] is compared with the generated samples in Figure 8f, despite the complete 3D data used in their work. This shows that the low-cost 2D images readily available with corresponding SDFs from the proposed method are better dataset options for training 3D mesh-based generator models leveraging relativistic average GAN loss function. Moreover, better mesh-based 3D shape generation with the proposed 3D VAE-SDFRaGAN model is aided by the combination of 2D images and the SDFs dataset. In addition, the proposed model, which is simple yet efficient, was evaluated by comparing its trainable parameters to those of Jiang et al.’s [36] two-stage model implementation. The trainable parameters of [36] amounted to 176,260,554, while our proposed model has only 41,645,290 trainable parameters, making it computationally efficient. The proposed model produced appealing results, as it was trained end-to-end using only a low-cost 2D image input, while Jiang et al.’s model involved a computationally expensive process of generating SDF in two stages.

5. Limitations and Future Directions

One limitation of 3D-VAE-SDFRaGAN is that it requires a large amount of training data and computational resources, which can be a challenge for many researchers and practitioners. Another limitation is that the performance of the proposed model is sensitive to the choice of hyperparameters, such as the size of the latent space, the learning rate, and the regularization parameters. The model is also limited in its ability to handle complex shapes and geometries, particularly those that involve non-uniform scaling or non-rigid deformation.

One potential direction for future research is to explore ways to improve the accuracy and detail of the generated 3D models, perhaps by incorporating additional data sources or refining the network architecture. Another promising direction is to investigate ways to extend the model to handle more complex shapes and geometries, such as those found in medical imaging or robotics applications. It may also be useful to explore ways to incorporate semantic information or other forms of prior knowledge into the model to improve its ability to generate models that are relevant to specific domains or applications.

6. Conclusions

In this paper, a novel approach for generating 3D shapes using 2D images as input is proposed. The proposed method, 3D-VAE-SDFRaGAN, leverages the power of variational autoencoders (VAEs) to learn the latent space of the 2D images and maps it to the corresponding 3D shapes represented by signed distance functions (SDFs). The experimental results demonstrate that the latent space of 2D images has a significant impact on the performance of the model and determines the amount of information that can be transferred from the 2D image encoder network to the SDF-generator network. To further improve the performance of the model, a relativistic average GAN loss function is incorporated into the proposed 3D-VAE-SDFRaGAN framework. This improves the stability of the model during training and results in higher-quality shape generation. The proposed method successfully generates smooth surface attributes of the corresponding SDF using features extracted from the 2D image. The proposed 3D-VAE-SDFRaGAN model exhibits superior quantitative performance over state-of-the-art methods, achieving an average chamfer distance score of 0.578 across the selected categories of the ShapeNet dataset. Additionally, the proposed model exhibits superior performance compared to existing methods, particularly in terms of qualitative output, which is supported by a visual comparison of the generated samples.

As the underlying task involves generating 3D shapes using generative modeling algorithms, the use of accuracy metrics such as F1, which are typically employed in classification tasks, may not be suitable for evaluation purposes. Instead, we utilized chamfer distance (CD), a standard evaluation metric for 3D deep learning, especially for mesh generation tasks [46,49,50]. In response to feedback, we have included this metric in the abstract to highlight the accuracy measurement used to evaluate our work, and have revised the abstract accordingly for the reader’s consideration.

Author Contributions

Conceptualization, E.A.A. and K.M.L.; methodology, E.A.A. and K.M.L.; software, E.A.A. and K.M.L.; validation, E.A.A. and K.M.L.; formal analysis, E.A.A.; investigation, E.A.A.; resources, E.A.A.; data creation, E.A.A. and K.M.L.; writing—original draft preparation, E.A.A.; writing—review and editing, K.M.L., S.-C.C. and C.P.L.; visualization, E.A.A. and K.M.L.; supervision, K.M.L., S.-C.C. and C.P.L.; project administration, K.M.L.; funding acquisition, K.M.L. All authors have read and agreed to the published version of the manuscript.

Funding

The research in this work was supported by Telekom Malaysia Research and Development under grant number RDTC/221045 and Multimedia University Graduate Research Assistant Scheme (MMUI/190044) Malaysia.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Chang, A.X.; Funkhouser, T.; Guibas, L.; Hanrahan, P.; Huang, Q.; Li, Z.; Savarese, S.; Savva, M.; Song, S.; Su, H.; et al. ShapeNet: An Information-Rich 3D Model Repository. arXiv 2015, arXiv:1512.03012. [Google Scholar]
Xiang, Y.; Kim, W.; Chen, W.; Ji, J.; Choy, C.; Su, H.; Mottaghi, R.; Guibas, L.; Savarese, S. Objectnet3D: A large scale database for 3D object recognition. In Proceedings of the Computer Vision—ECCV 2016 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 160–176. [Google Scholar] [CrossRef]
Wu, Z.; Song, S.; Khosla, A.; Yu, F.; Zhang, L.; Tang, X.; Xiao, J. 3D ShapeNets: A deep representation for volumetric shapes. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA, 7–12 June 2015; pp. 1912–1920. [Google Scholar] [CrossRef]
Fu, H.; Jia, R.; Gao, L.; Gong, M.; Zhao, B.; Maybank, S.; Tao, D. 3D-FUTURE: 3D Furniture Shape with TextURE. Int. J. Comput. Vis. 2021, 129, 3313–3337. [Google Scholar] [CrossRef]
Rezende, D.J.; Ali Eslami, S.M.; Mohamed, S.; Battaglia, P.; Jaderberg, M.; Heess, N. Unsupervised learning of 3D structure from images. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 5003–5011. [Google Scholar]
Wu, J.; Zhang, C.; Xue, T.; Freeman, W.T.; Tenenbaum, J.B. Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling. CVGIP Graph. Model. Image Process. 2016, 53, 157–185. [Google Scholar] [CrossRef]
Smith, E.; Meger, D. Improved Adversarial Systems for 3D Object Generation and Reconstruction. In Proceedings of the Conference on Robot Learning, Mountain View, CA, USA, 13–15 November 2017; pp. 87–96. [Google Scholar]
Zhu, J.; Xie, J.; Fang, Y. Learning Adversarial 3D Model Generation with 2D Image Enhancer; AAAI: Palo Alto, CA, USA, 2018; pp. 7615–7622. [Google Scholar]
Fan, H.; Guibas, L. DeepPointSet: A Point Set Generation Network for 3D Object Reconstruction from a Single Image Supplementary Material. In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1–4. [Google Scholar]
Yang, G.; Huang, X.; Hao, Z.; Liu, M.Y.; Belongie, S.; Hariharan, B. Pointflow: 3D point cloud generation with continuous normalizing flows. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4540–4549. [Google Scholar] [CrossRef]
Liu, H.; Yuan, H.; Hou, J.; Hamzaoui, R.; Gao, W. PUFA-GAN: A Frequency-Aware Generative Adversarial Network for 3D Point Cloud Upsampling. IEEE Trans. Image Process. 2022, 31, 7389–7402. [Google Scholar] [CrossRef] [PubMed]
Tan, Q.; Gao, L.; Lai, Y.K.; Xia, S. Variational Autoencoders for Deforming 3D Mesh Models. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5841–5850. [Google Scholar] [CrossRef]
Feng, Y.; Feng, Y.; You, H.; Zhao, X.; Gao, Y. MeshNet: Mesh neural network for 3D shape representation. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27–28 January 2019; pp. 8279–8286. [Google Scholar] [CrossRef]
Cheng, S.; Bronstein, M.; Zhou, Y.; Kotsia, I.; Pantic, M.; Zafeiriou, S. MeshGAN: Non-linear 3D Morphable Models of Faces. arXiv 2019, arXiv:1903.10384. [Google Scholar]
Li, H.; Zheng, Y.; Wu, X.; Cai, Q. 3D Model Generation and Reconstruction Using Conditional Generative Adversarial Network. Int. J. Comput. Intell. Syst. 2019, 12, 697. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. Adv. Neural Inf. Process. Syst. 2014, 177, 326–333. [Google Scholar] [CrossRef]
Roberts, L.G. Machine Perception of Three-Dimensional Solids. Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 1963. [Google Scholar]
Marr, D.; Ullman, S. A Computational Investigation into the Human Representation and Processing of Visual Information; Technical Report; Henry Holt and Co. Inc.: New York, NY, USA, 1982. [Google Scholar]
Sung, M.; Kim, V.G.; Angst, R.; Guibas, L. Data-Driven Structural Priors for Shape Completion. ACM Trans. Graph. 2015, 34, 1–11. [Google Scholar] [CrossRef]
Sundar, H.; Silver, D.; Gagvani, N.; Dickinson, S. Skeleton based shape matching and retrieval. In Proceedings of the SMI 2003: Shape Modeling International 2003, Seoul, Republic of Korea, 12–16 May 2003; pp. 130–139. [Google Scholar] [CrossRef]
Li, C.; Zia, M.Z.; Tran, Q.H.; Yu, X.; Hager, G.D.; Chandraker, M. Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 5465–5474. [Google Scholar]
Miao, Y.; Hu, F.; Zhang, X.; Chen, J.; Pajarola, R. SymmSketch: Creating symmetric 3D free-form shapes from 2D sketches. Comput. Vis. Media 2015, 1, 3–16. [Google Scholar] [CrossRef]
Huang, H.; Kalogerakis, E.; Marlin, B. Analysis and synthesis of 3D shape families via deep-learned generative models of surfaces. Eurographics Symp. Geom. Process. 2015, 34, 25–38. [Google Scholar] [CrossRef]
Jones, R.K.; Barton, T.; Xu, X.; Wang, K.; Jiang, E.; Guerrero, P.; Mitra, N.J.; Ritchie, D. ShapeAssembly: Learning to Generate Programs for 3D Shape Structure Synthesis. ACM Trans. Graph. 2020, 39, 3417812. [Google Scholar] [CrossRef]
Wang, K.; Guerrero, P.; Kim, V.; Chaudhuri, S.; Sung, M.; Ritchie, D. The Shape Part Slot Machine: Contact-Based Reasoning for Generating 3D Shapes from Parts. In Computer Vision—ECCV 2022, Proceedings of the 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Technical Report; Springer Nature Switzerland: Cham, Switzerland, 2022; pp. 610–626. [Google Scholar]
Kar, A.; Tulsiani, S.; Carreira, J.; Malik, J. Category-specific object reconstruction from a single image. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1966–1974. [Google Scholar] [CrossRef]
Girdhar, R.; Fouhey, D.F.; Rodriguez, M.; Gupta, A. Learning a predictable and generative vector representation for objects. In Proceedings of the Computer Vision—ECCV 2016 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 484–499. [Google Scholar] [CrossRef]
Park, J.J.; Florence, P.; Straub, J.; Newcombe, R.; Lovegrove, S. Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 165–174. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Springenberg, J.T.; Brox, T. Learning to generate chairs with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1538–1546. [Google Scholar]
Choy, C.B.; Xu, D.; Gwak, J.Y.; Chen, K.; Savarese, S. 3D-R2N2: A unified approach for single and multi-view 3D object reconstruction. In Proceedings of the Computer Vision—ECCV 2016 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 628–644. [Google Scholar] [CrossRef]
Denton, E.; Chintala, S.; Szlam, A.; Fergus, R. Deep Generative Image Models Using a Laplacian Pyramid of Adversarial Networks. Adv. Neural Inf. Process. Syst. 2015, 28, 1–10. [Google Scholar]
Gadelha, M.; Maji, S.; Wang, R. 3D shape induction from 2D views of multiple objects. In Proceedings of the 2017 International Conference on 3D Vision, 3DV 2017, Qingdao, China, 10–12 October 2017; pp. 402–411. [Google Scholar] [CrossRef]
Li, R.; Li, X.; Hui, K.H.; Fu, C.W. SP-GAN: Sphere-guided 3D shape generation and manipulation. ACM Trans. Graph. 2021, 40, 3459766. [Google Scholar] [CrossRef]
Gao, J.; Shen, T.; Wang, Z.; Chen, W.; Yin, K.; Li, D.; Litany, O.; Gojcic, Z.; Fidler, S. GET3D: A Generative Model of High Quality 3D Textured Shapes Learned from Images. Adv. Neural Inf. Process. Syst. 2022, 35, 31841–31854. [Google Scholar]
Shen, B.; Yan, X.; Qi, C.R.; Najibi, M.; Deng, B.; Guibas, L.; Zhou, Y.; Anguelov, D. GINA-3D: Learning to Generate Implicit Neural Assets in the Wild. arXiv 2023, arXiv:2304.02163. [Google Scholar]
Jiang, C.M.; Marcus, P. Hierarchical Detail Enhancing Mesh-Based Shape Generation with 3D Generative Adversarial Network. In Proceedings of the 2006 IEEE International Conference on Acoustics Speech and Signal Processing, Toulouse, France, 14–19 May 2006. [Google Scholar] [CrossRef]
Kingkan, C.; Hashimoto, C. Generating Mesh-based Shapes from Learned Latent Spaces of Point Clouds with VAE-GAN. In Proceedings of the International Conference on Pattern Recognition, Beijing, China, 20–24 August 2018; pp. 308–313. [Google Scholar] [CrossRef]
Zheng, X.Y.; Liu, Y.; Wang, P.S.; Tong, X. SDF-StyleGAN: Implicit SDF-Based StyleGAN for 3D Shape Generation. Comput. Graph. Forum 2022, 41, 52–63. [Google Scholar] [CrossRef]
Wang, W.; Huang, Q.; You, S.; Yang, C.; Neumann, U. Shape Inpainting Using 3D Generative Adversarial Network and Recurrent Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2298–2306. [Google Scholar]
Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved Techniques for Training GANs. arXiv 2016, arXiv:1606.03498. [Google Scholar]
Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever, I.; Abbeel, P. InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets. arXiv 2016, arXiv:1606.03657. [Google Scholar]
Doersch, C. Tutorial on Variational Autoencoders. arXiv 2016, arXiv:1606.05908. [Google Scholar]
Osher, S.; Fedkiw, R. Signed Distance Functions. In Level Set Methods and Dynamic Implicit Surfaces. Applied Mathematical Sciences; Springer: New York, NY, USA, 2003; Volume 153, pp. 17–22. [Google Scholar] [CrossRef]
Larsen, A.B.L.; Sønderby, S.K.; Larochelle, H.; Winther, O. Autoencoding Beyond Pixels Using a Learned Similarity Metric. Int. Conf. Mach. Learn. 2015, 29, 1558–1566. [Google Scholar]
Lewiner, T.; Lopes, H.; Vieira, A.W.; Tavares, G. Efficient Implementation of Marching Cubes’ Cases with Topological Guarantees. J. Graph. Tools 2003, 8, 1–15. [Google Scholar] [CrossRef]
Wang, N.; Zhang, Y.; Li, Z.; Fu, Y.; Liu, W.; Jiang, Y.G. Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images. In Proceedings of the Computer Vision—ECCV 2018 15th European Conference, Munich, Germany, 8–14 September 2018; pp. 55–71. [Google Scholar] [CrossRef]
Genova, K.; Cole, F.; Vlasic, D.; Sarna, A.; Freeman, W.; Funkhouser, T. Learning shape templates with structured implicit functions. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7153–7163. [Google Scholar] [CrossRef]
Kato, H.; Ushiku, Y.; Harada, T. Neural 3D Mesh Renderer. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar] [CrossRef]
Remelli, E.; Lukoianov, A.; Richter, S.R.; Guillard, B.; Bagautdinov, T.; Baque, P.; Fua, P. MeshSDF: Differentiable Iso-Surface Extraction. Adv. Neural Inf. Process. Syst. 2020, 33, 22468–22478. [Google Scholar]
Xu, Q.; Wang, W.; Ceylan, D.; Mech, R.; Neumann, U. DISN: Deep Implicit Surface Network for High-Quality Single-View 3D Reconstruction. arXiv 2019, arXiv:1905.10711. [Google Scholar]
Trevithick, A.; Yang, B. GRF: Learning a General Radiance Field for 3D Representation and Rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seattle, WA, USA, 13–19 June 2020; pp. 15182–15192. [Google Scholar]
Genova, K.; Cole, F.; Sud, A.; Sarna, A.; Funkhouser, T. Local Deep Implicit Functions for 3D Shape. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 4856–4865. [Google Scholar] [CrossRef]

Figure 1. Examples of 2D images and the transformation of 3D meshes into signed distance functions (SDFs).

Figure 2. The overall architecture of the proposed 3D-VAE-SDFRaGAN network. (a) the encoder network, which converts a 2D image to latent spaces; (b) the generator network, which produces an SDF from a 2D image latent vector; and (c) the relativistic discriminator network, which evaluates the likelihood of an actual SDF being more authentic than a randomly selected generated SDF from the generator.

Figure 3. Training process flow for our proposed framework.

Figure 4. Examples of 2D images and SDFs of the object, and the generated meshes. The 2D images and SDFs of objects are shown in (a,b). The generated mesh-based shape is shown in (c).

Figure 5. Performance of chamfer distance evaluation metrics on the proposed model and some of the state-of-the-art methods on the chair category.

Figure 6. Performance of chamfer distance evaluation metrics on the proposed model and some of the state-of-the-art methods on the cabinet category.

Figure 7. Performance of the average chamfer distance metrics on our proposed model and some of the state-of-the-art methods.

Figure 8. The qualitative comparison of the generated samples and the learned representations of the proposed model with other 3D GAN-based models for visual evaluation. (a) Samples generated by J. Wu et al. [6]. (b) Samples generated by J. Zhu et al. [8]. (c) Samples generated by Smith et al. [7]. (d) Samples generated by Jiang et al. [36]. (e) Samples generated by Kingkan et al. [37]. (f) Samples generated by our model.

Table 1. Comparative analysis between the proposed model and related works.

Criteria	Approach	Architecture	Training Data	Input Data Type	Output Data Type	Performance	Computational Complexity	Limitation	Future Work
SymmSketch [22]	Algorithmic	Symmetry-aware algorithm	2D sketches	2D sketches	Symmetric 3D shapes	Fast generation of symmetric 3D shapes with limited input	Low	Not suitable for non-symmetric shapes	Improving the ability to generate 3D shapes from multiple viewpoints
AS3DS-DLGM [23]	Deep Learning	Generative model of surfaces	3D shapes within a family	3D meshes	3D meshes	Accurate generation of diverse 3D shapes within a family	High	Requires a large amount of training data	Exploring ways to generate higher-quality shapes with more user control
ShapeAss [24]	Deep Learning	CNNs, LSTM	Database of 3D parts and their assembly rules	Parts and assembly rules	3D meshes	High structural integrity of complex 3D shapes	High	Requires a large library of parts and complex reasoning algorithms	Investigating methods for reducing the computational complexity
3D ShapeNet [3]	Deep learning based volumetric shape representation	CNN	Volumetric 3D shapes	Voxelized 3D shapes	Voxelized 3D shapes	Can generate high-quality 3D shapes with complex structures	High	Can be memory-intensive and computationally expensive	Improving the ability to generate 3D shapes from multiple viewpoints
SP-GAN [33]	Sphere-guided 3D shape generation	GAN with a sphere-guided generator	Volumetric 3D shapes	2D image + sphere representation	Voxelized 3D shapes	Can generate high-quality 3D shapes with better user control	High	Generated shapes may not be visually consistent with the input data	Investigating methods for reducing the computational complexity
PrGAN [32]	3D shape generation from 2D views and manipulation	Multi-view CNN	2D images of multiple objects from various viewpoints	2D images	Voxelized 3D shapes	Can generate 3D shapes from limited 2D views with moderate quality	Moderate	The generated shapes may be limited by the quality and diversity of the input images	Exploring ways to generate higher-quality shapes with more user control
GET3D [34]	Generative model of 3D textured shapes learned from images	GAN with a novel 3D convolutional architecture	3D textured shapes from various viewpoints	RGB images	3D textured meshes	Can generate high-quality 3D textured meshes	High	Computationally intensive and requires significant computational resources, including high-end GPUs and large amounts of memory	Investigating ways to improve the texture quality and resolution of generated meshes
3D-VAE-GAN [6]	Generative adversarial modeling of 3D shapes	GAN with a 3D generator and discriminator	2D images and corresponding 3D shapes	Voxelized 3D shapes	Voxelized 3D shapes	Can generate high-quality 3D shapes with a probabilistic latent space	High	The generated shapes may be limited by the quality and diversity of the input images	Exploring ways to better control the probabilistic latent space for generating 3D shapes
MSVAE-GAN [37]	Generative Model	VAE-GAN	Point cloud + SDF	Point cloud	Meshes	Can generate high-quality 3D shapes with complex structures	Moderate	Requires a large amount of training data to accurately capture the underlying shape distribution	Incorporation of semantic labels
HDEM-3DGAN [36]	Generative Model	3D-GAN	Latent vector + SDF	SDF	Meshes	Can generate 3D shapes from limited 2D views with moderate quality	High	Computationally expensive, it may be challenging to incorporate user-defined constraints or preferences into the generated shapes	Improved hierarchical representation
Ours	Generative Model	VAE-SDFRaGAN	2D images + SDF	2D image	Meshes	Can generate high-quality 3D shapes with better user	Moderate	It requires a large amount of training data and computational resources	Incorporation of texture and colour

Table 2. Comparative analysis between the proposed model and related works on improving GAN algorithms.

Criteria	3D-VAE-IWGAN [7]	InfoGAN [41]	Ours
Approach	Generative Model	Generative Model	Generative Model
Architecture	VAE-IWGAN-based	GAN-based	VAE-SDFRaGAN-based
Training Data	2D images and corresponding 3D voxels	Latent vector and 3D Face dataset	2D images and corresponding 3D shapes
Input Data Type	2D images	3D Face, 2D Images	2D Images
Output Data Type	3D voxels	Meshes	Meshes
Performance	High-quality 3D shape generation, smooth interpolation between shapes, controllable output via latent space manipulation	Disentangles latent factors for interpretable representations, can learn without explicit supervision	Fast training and generation, good visual quality, can handle large-scale 3D shapes
Adversarial Loss	Wasserstein GAN (WGAN)	GAN	Relativistic Average GAN (RaGAN)
Training Approach	Iterative Optimization	Adversarial and Mutual Information	Iterative Optimization
Benefits	High-quality 3D object generation, improved stability	Interpretable Latent Code, improved Stability	Improved 3D shape generation, improved stability
Latent Code Interpretability	Limited (WGAN)	High	High
Limitation	May suffer from mode collapse and instability during training	May require a larger number of training samples to achieve optimal performance	May require more complex pre-processing steps to convert 3D shapes into 2D images and signed distance functions
Potential Applications	Computer graphics, gaming, medical imaging, and virtual reality	Image synthesis, feature extraction	Computer graphics, gaming, medical imaging, and virtual reality

Table 3. Information about 2D images used in our experiment.

Categories	Size (Pixel)
Chair, Table, Car, Lamb, Sofa, Cabinet	64 × 64

Table 4. Performance comparison of the proposed method with state-of-the-art on the ShapeNet dataset.

Category	3D-R2N2 [30]	SIF [47]	N3MR [48]	MeshSDF [49]	DISN [50]	Ours
Chair	1.432	1.540	2.084	0.590	0.754	0.589
Table	1.116	1.570	2.383	1.070	1.329	0.672
Car	0.845	1.080	2.298	0.960	0.492	0.491
Lamb	4.009	3.420	3.013	1.490	2.273	0.662
Sofa	1.135	0.800	3.512	0.780	0.871	0.566
Cabinet	0.750	1.100	2.555	0.780	1.130	0.314
Average	1.545	1.585	2.641	0.945	1.142	0.578

Table 5. Comparative analysis between the proposed model and state-of-the-art models.

Criteria	3D-R2N2 [30]	SIF [47]	N3MR [48]	MeshSDF [49]	DISN [50]	Ours
Approach	Multi-view reconstruction using 3D convolutional neural networks	Reconstruction using implicit functions and shape templates	Rendering 3D shapes from learned feature maps	Reconstruction using implicit functions and iso-surface extraction	Reconstruction using implicit functions and single-view input	Learning a generative model for 3D shapes via 2D images
Architecture	2D-CNN + 3D-LSTM + 3D-DCNN	Multi-layer perceptron	Neural network with geometric encoding	Multi-layer perceptron	Convolutional neural network	VAE-SDFRaGAN
Dataset	ShapeNet and PASCAL 3D	ShapeNet	ShapeNet	ShapeNet	ShapeNet	ShapeNet
Training Data	Set of 2D views with corresponding 3D shapes	3D shapes	3D mesh models	3D mesh represented as SDFs	3D shapes	2D images and corresponding 3D shapes
Input Data Type	2D images of an object from different viewpoints	Stack of depth images around the mesh	3D mesh	SDF	2D images	2D images
Output Data Type	Volumetric 3D shape	Learned shape templates	3D mesh	3D meshes	3D meshes	3D meshes
Performance	Utilizes a 3D recurrent neural network to generate a 3D voxel grid from a 2D image or a sequence of images	Learns a 3D shape template from a set of 3D shapes using implicit functions	Uses a neural network to estimate the visibility of each vertex and generates the 2D image using a differentiable renderer	Uses a differentiable iso-surface extraction algorithm to generate the SDF and trains a neural network to approximate the SDF	Uses a deep neural network to estimate the implicit function and generates the 3D mesh using marching cubes	Can generate high-quality 3D meshes with complex structure
Computational Complexity	High	Low	High	Low	Low	High
Limitation	Limited to objects that can be approximated by a set of primitive shapes	Limited to learning shapes that can be described by implicit functions	Limited to rendering objects that have a corresponding mesh representation	Limited to shapes that can be represented by a signed distance function	Limited to learning shapes that can be represented by implicit functions	It requires a large amount of training data and computational resources
Potential Future Work	Improving the accuracy and resolution of the reconstructed 3D models	Incorporating texture information into the model	Improving the efficiency and speed of the rendering process	Improving the model’s robustness to handle noise and outliers in the input data	Investigating the use of more complex neural network architectures to improve the model’s accuracy and efficiency	Incorporating texture, Incorporating semantic labels

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ajayi, E.A.; Lim, K.M.; Chong, S.-C.; Lee, C.P. 3D Shape Generation via Variational Autoencoder with Signed Distance Function Relativistic Average Generative Adversarial Network. Appl. Sci. 2023, 13, 5925. https://doi.org/10.3390/app13105925

AMA Style

Ajayi EA, Lim KM, Chong S-C, Lee CP. 3D Shape Generation via Variational Autoencoder with Signed Distance Function Relativistic Average Generative Adversarial Network. Applied Sciences. 2023; 13(10):5925. https://doi.org/10.3390/app13105925

Chicago/Turabian Style

Ajayi, Ebenezer Akinyemi, Kian Ming Lim, Siew-Chin Chong, and Chin Poo Lee. 2023. "3D Shape Generation via Variational Autoencoder with Signed Distance Function Relativistic Average Generative Adversarial Network" Applied Sciences 13, no. 10: 5925. https://doi.org/10.3390/app13105925

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

3D Shape Generation via Variational Autoencoder with Signed Distance Function Relativistic Average Generative Adversarial Network

Abstract

1. Introduction

2. Related Work

2.1. Modeling and Generation of 3D Shapes

2.2. 3D Shape Generation via Deep Neural Networks

2.3. 3D Shape Generation with Signed Distance Functions

2.4. Improving GAN Training for Quality 3D Model Generation

3. Proposed 3D-VAE-SDFRaGAN

3.1. 2D Image and 3D SDF Dataset Generation

3.2. VAE, GAN, and VAE-GAN

3.3. 3D-VAE-SDFRaGAN Framework

3.3.1. 2D-Image Encoder Network $(E)$

3.3.2. SDF-Generator and SDF-Discriminator Networks

4. Experiments

4.1. Training Procedure

4.2. Performance Evaluation

5. Limitations and Future Directions

6. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

3D Shape Generation via Variational Autoencoder with Signed Distance Function Relativistic Average Generative Adversarial Network

Abstract

1. Introduction

2. Related Work

2.1. Modeling and Generation of 3D Shapes

2.2. 3D Shape Generation via Deep Neural Networks

2.3. 3D Shape Generation with Signed Distance Functions

2.4. Improving GAN Training for Quality 3D Model Generation

3. Proposed 3D-VAE-SDFRaGAN

3.1. 2D Image and 3D SDF Dataset Generation

3.2. VAE, GAN, and VAE-GAN

3.3. 3D-VAE-SDFRaGAN Framework

3.3.1. 2D-Image Encoder Network ( E )

3.3.2. SDF-Generator and SDF-Discriminator Networks

4. Experiments

4.1. Training Procedure

4.2. Performance Evaluation

5. Limitations and Future Directions

6. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.3.1. 2D-Image Encoder Network $(E)$