1. Introduction
The generation and understanding of three-dimensional (3D) geometries hold significant importance across diverse domains, from computer graphics to robotics and virtual reality. Recent advancements in deep learning (DL) and generative modeling have propelled research in the area of 3D shape generation. Notably, techniques such as Variational Autoencoders (VAEs) [
1,
2,
3], 3D Generative Adversarial Networks (3D-GANs) [
4,
5,
6], and 3D Stable Diffusion [
7,
8,
9,
10] have shown promise in autonomously producing realistic and diverse 3D shapes. Shape grammars have also been demonstrated as a powerful approach for formal model generation by providing a rule-based framework for generating complex geometric structures and enforcing constraints within objects [
11,
12,
13]. Each methodology has its advantages and limitations. This article seeks to provide a fusion of these two methodologies to achieve the best of both worlds for novel 3D shape synthesis.
Despite the progress achieved, challenges and limitations persist in the deep learning 3D shape generation methodology.
Figure 1 showcases several failure instances of these methodologies. VAEs, as evidenced in
Figure 1a, often struggle with capturing complex and high-dimensional distributions of 3D shapes. GANs often struggle with mode collapse in training, where the generator produces a limited diversity of shapes or collapses to a few modes, failing to capture the full distribution of the data as shown in
Figure 1b. Mode collapse can make GANs difficult to train and lead to the generation of unrealistic or repetitive shapes, limiting the variety and quality of the generated outputs.
Figure 1c shows a car model generated by 3D stable fusion techniques which may struggle with preserving fine-grained details and local geometric features, leading to information loss in complex shapes or smoothed shapes.
Shape grammars have been used commonly in computer graphics and design to describe the generation of complex shapes through a set of production rules [
14,
15]. These rules define how basic shapes or components can be combined and manipulated to form more intricate structures. Shape grammars provide a systematic approach to generating shapes by specifying the relationships between various components and enforcing constraints to ensure the coherence and consistency of the generated designs.
Shape grammar rules can be implemented using the modeling language as the formal syntax and vocabulary [
16,
17,
18,
19]. Users can define programs that describe objects as a semantic hierarchy of 3D shape elements, where each element may be a semantic group of objects, e.g., a floor of a building, or an indivisible object, i.e., a brick within a building. Each indivisible object, e.g., a brick from a building, is modeled in terms of its geometry and appearance.
Shape grammars offer a means to encode the implicit generation rules and geometric constraints inherent in objects, which remains challenging for deep learning models to grasp. However, common objects typically have predictable generation rules which are satisfied by all instances of these objects. For example, tables usually feature a top surface and multiple supporting legs connected to the top, and cars typically have four wheels on two sides that can roll. However, deep learning neural networks find it challenging to comprehend these constraints, hindering their ability to accurately generate such objects. Shape grammars provide a solution by implementing these rules to construct objects and allowing users to define the parameters of these rules.
Bridging deep learning techniques with shape grammars presents significant potential. Users can define shape programs and convey shape rules to artificial intelligence (AI) systems. By employing DL networks to learn the parameters of these rules rather than the rules themselves, AI models gain the ability to internalize such constraints. This capability leads to disentangled representations within the latent space, where each dimension corresponds to a meaningful attribute of the data. This promises enhanced controllability and interpretability of the generated shapes.
This article proposes a novel fusion of 3D shape representation using shape grammars and DL model estimation. Shapes are represented as a formal shape grammar using Procedural Shape Modeling Language (PSML) [
19], which applies a sequence of rules to construct a 3D geometric model as a collection of 3D primitives. In contrast to competing approaches from the DL literature, the inclusion of dynamic parameterized formal shape models promises to allow DL applications to more accurately represent the structure of commonplace objects.
The proposed method merges the powerful representation of PSML for many important shape-understanding contexts with the heretofore unprecedented capabilities for non-linear estimation provided by recent advances in AI through deep learning neural models and their training methodologies. Prior efforts have seen much success in deterministically applying shape grammars in procedural modeling contexts [
11,
20] to generate impressive geometric models with direct human input and control. However, the capability to leverage procedural shape grammar models to solve the inverse problem of the shape estimation approach from measured data has eluded academics in terms of being able to extract reliable fits of shape grammar parameters given measured depth, RGB, LiDAR, or other data [
21,
22,
23]. The promise of using the unprecedented power of AI estimation to solve this inverse problem has not been investigated. This article explores the extent to which current DL models are capable of solving the problem of estimating shape grammar parameters from measured data. Our results demonstrate that the deep learning models effectively address this new problem, achieving reliable and accurate parameter estimation.
In this article, we demonstrate several benefits of our approach that fuses shape models with DL estimation which are listed below:
Shape estimates are guaranteed to satisfy complex geometric shape and physical constraints, including self-symmetry, self-similarity, and free-standing stability properties.
Shape estimates are guaranteed to satisfy important geometric model properties by providing water-tight, i.e., manifold, polygon models that require a small number of triangle primitives to describe the basic object geometry.
Shape estimates provide a highly compact parametric representation of objects, allowing objects to be efficiently shared over communication links.
User-provided shape programs allow human-in-the-loop control over DL estimates. Aspects of this control include specifying lists of candidate objects, the shape variations that each object can exhibit, and the level of detail or, equivalently, dimension of the latent representation of the shape. These aspects of our approach allow humans to more easily control the DL estimate outputs and also enable humans to more easily interpret DL estimate results, which we collectively refer to as “human-in-the-loop” benefits.
Users can control the complexity and diversity of DL-estimated shapes for each object and for each object component directly through the construction of the DL network.
Object models can be used to synthesize training data for DL systems, improving over current 3D model databases which use static 3D models and therefore lack geometric diversity. Object models can be combined to generate extremely large synthetic 2D/3D datasets having rich geometric diversity and including important annotations to support a wide variety of 3D and 2D DL applications.
An example of the proposed DL fusion is provided that detects objects, and their parametric representation given a PSML shape grammar is demonstrated. Key metrics for the model estimates are shown that demonstrate the benefits of this approach.
These contributions open the door to the integration of shape-grammar-based data generation methods with deep learning techniques for 3D object/scene understanding. User-defined shape programs offer various benefits and advantages over competing approaches which are demonstrated by the results of this study.
4. Results
This section presents the results of three experiments. The results demonstrate the benefits offered by the PSML shape generation method and its fusion with deep learning techniques.
4.1. Comparison with Other Generative Methods
This experiment was conducted to demonstrate the advantages of 3D models generated using the PSML programs over the models generated by other competing methods. Specific cases of shapes generated by different methods were analyzed. From a wide array of possible algorithms, three algorithms representing VAE, GANs, and stable diffusion, respectively, were evaluated against the PSML approach: (1) 3D Shape Variational Autoencoder (3DSVAE) [
1], (2) 3D Generative Adversarial Network (3DGANs) [
4], and (3) Score Jacobian Chaining (SJC) [
9]. While many algorithms are available in the literature, the selected algorithms provide a representative sampling of generative methods for 3D shapes.
Figure 10 illustrates the comparison between 3D shapes generated using PSML and using other approaches, including VAE, GANs, and stable diffusion. The examples for comparison were sourced from their respective papers. The couch model generated by 3DSVAE (
Figure 10a) lacks the structural characteristics of a couch object, offering only a rough approximation of its complex geometry. In contrast, the couch model generated using the PSML approach (
Figure 10d) contains sufficient geometric features to represent a couch object. In the case of the table model generated by 3DGANs, the second leg from the left lacks manifold geometry, resulting in a discontinuous geometry and an unrealistic gap. Additionally, the legs lack self-similarity and self-symmetry in terms of size and length, which are typically present in real-world manufactured table objects. Conversely, the table model generated using PSML (
Figure 10e) is a manifold polygon model and satisfies the geometric and physical constraints, attributed to its rule-based volumetric generation method. The car model generated by SJC (
Figure 10c) lacks fine-grained details and fails to adhere to physical constraints, as one of the front wheels occupies the spatial location intended for the car’s front. Its PSML-generated counterpart (
Figure 10f), however, presents a high-quality and physically realistic model.
Although the objects in
Figure 10d–f may be rigid and visually simplistic, they do satisfy important common constraints for these commonplace objects. These constraints, for example, closed-shape geometry and free-standing capability, are necessary to be exhibited for objects to be classified into the correct category. The objects in
Figure 10a–c, while being visually sophisticated, would fail most realistic tests for symmetry and usability; for example, the couch cannot be sat on, the table does not stand, and the car wheels do not roll.
The PSML approach ensures the generation of 3D models that adhere to essential geometric constraints. Expanding upon this technology to integrate more realistic details holds the potential to produce visually captivating and functionally reliable 3D geometries. This advancement could bridge the gap between visually appealing designs and practical usability, offering a holistic solution for various applications.
4.2. Comparison with Other Data Representations
This experiment was conducted to demonstrate the efficiency of the compact parametric shape representation offered by the PSML approach.
Figure 11 shows a table generated using Algorithm 1, and its polygonal mesh and point cloud representations. The shape grammar representation only requires six parameters (
l,
w,
h,
t,
, and
) to represent the geometry and three more parameters to describe the color appearance. In contrast, the polygonal mesh contains 674 vertices and 1328 triangular faces. The point cloud sampled from the mesh representation contains 5000 3D points. Assuming the data are represented using single-precision floating points (4 bytes), the total memory usage is 24,024 bytes for the polygonal mesh, 60,000 bytes for the point cloud representation, and only 36 bytes for the PSML parametric representation. The PSML representation requires to work with the associated program. Assuming each character in Algorithm 1 is represented using 2 bytes, the program occupies 1662 bytes, making the total memory necessitated for a PSML table object 1698 bytes. Compared to the other two representations, this parametric representation reduces the data required to describe the geometry by ∼14 times compared to the polygonal mesh and ∼35 times to the point cloud.
Table 1 illustrates the memory usage of three representations for various object instances. The memory usage for the point cloud representation is determined by sampling 2000 points per area unit of the mesh, and the PSML usage includes the memory required for the source code. The results indicate that PSML substantially reduces memory usage for most object instances, except for the chair, where polygon meshes achieved minimal usage. As the complexity of objects increases, such as in a room scene model containing various furniture pieces, the efficiency of PSML parametric representation becomes increasingly significant.
The results presented herein underscore the data efficiency offered by parametric representation in contrast to alternative methods. Through parameterization, the PSML approach retains considerable potential for achieving high data efficiency. This efficiency not only conserves memory but also streamlines the transmission and processing of object information, rendering it particularly advantageous for applications constrained by limited resources or bandwidth.
The compact parametric representation also sets the foundation for the benefits of utilizing deep learning techniques for parameter estimation, particularly in reducing the complexity of the solution space. By leveraging the compact representation, the fusion of PSML and deep learning methods promises the effective estimation of 3D shapes with greater efficiency.
4.3. PSML-DL Fusion for Shape Detection Task
In this experiment, a synthetic dataset is generated, and the 3DETR-P network is trained on this dataset to detect 3D objects in the scene and estimate the associated PSML parameters.
4.3.1. Synthetic Dataset Generation
An experiment is conducted to demonstrate that object models can be used to synthesize training data for DL systems, improving current 3D model databases which use static 3D models and therefore lack geometric diversity.
A PSML program of the indoor room scene is written that involves six other PSML programs of common indoor furniture: table, chair, couch, bookshelf, window, and door. The proposed human-in-the-loop approach fixes various attributes of this shape-generation process and allows other aspects to vary. Fixed aspects include the size of the room and some relative and physical constraints between objects, including that (1) all of the objects are on the ground, (2) bookshelves are always against the wall, and (3) solid objects do not overlap with each other. The variations include the occurrence, location, orientation, and structural characteristics of the furniture. This is achieved by controlling the PSML parameters for each object type. These parameters are set to follow uniform distributions and to ensure realism while adhering to relative constraints within and among objects. For example, the length and width of the table object are uniformly generated from 1.5 to 2.5 units. Similarly, the height of the chair seats ranges from 0.5 to 0.8 units, reflecting real-world proportions, where chairs typically sit lower than adjacent tables.
An RGB-D image dataset of the room scene is generated using the method in
Section 3.4. The dataset is then utilized in a deep learning task for detecting 3D objects within the room, where the objective is to predict the 3D bounding boxes for each object based on the input point cloud. The point cloud data are derived from the depth data using the cameras’ intrinsic parameters. Ground truth data are generated for each sample, comprising a semantic label, 3D bounding box (location, orientation, and size), and 5 PSML parameters. Specifically, for the bookshelf object, these PSML parameters include length, width, height (to define the object’s 3D dimensions), number of horizontal panels, and vertical panels (to describe its structural characteristics). While the bookshelf necessitates all five parameters for PSML generation, it is important to note that not all objects require the same number of parameters. For instance, the door object in Algorithm 2 only requires three parameters as program arguments; in such cases, the two extra parameters are set to 0. In the dataset generated, non-zero PSML parameters always precede zero parameters within the parameter vector. Among the six object classes—table, chair, couch, bookshelf, window, and door—the respective counts of non-zero PSML parameters are 4, 3, 5, 5, 4, and 3. The DL estimation of PSML parameters can be adjusted by increasing or decreasing the parameters for prediction. For example, limiting DL models to estimate only three parameters will result in less constraint within the DL solution space.
Figure 12 shows the RGB-D image pair and associated point cloud of 3 samples from the dataset containing 2000 samples. It can be seen that the occurrence of the objects, their shapes (length, width, height, etc.), positions, and orientations are different in the room space but still obey the physical constraints. Different data types serve distinct purposes across various tasks, depending on the inputs involved. For instance, point cloud data can serve as input for deep learning models engaged in tasks such as 3D object segmentation and scene completion. RGB and depth image data can be utilized either independently or collectively as inputs for deep learning models focusing on 2D image data tasks, such as 2D object segmentation and scene reconstruction from image(s). The generated synthetic dataset demonstrates the application of PSML to deep learning systems.
Throughout our experiments, the point cloud data are exclusively used as input, aligning with the architectural design of the 3DETR-P network. The point clouds are reconstructed from the depth image using the intrinsics of the simulated cameras. Both RGB and depth images are rendered at a resolution of 640 × 480. The total processing time, including rendering and file writing, for each sample ranges from 1 to 2 s on an NVIDIA GeForce RTX 4090 GPU.
4.3.2. Three-Dimensional Object Detection and Shape Estimation
An experiment is conducted to demonstrate the capability of the proposed DL fusion in computer vision tasks, specifically detecting objects and estimating their parametric representation. Key metrics for the model estimates are presented, highlighting the benefits of this approach.
The original 3DETR implementation [
40] is adapted and modified for the implementation of 3DETR-P using PyTorch. The standard “nn.MultiHeadAttention” module is used to implement the Transformer architecture. To process the input point cloud data, a single set aggregation operation is used to reduce the number of points to 2048 and extract 256-dimensional point features. Dropout [
41] regularization for all MLPs and self-attention modules in the model, with a dropout rate of 0.1, except in the decoder, with a dropout rate of 0.3, is used to prevent overfitting. For optimization, the AdamW optimizer [
42] is used with a learning rate decayed by a cosine learning rate schedule [
43] to
, a weight decay of 0.1, and gradient clipping at an L2 norm of 0.1. The weight for
loss in Equation (
1) is set to 3. The training is performed on a NVIDIA GeForce RTX 4090 GPU for 350 epochs with a batch size of 16. Other parameters are configured to be consistent with the [
36].
The dataset generated in
Section 4.3.1 is split into train, validation, and test sets with 1200, 400, and 400 samples, respectively (60%–20%–20%). The 3DETR-P network designed in
Section 3.3 is trained to detect 3D objects in the scene and estimate the associated PSML parameters.
Figure 13 shows the MSE results on the validation set for the estimated five PSML parameters across training steps. The performance of the 3DETR-P network, as evaluated by the MSE on the validation set, demonstrates consistent convergence across the five estimated PSML parameters. All parameters show a steady decline in MSE throughout the training and validation processes, ultimately converging to values close to 0.1. This indicates that the 3DETR-P network is able to predict the shape parameters well, achieving a low error rate on the validation data.
Table 2 shows the testing results of the trained network. Following the practice in [
36], the detection performance is reported on the test set using the mean Average Precision (mAP) at two different IoU (Intersection of Union) thresholds of 0.25 and 0.5, denoted as
and
. The PSML parameters are evaluated by calculating the mean absolute error (MAE) between the estimation and ground truth. The row corresponding to 3DETR-P in the table presents its performance on the room dataset created within this study. Overall, it succeeds in detecting objects within the scene, although its performance on door detection is comparatively lower. This discrepancy may be attributed to the fact that doors in the scene often (1) lack sufficient thickness to be distinctly separated from the wall they are embedded within and (2) lack sufficient depth variations within the object to provide more features for the network to learn the structure. The
row denotes the MAE of the PSML parameters, quantitatively showcasing the success of estimating the 3D shapes from the input point cloud.
Table 2 also includes the
results of naive 3DETR on other datasets, reported in [
36]. The row corresponding to 3DETR-SUN reflects the 3DETR results from [
36] on the SUN-RGBD dataset [
44] and the 3DETR-SN row shows results on the ScanNetV2 dataset [
45]. Although a direct comparison between the results in this article and theirs is not possible, it can be seen that 3DETR-P on the generated synthetic dataset achieves comparative detection performance to 3DETR on the ScanNetV2 dataset for classes like chair, couch, and door, and outperforms 3DETR for other classes. The detection performance, together with the MAE results of the estimated PSML parameters, indicates the capability of the proposed PSML and DL fused system in detecting objects and their parametric representation.
Figure 14 visualizes three examples, presenting (1) the RGB image of the scene (left), (2) the input point cloud to 3DETR-P (middle) with the ground truth (red) and predicted 3D bounding boxes (green), and (3) the 3D shapes reconstructed using the PSML parameters estimated by 3DETR-P (right). The appearance of the reconstructed shapes is omitted, as such information is not estimated by the network in this experiment. The reconstructed 3D shapes closely resemble those observed in the RGB images and the point cloud, qualitatively demonstrating the success of 3D shape estimation from the input point cloud.
Table 3 shows the MSE results of the PSML parameter estimation (5 parameters represented by
–
) applied to the three instances depicted in
Figure 14. In this analysis, the MSE values for various objects belonging to the same class are aggregated for clarity. As indicated in
Section 4.3.1, among the objects categorized as table, chair, bookshelf, window, and door, the counts of non-zero parameters are 4, 3, 5, 4, and 3, respectively. It is worth noting that within the parameter vector, non-zero parameters consistently appear before zero parameters. For instance, within the table class, parameters
through
are non-zero, while
is set to zero. The estimation errors for the zero parameters as shown in the table range from 0 to 0.02, while MSE values for the non-zero parameters range from 0.06 to 0.17. These findings further corroborate the performance reported in
Table 2, demonstrating the efficacy of the 3DETR-P network in accurately estimating shape grammar parameters from 3D point data.