1. Introduction
Freehand Design Sketching (FDS) in architecture and industry refers to a creative activity where designers visually express and convey their ideas through two-dimensional media (e.g., paper and digital drawing boards) during the early stages of the design process. It provides an effective platform for exploration and communication within the design team or with the client [
1]. FDS not only aids in the design process but also captures the unique concepts envisioned by the designer, translating them into a tangible visual form. In architectural design, FDS is commonly used to explore architectural forms, spatial layouts, and material selections, providing essential references for the initial design proposals [
2]. In industrial design, FDS allows designers to experiment with different product appearances and functional structures, helping them find the most suitable design solution [
3]. Therefore, FDS plays an irreplaceable and essential role in the design process, connecting the initial creative idea with the subsequent concrete implementation. The FDS infuses the entire design process with inspiration and vitality.
Figure 1 illustrates that converting FDS in architectural and industrial design into Digital 3D Models (D3DM) is crucial for verifying, refining, and further developing design proposals [
4]. This method enhances the depth and breadth of the design process and also helps facilitate effective communication and understanding among the various parties involved in the project [
5].
Traditional CAD systems are limited by issues such as manual input errors, design rigidity, scalability challenges, lack of predictive insights, and difficulties in collaboration [
6]. In contrast, deep learning enhances design workflows by enabling automation, flexibility, scalability, predictive capabilities, and improved collaboration [
7]. Compared to Computer Aided Design (CAD) software, Artificial Intelligence (AI) based on deep learning techniques offers an innovative yet challenging technological approach for 3D shape reconstruction from FDS. Manually transforming FDS into D3DM encounters at least three main challenges. First, FDS is often spontaneous and unstructured, lacking precise dimensions and detailed information, making the transformation process a high-communication-cost modeling task [
8]. The related sketches may be incomplete, ambiguous, or even distorted, making it difficult for modelers to capture accurately the intended shapes and structures [
9]. Second, the limited views and scale references that FDS offers constrain the exact replication of item proportions throughout the 3D reconstruction process [
10]. Thirdly, fidelity and precision may be negatively impacted by the designer’s sketching’s inaccuracies, stains, and flaws during the FDS process [
11].
To address the challenges in interpreting and reconstructing 3D models, sophisticated algorithms and techniques can be leveraged to capture and express effectively the design intent. Relevant methods using general filtering [
12] or consolidation [
13] to transform rough sketches into clear line drawings often require manual intervention and lack precision, leading to discrepancies between the original sketches and the reconstructed models. These issues hinder the seamless transition from FDS to D3DM and call for more efficient and accurate technologies in this field. In recent years, advancements in machine learning, computer vision, and computational geometry have enabled greater automation and accuracy in 3D shape reconstruction. Technologies such as deep learning-based image recognition and reconstruction [
14], CAD-based automated modeling tools [
2], and virtual reality [
15] have introduced new technical possibilities for transforming FDS into D3DM.
In architectural and industrial design, the transformation of FDS into D3DM has high product standards for (i) restoring the modeling characteristics, (ii) ensuring symmetry, and (iii) maintaining smooth surfaces. However, 3D shape reconstruction based on deep learning techniques must still address these challenges. These difficulties highlight the existing research gaps, namely the need to develop robust algorithms to effectively improve the quality of 3D shape reconstruction [
16]. Early methods had predefined rules to infer local geometric features and generate 3D shapes. However, these methods often limited the diversity of 3D reconstructions.
The advent of deep learning has revolutionized this field. With large datasets of synthetic sketches to train deep neural networks, algorithms can predict 3D geometries from free-hand sketches. For example, conditional Generative Adversarial Networks (cGANs) can process free-form sketches without strict structural requirements. Multi-view integration methods reduce ambiguity in single-view sketches, enhancing depth perception and structural coherence. Direct shape optimization methods reconstruct 3D shapes from multiple sketches, providing greater detail and alignment with artistic intent. In our work, we use the marching cubes algorithm to convert implicit functions into surface meshes and render any view of the reconstructed surface through rasterization. This method supports iterative model optimization based on user input, making it ideal for design and prototyping applications.
This study selected eyeglass frame design as the subject for 3D shape reconstruction from FDS due to the distinct modeling characteristics of its rims, bridges, and temples. These components define the product’s design and feature absolute symmetry and smooth surfaces, which are important criteria for the D3DM outcomes. The transformation from an eyeglass frame’s FDS to its D3DM requires 3D shape reconstruction to solve challenges such as surface fitting, edge detection, and geometric shape matching. Therefore, this research explores the essential technological application of 3D reconstruction algorithms to capture the modeling characteristics of the target product and ensure symmetry and smoothness standards. Innovative solutions in this field can help promote the broad application prospects of 3D shape reconstruction from FDS based on deep learning techniques. The research results of the essential technologies can directly drive the progress of computer-aided design and digital manufacturing, opening up new possibilities for intuitive design conceptualization and efficient scheme adjustment in architectural and industrial design. Overall, this research introduces a transformative approach to 3D shape reconstruction from FDS, leveraging deep learning techniques to overcome traditional challenges in fidelity, symmetry, and geometric precision, setting a new benchmark in the integration of AI and creative design workflows.
3. Methods
3.1. Problem Statement
As shown in
Figure 3, our method takes an eyeglass frame sketch as input, and our goal is to reconstruct the corresponding D3DM of the eyeglass frame. We assume that the sketch represents a shape drawn in perspective. We define a binary sketch as
, where zero denotes pixels covered by pen strokes, and one denotes uncovered pixels.
An encoder
and a decoder
are trained, whose composite function,
, constructs a mesh
, where
denotes vertex positions in
and
denotes faces. In this formulation,
,
is the latent vector that defines the 3D geometry of mesh. We describe the encoder and the decoder in
Section 3.2 and
Section 3.3, respectively.
The main objective of obtaining the 3D eyeglass frame mesh model is to enhance the parameter
to align the projection of 3D mesh
with the 2D sketch
. This enhancement occurs by minimizing the 2D Chamfer distance between the projected mesh and the input sketch (
Section 3.4).
Section 3.5 detail optimization based on the symmetry of the eyeglass frame, emphasizing that the symmetrical property improves the accuracy of 3D reconstructions derived from FDS. Finally, we present implementation details in
Section 3.6.
Figure 4 illustrates the architecture of the DINOv2 model, providing a comprehensive view of its layered structure and the flow of data through various network components. The model consists of multiple layers, including convolutional layers (Conv), normalization layers (Norm), and activation functions such as ReLU and Tanh, each playing a pivotal role in feature extraction and data processing. These components work collaboratively to progressively refine extracted features, enhancing the model’s ability to handle complex structures in image data. The network is divided into three primary layers, each with distinct functionality that contributes to the model’s performance. Layer 1 serves as the initial stage of feature extraction, where convolutional and normalization layers process raw input data to capture fundamental image features, with ReLU further enhancing the network’s capacity to learn nonlinear relationships. Layer 2 builds upon this foundation, incorporating additional convolutional and normalization layers to refine the features learned in Layer 1, enabling the model to discern more intricate patterns. Layer 3 continues this refinement process, using further convolutional and normalization layers to extract highly abstract and complex features. The data flows through the network in a clear progression, with each layer extracting progressively more sophisticated features essential for accurate predictions or specific tasks.
Regarding computational efficiency and resource usage, we recognize the importance of these factors in evaluating the practicality of the DINOv2 framework. While powerful, the model demands significant computational resources due to the use of multiple convolutional layers, normalization, and activation functions, which increase both memory and processing requirements. However, the modular design allows for a flexible trade-off between performance and resource consumption, depending on the application and available hardware. For example, increasing the number of convolution layers in Layers 2 and 3 enhances feature extraction but also raises the computational load. This trade-off is critical in real-time applications, where balancing accuracy and efficiency is essential. Additionally, techniques like batch normalization and activation functions improve convergence speed and model stability, enhancing training efficiency.
3.2. Feature Extraction Using DINO
DINOv2 [
44] is an advanced self-supervised learning model that improves the training of robust visual features without supervision. In this study, we selected the DINOv2 self-supervised learning framework for feature extraction due to its distinct advantages. First, DINOv2 eliminates the need for complex preprocessing, such as extensive data augmentation or label processing, enabling the direct use of raw images for training, thus simplifying the workflow and reducing dataset preparation time. Second, DINOv2 excels in feature representation by maximizing consistency between homologous images, allowing the model to learn robust representations essential for tasks like sketch feature extraction. Third, its self-supervised approach enables feature extraction from large-scale image datasets without manual annotations, making it ideal for handling vast amounts of unlabeled data. Finally, DINOv2 demonstrates strong robustness in recognizing key elements in images, particularly for precise sketch feature extraction. Overall, its simplicity, robustness, and self-supervised capabilities make DINOv2 an optimal choice for our task, offering efficiency and strong performance without relying on extensive labeled datasets. It trained on diverse datasets providing a broad perspective, including ImageNet-22k, the train split of ImageNet-1k, Google Landmarks, and an assortment of fine-grained datasets comprising 1.2 billion unique images. DINOv2 showcases significant advancements in handling various computer vision tasks without fine-tuning.
The DINO framework employs a dual-network architecture consisting of a student and a teacher network. The student network learns by attempting to replicate the output of the teacher network, which in turn is an exponential moving average of the student’s parameters. The core process involves generating multiple augmented views (
) of a given input image (
I), which then these networks process. The resultant feature vectors from the student (
) and teacher (
) networks are utilized to compute the distillation loss as follows:
where
represents the temperature scaling parameter. The output functions for DINOv2 can be denoted as
, converting an input image
of size
into a
feature vector.
Since the eyeglass frames often showcase intricate curves, sharp edges, and sophisticated geometric shapes, we utilize the DINOv2 model based on the Vision Transformer (ViT) architecture as our encoder to extract features from the FDS of the eyeglass frame, which allows us to identify a diverse range of geometric features effectively.
Given an input sketch
, we employ the DINOv2 model to extract features and project them to a fixed dimension using a linear layer,
3.3. Implicit Representation
We want to learn a generalized representation of 3D eyeglass frames of different shapes. In this section, we learn a Signed Distance Function (SDF), which is a continuous function that maps
to a signed distance
s,
The sign represents whether the point is inside (negative) or outside (positive) the watertight surface and the amplitude represents the distance from the watertight surface. Therefore, we further express SDF as follows:
We implicitly defined the underlying surface of the object as the 0-level set of the neural function, denoted as . This implicit surface can be rendered through raycasting or rasterization of a mesh obtained with, for example, marching cubes.
Inspired by Park et al. [
45], we represent the object geometry as an SDF field and directly regress the continuous SDF from point samples using deep neural networks. The resulting trained network can predict the SDF value of a given query position, from which we can extract the zero level-set surface by evaluating spatial samples. In practice, we use a Multi-Layer Perceptron (MLP) neural network
as the decoder. It takes a 3D point
and latent vector
as the input and generates its distance to the closest surface. We train the parameters
of
to obtain a good approximator of the given SDF:
The training is performed by minimizing the sum over losses between the predicted
and real SDF values of points
s under the following
loss function:
where
introduces the parameter
to control the distance from the surface over which we expect to maintain a metric SDF. Larger values of
allow for fast ray tracing, since each sample gives information about safe step sizes. Smaller values of
can concentrate network capacity on details near the surface.
Once trained, the network implicitly represents the surface as the zero iso-surface of , which we can visualize using raycasting or marching cubes.
3.4. Minimizing 2D Chamfer Distance
Our primary aim in this section is to enhance the precision of our 3D model of eyeglass frames through an optimization process that aligns it closely with the FDS. We use the 2D Chamfer distance as a metric to measure the alignment between the projection of the reconstruction 3D model and the corresponding FDS.
The optimization begins by identifying the essential points on our 3D mesh, , that should align with the contour of the FDS. These points are critical in minimizing the distance between the projected mesh contour and the sketch outline.
To execute this, we employ a mapping function that projects points from 3D space onto the 2D plane of the sketch. We first project our entire 3D mesh onto a binary image . In this projection, pixels representing external contours have a zero value, while all others have one. We then identify 3D points on the mesh that project onto pixels with a value of zero on the contour image , signifying the contour points P of the mesh.
Similar to Guilard et al.’s [
37] approach, we use PyTorch3D [
46] to access both facet IDs and barycentric coordinates of
that contribute to the contour regions indicated by the binary contour image
. The position of
P is then interpolated using the vertices
,
,
of the associated facet. Since the vertices of the facet are differentiable functions of a parameter set
, the coordinates of
P are similarly differentiable.
Applying this calculation to all external contour points, we compile a set of 3D points
characterized by
The corresponding 2D projections of
can be denoted as
To enhance the alignment, we refine the outer contours of the target sketch . This refinement employs a ray-shooting algorithm from all four borders of the image to precisely preserve the initial encounter with black pixels, resulting in a refined sketch . In this refined sketch, denotes pixels p lying on the contour, while denotes background pixels. This process effectively isolates and highlights the outermost features of the sketch, disregarding interior details.
Our objective is to align the filtered sketch
and the external contours of the projected mesh
as closely as possible. We define our objective function using a bidirectional 2D Chamfer loss,
Here, is the projection of the 3D vertices in , and the coordinates of the 3D vertices in are differentiable concerning . Since f is differentiable, so are their 2D projections in and as whole.
3.5. Optimizing Based on Symmetry
In this section, our goal is to refine the 3D eyeglass frame mesh model by leveraging the symmetrical properties of the object. To compensate for the missing information from unseen views, we assume the eyeglass frame exhibits perfect bilateral symmetry, where the left part mirrors the right. This assumption holds under the condition that deformations or asymmetrical designs are minimal. We exploit this symmetry by generating a mirrored version of the input sketch , which we denote as , achieved by a horizontal flip transformation. We obtain the corresponding camera poses by applying a fixed transformation matrix, which reflects the original camera extrinsic parameters across the axis of symmetry. The intrinsic parameters are unchanged. The mirror image acts as a pseudo-projected image for a hypothetical viewpoint, enabling us to infer the appearance and geometry from angles not captured in the original view.
Using
, we generate a filtered sketch
, akin to the process described in
Section 3.4 for the original sketch
. To optimize the model, we extend our bidirectional 2D Chamfer loss function to incorporate the symmetry,
This comprehensive approach helps minimize discrepancies between the 3D models and observed and inferred 2D representations.
3.6. Implementation Details
Train. We implement our models using PyTorch Lightning [
47]. For the encoder, we freeze the DINOv2 model and only train the final linear layer, which projects the feature dimensions from 384 to 256. For the decoder, we set
and use a feed-forward network consisting of eight fully connected layers, each with dropout applied. All internal layers are 512-dimensional and utilize ReLU activations, while the output layer employs a tanh activation to regress the SDF values. We found batch normalization [
48] unstable during training, so we instead applied weight normalization [
49]. We use the Adam optimizer [
50] with a learning rate of
and train the network for 400,000 iterations across four NVIDIA RTX 3090 GPUs, with a batch size of 16 per GPU.
Figure 5 illustrates the training process of a model that takes an input sketch and employs an encoder–decoder network to predict SDF values. The training focuses on minimizing the discrepancy between the predicted and actual SDF values.
Inference. At inference time, we first retrieve the latent vector
from the encoder, similarly to the training process. Next, we voxelize the 3D space using a grid
consisting of
voxels and compute the 3D coordinates for each voxel
based on the voxel size and origin. Finally, we process the entire voxel grid in batches, along with the latent vector
, using the decoder to obtain the predicted SDF value of each voxel
v.
Figure 6 outlines the inference process, which begins with the encoder’s latent vector. The 3D space is then voxelized, and the decoder is used to predict SDF values for each voxel, enabling the reconstruction of the 3D model. The process is further refined by incorporating symmetry, where a mirrored sketch and a symmetrical loss function are applied to enhance the accuracy of the 3D eyeglass frame model.
5. Discussion
5.1. Reconstructing Non-Symmetrical Designs
This study focuses on the 3D reconstruction of symmetrical designs, exemplified by objects such as eyeglass frames, which are prevalent in industrial and architectural domains. Symmetry is a defining feature in designs ranging from automobiles and monitors to iconic architectural structures like the Louvre Pyramid. However, non-symmetrical designs are equally pervasive in these fields, characterized by their unique geometries and deliberate asymmetry. Such designs often embody a designer’s intention to create dynamic, organic forms that deviate from traditional symmetrical patterns [
52]. In industrial design, non-symmetrical examples include gaming mice designed for one-handed use, while architectural masterpieces like the Guggenheim Museum illustrate the aesthetic and structural appeal of asymmetry.
Building on our approach for the 3D reconstruction of symmetrical designs, we propose extending deep learning techniques to address the challenges associated with reconstructing non-symmetrical 3D shapes. Specifically, our method for symmetrical designs can be adapted to handle non-symmetrical ones by incorporating advanced computational solutions. Deep learning algorithms, which have demonstrated robust performance in reconstructing symmetrical 3D shapes, can be further refined to manage the irregularities and complexities of non-symmetrical geometries. For example, adding specialized layers to the neural network architecture could enhance its capacity to process the variations intrinsic to non-symmetrical designs. Additionally, leveraging Generative Adversarial Networks (GANs) can improve the model’s ability to infer and reconstruct missing or ambiguous components, facilitating the creation of accurate 3D models even from incomplete or imprecise sketches. By integrating these strategies, our approach becomes more versatile, enabling the effective reconstruction of both symmetrical and non-symmetrical 3D shapes while accommodating diverse design features.
5.2. Limitations and Future Work
While our method has demonstrated significant promise in interpreting FDS for 3D reconstruction, it is important to acknowledge its current limitations, particularly in handling incomplete, distorted, or ambiguous sketches. Our primary objective is to enhance the interpretation of sketches created by designers during the early stages of conceptualization. However, a key challenge remains: existing approaches, including ours, face difficulties when dealing with incomplete sketches that exhibit extreme distortions, ambiguous contours, overlapping strokes, or missing elements. This is especially problematic when sketches contain support lines or partial details, as current methods often merge these lines with adjacent curves, thereby losing critical contextual information required for accurate 3D reconstruction.
To address these challenges, we propose several strategies aimed at enhancing the robustness of our method in handling incomplete, distorted, or ambiguous sketches. Previous works have already tackled similar issues. For instance, Bessmeltsev and Solomon [
53] introduced an image-processing technique to extract clean vector curves from incomplete sketches. By analyzing edge features and line continuity, their method can infer missing elements, which is essential for reconstructing 3D shapes from incomplete sketches. Similarly, Favreau et al. [
54] employed deep learning algorithms to iteratively process sketch images, refining blurred or incomplete parts to generate more accurate 3D models. Additionally, Liu et al. [
13] proposed a machine learning-based approach for reasoning over sketches, which predicts missing or ambiguous parts by leveraging design patterns learned from a large dataset. This enables their system to infer plausible continuations of partial sketches. We suppose that integrating such machine learning methods into our pipeline could significantly enhance our model’s ability to handle missing components, ambiguous details, and extreme distortions, leading to more reliable 3D reconstructions even when the input sketch is incomplete, unclear, or contains overlapping strokes.
In addition to these technical improvements, we propose incorporating a user feedback loop into the system. This would allow the system to interact with the designer when encountering ambiguous or incomplete sketches, prompting for clarifications or additional input. This iterative feedback process would help ensure that the final 3D model aligns more closely with the designer’s intent. By adopting this collaborative approach, our system would not only become more robust in handling incomplete, distorted, or ambiguous sketches but also improve the overall user experience, making the process of converting freehand sketches into 3D models more intuitive and efficient.
6. Conclusions
This study investigated the potential of deep learning techniques for the 3D reconstruction of eyeglass frames from FDS. By employing advanced neural network architectures and integrating symmetry optimization, the proposed method effectively extracts detailed 3D information from FDS, achieving enhanced accuracy and reliability in the reconstruction process. This approach demonstrates significant advancements in capturing fine details and accommodating shape variations, offering a practical solution for industrial applications. Notably, this research contributes a novel framework that combines self-supervised learning and implicit representations, setting a new standard for AI-driven 3D modeling workflows in design contexts.
Despite these achievements, challenges remain in addressing the diversity and size of datasets, as well as the computational demands of the reconstruction process. These factors currently limit the scalability of the method for real-time and high-complexity applications. Future research should prioritize the development of more efficient models to reduce computational complexity, alongside expanding dataset diversity to enhance the generalizability of the approach. Additionally, exploring novel neural network architectures and integrating user-driven feedback mechanisms could further improve reconstruction accuracy and usability. These enhancements would bridge existing gaps, enabling broader adoption in design workflows and extending the applicability of AI-driven 3D modeling technologies.