A Hierarchical Neural Network for Point Cloud Segmentation and Geometric Primitive Fitting

Wan, Honghui; Zhao, Feiyu

doi:10.3390/e26090717

Open AccessArticle

A Hierarchical Neural Network for Point Cloud Segmentation and Geometric Primitive Fitting

by

Honghui Wan

¹

and

Feiyu Zhao

^1,2,*

¹

College of Computer Science, South-Central Minzu University, No. 182 Minzu Avenue, Hongshan District, Wuhan 430074, China

²

Key Laboratory of Cyber-Physical Fusion Intelligent Computing (South-Central Minzu University), State Ethnic Affairs Commission, No. 182 Minzu Avenue, Hongshan District, Wuhan 430074, China

^*

Author to whom correspondence should be addressed.

Entropy 2024, 26(9), 717; https://doi.org/10.3390/e26090717 (registering DOI)

Submission received: 27 May 2024 / Revised: 20 August 2024 / Accepted: 21 August 2024 / Published: 23 August 2024

(This article belongs to the Special Issue Application of Information Theory to Computer Vision and Image Processing II)

Download

Browse Figures

Versions Notes

Abstract

:

Automated generation of geometric models from point cloud data holds significant importance in the field of computer vision and has expansive applications, such as shape modeling and object recognition. However, prevalent methods exhibit accuracy issues. In this study, we introduce a novel hierarchical neural network that utilizes recursive PointConv operations on nested subdivisions of point sets. This network effectively extracts features, segments point clouds, and accurately identifies and computes parameters of regular geometric primitives with notable resilience to noise. On fine-grained primitive detection, our approach outperforms Supervised Primitive Fitting Network (SPFN) by 18.5% and Cascaded Primitive Fitting Network (CPFN) by 11.2%. Additionally, our approach consistently maintains low absolute errors in parameter prediction across varying noise levels in the point cloud data. Our experiments validate the robustness of our proposed method and establish its superiority relative to other methodologies in the extant literature.

Keywords:

computer vision; point cloud; segmentation; primitive fitting

1. Introduction

Recent advancements in three-dimensional (3D) scanning technology, coupled with increased data storage capacity, have significantly impacted the field of computer vision, offering unparalleled opportunities for the processing and analysis of 3D geometric data. Nevertheless, the scanned data often exists in rudimentary forms, such as point clouds or meshes, which lack direct expressions of structure and semantic information inherent to objects [1]. This poses a notable challenge for precise 3D object manipulation, as it hampers the understanding and reproduction of an object’s structural and semantic integrity. These challenges are acutely felt in advanced editing and model-fitting tasks, where the absence of well-defined geometric primitives and their associated parameters obstructs the achievement of desired outcomes. Consequently, researchers are assiduously converting this raw data into organized, semantically rich representations—a data enhancement process that is crucial for computer vision and serves as a foundational step towards achieving accurate digital twins of physical entities. The importance of this data enhancement process has been emphasized in the fields of Constructive Solid Geometry (CSG) [2] and optical sensor fusion [3], which contribute to 3D reconstruction. CSG is the foundation of Computer-Aided Design (CAD), further highlighting the need for such data enhancement.

In the domains of computer vision and machine learning, the RANSAC [4] algorithm is widely employed, particularly for the fitting of geometric primitives, including lines, circles, and planes. RANSAC excels at selecting the minimum subset of points from a dataset required to define a model’s parameters. Distinctly, RANSAC operates without initial knowledge of the outlier ratio within the data and demonstrates considerable resilience to outliers, often successfully estimating a reasonable model amidst substantial noise and outliers. Nevertheless, RANSAC-based fitting techniques can demand significant computational resources, particularly with vast datasets or when the ratio of inliers to outliers is low. Numerous iterations may be essential to ascertain the optimal model parameters. Furthermore, the algorithm is stochastic, resulting in varying outcomes across different executions due to the random selection of points in each iteration. It is also sensitive to the specification of parameters like the inlier threshold and iteration count, where improper configurations can yield suboptimal fitting results. In extreme cases where outliers overshadow inliers, RANSAC might converge to a local optimum without ensuring a global optimum, rendering it unsuitable for processing extensive point cloud data.

In contrast to conventional primitive-fitting methods, the Supervised Primitive Fitting Network (SPFN) [5] framework introduced by Li et al. [5] leverages the PointNet++ [6] architecture as an encoder for processing input point clouds. This network can autonomously determine the relationships between points and primitives, discerning their types and deducing the parameters that minimize fitting error, which markedly diminishes reliance on manual parameter adjustment and initial processing stages. Thanks to the deep network’s robust performance, SPFN excels at feature extraction from intricate datasets and managing complex environments that pose a challenge to traditional algorithms. Nevertheless, SPFN’s efficacy is limited by PointNet++‘s inherent processing capacity, leading to difficulties in fitting small-sized primitives and handling high-resolution point clouds. To address these constraints, researchers such as Lê et al. [7] have improved upon SPFN by introducing the Cascaded Primitive Fitting Networks (CPFNs) [7] framework, enhancing the matching of diverse primitive sizes with high-resolution point clouds. Despite these advancements, fundamental issues persist, as the networks continue to utilize PointNet++ and exhibit limitations in point feature extraction and primitive fitting.

In this study, we propose a novel architecture, the details of which are depicted in Figure 1.

The initial step in our research involves de-noising and re-sampling the input point cloud data through our preprocessing module. This preprocessing not only improves the quality of the data but also lays the foundation for subsequent feature extraction. Following the preprocessing, the feature extraction network mines rich geometric features from these refined point clouds, thus providing support for in-depth geometric shape analysis and understanding. The effectiveness of this network is demonstrated in its ability to capture the essential geometric information in the point cloud data, which is an indispensable step in the entire primitive-fitting process. After feature extraction, the segmentation and classification module further process these features, effectively partitioning the entire point cloud into multiple subsections and accurately categorizing each section into the corresponding geometric primitive types. This module employs advanced machine-learning techniques capable of discerning and categorizing various geometric structures within the point cloud, thereby ensuring accuracy and efficiency in subsequent processing steps. Following meticulous segmentation and accurate classification, the parameter prediction module is responsible for precise parameter estimation of each segmented primitive. This module utilizes a weighted least squares method to infer the geometric parameters of the primitives, thus achieving precise fitting of the geometric primitives to the point cloud data.

The primary contributions of our research can be summarized as follows:

This study introduces an innovative feature extraction network that is specifically designed to address the issue of imprecision that arises during the process of geometric primitive recognition.
We propose a parameter prediction module, the design of which is intended to predict the parameters of geometric primitives more accurately. This is crucial for enhancing the precision of shape representation and for the understanding of complex 3D structures, marking an important step toward fine-grained 3D modeling.
Extensive experimental validation demonstrates that our method shows significant advantages in terms of accuracy. These experimental results fully substantiate the effectiveness and practicality of our model in the processing of 3D geometric data, providing reliable technical support for further research and practical applications in the field of computer vision.

2. Related Work

Three-dimensional point cloud data inherently lack a direct representation of object structure and semantic information, a trait that poses a significant obstacle to precise three-dimensional object manipulation within the field of computer vision [1]. This limitation constrains the ability to discern and replicate the structural and semantic characteristics of objects from the data. The constraint is particularly pronounced during complex tasks such as advanced editing and model fitting. Due to the absence of clearly defined geometric primitives and their associated parameters, achieving the anticipated results becomes increasingly challenging.

In the field of unstructured data extraction, such as recovering or extracting CSG representations of constructed entities from point clouds, 3D point cloud primitive fitting has always been a focal point of interest for researchers. This paper provides a comprehensive review of cutting-edge methods for 3D point cloud primitive fitting, with detailed technical trends referenced in in-depth research literature [8,9].

In traditional methods, the processing primarily involves RANSAC [10,11,12], parameter space transformation [13], and the use of clustering and segmentation techniques [14,15,16]. Among them, RANSAC [10] and its variant methods [11,17,18,19,20,21] are particularly common in the field of computer vision for the detection and fitting of geometric primitives. Methods based on RANSAC are capable of approximating the estimation of model parameters iteratively, demonstrating exceptional performance. However, this approach requires tedious parameter tuning for each primitive, which is also its application limitation. For example, Schnabel et al. [11] presented a robust framework based on RANSAC in their seminal paper, which effectively detects various geometric primitives in dense point clouds. Building on this, Li [17] introduced a post-optimization step, refining the relationship between the extracted primitives. Meanwhile, Wu et al. [9] and Du et al. [22] explored the application of RANSAC methods for reverse engineering of CSG models using input point clouds or mesh structures. Romanengo et al. [23] proposed a solution based on the Hough Transform that can identify simple geometric primitives and exhibits a certain degree of robustness to noise. Although these methods yield significant results in their application domains, their performance is often constrained by intricate parameter tuning. Moreover, the normal information generated during the point cloud processing is crucial for primitive fitting, yet it is not easily attainable in 3D scanning. In contrast, our deep-learning network requires no additional information during the testing phase and solely relies on the coordinates of the point cloud data to function.

In terms of approaches based on neural networks, Tulsiani et al. [24] and Zou et al. [25] were among the first to introduce neural networks for deconstructing and decomposing shapes. They proposed a network architecture that assembles complex geometric shapes using simple 3D voxels. Subsequent researchers have extended this idea. For instance, Sun et al. [26] constructed a network architecture for predicting cuboid parameters to abstract the structure of 3D objects. Smirnov et al. [27] introduced a framework for predicting parametrized shape primitives through deep learning, which utilizes distance fields to perform transformations between pixel grids and input data, applicable to 2D and 3D tasks. Lin et al. [28] proposed a method for editing primitive adaptations to target shapes based on reinforcement learning. Subsequent research focused on fitting various types of primitives, such as solutions based on super quadric geometric primitives studied by Paschalidou et al. [29]. Gadelha et al. [30] developed a generative model that establishes compact 3D expressions by predicting parameters for cubical and spherical meshes. Additionally, networks for learning 3D representations of convex shape sets through recursive spatial subdivision have been proposed by Chen et al. [31] and Deng et al. [32]. Local shape elements expressed using implicit functions have been explored by Genova et al. [33,34]. However, the types of primitives used in these studies are fairly limited, with most work focusing on abstracting input shapes using coarse prototypes. Some research has considered fitting multiple primitive types, such as a network proposed by Gopal et al. [35] that predicts Constructive Solid Geometry (CSG) structures from complex primitive geometric shapes, though its accuracy is limited by the definition within low-resolution voxel grids. Li et al. [36] proposed a geometric primitive-fitting network based on surface and edge detection, with an emphasis on learning edge and geometric features. However, the network may fail when edges of adjacent primitives are missing. Li et al. [5] presented the SPFN framework for more precise fitting of various primitives including planes, spheres, cylinders, and cones; Sharma et al. [37] further extended this method to adapt to B-spline surface stitching; Saporta [38] undertook unsupervised recursive structural modification of this framework; while Lê et al. [7] applied a cascaded structure combining local and global features to SPFN. Although these methods are effective, most of them use the ready-made network architecture PointNet++ [6] to encode input point clouds and are thus subject to constraints of the point cloud scale (such as processing at most an 8 k magnitude of points) and hindered by the limitations of point cloud feature extraction capabilities. Our proposed method innovatively provides a new network architecture composed of a preprocessing module, feature extraction module, classification segmentation module, and parameter prediction module.

3. Method

Our network architecture predominantly consists of four parts: the preprocessing module (Section 3.1), feature extraction network (Section 3.2), classification and segmentation module (Section 3.3), and parameter prediction module (Section 3.4).

3.1. Preprocessing Module

The preprocessing module plays an indispensable role in the construction of deep-learning architectures. As shown in Figure 2, the core function of this module lies in optimizing the computational efficiency and overall performance of the network.

In the practical application of deep learning, particularly when dealing with complex 3D point cloud data, one common challenge we encounter is data noise. We first perform statistical filtering on the point set

\{x_{1}, x_{2}, \dots, x_{n}\}

, calculating the average distance

\{d_{1}, d_{2}, \dots, d_{n}\}

of each point to its

K

nearest neighbors. Points satisfying

d_{i} > μ + c \cdot σ

are removed, resulting in the subset

\{x_{1}, x_{2}, \dots, x_{m}\}

, where

μ

and

σ

are the mean and standard deviation of the average distances of the point set, respectively, and c is the standard deviation multiplier. Then, for each point

P_{i} \in \{x_{1}, x_{2}, \dots, x_{m}\}

, Gaussian filtering is applied to reduce positional perturbations caused by noise with the following equation:

P_{i}^{'} = \frac{\sum_{j = 1}^{n} w_{i j} \cdot P_{j}}{\sum_{j = 1}^{n} w_{i j}}

(1)

where

w_{i j} = e^{- \frac{d_{i j}^{2}}{2 σ^{2}}}

represents the Gaussian weight of point

P_{j}

relative to point

P_{i}

, and

d_{i j}

represents the distance between points

P_{i}

and

P_{j}

.

High-resolution three-dimensional point cloud data typically exhibit an extremely high data density, containing hundreds of thousands or even millions of data points. The high density of data points exerts significant pressure on computational resources and increases the complexity of processing. Therefore, it is essential to significantly alleviate computational and storage burden by reducing the total number of data points during the preprocessing stage while maintaining the core structural characteristics of the data. Consequently, our network automatically performs down-sampling on point clouds with a number of points exceeding 10k within the preprocessing module. Figure 3 is the visualization of the point cloud after being processed by the preprocessing module.

3.2. Feature Extraction Network (FEN)

Before delving into the role of the feature extraction network, it is necessary to clarify its core objectives and working principles. The primary task of the feature extraction network is to handle point cloud data containing multi-dimensional features, distilling key information that aids subsequent tasks. In this section, we will introduce the main components of the feature extraction network. Figure 4 depicts the network structure diagram, which adopts a hierarchical structure similar to that of PointNet++ [6].

3.2.1. Input and Output

The feature extraction network accepts a matrix of size

N \times (d + C)

as input, which represents a point cloud composed of

N

points, with each point containing not only d-dimensional spatial coordinates but also C-dimensional attribute features. After a series of operations performed by the network, it outputs another matrix of dimensions

N' \times (d + C')

. In the output matrix, the number of points is reduced to

N'

, and each point retains its spatial coordinates while also acquiring an updated feature vector of dimension

C'

. In this manner, the network summarizes the local contextual information of each point, culminating in a more advanced representation of features.

3.2.2. Sampling Layer

The input point cloud data are initially processed through a sampling layer, which extracts a representative subset

\{x_{1}, x_{2}, \dots, x_{m}\}

from the initial set of points

{x_{1}, x_{2}, \dots, x_{n}}

. The traditional farthest point sampling (FPS) [39] algorithm experiences a linear increase in time complexity with the increase in the number of points, leading to significant time consumption when dealing with large-scale point cloud data. To address this issue, we applied the optimized farthest point sampling (OFPS) [40,41] algorithm, which accelerates the retrieval speed for the farthest points by incorporating a spatial data structure known as the KD-tree. It also significantly reduces the frequency of distance computations through a local search strategy. The core objective is to more effectively select key points from the point cloud that represent the overall distribution of the data, ensuring that the sampled points are evenly distributed in space to achieve high-quality coverage of the entire set of points.

3.2.3. Grouping Layer

Following the sampling layer, the grouping layer utilizes the set of key points selected by the sampling layer as a set of centroids. Using a spherical neighborhood algorithm centered on these centroids, it interweaves the points from the original set that are relevant to them, thereby forming multiple local point sets. This layer operates on the input point set with dimensions

N \times (d + C)

, in combination with a set of centroid coordinates with dimensions

N' \times d

, ultimately producing a group of point sets with dimensions

N' \times K \times (d + C')

. Each set of data represents a local neighborhood of a centroid, where

K

is the number of points within the neighborhood. Although the

K

values can vary across different groups, subsequent network layers can unify these local neighborhood features, with a varying number of points, into fixed-length feature vectors. In convolutional neural networks, the local region of pixels includes those whose array indices are within a certain Manhattan distance. In point sets sampled from metric spaces, the neighborhood of points is defined by the metric distance. Considering that the density of real-world-scanned point cloud data is often uneven, an adaptive neighborhood algorithm is employed. Specifically, the local point cloud density is estimated based on the distances to the

K

nearest neighbors of each key point, where the distance can be represented as the distance from the point to its K-th nearest neighbor, and then this distance is used as the adaptive radius for that point. In areas of high density, the search radius is smaller, as a few neighbors are sufficient to represent the local features of the region; in sparser areas of the point cloud, the search radius is increased to ensure that a sufficient number of neighboring points are captured, allowing the point cloud to automatically adapt to varying densities.

3.2.4. PointConv Layer

Then, the network employs a PointConv [42] layer to emulate the effect of traditional two-dimensional convolution within a point cloud. Unlike two-dimensional convolution, which operates with a fixed kernel sliding over a pixel grid, PointConv functions within the local neighborhood of each point using a set of specific importance weight functions. These weight functions take into account the spatial position of points, existing features, and the local point density to calculate the contribution of each point in the feature-learning process. The following equation is used when calculating each point’s contribution to the feature learning process:

F_{i}^{'} = \sum_{x_{j} \in H (x_{i})} W (x_{i}, x_{j}) F_{j} \frac{1}{ρ (x_{j})} Δ V_{j}

(2)

where

F_{j}

represents the input features of point

x_{j}

,

F_{i}^{'}

represents the output features,

W

is the weight function of the continuous convolution,

ρ (x_{j})

is the density estimation of point

x_{j}

,

Δ V_{j}

represents the volume element, and

H (x_{i})

denotes the neighborhood set of

x_{i}

. Through this approach, the integration of PointConv kernels with sampling and grouping can accommodate the non-uniform distribution of point cloud data and learn more enriched feature representations.

3.3. Segmentation and Classification Module (SCM)

As shown in Figure 5, the network performs segmentation on the point cloud after feature extraction, dividing it into several subsets. It performs classification on each segmented subset in order to determine its most compatible primitive geometric type (Figure 5: plane, sphere, cylinder, or cone). Inspired by the SPFN [5] approach, we compute the Relaxed Intersection over Union (RIoU) for all column pairs in the membership matrix W and

\tilde{W}

. The definition of RIoU is as follows:

R I o U (W, \tilde{W}) = \frac{W ⨀ \tilde{W}}{W + \tilde{W} - W ⨀ \tilde{W}}

(3)

In this context,

W \in {0,1}^{N \times K}

is a predefined ground truth matrix, and

\tilde{W} \in {[0,1]}^{N \times K}

is the prediction matrix;

⨀

denotes the Hadamard product of matrices; the plus and minus signs represent the element-wise addition and subtraction of matrices, respectively.

N

is the total number of points, and

K \in {0,1, 2,3}

represents different geometric primitives: planes, spheres, cylinders, and cones, respectively.

In the segmentation task, the network assigns a category label to each individual point in the point cloud, which requires the utilization of both local and global features. Local features are passed to the decoder and fused with global features to generate the predicted classification for each point. Here, the decoder comprises two components: interpolation and PointConv [42]. Initially, we employ an interpolation method to propagate the coarse features from the previous layer, interpolating from the features of the three nearest neighboring points through linear interpolation. Subsequently, with the use of skip-links, the interpolated features are combined with features generated by a convolutional layer of matching resolution. The interpolation satisfies the formula:

v (q) = \frac{\sum_{i}^{k} w_{i} v_{i}}{\sum_{i}^{k} w_{i} + ε}

(4)

where

w_{i} = \frac{1}{d_{i}^{p}}

, with

q

being any query point, the initial value of

p

set to 2, and

k

set to 3;

ε

is a small constant to avoid the situation of division by zero,

v_{i}

denotes the feature vector of the i-th point, and

d_{i}

denotes the distance from the i-th point to the query point

q

. Meanwhile, skip connections update the feature vector of each point through a fully connected layer and a ReLU layer, ensuring the effective transmission of features and non-linear enhancement. After combining the features, we apply the PointConv operation on the mixed features to obtain the final deconvolution output, a process similar to deconvolving layers in image processing. This procedure is repeated until the features of all the input points are propagated back to the original resolution.

In classification tasks, the network predicts the category of the primitive type to which the input point cloud belongs. Local features are extracted from points within a neighborhood through multiple small PointConv networks. Then, a max-pooling operation is performed on all points to obtain a global feature

F_{G}

representation. By feature aggregation, the network is able to synthesize attributes from different spatial positions and thus consolidate them into a global feature representation via global max-pooling. This global feature representation is subsequently fed into a multi-layer perceptron (MLP) to capture the complex relationships between global features and to make classification judgments based on the extracted features.

3.4. Parameter Prediction Model

Building upon the identification of geometric primitives for each segment by the network, as described in the previous section (Refer to Figure 6, where labels (0, 1, 2, 3) are used to denote a plane, sphere, cylinder, and cone, respectively.), the function of this network component is to predict the precise mathematical parameters corresponding to these primitives.

3.4.1. Plane

The fitted plane can be represented by the following equation:

N^{T} x + d = 0

(5)

where

N

is the normal vector of the plane, and d is the distance to the origin. Our goal is to find the vector

N

and the scalar

d

that minimize the weighted least squares error function:

L (N, d) = \sum_{i = 1}^{N} w_{i} (N^{T} P_{i} + d)^{2}

(6)

Here, the error is the distance from a point

P

to the plane, and

w_{i}

represents the weight, which indicates greater consideration given to the point when the weight is larger. To solve this problem, we take the partial derivatives with respect to

N

and

d

, set them to zero, and obtain the linear system of equations:

G^{T} W G H = b_{1}

(7)

where

G = [P_{i}^{T}, 1]

,

H = {[N, d]}^{T}

, and

b_{1}

is a zero vector. The system of equations is solved by finding the solution

H

through singular value decomposition (SVD).

3.4.2. Sphere

The fitted sphere can be described by the following equation:

F_{s p h e r e} = (c_{1}, r)

(8)

where

c_{1} = (c_{x}, c_{y}, c_{z})

is the center of the sphere, and

r

is the radius of the sphere. Given a set of points

P

and corresponding weights

w

, the optimal sphere is found by minimizing the sum of squared weighted distances, given by the following equation:

L^{2} (w, c_{1}, r) = \sum_{i = 1}^{N} w_{i} {({||P_{i} - c_{1}||}^{2} - r^{2})}^{2}

(9)

Setting

\frac{\partial F}{\partial r^{2}} = 0

yields

r^{2} = \frac{{\sum_{1}^{N} w_{i} ‖P_{i} - c_{1}‖}^{2}}{\sum_{i = 1}^{N} w_{i}}

. Substituting

r^{2}

into the equation above yields the following equation in terms of

c_{1}

:

F (w, c_{1}) = {‖{(d i a g (w))}^{\frac{1}{2}} (X c_{1} - y)‖}^{2}

(10)

Let the centroid of all points be

\bar{P} = \frac{\sum_{i = 1}^{N} w_{i} P_{i}}{\sum_{i = 1}^{N} w_{i}}

, where the i-th row of matrix

X

is

X_{i} = 2 (- P_{i} + \bar{P})

, and the

i

-th element of vector

y

is

y_{i} = P_{i}^{T} P_{i} - \frac{\sum_{i = 1}^{N} w_{i} P_{i}^{T} P_{i}}{\sum_{i = 1}^{N} w_{j}}

. Construct the coefficient matrix of the normal equation:

B = X^{T} X

, and the constant term:

b_{2} = X^{T} y

. Use Cholesky decomposition to solve the normal equation

B c_{1} = b_{2}

to find the sphere’s center

c_{1}

. Once we have the center

c_{1}

, we can calculate the square of the radius:

r^{2} = \frac{{\sum_{i = 1}^{N} w_{i} ‖P_{i} - c_{1}‖}^{2}}{\sum_{i = 1}^{N} w_{i}}

.

3.4.3. Cylinder

The parametric equation representing the fitted cylinder can be expressed as follows:

F_{c y l i n d e r} = (η, c_{2}, r)

(11)

where the vector

η ϵ R^{3}

denotes the unit vector in the direction of the cylinder’s axis, the coordinate

c_{2} ϵ R^{3}

represents the center point of the cylinder, and

r ϵ R^{3}

is the radius of the cylinder. For a point

P

in space, we can compute the square of its distance to the cylinder’s surface, which is given by the following expression:

D_{c y l i n d e r}^{2} = {(\sqrt{{v_{1}}^{T} v_{1} - {(η^{T} v_{1})}^{2}} - r)}^{2}

(12)

where

v_{1} = P - c_{2}

represents the position vector of point

P

relative to the cylinder’s center

c_{2}

.

However, it is not straightforward to directly calculate this distance, as it involves non-linear operations. Therefore, to simplify the problem, we usually address the cylindrical fitting issue in two steps: First, we determine the cylinder’s axis vector. This can be achieved by finding a common direction in the set of the normals to the data points. The normal

N

on the data point is perpendicular to the axis vector of the cylinder. Therefore, we look for a vector

η

such that the dot product of the normals of all points with this vector is minimized as much as possible.

ξ_{c y l i n d e r} = {‖\sqrt{d i a g (w)} N η‖}^{2}

(13)

Then, once the axis vector

η

has been identified, we can determine the cylinder’s center point

c_{2}

and radius

r

. We assume that all data points form a circular shape when projected onto the plane perpendicular to the axis of the cylinder. We fit a circle through these projected points, a step very similar to fitting a sphere, where the objective is to find the optimal center and radius that best approximate the data points.

3.4.4. Cone

The cone can be parameterized as follows:

F_{c o n e} = (η, c_{3}, θ)

(14)

where

η

is a unit vector, denoting the axis direction emanating from the vertex

c_{3}

of the cone, and

θ \in (0, \frac{π}{2})

is the semi-angle of the cone. The square of the distance to the cone’s surface is given by the following equation:

D_{c o n e}^{2} = {(‖v_{2}‖ \sin (\min (|η - θ|, \frac{π}{2})))}^{2}

(15)

where

v_{2} = P - c_{3}

,

η = a r c c o s (\frac{η^{T} v_{2}}{‖v_{2}‖})

. When solving the computational problem of approximating the cone’s vertex and axis, we first independently estimate

η

and

c_{3}

and then estimate the semi-angle

θ

. Since all the intersection points of the tangential planes on the cone’s surface are the vertex

c_{3}

of the cone, the multiplane intersection problem can be represented as a least squares method by using the predicted normal vectors

N

.

ξ_{c o n e} (c_{3}, N) = {‖{(d i a g (w))}^{\frac{1}{2}} (N c_{3} - y)‖}^{2}

(16)

where

y_{i} = N_{i}^{T} P_{i}

. Finally, with the vertex

c

and the axis vector

η

, the semi-angle

θ

can simply be calculated as a weighted average:

θ = \frac{1}{\sum_{i = 1}^{N} w_{i}} \sum_{i = 1}^{N} w_{i} a r c c o s |η^{T} \frac{P_{i} - c_{3}}{‖P_{i} - c_{3}‖}|

.

3.5. Evaluation Metrics

We employ the following metrics [4] to evaluate our network:

Segmentation mean intersection over union (IoU)

This metric is used to measure the similarity between the predicted segmentation and the true segmentation. It is defined as follows:

S = \frac{1}{k} \sum_{i = 1}^{K} I o U ({\tilde{W}}_{:, i}, h (W_{:, i}))

(17)

where

h

denotes the conversion to a one-hot vector representation, K is the number of true segments, and for details on IoU, please refer to Section 3.3.

2.: Mean primitive type accuracy

This metric is used to measure the average accuracy of predicted primitives. It is defined as follows:

T = \frac{1}{K} \sum_{i = 1}^{K} 1 [t_{i} = {\tilde{t}}_{i}]

(18)

where

1

is the indicator function, and

t_{i}

and

{\tilde{t}}_{i}

are the true and predicted primitive types for the i-th segment of the point cloud, respectively.

3.: Mean point normal difference

This metric is used to measure the discrepancy between predicted normal vectors and true normal vectors, defined as follows:

N = \frac{1}{K} \sum_{i = 1}^{N} \arccos (|N_{i}^{T} {\tilde{N}}_{i}|)

(19)

where

N_{i}

and

{\tilde{N}}_{i}

are the true and predicted normal vectors for the i-th point, respectively.

4.: Mean primitive axis difference

This metric is utilized to compare the difference between the predicted primitive axes and the true axes, with lower values indicating higher precision in orientation prediction. It is calculated as follows:

A = \frac{1}{\sum_{i = 1}^{K} 1 [t_{i} = {\tilde{t}}_{i}]} \sum_{i = 1}^{K} [t_{i} = {\tilde{t}}_{i}] \arccos (\cos (Θ (η_{i}, {\tilde{η}}_{i})))

(20)

where

η_{i}

and

{\tilde{η}}_{i}

are the true and predicted principal axes of the i-th primitive, respectively, and

c o s (Θ)

represents the cosine of the angle between two axes.

5.: Mean residual

The average distance residual of actual data points to the surface of the predicted primitive. It is computed as follows:

R = \frac{1}{K} \sum_{i = 1}^{K} E_{p ~ U (S_{i})} D (p, {\tilde{A}}_{i})

(21)

where

E_{p ~ U (S_{i})}

is the expectation over points

p

sampled uniformly from the surface

S_{i}

of the true primitive, and

D (p, {\tilde{A}}_{i})

is the distance from

p

to the predicted primitive surface

{\tilde{A}}_{i}

.

6: P coverage

This metric represents the proportion of points in the point cloud that are covered by the predicted geometric primitives. It is defined as follows:

C = \frac{1}{N} \sum_{i = 1}^{N} I ({m i n}_{j = 1}^{K} (\sqrt{D (P_{i}, {\tilde{A}}_{j})}) < ε)

(22)

where

P_{i}

is the i-th point in the point cloud,

{\tilde{A}}_{j}

denotes the surface of the j-th predicted primitive,

D (P_{i}, {\tilde{A}}_{j})

is the square of the Euclidean distance from

P_{i}

to

{\tilde{A}}_{j}

,

ε

is a predefined threshold, and

I

is the indicator function, which values at 1 if the condition inside the parenthesis is true, and 0 otherwise.

4. Experiments

We utilized CAD models of mechanical parts that conform to the American national standards institute (ANSI) standards, provided by Traceparts [5], to train and assess the network we proposed. The training dataset contains 13,831 models, while the validation and test datasets each contain 3366 models, with each model consisting of 131,072 points. The training process was conducted on a computer equipped with an Intel Xeon(R) Silver 4210R CPU and an NVIDIA GeForce RTX 3090 GPU. During the training process, the initial learning rate is set to 0.001, with a decay rate of 0.7, and the learning rate is decayed every 200,000 steps. The batch size is 32, and the total number of training epochs is 300. The weights for the loss functions, including MIoU loss, normal loss, type loss, parameter loss, residue loss, and total loss, are all set to 1.0 to ensure that all loss components have an equal impact on the model. Meanwhile, the weight for the orthogonality loss is set to 0.1 to enhance the geometric properties of the model. Validation is performed every 2000 steps, and model snapshots are saved every 500 steps. The supported primitives include spheres, planes, cylinders, and cones. To ensure the rigor of our experiments, all models in the dataset underwent normalization such that their centroids were positioned at the origin of the coordinate system. In an effort to examine and enhance the performance of the network in noisy environments and improve the robustness of the algorithm, we simulated noise that might be encountered when receiving data from real sensors by adding Gaussian noise with standard deviations of 5%, 10%, and 15% to the original data, which is displayed in Figure 7 as an example. With this approach, our aim is to bring the development and testing environments as close as possible to real-world application scenarios, thereby increasing the practical value of our research.

We compared the performance of the method proposed in this study with the efficient RANSAC algorithm [11] in terms of primitive fitting, where RANSAC used the default adaptive algorithm parameters. Table 1 details the experimental results of our network versus efficient RANSAC.

Observations from the data reveal that compared to the RANSAC algorithm, our network achieved a significant improvement in mean intersection over union, increasing from 55.81% to 87.30%. This leap markedly demonstrates the advantage of our network in point cloud segmentation precision. In terms of geometric primitive type recognition, the accuracy of RANSAC was 63.03%, while our network significantly increased it to 95.62%, substantially enhancing the ability to recognize complex geometric shapes. For the estimation error of point normals and primitive axes, our method achieved 7.68 and 1.62, respectively, which were clearly superior to RANSAC’s 11.36 and 4.98, reflecting our algorithm’s greater precision in geometric feature estimation. Furthermore, the residual decreased to 0.002 and coverage increased to 91.76%, indicating that our method has a more advantageous understanding and representation of 3D point cloud data in terms of accuracy and comprehensiveness. Furthermore, Figure 8 demonstrates that our method generates segmentation results and models of exceptional quality that closely mirror the ground truth, outperforming all preceding techniques.

In comparison with the SPFN method, our network increased the Mean IoU from 75.66% to 87.30%, further validating our method’s superiority in maintaining high segmentation quality. In terms of accuracy for primitive-type recognition, our method also improved from SPFN’s 88.73% to 95.62%, significantly strengthening the classification performance for different 3D geometric primitives. Although SPFN had already achieved low values in point normal and primitive axis estimation errors (1.67 and 0.004, respectively), our network still accomplished further optimization in these metrics, reducing residuals to 1.62 and 0.002, showcasing an even more outstanding performance. The enhancement in point cloud coverage also highlights the comprehensiveness of our algorithm in integrated geometric information processing. When compared with the CPFN method, our network demonstrated a significant advantage in accurately segmenting point cloud data, with an increase in the mean intersection over union (IoU) from 78.15% to 87.30%. In terms of primitive-type recognition, the accuracy of our network improved from 93.56% to 95.62%, which underscores the network’s precision in identifying a diverse range of geometric primitives. Even though the numerical results are close to those of CPFN in point normal and primitive axis estimation, our network still exhibited further optimization by reducing the error to a very low level. By decreasing the residual to only 0.002 and improving the point cloud coverage to 91.76%, our network has effectively proven its exceptional capability in capturing the complexity of point cloud data in great detail.

Figure 9 displays the fitting results of the neural network’s predictions for the geometric primitive parameters applied to CAD models. The input point cloud is displayed at the top of the image. The middle row shows the color-coded representation of the point cloud after our network predicts the point-to-primitive affiliations, where points of the same color represent classification to the same geometric primitive. The bottom line presents three-dimensional primitive graphic meshes assembled by a sequence of CSG [35] operations, which are based on parameters predicted by our network.

These CSG operations, such as union, intersection, and subtraction, are implemented with user assistance. In this study, we treat the CSG sequence as cases of multiple primitives overlapping, and it becomes necessary to determine the appropriate sequence of CSG operations. As the number of overlapping primitives increases, the number of potential CSG sequences escalates exponentially. To circumvent the exponential complexity of searching for the correct sequence of CSG operations, we allow users to manually determine the suitable operation between primitives, be they union, intersection, or subtraction. Therefore, in situations where multiple primitives intersect, users are required to manually select the primitives and assign the appropriate CSG operations, such as union, intersection, or subtraction.

Additionally, we conducted an experiment to compare the execution times of the four aforementioned methods using point cloud data with a size of 131,072. The traditional RANSAC algorithm, using default parameters, had an execution time of 0.542 s on a CPU. For machine-learning-based methods, we evaluated pre-trained models without including training time (which takes several days). The execution times on a GPU were as follows: 0.035 s for SPFN, 0.039 s for CPFN, and 0.058 s for our method. These results indicate that our proposed method also demonstrates competitive execution times, enabling the rapid generation of experimental results.

To corroborate the precision of our network in predicting geometric primitive parameters, we utilized the point cloud library (PCL) [43] to efficiently generate three-dimensional point cloud data for four typical regular surfaces commonly found in industrial products, such as planes, spheres, cylinders, and cones. The dataset for each type of surface contained 60,000 points, and we injected different levels of Gaussian noise into all of the three-dimensional point cloud data to mimic the characteristics of point cloud data obtained from laser scanning in the real world.

In the case of planar entities, we generated 1000 3D plane point cloud datasets that satisfy the normal vector

n = (1,0,0)

and a distance

d = 1

from the origin. Table 2 shows the predicted results of the primitive parameters after adding different noise levels (no noise, 5% noise, 10% noise, 15% noise) and the absolute error compared to the theoretical ground truth.

A b s o u l t e E r r o r = \sum_{i = 1}^{N} |λ_{i} - λ_{i}^{'}|

(23)

where

λ

represents the ground true parameter,

λ^{'}

represents the predicted parameter, and

N

is the total number of parameters. From the table above, it can be observed that even after the introduction of noise, our network still retains the ability to predict the geometric parameters of the surfaces with a high degree of accuracy, demonstrating its robustness in handling noisy point cloud data.

For spherical entities, we created 1000 point cloud datasets of spheres centered at

(0,0,0),

with a radius of

6

, incorporating varying degrees of Gaussian noise. Table 3 presents the predicted spherical parameters at different noise levels (no noise, 5%, 10%, and 15%) and their absolute errors relative to the theoretical ground truth. The data indicate that, despite noise interference, our network retains proficiency in parameter estimation for spheres. Remarkably, with 15% noise, absolute errors remain well contained, validating our network’s precision and resilience in handling noisy point cloud data.

For cylindrical entities, we generated 1000 cylindrical point cloud datasets with an axial vector

n = (0,0,1)

, a center point

c = (0,0,0)

, and a radius of

6

. Table 4 presents the predicted cylindrical parameters with various levels of noise (no noise, 5%, 10%, 15%) and the corresponding absolute errors relative to the theoretical ground truth. The results, as evidenced by the table, show that our network proficiently predicts the parameters of cylinders, despite the presence of noise interference.

In the case of conical entities, we generated 1000 conical point cloud datasets with the apex

c = (0,0,0)

, axial vector

n = (0,0,1)

, and cone half-angle

θ = \frac{π}{4}

. Table 5 shows the predicted results of the conical parameters after adding different levels of noise (no noise, 5% noise, 10% noise, 15% noise), along with the absolute error, compared to the theoretical ground truth. It can be observed that despite the presence of noise interference, our network still demonstrates good performance in predicting the parameters of conical surfaces.

5. Conclusions

In the field of computer vision, point cloud segmentation and geometric primitive recognition are fundamental and crucial upstream tasks. This study is dedicated to proposing an innovative neural network architecture that exhibits outstanding performance in the precise segmentation and robust recognition of geometric primitives for 3D point cloud data. A series of stringent experiments conducted have verified the excellence in accuracy of our proposed neural network over existing technologies when addressing issues of point cloud segmentation and geometric primitive recognition. Specifically, our network has achieved significant improvements in the accuracy of predicting parameters of geometric primitives, which will offer a new alternative approach in applications related to 3D geometric modeling and object recognition. In addition, given the high potential and versatility displayed by this network in complex computer vision tasks, it not only enhances the insight of computer vision systems into three-dimensional space but also offers powerful tools for various fields to more effectively parse and utilize 3D spatial data. Focusing on improving the precision of point cloud segmentation not only directly affects the accuracy of geometric primitive parameter estimation but is also a prerequisite for understanding complex 3D structures. Looking ahead, we plan to further evaluate our methods using real-world data collected by actual 3D sensors. In addition, beyond planes, cylinders, spheres, and cones, we will also explore extending our current approaches to more generic shapes and primitives.

Author Contributions

Conceptualization, H.W.; methodology, H.W.; software, H.W. and F.Z.; validation, H.W. and F.Z.; data curation, H.W. and F.Z.; writing—original draft, H.W.; writing—review and editing, H.W. and F.Z.; visualization, H.W. and F.Z.; supervision, F.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (No. 52105291).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data will be made available on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Morteza, D.; Ahmed, H.; Egils, A.; Fatemeh, N.; Fatih, A.; Hasan, S.A.; Jelena, G.; Rain, E.H.; Cagri, O.; Gholamreza, A. 3D scanning: A comprehensive survey. arXiv 2018, arXiv:1801.08863. [Google Scholar] [CrossRef]
Fayolle, P.-A.; Friedrich, M. A Survey of Methods for Converting Unstructured Data to CSG Models. Comput. Aided Des. 2024, 168, 103655. [Google Scholar] [CrossRef]
Sergiyenko, O.; Alaniz-Plata, R.; Flores-Fuentes, W.; Rodríguez-Quiñonez, J.C.; Miranda-Vega, J.E.; Sepulveda-Valdez, C.; Núñez-López, J.A.; Kolendovska, M.; Kartashov, V.; Tyrsa, V. Multi-view 3D data fusion and patching to reduce Shannon entropy in Robotic Vision. Opt. Lasers Eng. 2024, 177, 108132. [Google Scholar] [CrossRef]
Chum, O.; Matas, J. Optimal randomized RANSAC. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 1472–1482. [Google Scholar] [CrossRef] [PubMed]
Li, L.; Sung, M.; Dubrovina, A.; Yi, L.; Guibas, L.J. Supervised fitting of geometric primitives to 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Patten Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2652–2660. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Lê, E.-T.; Sung, M.; Ceylan, D.; Mech, R.; Boubekeur, T.; Mitra, N.J. CPFN: Cascaded primitive fitting networks for high-resolution point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2022; pp. 7457–7466. [Google Scholar]
Kaiser, A.; Ybanez Zepeda, J.A.; Boubekeur, T. A survey of simple geometric primitives detection methods for captured 3D data. Comput. Graph. Forum 2019, 38, 167–196. [Google Scholar] [CrossRef]
Wu, Q.; Xu, K.; Wang, J. Constructing 3D CSG models from 3D raw point clouds. Comput. Graph. Forum 2018, 37, 221–232. [Google Scholar] [CrossRef]
Fischler, M.A.; Bolles, R.C. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
Schnabel, R.; Wahl, R.; Klein, R. Efficient RANSAC for point-cloud shape detection. Comput. Graph. Forum 2007, 26, 214–226. [Google Scholar] [CrossRef]
Carr, P.; Sheikh, Y.; Matthews, I. Monocular object detection using 3d geometric primitives. In Proceedings of the 12th European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; Proceedings, Part I 12, 2012. Springer: Berlin/Heidelberg, Germany, 2012; pp. 864–878. [Google Scholar] [CrossRef]
Rabbani, T.; Dijkman, S.; van den Heuvel, F.; Vosselman, G. An integrated approach for modelling and global registration of point clouds. ISPRS J. Photogramm. Remote Sens. 2007, 61, 355–370. [Google Scholar] [CrossRef]
Yan, D.-M.; Wang, W.; Liu, Y.; Yang, Z. Variational mesh segmentation via quadric surface fitting. Comput. Aided Des. 2012, 44, 1072–1082. [Google Scholar] [CrossRef]
Lafarge, F.; Mallet, C. Creating large-scale city models from 3D-point clouds: A robust approach with hybrid representation. Int. J. Comput. Vis. 2012, 99, 69–85. [Google Scholar] [CrossRef]
Holz, D.; Holzer, S.; Rusu, R.B.; Behnke, S. Real-time plane segmentation using RGB-D cameras. In RoboCup 2011: Robot Soccer World Cup XV 15; Springer: Berlin/Heidelberg, Germany, 2012; pp. 306–317. [Google Scholar] [CrossRef]
Li, Y.; Wu, X.; Chrysathou, Y.; Sharf, A.; Cohen-Or, D.; Mitra, N.J. Globfit: Consistently fitting primitives by discovering global relations. In ACM SIGGRAPH 2011 Papers; ACM: New York, NY, USA, 2011; pp. 1–12. [Google Scholar] [CrossRef]
Matas, J.; Chum, O. Randomized RANSAC with Td, d test. Image Vis. Comput. 2004, 22, 837–842. [Google Scholar] [CrossRef]
Kang, Z.; Li, Z. Primitive fitting based on the efficient multibaysac algorithm. PLoS ONE 2015, 10, e0117341. [Google Scholar] [CrossRef] [PubMed]
Torr, P.H.; Zisserman, A. MLESAC: A new robust estimator with application to estimating image geometry. Comput. Vis. Image Underst. 2000, 78, 138–156. [Google Scholar] [CrossRef]
Chum, O.; Matas, J. Matching with PROSAC-progressive sample consensus. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; IEEE: Piscataway, NJ, USA, 2005; pp. 220–226. [Google Scholar] [CrossRef]
Du, T.; Inala, J.P.; Pu, Y.; Spielberg, A.; Schulz, A.; Rus, D.; Solar-Lezama, A.; Matusik, W. Inversecsg: Automatic conversion of 3D models to csg trees. ACM Trans. Graph. (TOG) 2018, 37, 1–16. [Google Scholar] [CrossRef]
Romanengo, C.; Raffo, A.; Biasotti, S.; Falcidieno, B. Recognizing geometric primitives in 3D point clouds of mechanical CAD objects. Comput. Aided Des. 2023, 157, 103479. [Google Scholar] [CrossRef]
Tulsiani, S.; Su, H.; Guibas, L.J.; Efros, A.A.; Malik, J. Learning shape abstractions by assembling volumetric primitives. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2635–2643. [Google Scholar]
Zou, C.; Yumer, E.; Yang, J.; Ceylan, D.; Hoiem, D. 3d-prnn: Generating shape primitives with recurrent neural networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 900–909. [Google Scholar]
Sun, C.-Y.; Zou, Q.-F.; Tong, X.; Liu, Y. Learning adaptive hierarchical cuboid abstractions of 3d shape collections. ACM Trans. Graph. 2019, 38, 1–13. [Google Scholar] [CrossRef]
Smirnov, D.; Fisher, M.; Kim, V.G.; Zhang, R.; Solomon, J. Deep parametric shape predictions using distance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 561–570. [Google Scholar]
Lin, C.; Fan, T.; Wang, W.; Nießner, M. Modeling 3d shapes by reinforcement learning. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part X 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 545–561. [Google Scholar]
Paschalidou, D.; Ulusoy, A.O.; Geiger, A. Superquadrics revisited: Learning 3d shape parsing beyond cuboids. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10344–10353. [Google Scholar]
Gadelha, M.; Gori, G.; Ceylan, D.; Mech, R.; Carr, N.; Boubekeur, T.; Wang, R.; Maji, S. Learning generative models of shape handles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 402–411. [Google Scholar]
Chen, Z.; Tagliasacchi, A.; Zhang, H. Bsp-net: Generating compact meshes via binary space partitioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 45–54. [Google Scholar]
Deng, B.; Genova, K.; Yazdani, S.; Bouaziz, S.; Hinton, G.; Tagliasacchi, A. Cvxnet: Learnable convex decomposition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 31–44. [Google Scholar]
Genova, K.; Cole, F.; Vlasic, D.; Sarna, A.; Freeman, W.T.; Funkhouser, T. Learning shape templates with structured implicit functions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7154–7164. [Google Scholar]
Genova, K.; Cole, F.; Sud, A.; Sarna, A.; Funkhouser, T. Local deep implicit functions for 3d shape. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4857–4866. [Google Scholar]
Gopal Sharma, R.G. Difan Liu, Evangelos Kalogerakis, Subhransu Maji, Neural Shape Parsers for Constructive Solid Geometry. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 2628–2640. [Google Scholar] [CrossRef]
Li, Y.; Liu, S.; Yang, X.; Guo, J.; Guo, J.; Guo, Y. Surface and Edge Detection for Primitive Fitting of Point Clouds. In Proceedings of the ACM SIGGRAPH 2023 Conference Proceedings, Los Angeles, CA, USA, 6–10 August 2023; Association for Computing Machinery: New York, NY, USA, 2023; p. 44. [Google Scholar] [CrossRef]
Sharma, G.; Liu, D.; Maji, S.; Kalogerakis, E.; Chaudhuri, S.; Měch, R. Parsenet: A parametric surface fitting network for 3D point clouds. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part VII 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 261–276. [Google Scholar] [CrossRef]
Saporta, T.; Sharf, A. Unsupervised recursive deep fitting of 3D primitives to points. Comput. Graph. 2022, 102, 289–299. [Google Scholar] [CrossRef]
Eldar, Y.; Lindenbaum, M.; Porat, M.; Zeevi, Y.Y. The farthest point strategy for progressive image sampling. IEEE Trans. Image Process. 1997, 6, 1305–1315. [Google Scholar] [CrossRef] [PubMed]
Schlömer, T.; Heck, D.; Deussen, O. Farthest-point optimized point sets with maximized minimum distance. In Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics, Vancouver, BC, Canada, 5–7 August 2011; pp. 135–142. [Google Scholar] [CrossRef]
Li, J.; Zhou, J.; Xiong, Y.; Chen, X.; Chakrabarti, C. An adjustable farthest point sampling method for approximately-sorted point cloud data. In Proceedings of the 2022 IEEE Workshop on Signal Processing Systems (SiPS), Rennes, France, 2–4 November 2022; pp. 1–6. [Google Scholar] [CrossRef]
Wu, W.; Qi, Z.; Fuxin, L. PointConv: Deep convolutional networks on 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9621–9630. [Google Scholar]
Rusu, R.B.; Cousins, S. 3D is here: Point cloud library (PCL). In Proceedings of the 2011 IEEE International Conference on Robotics and Automation, Shanghai, China, 9–13 May 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 1–4. [Google Scholar] [CrossRef]

Figure 1. Architecture diagram. It is mainly composed of four modules, namely the preprocessing module, the feature extraction module, the segmentation and classification module, and the parameter prediction module.

Figure 2. Flowchart of the preprocessing module. This module is utilized to filter out certain noise and to sample high-resolution point clouds.

Figure 3. Enhancing point cloud data through noise reduction, smoothing, and down-sampling.

Figure 4. Feature extraction network architecture diagram. This network implements a novel feature aggregation layer based on hierarchical PointConv kernels.

Figure 5. Segmentation and classification schematic. The point cloud is segmented by the network into individual constituent fragments, which are then classified according to geometric primitives.

Figure 6. Basic primitives and parameters.

Figure 7. Noise simulation by adding Gaussian noise with standard deviations: (a) original data, (b) 5% noise, (c) 10% noise, and (d) 15% noise, (e–h) corresponding to the respective fitted results.

Figure 8. Primitive segmentation and surface reconstruction results.

Figure 9. Primitive fitting of CAD models. Top-to-bottom rows: initial points, point-to-primitive assignment (denoted by distinct colors), and primitive fitting.

Table 1. Results of all experiments. The bold values in the table represent the best results.

Method	Seg. (Mean IoU) ↑	Type Accuracy ↑	Point Normal ↓	Primitive Axis ↓	Residual ↓	P Coverage ↑
RANSAC	55.81	63.03	11.36	4.98	0.058	81.21
SPFN	75.66	88.73	9.43	1.67	0.004	88.13
CPFN	78.15	93.56	8.83	1.56	0.034	89.06
Ours	87.30	95.62	7.68	1.62	0.002	91.76

Table 2. Parameter of plane primitive. Prediction parameters and absolute errors of point clouds of planar primitives under different noises.

Plane	Ground-Truth	Noise-Free	5% Noisy	10% Noisy	15% Noisy
Normal_x	1.000000	1.000000	0.999989	0.999998	0.999983
Normal_y	0	0	−0.002726	−0.002029	−0.001594
Normal_z	0	0	0.001895	0.000851	0.005571
d	1	1	0.998976	1.014328	1.085687
Absolute Error	-	0	0.005656	0.017210	0.092869

Table 3. Parameter of sphere primitive. Prediction parameters and absolute errors of point clouds of sphere primitives under different noises.

Sphere	Ground-Truth	Noise-Free	5% Noisy	10% Noisy	15% Noisy
Center_x	0.000000	0.000000	0.004954	0.029597	0.011205
Center_y	0.000000	0.000000	0.000071	0.009223	0.015344
Center_z	0.000000	0.000000	0.003963	−0.017266	−0.012764
r	6.000000	6.000000	6.002710	6.096920	6.217777
Absolute Error	-	0	0.011698	0.153006	0.257090

Table 4. Parameter of cylinder primitive. Prediction parameters and absolute errors of point clouds of cylinder primitives under different noises.

Cylinder	Ground-Truth	Noise-Free	5% Noisy	10% Noisy	15% Noisy
Axis_x	0.000000	0.000000	0.001241	0.004835	−0.001827
Axis_y	0.000000	0.000000	0.000197	−0.002319	0.001084
Axis_z	1.000000	1.000000	0.993506	0.897432	0.932910
Center_x	0.000000	0.000000	0.000327	0.005059	0.008748
Center_y	0.000000	0.000000	−0.000215	−0.006206	−0.000155
Center_z	0.000000	0.000000	−0.002756	0.153001	−0.344465
r	6.000000	6.000000	6.005907	6.010655	6.076211
Absolute Error	-	0	0.017137	0.284643	0.499580

Table 5. Parameter of cone primitive. Prediction parameters and absolute errors of point clouds of cone primitives under different noises.

Cone	Ground-Truth	Noise-Free	5% Noisy	10% Noisy	15% Noisy
Axis_x	0.000000	0.000000	0.001528	0.001570	0.018011
Axis_y	0.000000	0.000000	0.000161	0.002301	0.000327
Axis_z	1.000000	1.000000	0.999618	1.000382	0.989207
C_x	0.000000	0.000000	0.000270	0.001648	0.011430
C_y	0.000000	0.000000	0.000261	0.000741	0.004834
C_z	0.000000	0.000000	0.000503	0.000653	0.001576
θ	0.785398	0.785398	0.753899	0.873067	0.633701
Absolute Error	-	0	0.034604	0.094964	0.198668

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wan, H.; Zhao, F. A Hierarchical Neural Network for Point Cloud Segmentation and Geometric Primitive Fitting. Entropy 2024, 26, 717. https://doi.org/10.3390/e26090717

AMA Style

Wan H, Zhao F. A Hierarchical Neural Network for Point Cloud Segmentation and Geometric Primitive Fitting. Entropy. 2024; 26(9):717. https://doi.org/10.3390/e26090717

Chicago/Turabian Style

Wan, Honghui, and Feiyu Zhao. 2024. "A Hierarchical Neural Network for Point Cloud Segmentation and Geometric Primitive Fitting" Entropy 26, no. 9: 717. https://doi.org/10.3390/e26090717

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Hierarchical Neural Network for Point Cloud Segmentation and Geometric Primitive Fitting

Abstract

1. Introduction

2. Related Work

3. Method

3.1. Preprocessing Module

3.2. Feature Extraction Network (FEN)

3.2.1. Input and Output

3.2.2. Sampling Layer

3.2.3. Grouping Layer

3.2.4. PointConv Layer

3.3. Segmentation and Classification Module (SCM)

3.4. Parameter Prediction Model

3.4.1. Plane

3.4.2. Sphere

3.4.3. Cylinder

3.4.4. Cone

3.5. Evaluation Metrics

4. Experiments

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI