1. Introduction
Point clouds are widely available now due to the progressive development of various laser sensors and dense image matching techniques. The efficient classification of point clouds is one of the fundamental problems in scene understanding for three-dimensional (3D) digital cities, intelligent robots, and unmanned vehicles. However, classifying point clouds under complex urban environments is not a trivial task, since they are usually noisy, sparse, and unorganized [
1]. The density of point clouds varies with the sampling intervals and ranges of laser scanners. Moreover, severe occlusions between objects during scanning can lead to incomplete coverage of object surfaces. These problems present challenges for point cloud classification.
Point cloud classification can be accomplished via various approaches, such as region growing, energy minimization, and machine learning. A review of these approaches can be found in Nguyen and Le [
1]. Region growing is a segmentation method that partitions point clouds into disjoint homogenous regions. Based on the catastrophe and non-continuous points of curvatures, the seed regions are grown with a geometrical and topological continuity of the points. However, the process is sensitive to noise in the point cloud. One of the solutions to this problem is to grow the regions according to the local geometrical characteristics [
2] so that the noise can be largely suppressed. Another way to solve this problem is to jointly use the edge and local region characters [
3] to reduce the computational complexity and improve the classification accuracy. However, the region growing approach is susceptible to the initial seeds [
4]; inaccurate seeds may lead to under or over-segmentation, and the growing process is difficult to stop when the transitions between two regions are smooth [
5].
The energy minimization approach is a global solution framework that formulates the classification as an optimization problem [
6]. The process starts by considering the point cloud as a graph. Each vertex corresponds to a point in the data, and the edges connect neighboring points [
7,
8]. By turning the segmentation into a min-cut/max-flow problem, the point cloud is segmented with graph cuts [
9,
10,
11]. In addition, many studies on the graph approach also cast it into a probabilistic inference model, such as the conditional random field (CRF) [
12,
13], which regards the classification as a multi-labeling optimization problem. By constraining the fidelity of the data, the continuity of feature values, and the compactness of the segment boundaries [
6], the energy function is minimized to ensure that the statistic characteristic difference is the minimum in the same class and the maximum between different classes [
9,
10]. To efficiently combine other constraints, the weights in the graph also can be computed as a combination of Euclidean distances, pixel intensity differences, and the angles between the surface normals, among others [
14]. This global optimization approach is insensitive to noise, but its segmented results are usually piecemeal.
Machine learning aims to train an efficient classifier from enormous numbers of samples. To this end, a proper feature descriptor is important for the classification model. Generally, the currently available feature descriptors can be divided into three categories: (1) global, (2) local, and (3) regional [
15]. Global descriptors describe the holistic statistical characteristics of a class of objects, and are useful in object retrieval and recognition. However, they are sensitive to incomplete object extraction, which is a common problem in point cloud processing. Local descriptors represent the local properties of objects, such as the surface normals, surface curvatures, and eigenvalues of the covariance matrix; however, they are sensitive to noise. Regional descriptors further introduce some neighboring region information, including texture [
16], geometrical structure [
17], topology [
18], and contexture [
19]. In addition, as objects often present different properties at different spatial scales, the multi-scale or multi-resolution spatial feature descriptors [
20,
21,
22] can describe the objects across different scales. After the feature descriptors are extracted, the point cloud is classified with a machine learning classifier. The commonly used machine learning methods include support vector machine (SVM) [
23,
24], cascaded AdaBoost [
25], and random forest [
26,
27]. However, the accuracy of these classifiers is similar and sensitive to the selection of the feature descriptors [
25]. A good feature descriptor should be discriminative enough to adapt to complex situations. Nonetheless, the procedure of the aforementioned feature extraction techniques is “knowledge-driven”, and highly relies on the human operator’s a priori knowledge, which can be an overwhelming task with complex urban environments.
Over the last several years, deep learning has led to substantial gains in many areas, owing to its powerful feature learning ability. It simulates the cognitive processes of the human brain to learn the discriminative features from enormous amounts of data. As one of the deep learning networks, the convolutional neural network (CNN) uses convolution kernels to simulate the receptive fields of the vision system. CNN has become one of the most efficient methods for image classification, document analysis, voice detection, and video analysis [
28,
29], among others. For 3D object recognition [
30,
31], which aims to assign a reasonable label to a cluster of points, CNN has achieved promising results for discriminative feature extraction and representation [
32,
33].
As for 3D point cloud classification, each point is labeled separately. Some researchers [
34] project the point cloud to a plane so that the standard two-dimensional (2D) CNN can be applied. Nevertheless, due to the occlusion among objects, directly projecting the point cloud to imagery inevitably loses the depth and 3D spatial structure information. Another way to apply CNN to a point cloud is by voxelizing the entire unorganized 3D scene into a 3D array of point clouds. This would allow using some classical image semantic labeling networks such as FCN [
35], SegNet [
36], Deeplab [
37], and DeconvNet [
38] for point cloud classification by extending 2D convolution kernels into 3D ones. However, a point cloud is not actually the “3D data” of whole solid objects; rather, it is a recording of the objects’ surfaces, which actually is a manifold that is embedded in the 3D space. Except for the objects’ surfaces, the 3D space is filled with enormous null data. Simply voxelizing the entire 3D point cloud into a regular 3D array can lead to huge unnecessary computation. Therefore, an efficient network constructed directly on the point or voxel has increasingly become a topic of interest among research studies [
39,
40,
41].
One of our arguments is that objects of various sizes exist in a point cloud. For objects that are small in size, a fine scale within a small neighboring region is enough, while large objects require a coarse scale containing a large region. To adapt to the varying sizes of objects, multi-scale voxelization is proposed in this paper to “observe” the small neighbors finitely and the large neighbors roughly. Instead of dividing the whole space into voxels with a fixed size, multi-scale voxelization divides a point cloud into voxels at multiple scales, thereby allowing the multi-scale features of objects of various sizes to be extracted based on those voxels. Also, the spatial context information at different scales is integrated during multi-scale feature extraction.
Based on multi-scale voxelization, we further propose a multi-scale convolutional network (MSNet), with the aim of efficient feature learning and class prediction. In our proposed method, only the position information (x, y, and z coordinates) of a point cloud was considered, as the intensity or RGB information is not always available. Among the neighboring voxels around a point, MSNet is proposed for the discriminative feature learning of its local context. The multi-scale features of different spatial resolutions are learned with convolutional networks and fused across different scales. Meanwhile, as a result of multi-scale voxelization, the spatial context with different sizes is captured at different scales by the 3D convolution kernels of MSNet. With this strategy, the conventional pooling operation is not necessary in MSNet to robustly capture the multi-scale features, so the structure of MSNet becomes concise to implement. However, the classification of point clouds with MSNet is voxel-level, and inevitably influenced by noise. As such, a CRF that fully considers the spatial consistency of a point cloud is applied to achieve a global optimization of the predicted class probabilities. Therefore, our method incorporates both local and global constraints for a highly accurate point cloud classification.
The remainder of this paper is structured as follows.
Section 2 starts with multi-scale point cloud voxelization. MSNet is then established for the discriminative local feature learning, and global label optimization with CRF is employed.
Section 3 starts with the experimental data description, followed by a presentation of the individual and overall test results to demonstrate the solution procedure.
Section 4 is a series of discussions where we compare our proposed MSNet with some state-of-the-art methods, and analyze the generalization capability of the proposed approach. Both quantitative and qualitative evaluations are presented.
Section 5 consists of our concluding remarks on the properties of MSNet and our proposed future efforts.
2. Materials and Methods
As depicted in
Figure 1, the proposed method consists of two complementary parts. In Part I of
Figure 1, a point cloud is represented as multi-scale 3D voxels. Then, a corresponding MSNet is established for discriminative local feature learning to predict a class probability. In Part II of
Figure 1, the point cloud is regarded as an edge-weighted graph, and a CRF with spatial consistency constraints is constructed to obtain the global context. Finally, global label optimization is used to combine the local feature and the global context for accurate classification of the point cloud.
2.1. Multiscale Voxelization
Humans perceive the object context of point clouds at multiple scales, including the scene at a coarse scale, and then objects, structures, edges, and points at a fine scale. It is a multi-scale observation process that considers information across different scales to enable comprehensive judgment. Similarly, to automatically understand point clouds across different scales for discriminative feature learning, the context of a point cloud is analyzed by multi-scale voxels that are centered at the point, which allows the network to closely observe at a fine scale, and a consider rough view at a coarse scale.
At each scale, for a given point
, a neighboring cubic
with length
is set up around it, as shown in
Figure 2a. Then, the cube is subdivided into
grid voxels [
42] as a patch, and the side length of each voxel is
.
corresponds to the size of the neighboring region. The smaller the
is, the finer the scale. For small objects, a fine scale within a small neighboring region
is enough, whereas for large objects, a coarse scale including a large region is needed. On the contrary, observing small objects at coarse scale will omit details, while the processing of large objects with a fine scale may lead to high noise sensibility and large unnecessary computation. Therefore, dividing the point cloud with multiple scales is necessary to accommodate the various sizes of objects.
According to the aforementioned analysis, we present the design of a multi-scale voxelization frame in this paper. The patch length
of different scales is fixed to an equal number that is as small as possible for computation efficiency. Then, a series of voxel side lengths
with increasing values are applied. With a larger
, the cubic side length
also will increase and yield a coarse view in a large region, and vice versa. Instead of dividing the whole space into voxels of a fixed size, the multi-scale voxelization divides the individual context of the point cloud into voxels at multiple scales, and the spatial context information of different scales is well represented for each point. As shown in
Figure 2b, with an equal patch length
of all of the scales, the spatial context with different sizes are obtained by changing the voxel length
without compromising the computation efficiency.
Without loss of generality, the point density of each voxel, which is defined as the ratio of the point count within the voxel and its volume, is adopted as the representative value of the voxel. For fine voxel volume, the commonly-used occupancy value (i.e., if there is a point inside the voxel, the value is set as 1, otherwise it becomes 0) is reasonable, as the voxel is small, and there are only a few or no points that lie in it. Whereas, for the coarse voxel volume, the number of points that lie in the voxels may vary greatly, and cannot be simply approximated as 0 or 1. Compared with the occupancy value, the point density characterizes the degree of point occupancy within a voxel, and is invariant to scale.
2.2. Multi-Scale Convolutional Network of a 3D Point Cloud
Based on the multi-scale voxelization, MSNet is proposed for discriminative local feature learning and class probability prediction, as shown in
Figure 3. With the multi-scale voxelization of point clouds, the multi-scale features of different spatial resolutions are learned with a series of convolutional networks (ConvNets) of shared weights that are fused directly across different scales. Due to multi-scale voxelization, the 3D convolution kernels of MSNet capture the spatial context with different sizes at different scales. Thus, the cascaded pooling operation is not necessary, and a concise structure of less model parameters is proposed in MSNet.
2.2.1. Multi-scale Feature Extraction
Many excellent discriminative feature extraction methods have been proposed for point cloud classification [
16,
17,
18]. However, most of them are “knowledge-driven”, and are designed subjectively based on prior knowledge. Due to the influence of noise, occlusion, and various types and sizes of objects, these subjectively-designed features are difficult to use for characterizing the objects in a point cloud.
Owing to its convolution and pooling layer, CNN has recently been shown to have a powerful feature learning capability in the classification and semantic labeling of 2D images [
28,
29]. The kernels of the convolution layer simulate the receptive fields of human vision, while the pooling layer is applied for dimension reduction and an invariance guarantee of translation, rotation, and scale.
However, it is difficult to directly utilize the conventional CNN for 3D point cloud classification. CNN needs an input of regular 2D or 3D array, but when the 3D point cloud is simply projected into 2D imagery, the 3D point cloud loses its 3D spatial context information. Dividing the point cloud into a regular 3D array cannot adaptively reflect the different sizes of the objects in the point cloud, and also will lead to large unnecessary computation on null values inside the object, even with the following pooling operation.
To address these problems, MSNet is proposed based on the 3D multi-scale voxelization of the point cloud. By simultaneously applying the ConvNets at multiple scales, the multi-scale contextual features of the objects of different sizes in the point cloud are extracted with discriminative feature learning. The ConvNets at different scales operates within the spatial context of different region sizes, which acts as the cascaded pooling operations in normal CNN. Therefore, the pooling layer is not necessary in MSNet, and a less deep structure is achieved due to the simultaneous convolution at multiple scales.
At each point
in the 3D scene, we first construct the corresponding multiscale 3D voxels according to
Section 2.1. Denote the patch of scale
as
. In the superscript of
, the last dimension of
represents the number of features. In this paper, only the voxel’s density is considered, leading to the last dimension
. For each scale
, the 3D Convnet
can be described as a sequence of linear transforms and non-linear activations. For the 3D Convnet
with
layers, the
-th output feature map of layer
can be represented as:
for all of
with
as a vector of bias; where
.
is the convolution kernel with a size of
,
is the feature number of hidden layer
, and
represents the 3D convolution operator.
is the activation function acting on each element of the input matrix, which leads to the non-linearity of the network and reduces the vanishing of the gradient and fast training. In addition, the contribution of the neighboring voxels to the center one is similar across different scales, and depends on their spatial relationship. To capture this character and improve the generalization capability of our MSNet, the weights of ConvNets are shared across different scales to reduce the number of model parameters, which also makes the MSNet concise.
The output of the 3D Convnet
is obtained as:
It is regarded as the feature of point
at scale
. A detailed convolution process at a single scale is provided in
Figure 4.
Finally, the outputs of the
-scale ConvNets are flattened and fused to produce the final feature vector
, which can be seen as the multi-scale feature around point
:
where
is the flatten function to stretch the matrix to be a vector,
represents the full connection parameters, and
is the corresponding bias.
2.2.2. Discriminative Feature Learning
With the fused multi-scale feature
, our goal is to use it for class probability prediction. To this end, we apply softmax regression to predict the probability distribution
over each class as:
where
is the predicted probability for the
-th point belonging to class
,
is denoted as the set of classes, and
represents the number of classes.
Next, we construct the loss function using cross entropy, which depicts the difference between the probability distribution
and the true probability distribution
of class
:
where the parameters
can be learned by minimizing
with a batch stochastic gradient descent algorithm. Once the network is learned, the loss function is no longer needed, and the predicted probabilities are used for further class label inference.
2.3. Global Label Optimization with CRF
Point cloud classification must assign each point a unique label that indicates its class. The simplest strategy for this end is to give each voxel a label with the argmax of the predicted probabilities (Equation (4)). Then, the label is assigned to the points in the corresponding voxel. However, such classification results are at the voxel level, and are inevitably influenced by noise and result in the spatial inconsistency of label prediction.
To address this issue, we use a CRF model with spatial consistency to globally optimize the class label of the point cloud. For this purpose, we construct a graph with vertex and edge . Each vertex is associated to a point, and the edges are added between the point and its K-nearest points of the point cloud.
Let random variable
be the label of point
. Random variable
consists of
, where
is the total number of the points. We regard vertex
of the graph
as the random variable of label (i.e.,
. We can constitute the CRF model (
) based on the graph
of the point cloud, where
is the global observation of
.
corresponds to the predicted class probability of the point cloud, which is obtained with MSNet. The posterior probability of the point cloud that is assigned to label
, which consists of
under the global observation
, is then represented as below:
where
indicates the normalized index, and the energy of label
can be represented as:
Label maximizing the posterior probability is the most appropriate label of the point cloud, whereas maximizing the posterior probability in Equation (6) is equal to minimizing the energy in Equation (6), which leads to a global optimization of the point cloud label.
The data cost term
penalizes the disagreement between a point and its assigned label. In this paper, the initial data cost of each point is calculated with its predicted probability in
Section 2.2 as unary terms:
where
is an indicator function. The data cost enforces the value of label
close to the predicted probability.
The smooth cost term
penalizes the label inconsistency between neighboring points. The neighboring points are encouraged to assign similar labels. In this work, the
K-nearest neighboring points are connected with the central point. The smooth cost is calculated according to the Euclidean distance between two points:
where
is the 3D Euclidean distance between points
and
. The smooth cost constrains the regularity and consistency of label
.
Finally, the energy function
is minimized with the
[
43,
44,
45] algorithm. A simple diagram of the optimization process is provided in
Figure 5. The initial probabilities of each point are pre-predicted with MSNet, as described in
Section 2.2. After several iterations, all of the class labels of the point cloud are globally optimized.
5. Conclusions
The method proposed in this paper provides an efficient point cloud classification approach, which consists of two complementary parts. In the first part, the point cloud is represented as multi-scale 3D voxels, and a corresponding MSNet is proposed to learn the multi-scale discriminative local features and predict the class label of each point. In the second part, the coarse classification results of MSNet are globally optimized using CRF with a spatial consistency constraint on the point cloud.
Compared with the existing point cloud feature extraction methods, which mainly focus on designing and extracting features subjectively, the feature extraction in our method is adaptive and learning-based. With the proposed multi-scale voxelization of MSNet, the multi-scale discriminative feature of a point cloud is adaptively extracted and fused to comprehensively characterize the local spatial context of each point in a concise way.
To address the MSNet classification inconsistency of one object cluster, which is caused by the point-wise class prediction, CRF with spatial consistency is constructed based on the graph of the point cloud to achieve a global optimization for all of the predicted class labels.
The experimental results show that the proposed method not only works well for MLS point clouds, it also achieved a much higher classification accuracy on ALS and TLS point clouds compared with the state-of-the-art methods, at 97.02% and 98.24%, respectively, thereby demonstrating the strong generalization capability of the proposed network for point cloud classification under complex urban environments.
However, the proposed solution also has its limitations. Although the multi-scale voxelization of point clouds substantially reduced the computation expense compared with a traditional CNN, further improvement is possible for the point-wise classification method. Therefore, a new convolution kernel with angle parameters, which can adopt the manifold structure and efficiently handle the point cloud within the linear computation, will be considered in our future work. Additional experiments on larger data sets are also a possibility in the future.