Three-dimensional point clouds obtained from mobile laser scanning (MLS) in road environments have received considerable attention due to the increasing demand for their accurate understanding [
1]. Point clouds can provide completeness and a rich level of detail for the objects located on roads. On the other hand, the massive volume of points collected by an MLS system may contain local redundant data that may increase the data volume. This type of data sets may also feature a variable point density and a high number of incomplete structures due to the presence of occlusions [
2]. These problems, for example, prevent the directly exploitation of the three-dimensional high-precision map and autonomous vehicle navigation, as described in [
3]. Consequently, classification of road infrastructures from such dense point clouds needs to be investigated both theoretically and practically.
The following subsections review relevant works in the literatures that include detection and classification of objects from geospatial data, including either images and point clouds as data sources. Briefly speaking, the works are grouped into rule-feature-based and deep-learning-based methods.
1.1. Rule-Feature-Based Classification Methods
In early research works, a set of predefined discriminant rules were used to extract a single object (e.g., [
4,
5]). These rules are effective but show limitations when adopted in complex environments which may often contain considerable uncertainty and outliers [
6,
7]. To classify multi-objects, machine learning techniques with leveraging priori knowledge have been proposed [
8], such as random forest (RF) [
9], support vector machine (SVM) [
10], decision tree [
11] and Euclidean cluster extraction (ECE) [
12].
Interesting applications of RF are briefly introduced here. Becker et al. used RF and gradient boosted trees to train the classifier by considering a multi-scale pyramid with decreasing point densities, combined with HSV colour values of aerial photogrammetry data [
13]. Road curbs and markings in MLS data are detected by a binary kernel descriptor and RF classifiers [
2]. Niemeyer et al. integrated an RF classifier with the conditional random fields method, and demonstrated a 2% increase in the overall classification accuracy with contextual features considered [
14]. The limitation of this classification method is the over-smoothing problem wherein both small and large objects can be easily wrongly classified [
15]. Other applications also demonstrated the active effects of RF in various scenarios (e.g., [
16,
17]).
An SVM approach with geometrical and contextual features was proposed to extract 3D objects in urban scenes [
18]. In order to classify images, an SVM-based edge-preservation multi-classifier relearning framework was developed to classify the high-resolution images and achieve highly accurate interpretation [
19]. Xiang et al. segmented the initial point clouds, and then extracted features with three popular classifiers—SVM, RF and extreme learning machine (ELM) [
20]. On an average, both SVM and RF classifiers reached similar precisions and recall rates in classifying grounds, trees and buildings. Other similar applications also reported desirable performances of SVM [
21,
22]. However, these classifiers often label each point independently from their local features and do not consider the semantic labels of the neighbouring points, which often leads to noisy results, especially in complex scenes [
23].
Given a set of points, the points within each cluster are similar to each other and the points from different cluster are dissimilar. On the basis of this concept, the Euclidean cluster extraction (ECE) adopts a 3D grid subdivision of the space that is fast to build and useful for situations where either a volumetric representation of the occupied space is needed, or the data in each resultant 3D grid can be approximated with a different structure [
12]. This strategy could cope effectively in the case of road infrastructure, which may be segmented into clusters based on the Euclidean distance.
1.2. Deep-Learning-Based Classification
Recently, deep learning network techniques have been successfully applied to data segmentation and classification. Basically, the deep learning networks are composed of multiple processing layers, with the aim of learning the representations of data with multiple levels of abstraction. Convolutional neural networks (CNNs) [
24] are the primary architecture that has been used in deep learning methods for segmenting and classifying objects [
25,
26]. In applications such as the classification of individual tree species, depth images are learned by a CNN to describe the characteristics of each species [
27]. For detecting multi-class geospatial objects, a weakly supervised deep learning method was proposed by leveraging pair-wise scene-level similarity to learn discriminative convolutional weights, and by using pairwise scene-level tags to learn class-specific activation weights [
28]. In [
29], an automated framework combining CNN and three-dimensional point-cloud features is applied to aerial imagery for the detection of severe building damages caused by disasters. These methods focused on the classification/extraction of objects in 2D aerial or satellite images.
Based on CNN, a fully convolutional network (FCN) takes inputs of arbitrary size and produces outputs of the corresponding size. It introduces skip connections as a way of fusing information from different depths, which correspond to different image scales [
30]. The U-net [
31] concatenates feature maps from the contracting path. It combines low-level details and high-level semantic information, and achieved good performance on biomedical image segmentation. The SegNet [
32] consists of an encoder network and a corresponding decoder network, which maps the low-resolution encoder features to all input-resolution features for a better segmentation accuracy. The DeconvNet [
33] fused detail and semantic features for segmentation purpose. The up-sampling of DeconvNet is similar to the SegNet.
In two-dimensional images, the elementary radiometric information is organised in regular grid of pixels where spatial relationships among them can be caught by using moving filtering windows. However, three-dimensional point clouds are unorganised point structures in which the density maybe uneven [
34]. To overcome this drawback, the point clouds are transformed into regular three-dimensional voxels or two-dimensional raster structures before feeding them to a deep learning network. Voxel-based (e.g., ShapeNet [
35]), multi-view-based (e.g., Multi-view CNN [
36]) and point-based CNN (e.g., PointNet [
37]) techniques are popular networks to process 3D data and to extract the features/characteristics of objects based on the CNN techniques.
Some interesting investigations on 3D data segmentation and classification are briefly introduced here. By projecting point clouds into raster data sets, road markings are extracted, classified and completed based on the popular U-net, CNN and generative adversarial network (GAN) networks, respectively [
38]. By generating a CNN to leverage a spatially local correlation, PointCNN [
39] is proposed to classify multiple benchmark data sets using an
-Conv operator, which weighs and permutes point clouds. Instead of sigmoid as the activation function, Zhang et al. [
40] used a rectified linear unit neural network (ReLu-NN) to speed up the convergence and reduce the number of neurons to avoid over-fitting airborne laser scanning data. KD-networks are designed for three-dimensional data recognition with open indoor data [
41]. In the case of high-resolution three-dimensional data, OctNet is presented by hierarchically partitioning the space with unbalanced octrees [
42].
A multi-layer perception (MLP) can be viewed as a logistic regression wherein the input is first transformed using a non-linear learnt transformation, which then projects the input onto a space where it becomes linearly separable. This intermediate layer is referred to as a “hidden” layer. A single hidden layer is sufficient to make MLPs a universal approximator. In the case of very deep network with hundreds of layers, ResNet [
43] is proposed for solving the gradient vanishing problems by using residual blocks. Although slightly better than the approach of directly processing unsorted points, the direct application of MLP on unsorted point clouds does not perform well [
37].
Instead of transforming irregular point clouds to voxel grids, Qi et al. [
37] directly exploited the point clouds for segmentation and classification by designing a PointNet, which is permutation and transformation invariant. Evaluated on modelNet40 [
36], the PointNet is robust and performs at the same level as, or, in some cases, even better than, other state-of-the-art solutions. Interesting applications are demonstrated, for example, in learning local normal and curvatures [
44] and in segmentation based on sections along the road [
45]. Later, Qi et al. [
46] introduced a PointNet++ network to cope with the uneven point cloud density. This network has been applied to the classification of coniferous and deciduous trees [
47]. VoteNet demonstrates significant improvements in object detections and the authors suggest to apply in downstream point cloud segmentation [
48].
The PointNet and its variants were tested in indoor environments and provided reliable results, offering a new option of being transferred to other domains [
37]. However, the PointNet and PointNet++ process each point in the local point set individually and does not extract the relationships, such as distance and edge, between the point and its neighbours [
49]. This may result in problems when classifying small objects and neighbouring objects that lie within a short distance from one another.