3.2. Handcrafted Feature
Following the success of general DNN-based point cloud semantic segmentation methods, several recent works have used DNN models for forest point cloud segmentation [
7,
8,
9]. However, general point cloud semantic segmentation and tree point cloud semantic segmentation tasks are essentially different since forest point clouds are much more difficult to annotate and the labelled forest point cloud datasets are much smaller. Therefore, using DNN-based methods alone may not be enough for learning effective representations from forest data. To obtain a better representation for the tree point cloud semantic segmentation task, we proposed a handcrafted feature that characterises the local interaction of points in an explicit manner. Motivated by our observation that stem and foliage points differ by the relative position of their neighbours, we design a histogram-based local feature [
70,
71] that encodes for each individual point the direction of its neighbour points.
More specifically, given a point coordinate 
 and the set of its surrounding points 
 selected by 
k-NN search in the 3D space, we first compute the orientation of the vector 
 for each neighbor point 
 by projecting 
 onto each of the three 2D planes (i.e., 
, 
, and 
) and computing the three orientation angles corresponding to the three 2D components, i.e.,
        
        where the function 
 computes the angle of the 2D vector given by its two input variables, i.e., when using the 
 in (
1) as example,
        
        where the 
 in 
r is a small positive value used to avoid zero denominator for 
u and 
v, and 
 is the binary indicator function.
For each of the three angles 
, 
, and 
 in (
1), we evenly divide the 
 interval into 
B bins and assign the angle to one of the bins. Using the 
 in (
1) for example, the index of the bin to assign the angle is computed as follows,
        
        where 
 is the floor operator for numbers. Then we compute the number of points assigned to each bin based on the orientations computed for all 
, i.e.,
        
        where 
 is the bin index. By the same token, we compute 
 and 
 for the other two angles 
 and 
 in (
1).
Finally, our histogram feature 
 is given by the vector of point numbers across all bins, i.e.,
        
We illustrate our method for computing 
 in 
Figure 2.
The histogram feature 
 only contains the relative position information of points and does not utilise the original 3D point coordinates, while we empirically found it is beneficial to incorporate the point coordinates into our local feature. More specifically, we improve the descriptiveness of our local feature simply by concatenating the coordinate 
 with the corresponding local feature 
, resulting in a feature dimension of 
 for each point. For clarity, we denote the improved feature as 
, i.e.,
        
When using the handcrafted feature 
 alone as the segmentation method and not integrating with the DNN-based method, we train a simple multi-layer perceptron (MLP) classifier to predict the segmentation results. In this way, we achieve results on par with popular DNN-based segmentation methods. In 
Section 3.3, we will combine the advantage of both our handcrafted feature and DNN-based methods for improved performance.
  3.3. Point Cloud Segmentation Model
In this subsection, we introduce our backbone semantic segmentation model which we use for all the learning settings involved in this work, i.e., supervised learning, semi-supervised learning, and domain adaptation. Inspired by the success of existing works [
23,
24,
25,
26] which integrate DNN with a handcrafted feature for the image recognition and point cloud recognition tasks, we integrate our handcrafted local feature with a DNN model for the point cloud semantic segmentation task. In this way, our backbone point cloud semantic segmentation model can combine the benefits of both types of methods.
For the DNN component in our point cloud semantic segmentation model, we use the popular PointNet++ [
18] model. In the encoder network of PointNet++, the Farthest Point Sample algorithm [
72] is used to divide the points into local groups and several set abstraction modules are used to gradually capture semantics from a larger spatial extent. Then in the decoder, an interpolation strategy based on the spatial distance between point coordinates is used to propagate features from the semantically rich sampled points to all the original points.
To integrate our handcrafted feature with the PointNet++ model, for each individual point, we concatenate our handcrafted feature vector with the output feature vector of the last feature propagation layer in the decoder component of PointNet++. We illustrate our backbone point cloud semantic segmentation model in 
Figure 3.
  3.4. Learning Framework for Semi-Supervised Point Cloud Semantic Segmentation
To address the issue of limited labelled data for training the semantic segmentation model, we utilise unlabelled examples for training the model by employing semi-supervised learning. We propose a learning framework based on pseudo-labelling and model ensembling to utilise the unlabelled training data.
Formally, we denote the labelled point cloud dataset as , where  is a labelled example with  being the number of points in  and  is the set of semantic labels corresponding to each point in  with K being the total number of semantic classes.
Inspired by the model ensembling and distillation strategy [
52], in our learning framework, we employ two point cloud semantic segmentation models (i.e., one student model and one teacher model) with an identical architecture. During the training process, we first use the teacher model to produce pseudo-labels 
 for each unlabelled example 
 and train the student model on both the labelled examples and the pseudo-labelled examples, then update the variables in the teacher model by taking Exponential Moving Average (EMA) of the variables in the student model. More specifically, we denote the student model function as 
, where the output is the prediction probability matrix over all points and semantic classes and 
 is the set of model variables with 
j being the layer index of DNN model. Similarly, we use 
 to denote the teacher model function. We also use the indices 
p and 
k to indicate the 
p-th point and the 
k-th semantic class in the prediction probability matrix, respectively. Then we formulate our pseudo-labelling strategy, overall training loss function, and EMA update strategy for the teacher model as Equations (
7)–(
9) as follows,
        
        where 
 is 
p-th element in the pseudo-label vector 
 (i.e., the pseudo-label predicted by the teacher model for the 
p-th point of 
), 
 denotes the Cross-Entropy loss function, 
w is a weight factor for balancing the two loss terms, and 
 is a constant factor called 
momentum that is set to 
 in our experiments.
The main issue with the pseudo-labelling strategy is the quality of pseudo-labelled examples as the lack of variety can harm model performance. Inspired by [
73], we propose a pseudo-label selection strategy based on the entropy ranking of the individual points to select the pseudo-labelled points in each point cloud that are beneficial for training the segmentation model. More specifically, for each 
, we compute the entropy of each individual point and select the 
 points with higher entropy as the pseudo-labelled points. Therefore, we denote the updated 
 with only the selected points and their corresponding pseudo-labels as 
. In this way, we are selecting the hard examples which are better for the variety of the training dataset. We illustrate our complete learning framework for semi-supervised and cross-dataset point cloud semantic segmentation in 
Figure 4.
  3.5. Extending Semi-Supervised Learning Framework to Domain Adaptation
In practice, aside from the shortage in labelled training data, we often face situations where the training data and test data are collected from different scenes and do not follow the same data distribution. For example, in the forest inventory scenario, the training and test data can be of different tree species, or collected using different types of LiDAR devices or from different sites. In learning theory, this is known as the “domain adaptation” or cross-dataset generalisation problem, and the data distributions of the training and test data are called the source domain and the target domain, respectively. The domain adaptation problem is usually tackled by exploiting the training data from the target domain, while we tackle the more challenging setting of unsupervised domain adaptation where the target training data are unlabelled. Using unlabelled target training data is more advantageous than using labelled training data in real-world applications since it saves time and the cost of annotation.
For the cross-dataset point cloud semantic segmentation task, we can extend our semi-supervised learning framework in 
Section 3.4 to the domain adaptation setting by simply replacing the training datasets, i.e., we replace the labelled dataset 
 and unlabelled dataset 
 with the labelled source dataset 
 and the unlabelled target dataset 
, respectively. Different from the 
 and 
 in semi-supervised learning which are assumed to be drawn from the same data distribution, the 
 and 
 in domain adaptation are drawn from different data distributions.
  3.6. Tree Parameter Estimation Model
Existing works in tree parameter estimation mainly fit a parametric model to characterise the tree stem geometry for each tree point cloud individually. In contrast to the existing works, we propose a data-driven method for tree parameter estimation based on DNN to predict the cylindrical parameters of the tree stem [
10] while being able to learn from the variety of data for improved robustness and adaptability to geometric variations across different individual trees.
In particular, our DNN tree parameter estimation method consists of three components, splitting the point cloud into sub-clouds based on height, extracting features from each sub-cloud, and feature processing. Each individual tree point cloud 
 is first passed through the point cloud division component, where we first divide the height range from 0 to 50 m into 
M segments, then group the subset of points (both stem and foliage points) fallen into the 
m-th height segment as the tree point cloud segment 
. Here we are overloading the denotation by using the superscript of 
 to indicate the tree segment index instead of the dataset as in 
Section 3.4 and 
Section 3.5. Then in the segment-wise feature extraction component, we use a PointNet [
17] feature extractor on each tree point cloud segment individually to extract segment-wise semantics, while the PointNet feature extractor for each individual tree segment has a shared set of model parameters. The features extracted across the segments of an individual tree naturally form a sequence. Finally, in the feature processing component, we employ a Long Short-Term Memory (LSTM) [
74] model to process the sequential features across all the tree segments 
 and predict the tree parameters for each segment, i.e., the planar center coordinates 
 of the stem segments along the 
X- and 
Y- axes and the stem segment radii 
. For the ease of denotation, we write the tree parameters 
, 
, and 
 collectively as 
.
Inspired by the recent works in point cloud object detection [
75,
76,
77], for estimating the coordinates 
 and 
, we let our tree parameter estimation model predict the residual between 
 and the point centroid 
 of each point cloud segment instead of directly predict 
. When computing each centroid 
, we first compute the initial centroid 
 by simply averaging the 
X- and 
Y-coordinates of the points within the corresponding point cloud segment, then we evenly divide the 
-plane centred on 
 into 16 directional bins. Finally, we compute 
 by randomly sampling a maximum of 4 points from each directional bin and taking the average on the 
X- and 
Y-coordinates of the sampled points. By refining 
 into 
 in this way, we can reduce the errors of the initial centroids caused by sampling bias during LiDAR scanning. Therefore, we formulate our total loss function across all the tree segments as follows,
        
        where 
 is the centroid coordinate vector extended by an additional zero to match the dimension of 
, 
N is the number of tree point clouds in the dataset, 
 is the Huber loss function with the 
 coefficient set to 0.01, 
 and 
 are the model functions of the PointNet feature extraction component and the LSTM feature processing component, respectively. 
 maps each point cloud segment into a 
D-dimensional feature vector while each 
 maps its input features into the tree parameters. We illustrate our tree parameter estimation model in 
Figure 5.
In addition, we also use a data augmentation strategy during the training process to improve the robustness of our tree parameter estimation method. For each 
, we apply a rotation with random angle along the vertical 
Z-axis to the whole point cloud 
 while we also apply the same rotation to the 
 and 
 coordinates in each 
 across all 
m. We denote this example-specific rotation operator as 
. Therefore, by incorporating our data augmentation strategy into the training process, our loss function in (
10) is updated as,
        
 Note in (
11) that our data transform 
 works differently for 
 and 
 since the 
Z-coordinates in 
 are replaced by the radius in 
, while we write 
 as the same type of transform for both 
 and 
 for the ease of denotation. We only use random rotation for 
 during the training process and we use the identity operator for 
 during testing.
In addition, we also propose a simple tree point cloud semantic segmentation model which is induced from our tree parameter estimation model. Intuitively, for each cylinder segment characterised by the parametric tree model, we classify a point within the corresponding point cloud segment as a stem if the point falls within the interior of the cylinder segment, otherwise we classify the point as foliage. More specifically, given a tree stem segment with parameters 
 and a point 
 from the corresponding point cloud segment, we classify 
 as stem if it falls within a distance threshold to 
, i.e.,
        
        where 
 is a positive coefficient we use to improve the robustness of the model. We name this induced tree point cloud semantic segmentation model the cylinder segmentation model. Following the procedure of our tree parameter estimation method, we use the PCA-transformed points in our cylinder segmentation.